Tag Suggestion from Content
I am researching what companies have technologies can suggest tags from the content of posts. For example, if I post a blog entry, the technology would automatically tag my content with appropriate tags. The way most link tagging sites like del.icio.us and ma.gnolia.com perform this task is by taking tags other people have used and you have used to suggest something. That’s great when you have several different people all tagging and marking links differently. That’s not so great when there is a central content like Flickr or a blog like this one. There’s one copy of the article, one copy of a picture, etc. There just isn’t a wider pool of tags to suggest from. The only way then is to analyze the content, and see how other similar content has been marked.
The academia approach involves natural language processing, storing contextual models of both the tag space and the content. You can get pretty accurate with that kind of approach, and even discover when there are tags that are misspelled but mean the same thing, etc. I’ve done some searching around and found Language Computer which is the research arm of Lymba . I also found a paper from TagAssist about the topic.
No matter how you slice it, this approach is going to take some number crunching and disk space. That means multiple machines to process the content on the way in. For very low volume submission sites like my blog, it might be possible to do everything on one machine. For higher volume submission sites like the one I’m working on, that’s a real problem to work through.
The question I have, and I haven’t been able to find much on the subject, is if there are low-tech solutions that will get us 50 percent of the way there for a little investment. We may have to do this “cool” integration at a later stage, depending on the costs involved. I need to find a set of alternatives and choose what will be the best match, but this is a relatively new application for this type of technology. If anyone has some clues, please let me know.
