InfoLab Logo Header

« Home | Next » | Next » | Next »

Thoughts on ECML PKDD Discovery Challenge (RSDC'08) (Posted by Paul Heymann)

Over the past few years, I have looked into a variety of problems in collaborative tagging systems, like whether they can be better organized, whether they can help web search, and how to avoid spam. Most recently, I have been looking at tag prediction; I will be presenting a paper called "Social Tag Prediction" at SIGIR'08 next month (more details on that in a later blog post). As such, I am very curious to see the outcome of the ECML PKDD Discovery Challenge (RSDC'08). (These are my initial thoughts as a non-participant, so if you have any complaints or find any errors, do leave a comment!)

BibSonomy and the University of Kassel

The Challenge is being put on by four members of the Knowledge and Data Engineering team at the University of Kassel:
This team, along with about ten other project members is behind BibSonomy. BibSonomy is one of the three (or so) major collaborative tagging systems for academic publications (the others that I know of being CiteULike and Connotea). BibSonomy supports bookmarking URLs as well as publications, but its main sell over something like del.icio.us is that it helps academics organize and share publications. One particularly nice thing about BibSonomy is that they have always been willing to share their data with the academic community (see the FAQ for more details, CiteULike shares some data in an anonymized form).

Dataset

Most collaborative tagging systems focus on a single object type: e.g., URLs, books, products. BibSonomy is different in that users post two distinct object types: academic publications and URL bookmarks. Furthermore, the dataset (at least for the discovery challenge) has about equal numbers of (non-spam) unique publications and URLs (approximately 200,000 each). The dataset also denotes whether each user is a spammer or not, but here things are a bit less balanced. There are about ten times as many spam bookmarks as ham bookmarks, but almost no one seems to have bothered to spam academic publications. The full dataset statistics are:
  • (tag,user,object) triples {ham: 816197, spam: 13258759}
  • URL bookmark objects {ham: 181833, spam: 2059991}
  • academic publication objects {ham: 219417, spam: 716}
  • users {ham: 2467, spam: 29248}
See the dataset page for more details about the Challenge dataset.

Tasks: Tag Recommendation and Tag Spam

The Challenge consists of two tasks, tag recommendation and tag spam detection.

Tag Recommendation

Tag recommendation seems to be equivalent to what other researchers call tag suggestion, but slightly different from what I call "tag prediction." In a tag recommendation or tag suggestion task, the goal is to assist in the tagging process. As the user is typing in tags describing a particular object, the system also provides the user with a list of helpful suggestions.

In the Challenge, the "ideal" recommendation is assumed to be whatever tag the user ended up choosing, though one could imagine different definitions for what makes a recommendation "good."

For example, suppose most users in the system tag articles about music with "music", but a particular user tags such articles with "audio". A good recommendation in a real world system might be to recommend that such articles be tagged with "music" in the future so that the user has the same labels as other users in the system (enhancing discoverability of the resources in the system), but this would be a bad strategy in the Challenge.

By contrast, when I set up a "tag prediction" task, the gold standard was to detect all tags that could be applied to a particular object, rather than particular tags that particular users chose to apply to a particular object.
  • Tag Recommendation (Suggestion): Given (user,object) try to guess tag.
  • Tag Prediction: Given (object) try to guess all potential tags that could occur in (tag,user,object) triples in the future.
Thus, the goal of "tag recommendation" is to assist and speed up tagging by users, while the goal of "tag prediction" is to guess all of the tags that could be applied to a particular object. One helps the users add tagging metadata, while the other adds tagging metadata directly.

Neither is necessarily a better or worse task, but I am struck by how even for a relatively specific goal like "predict tags in a tagging system" relatively minor details can have a huge impact on the design and applicability of solutions to the problem.

Tag Spam Detection

The spam detection task seems similar to previous spam detection tasks. The goal is to guess which users are spammers, based on previous spammers labeled by BibSonomy. (This is different from our previous work, which looked primarily at methods for prevention, ranking, and ways to simulate spam in tagging systems.)

The traditional difficulty with such tasks is defining what exactly constitutes "spam content" and how much spam content constitutes a "spammer." However, it seems likely that even if some spammers go unlabeled, or if some legitimate users are mislabeled as spammers, the relative ordering of the spam detection algorithms will be approximately the same.

The bigger issue that I am worried about with this challenge is that most of the spam signals will actually be too easy to identify. Unlike the web or e-mail where spammers have been competing against spam detection algorithms for a decade or more, tag spam is relatively new.

As a result, it seems like certain really obvious signals might catch most spam, because tag spammers are not really trying that hard, yet. For example, a quick glance at the Challenge dataset shows that BibSonomy got about 56 legitimate URLs in China (.cn) and about 127,000 spam URLs in China (.cn). So you can eliminate about 6 percent of spam URLs by just ignoring all of China. Another top level domain constitutes about 20 percent of the spam URLs and just about 1 percent of the legitimate URLs.

In fact, about 40 percent of the legitimate URLs in the dataset come from ".net", ".hu", ".org", ".edu", or ".de" while only about 9 percent of the illegitimate URLs come from those domains. In other words, there appears to be a lot of signal in the top level domains alone.

Likewise, of the 2,467 legitimate users in the dataset, about half of them (1,211) post academic publications, while only 113 of the spammers bother to do so. As a result, it looks like the difficulty in the task will be finding the legitimate users who are just posting URLs (as opposed to academic publications).

Factors That Might Impact The Challenge

There are three factors that I think will have a big impact on the Challenge:
  1. Compared to some other systems (e.g., del.icio.us), BibSonomy is relatively small. For example, the data in the dataset is equivalent to about two to four days worth of del.icio.us bookmarks postings. I wonder if this will end up having a big impact on the types of algorithms that will do well, in that they will have less data to work with. One of my favorite simple algorithms, association rules, might not do as well when there is less data to be had.
  2. For the tag recommendation task, it seems like there might be big differences between predicting tags for academic publications and predicting them for URLs. In the former case, the dataset seems to have more text data, but on the other hand, the tags might be much more specific and sparse (e.g., "neuralnetworks" versus "funny"). Likewise, URLs may have easier tags to predict, but less text to predict them based on. With more text, it might make more sense to use algorithms that look more like text categorization, whereas with less text, it might make sense to use algorithms that look more like collaborative filtering. (Our previous work looks at predicting tags based on text and tags based on other tags in the context of tag prediction, Zhichen Xu et al. looked at tag suggestion in a more collaborative filtering style way in "Towards the Semantic Web: Collaborative Tag Suggestions.")
  3. It also seems as though the setup of the task will force the best teams to model the user, as opposed to just the objects. Because a system does not get credit for recommending "rubyonrails" when the user chose "RoR", it seems that it will be likely to be important to model not just what the object is about, but what the user is likely to say the object is about. (Incidentally, machine translation also faces this problem: if two systems translate a piece of text, and their translations are both different from a gold standard, which one is better?) Furthermore, if the users tagging objects are themselves seeing BibSonomy's tag recommender, one may need to (directly or indirectly) model the user, the tag recommender, and the user's reaction to the recommender!
Ultimately, all of these reflect necessary choices made to setup the Challenge, but it will be interesting to see what impact they have on the winning systems. Good luck to all the teams!

Update (2008-06-20): A small clarification, it looks like the tag recommendation task will be somewhere in between predicting tags based on (user, object) and predicting tags based on (object). The tag recommendation submissions will be in the form (object, tag1, tag2, tag3...) so systems will not really be predicting tags for a particular user. On the other hand, none of the objects seem to have really complete coverage of all of the tags that could apply to them, so the task is not exactly tag prediction either.

Update (2008-06-23): Actually, contrary to the previous update, it looks like the tag recommendation task will be based on (user, object) after all. The confusion is that content_id is actually a combination of both user and object as opposed to object. This mailing list posting gives a few more details.

Labels: , , , , , , ,

leave a response