InfoLab Logo Header

Stanford Entity Resolution Framework: An Overview (Posted by Steven Whang)


The goal of the Stanford Entity Resolution Framework (SERF) project is to develop a generic infrastructure for Entity Resolution (ER). ER (also known as deduplication, or record linkage) is an important information integration problem: The same "real-world entities" (e.g., customers, or products) are referred to in different ways in multiple data records. For instance, two records on the same person may provide different name spellings, and addresses may differ. The goal of ER is to "resolve" entities, by identifying the records that represent the same entity and reconciling them to obtain one record per entity.

Early Work

Our initial work focuses on ER performance. We first identify three main challenges of ER: matching two records, merging two records that match, and chaining where we iteratively compare merged records with other records to discover additional record matches. We abstract the first two operations into binary match and merge functions, which we consider as black-box functions that are provided by the user. Given the functions, we derive ER solutions using the minimum number of potentially expensive record comparisons (thus our approach is focused on performance rather than accuracy). We also identify important properties for the match and merge functions that make ER natural and efficient. Our Swoosh algorithms are proved to be optimal in the number of record comparisons in worst-case scenarios.

In addition, we have addressed the following challenges:

Distribution: We distribute the operation of Swoosh on several machines. The important property to satisfy is that for any two records, there exists a machine that compares them. We investigate several schemes that guarantee this property while parallelizing ER.

Confidence: We consider numerical confidences associated with records and derive an ER result consisting of high-confidence records. The challenge is to perform ER efficiently when confidences are involved.

Negative Rules: We drop the assumption that the match and merge functions are always correct, and specify integrity checks in the form of negative rules to derive a consistent ER result using the match and merge functions.

Current Work

More recently, we have been focusing on ER scalability. Although Swoosh is an efficient algorithm, it is an exhaustive one where all the record pairs could be compared (making the algorithm quadratic) and thus cannot run on very large datasets (say on millions of records). Various blocking techniques can be used to scale ER where we divide records into (possibly overlapping) blocks, assuming that records in different blocks are not likely not match with each other. We thus save a significant amount of processing by running ER on one block at a time. The problem with blocking is that it may miss matching records. Many works have focused on improving the accuracy and performance of blocking by figuring out the correct "blocking criteria" to use.

Complementing various ER and blocking techniques, we propose a novel generic framework for blocking (called iterative blocking) where the ER results of blocks can now be reflected to other blocks. Blocks are iteratively processed until no block contains any more matching records. Unlike blocking, we now have an additional step of re-distributing the merged records generated from one block to other blocks. This process can improve the accuracy of ER (by possibly identifying additional record matches) while improving performance (by saving the processing time for other blocks) compared to simple blocking.

Labels: , , , , , , , , , ,

Thoughts on ECML PKDD Discovery Challenge (RSDC'08) (Posted by Paul Heymann)

Over the past few years, I have looked into a variety of problems in collaborative tagging systems, like whether they can be better organized, whether they can help web search, and how to avoid spam. Most recently, I have been looking at tag prediction; I will be presenting a paper called "Social Tag Prediction" at SIGIR'08 next month (more details on that in a later blog post). As such, I am very curious to see the outcome of the ECML PKDD Discovery Challenge (RSDC'08). (These are my initial thoughts as a non-participant, so if you have any complaints or find any errors, do leave a comment!)

BibSonomy and the University of Kassel

The Challenge is being put on by four members of the Knowledge and Data Engineering team at the University of Kassel:
This team, along with about ten other project members is behind BibSonomy. BibSonomy is one of the three (or so) major collaborative tagging systems for academic publications (the others that I know of being CiteULike and Connotea). BibSonomy supports bookmarking URLs as well as publications, but its main sell over something like is that it helps academics organize and share publications. One particularly nice thing about BibSonomy is that they have always been willing to share their data with the academic community (see the FAQ for more details, CiteULike shares some data in an anonymized form).


Most collaborative tagging systems focus on a single object type: e.g., URLs, books, products. BibSonomy is different in that users post two distinct object types: academic publications and URL bookmarks. Furthermore, the dataset (at least for the discovery challenge) has about equal numbers of (non-spam) unique publications and URLs (approximately 200,000 each). The dataset also denotes whether each user is a spammer or not, but here things are a bit less balanced. There are about ten times as many spam bookmarks as ham bookmarks, but almost no one seems to have bothered to spam academic publications. The full dataset statistics are:
  • (tag,user,object) triples {ham: 816197, spam: 13258759}
  • URL bookmark objects {ham: 181833, spam: 2059991}
  • academic publication objects {ham: 219417, spam: 716}
  • users {ham: 2467, spam: 29248}
See the dataset page for more details about the Challenge dataset.

Tasks: Tag Recommendation and Tag Spam

The Challenge consists of two tasks, tag recommendation and tag spam detection.

Tag Recommendation

Tag recommendation seems to be equivalent to what other researchers call tag suggestion, but slightly different from what I call "tag prediction." In a tag recommendation or tag suggestion task, the goal is to assist in the tagging process. As the user is typing in tags describing a particular object, the system also provides the user with a list of helpful suggestions.

In the Challenge, the "ideal" recommendation is assumed to be whatever tag the user ended up choosing, though one could imagine different definitions for what makes a recommendation "good."

For example, suppose most users in the system tag articles about music with "music", but a particular user tags such articles with "audio". A good recommendation in a real world system might be to recommend that such articles be tagged with "music" in the future so that the user has the same labels as other users in the system (enhancing discoverability of the resources in the system), but this would be a bad strategy in the Challenge.

By contrast, when I set up a "tag prediction" task, the gold standard was to detect all tags that could be applied to a particular object, rather than particular tags that particular users chose to apply to a particular object.
  • Tag Recommendation (Suggestion): Given (user,object) try to guess tag.
  • Tag Prediction: Given (object) try to guess all potential tags that could occur in (tag,user,object) triples in the future.
Thus, the goal of "tag recommendation" is to assist and speed up tagging by users, while the goal of "tag prediction" is to guess all of the tags that could be applied to a particular object. One helps the users add tagging metadata, while the other adds tagging metadata directly.

Neither is necessarily a better or worse task, but I am struck by how even for a relatively specific goal like "predict tags in a tagging system" relatively minor details can have a huge impact on the design and applicability of solutions to the problem.

Tag Spam Detection

The spam detection task seems similar to previous spam detection tasks. The goal is to guess which users are spammers, based on previous spammers labeled by BibSonomy. (This is different from our previous work, which looked primarily at methods for prevention, ranking, and ways to simulate spam in tagging systems.)

The traditional difficulty with such tasks is defining what exactly constitutes "spam content" and how much spam content constitutes a "spammer." However, it seems likely that even if some spammers go unlabeled, or if some legitimate users are mislabeled as spammers, the relative ordering of the spam detection algorithms will be approximately the same.

The bigger issue that I am worried about with this challenge is that most of the spam signals will actually be too easy to identify. Unlike the web or e-mail where spammers have been competing against spam detection algorithms for a decade or more, tag spam is relatively new.

As a result, it seems like certain really obvious signals might catch most spam, because tag spammers are not really trying that hard, yet. For example, a quick glance at the Challenge dataset shows that BibSonomy got about 56 legitimate URLs in China (.cn) and about 127,000 spam URLs in China (.cn). So you can eliminate about 6 percent of spam URLs by just ignoring all of China. Another top level domain constitutes about 20 percent of the spam URLs and just about 1 percent of the legitimate URLs.

In fact, about 40 percent of the legitimate URLs in the dataset come from ".net", ".hu", ".org", ".edu", or ".de" while only about 9 percent of the illegitimate URLs come from those domains. In other words, there appears to be a lot of signal in the top level domains alone.

Likewise, of the 2,467 legitimate users in the dataset, about half of them (1,211) post academic publications, while only 113 of the spammers bother to do so. As a result, it looks like the difficulty in the task will be finding the legitimate users who are just posting URLs (as opposed to academic publications).

Factors That Might Impact The Challenge

There are three factors that I think will have a big impact on the Challenge:
  1. Compared to some other systems (e.g.,, BibSonomy is relatively small. For example, the data in the dataset is equivalent to about two to four days worth of bookmarks postings. I wonder if this will end up having a big impact on the types of algorithms that will do well, in that they will have less data to work with. One of my favorite simple algorithms, association rules, might not do as well when there is less data to be had.
  2. For the tag recommendation task, it seems like there might be big differences between predicting tags for academic publications and predicting them for URLs. In the former case, the dataset seems to have more text data, but on the other hand, the tags might be much more specific and sparse (e.g., "neuralnetworks" versus "funny"). Likewise, URLs may have easier tags to predict, but less text to predict them based on. With more text, it might make more sense to use algorithms that look more like text categorization, whereas with less text, it might make sense to use algorithms that look more like collaborative filtering. (Our previous work looks at predicting tags based on text and tags based on other tags in the context of tag prediction, Zhichen Xu et al. looked at tag suggestion in a more collaborative filtering style way in "Towards the Semantic Web: Collaborative Tag Suggestions.")
  3. It also seems as though the setup of the task will force the best teams to model the user, as opposed to just the objects. Because a system does not get credit for recommending "rubyonrails" when the user chose "RoR", it seems that it will be likely to be important to model not just what the object is about, but what the user is likely to say the object is about. (Incidentally, machine translation also faces this problem: if two systems translate a piece of text, and their translations are both different from a gold standard, which one is better?) Furthermore, if the users tagging objects are themselves seeing BibSonomy's tag recommender, one may need to (directly or indirectly) model the user, the tag recommender, and the user's reaction to the recommender!
Ultimately, all of these reflect necessary choices made to setup the Challenge, but it will be interesting to see what impact they have on the winning systems. Good luck to all the teams!

Update (2008-06-20): A small clarification, it looks like the tag recommendation task will be somewhere in between predicting tags based on (user, object) and predicting tags based on (object). The tag recommendation submissions will be in the form (object, tag1, tag2, tag3...) so systems will not really be predicting tags for a particular user. On the other hand, none of the objects seem to have really complete coverage of all of the tags that could apply to them, so the task is not exactly tag prediction either.

Update (2008-06-23): Actually, contrary to the previous update, it looks like the tag recommendation task will be based on (user, object) after all. The confusion is that content_id is actually a combination of both user and object as opposed to object. This mailing list posting gives a few more details.

Labels: , , , , , , ,

Report: 2008 Berkeley Database Self-Assessment (Posted by Hector Garcia-Molina)

Periodically, a group of self-anointed database "experts" meet to assess the state of the database field. The most recent of these self-assessment meetings was held at Berkeley, CA May 29-30, 2008.

The Berkeley meeting (above) was the fifth in the series.

The four earlier meetings were held in Laguna Beach 1988, Stanford 1990, Asilomar 1998, and Lowell (MA) 2003. After each of the meetings, a report has been written. The previous reports can be found at:
Typically, each report is written by one or two of the participants, trying to summarize the discussion at the meeting. The rest of the participants get to put their name on the report after providing comments. The report does not capture everyone's opinion, and more than once I have been surprised to find material in the report that I did not recall hearing at the meeting. But that is not a bad thing: The reports are usually more coherent than the meetings!

At this year's meeting, many of the same topics from previous years made an appearance: data integration, privacy and scalability are some old-time favorites. However, some "new" topics made an appearance (by new I mean new to these reports). For instance, this time we discussed social networking and virtual worlds as two applications that may require new data management technology.

The topic that generated the most passionate discussion was that of academic publications and conferences. Given that the meeting was a few days after VLDB rejection notices had been mailed out, it is not surprising that people were complaining about the poor review process these days. It seems reviewers are more interested in looking for ways to kill a paper ("my paper was not cited", "the margins are a nanometer too wide", "the idea is useful but there are not enough theorems") than in identifying promising ideas that will have impact. I suspect there will not be much discussion on this topic in the workshop report, as it was felt that these issues are best handled by the organizations that run conferences, such as SIGMOD, VLDB and ICDE. I would not hold my breath...

After the workshop, all participants attended the Jim Gray Tribute May 31, also at Berkeley.

There were about 800 participants at the tribute (above). It was a very moving event. It was great to see so many friends from the database and systems communities, but it was unfortunate that it took Jim's disappearance to bring us all together.

We'll let all of our Stanford InfoBlog readers know when the report for the 2008 Berkeley self-assessment appears... Stay tuned.

Labels: , , , , , ,

Simrank++: Putting together 3 simple ideas (Posted by Ioannis Antonellis)

Two ideas that really pushed web search further came from two completely different directions.

Two ideas

The first one, link analysis, came into play as a tool for improving the quality of search results. The idea as people express it today is simple, yet its power has been proven invaluable. The more web pages that link to your web page, the more popular your web page becomes.

The second idea, sponsored search, allowed search engines to make revenue, grow and keep providing their service for free. The idea again is simple: The search engine sells each keyword to up to 10 advertisers and it displays their ads to the users that issue a query with that keyword. Since the keywords are not real goods that you can easily set a price for, the search engines rely on auctions to determine the price for each keyword. (Had the search engines been able to set fixed keyword prices, many Ph.D. students, including myself, would have no thesis topic to work on...)

Another idea: The wisdom of Crowds

But, my favorite simple idea that has started flooding around recently is that of exploiting the wisdom of the crowds; the information that the billions of web users provide in all its forms. The most concrete and recent example of a successful application that is based on this idea is Wikipedia. Now that this large repository of knowledge has been created, many researchers have started to think about using it to improve existing data mining methods.

Another, still under development and immature, set of techniques that are based on the wisdom of the crowds uses click logs as a source of implicit information that web search users provide.

The paper

In our paper "Simrank++: Query Rewriting through link analysis of the Click Graph", which will be presented at Auckland, New Zealand in the coming VLDB 2008 conference, we are trying to put all these three ideas into play together. Specifically, we develop a link analysis technique for click graphs that can be used to increase the revenue of a sponsored search engine.

So, what is a click graph, how exactly do we analyze its link structure and how does this improve the revenue of a search engine?

Click graph

The click graph is a bipartite graph with query nodes on one side, ad nodes on the other and an edge between a query and an ad when a user that issued the query clicked on the ad. In addition, each edge between a query and an ad has associated with it a weight that corresponds to the number of clicks the query brought to the ad.

How to increase search engine revenue

The potential for revenue increase comes from queries that no advertiser has bid on. Since the search engine has no indication which ads are relevant to those queries, there are no ads it can display. Given such a query, if there was a way to guess which ads are relevant (but for some reason the associated advertisers decided not to bid on it), we could display the relevant ads and keep both the advertisers and the search engine happy.

The solution comes from the click graph: the wisdom of the crowds. Given a query, we use the click graph to generate a list of query rewrites, i.e., of other queries that are "similar" to it. For example, for the query "camera," the queries "digital camera" and "photography" may be useful because the user may also be interested in ads for those related queries. The query "battery" may also be useful because users that want a camera may also be in the market for a spare battery. Both the query and its rewrites can then be used to find ads.

The schemes we present analyze the connections in the click graph to identify rewrites that may be useful. Our techniques identify not only queries that are directly connected by an ad (e.g., users that submit either "mp3" or "i-tunes" click on ad an for "iPod") but also queries that are more indirectly related.


The intuition of our solution, Simrank++, is the following:
  • two queries are similar if they are connected to similar ads and two ads are similar if they are connected to similar queries.
Sounds confusing, but it reveals the power of recursion. Initially each query/ad is only similar to itself, but by continuously applying these two rules, you end up with a similarity score for each pair of queries.

Are these scores always correct? How can we even define what a correct score is? And how does everything change when you try to take into account the edge weights of the click graph in the computation of the similarity scores?

The answers to these more detailed questions are in the paper...

Labels: , , , , , , , ,