Stanford Entity Resolution Framework: An Overview (Posted by Steven Whang)
The goal of the Stanford Entity Resolution Framework (SERF) project is to develop a generic infrastructure for Entity Resolution (ER). ER (also known as deduplication, or record linkage) is an important information integration problem: The same "real-world entities" (e.g., customers, or products) are referred to in different ways in multiple data records. For instance, two records on the same person may provide different name spellings, and addresses may differ. The goal of ER is to "resolve" entities, by identifying the records that represent the same entity and reconciling them to obtain one record per entity.
Our initial work focuses on ER performance. We first identify three main challenges of ER: matching two records, merging two records that match, and chaining where we iteratively compare merged records with other records to discover additional record matches. We abstract the first two operations into binary match and merge functions, which we consider as black-box functions that are provided by the user. Given the functions, we derive ER solutions using the minimum number of potentially expensive record comparisons (thus our approach is focused on performance rather than accuracy). We also identify important properties for the match and merge functions that make ER natural and efficient. Our Swoosh algorithms are proved to be optimal in the number of record comparisons in worst-case scenarios.
In addition, we have addressed the following challenges:
Distribution: We distribute the operation of Swoosh on several machines. The important property to satisfy is that for any two records, there exists a machine that compares them. We investigate several schemes that guarantee this property while parallelizing ER.
Confidence: We consider numerical confidences associated with records and derive an ER result consisting of high-confidence records. The challenge is to perform ER efficiently when confidences are involved.
Negative Rules: We drop the assumption that the match and merge functions are always correct, and specify integrity checks in the form of negative rules to derive a consistent ER result using the match and merge functions.
More recently, we have been focusing on ER scalability. Although Swoosh is an efficient algorithm, it is an exhaustive one where all the record pairs could be compared (making the algorithm quadratic) and thus cannot run on very large datasets (say on millions of records). Various blocking techniques can be used to scale ER where we divide records into (possibly overlapping) blocks, assuming that records in different blocks are not likely not match with each other. We thus save a significant amount of processing by running ER on one block at a time. The problem with blocking is that it may miss matching records. Many works have focused on improving the accuracy and performance of blocking by figuring out the correct "blocking criteria" to use.
Complementing various ER and blocking techniques, we propose a novel generic framework for blocking (called iterative blocking) where the ER results of blocks can now be reflected to other blocks. Blocks are iteratively processed until no block contains any more matching records. Unlike blocking, we now have an additional step of re-distributing the merged records generated from one block to other blocks. This process can improve the accuracy of ER (by possibly identifying additional record matches) while improving performance (by saving the processing time for other blocks) compared to simple blocking.