Wednesday, October 22, 2008
Generic Entity Resolution with Negative Rules (Posted by Steven Whang)
Entity Resolution (ER) is a process of identifying records that refer to the same real-world entity and merging them together. For example, two companies that merge may want to combine their customer records: for a given customer that dealt with the two companies they create a composite record that combines the known information.
However, the process for matching and merging records is most often application-specific, complex, and error-prone. The input records may contain ambiguous and not-fully specified data, and it may be impossible to capture all the application nuances and subtleties in whatever logic is used to decide when records match and how they should be merged. Thus, the set of resolved records (after ER) may contain "errors" that may be apparent to a domain specialist. For example, we may have a customer record with an address in a country we do not do business with. Or two different company records where the expert happens to know that one company recently required the other so they are now the same entity.
In our paper, we address the identification and handling of such inconsistencies using negative rules, which are predicates that take a number of records and return whether the records are consistent or not. For example, if ER mistakenly merges two person records with different genders, we can specify a binary negative rule that flags an inconsistency for any record that has both genders. The negative rules are then used with the original ER rules to find a consistent ER solution.
There are two reasons why we cannot simply modify the ER rules instead of using the negative rules. First, the negative rules are only used to check the consistency of the final ER result and not the consistency of the intermediate records that are created during the ER process. For example, a record that has both genders may turn out to be Female (based on more evidence) at the end of the ER process (and thus consistent). Second, the negative rules can be viewed as an effort of a second party to fix the errors made by the ER rules of the first party.
We propose a general algorithm that finds a consistent ER solution and an enhanced algorithm that is more efficient than the general algorithm, but assumes certain properties on the negative and ER rules. Our main findings are as follows:
However, the process for matching and merging records is most often application-specific, complex, and error-prone. The input records may contain ambiguous and not-fully specified data, and it may be impossible to capture all the application nuances and subtleties in whatever logic is used to decide when records match and how they should be merged. Thus, the set of resolved records (after ER) may contain "errors" that may be apparent to a domain specialist. For example, we may have a customer record with an address in a country we do not do business with. Or two different company records where the expert happens to know that one company recently required the other so they are now the same entity.
In our paper, we address the identification and handling of such inconsistencies using negative rules, which are predicates that take a number of records and return whether the records are consistent or not. For example, if ER mistakenly merges two person records with different genders, we can specify a binary negative rule that flags an inconsistency for any record that has both genders. The negative rules are then used with the original ER rules to find a consistent ER solution.
There are two reasons why we cannot simply modify the ER rules instead of using the negative rules. First, the negative rules are only used to check the consistency of the final ER result and not the consistency of the intermediate records that are created during the ER process. For example, a record that has both genders may turn out to be Female (based on more evidence) at the end of the ER process (and thus consistent). Second, the negative rules can be viewed as an effort of a second party to fix the errors made by the ER rules of the first party.
We propose a general algorithm that finds a consistent ER solution and an enhanced algorithm that is more efficient than the general algorithm, but assumes certain properties on the negative and ER rules. Our main findings are as follows:
- Negative rules can significantly improve the accuracy of the ER process. Depending on the strategy, a domain expert may need to help resolve the inconsistencies.
- Different combinations of negative rules may result in different accuracy and runtime results. The negative rules themselves also need to be correct and effectively pinpoint the errors of the ER rules.
- Applying negative rules is an expensive process (at least quadratic) that should not be run on the entire dataset if the dataset is very large. A common technique in ER is to use blocking techniques (many variations exist) where the entire dataset is divided into smaller blocks, and the blocks are processed one at a time. The negative rules can be used within each block.
- Even without the properties, the enhanced algorithm can efficiently produce results that are nearly identical to those of the general algorithm.