Matching Hierarchies Using Shared Objects (Posted by Robert Ikeda)
At Stanford, we have studied the problem of hierarchy matching, in particular, how to determine corresponding edges between two related hierarchies. The need to match hierarchies arises in many situations where the objects come from different systems. In our book hierarchy example above, we may want to combine the book catalogs of two different libraries; in our tourism example, we may want to build a meta-web-site that combines the resources of two or more French tourism sites.
Traditionally, hierarchy matching is done by comparing the text associated with each edge. In contrast, we explored using the placement of objects present in both hierarchies to infer how the hierarchies relate. We chose to explore this alternative approach since the text matching method is not foolproof. For instance, there could be different wordings in the facets; one book hierarchy may refer to "drugs" while the other uses the term "medications."
Suppose we are matching two book hierarchies and both hierarchies contain a particular book. We would infer from the presence of this shared book in the two hierarchies that the two corresponding paths in the hierarchies have a relationship; perhaps they share edges in common. Given many shared objects, we would have a lot of information about possible relationships to reconcile, and how to best reconcile the information is not obvious. We developed two algorithms (one rule-based, one statistics-based) that use shared objects to determine feasible facets for a second hierarchy, given a hierarchy with known facets (label-value pairs that define what objects are placed under an edge).
We ran experiments on both real and synthetic data and compared the performances of the two algorithms. What we found was that given clean synthetic data (data that conformed to the properties we encoded in our rule-based algorithm), our rule-based algorithm performed consistently better than the statistics-based algorithm. However, with real data, or even with some of our synthetic data, the statistics-based algorithm proved to be more robust. For details, please see here.