InfoLab Logo Header

Certain Answers From Uncertain Data (Posted by Parag Agrawal)

In his blog entry, Anish argues that maintaining uncertainty through the data management system (Approach-M) can yield benefits. Processing uncertain data correctly involves capturing dependencies (or correlations) along with probability values in the system. For example, consider a simplified weather forecasting system which predicts that it will rain in either Palo Alto or Sunnyvale with probabilities .3 and .7, because of uncertainty with regards to wind direction. It also predicts that it will rain in Fremont with .2 probability if it rains in Palo Alto, since Fremont is downwind from Palo Alto. (The uncertainty perhaps being with respect to wind speed.) Similarly, it will rain in Milpitas with .5 probability if it rains in Sunnyvale. Capturing correlations correctly would let us conclude that at least one of Sunnyvale or Fremont will be dry. While these correlations are crucial to drawing correct conclusions, end users may often prefer a final result that is simpler and more certain.

Allowing the user to compute most likely answers is a common way to provide a "simple to use" result. The result may be restricted to these high-probability results using a confidence threshold, or a top-k by confidence query. This paper is a part of the large body of work that addresses this problem. For the example above, a user might only want to get a travel warning when the chance of rain in a city of interest exceeded a threshold (say .5). This can be posed as confidence threshold query with a predicate restricting the search to only cities of interest for the user. Queries like this just "clean" up the result to remove some of the uncertainty, allowing the user to "zoom" into the interesting information in the result. I am interested in exploring other ways of cleaning uncertainty that may be useful to some applications.

While the techniques above return more certain answers, they don't resolve any uncertainty. However, can throwing more data at the problem improve results by actually reconciling uncertainty? Consider weather forecast information from multiple sources -- each could be uncertain, they could be mutually inconsistent or mutually reinforcing. Can careful resolution of these data sources yield better, more certain results? I am betting that the answer is "yes" -- This paper provides the foundation for such resolution in a principled manner.

Labels: , , , , , , , , , ,

SpotSigs: Are Stopwords Finally Good for Something? (Posted by Martin Theobald)

In almost all classical Information Retrieval settings that have a text processing component, stopwords are first discarded before anything interesting happens with the document. “Interesting” here might mean indexing the content for search, extracting features for automatic classification, or some other form of content analysis of whatever flavor. Jonathan (my co-author on the SpotSigs paper) had the amazing idea that stopwords may however be very good indicators of the actual interesting parts of a web page. It is especially useful to know where the interesting parts of a web page are when they are interspersed with “added-value” content such as advertisements or navigational banners. This is most strikingly the case with online news articles, but applies more generally across the web.

In our SpotSigs project, we tried to detect near-duplicate Web pages in the news domain. This is a particularly challenging setting because the page layouts are often literally drenched with ads or navigational banners as added by the different sites. The actual core articles constitute only a minor fraction of the overall page, which makes online news a very hard setting for any unsupervised clustering or deduplication approach. Moreover, near duplicates of the same core article frequently pop up from different news sites as most of the content gets delivered by the same sources, such as Associated Press, and the very same core articles then often end up completely unchanged (or only after some minor editing) on many of the sites.

In response to this setting, the idea of extracting more “localized” signatures—namely those that are close to stopword occurrences (hence spots)—was born. These localized signatures exploit the observation that stopwords are frequently and uniformly distributed throughout any form of natural-language text—at least in Western languages—but they remain very infrequent in the typical headline-style banners or ads. SpotSigs connects such stopword anchors (called antecedents) with a nearby n-gram, which is simply a concatenation of further text tokens that are themselves not a stopword, in a very similar way to classic Shingling on the entire page content. In choosing only those Shingles that are connected to a stopword antecedent, however, SpotSigs tends to extract more robust signatures than plain Shingling. At the same time it allows for a more efficient and less error-prone signature extraction as compared to many of the far more sophisticated tools for HTML layout analysis. Another nice property is how SpotSigs handles synthetic documents that do not exhibit any natural language text, like 404 documents that crawlers typically encounter. SpotSigs automatically discards such synthetic documents.

As a second focus of the paper, SpotSigs also addresses some algorithmic challenges by tackling the inherent quadratic complexity when having to consider all candidate pairs of documents for the deduplication step. Here, our entire clustering algorithm is developed on the simple idea that we may never need to compare documents (or their respective signature sets) if they already substantially vary in length – at least if some set-resemblance-based similarity measure such as Jaccard similarity is used. This basic observation lets us very efficiently match high-dimensional signature vectors using a combination of classic techniques such as collection partitioning and inverted index pruning. As a rather surprising result, we found that the SpotSigs matching algorithm may even outperform linear-time similarity hashing approaches like locality-sensitive hashing in runtime – at least for reasonably high similarity thresholds and if the distribution of document (or signature) lengths throughout the collection is not too skewed – which nicely advocates the old algorithmic paradigm that sorting may sometimes be favorable over hashing. Overall, for a collection of 1.6 million documents from the TREC WT10g benchmark, we achieve a remarkably fast runtime of only about 5 minutes for parsing and extracting signatures, and less than 15 seconds for the actual clustering step on top of our index structures, already on a single mid-range server machine.

What I personally like about SpotSigs is that all its parts – from the signature extraction, over to the partitioning and index pruning – seamlessly fit together like some seemingly random pieces of a puzzle that finally make a nice picture. For example, the partitioning approach is not only good for breaking down the overall quadratic runtime into many much smaller pieces, but it also helps to smooth the skew toward shorter documents that we typically find in web collections and to provide more balanced partition sizes (the slides provide a little more detail on this). Similar themes recur throughout the work, for example, the threshold-based pruning approach applied in no less than three variations throughout the entire algorithm – with all steps being based on the very same similarity bound we derive for the (weighted) Jaccard resemblance between two Spot Signature sets.

After the presentation at SIGIR 2008, we received some interesting comments about the inherently subjective nature of detecting near duplicates. This suggests that for future work, thinking about how to personalize the signature extraction step beyond just using stopword anchors would certainly be an intriguing direction. We'd also like to improve the clustering algorithm by further investigating disk-based index structures, possibly distributing the algorithm onto multiple machines, and extending our bounding approach for more similarity metrics such as the well-know Cosine measure, which is more commonly used in IR than for example Jaccard.

Have a look at the paper or slides for more details.

Labels: , , , , , , , , ,

Who is the Data Leaker? (Posted by Panagiotis Papadimitriou)

In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted third parties. For example, social networking sites, such as Facebook, share users' data with social applications owners (a user approval is often required). Similarly, enterprises that outsource their data processing have to give data to various other companies. Data may also be shared for other purposes, e.g., a hospital may give patient records to researchers who will devise new treatments.

We call the owner of the data the distributor and the supposedly trusted third parties the agents. In a perfect world there would be no need to hand over sensitive data to parties that may unknowingly or maliciously leak it. However, in many cases we must indeed work with agents that may not be 100% trusted. So, if data is leaked we may not be certain if it came from an agent or from some other source. Our goal is to detect when the distributor's sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data.

Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks involve some modification of the original data and are very useful in many cases. However, there are cases where it is important not to alter the original distributor's data. For example, the data of a Facebook profile may not look different to different users who have access to it. If an outsourcer is doing our payroll, he must have the exact salary and customer identification numbers. If medical researchers will be treating patients (as opposed to simply computing statistics), they may need accurate data for the patients.

In this paper we propose unobtrusive techniques for detecting leakage of a set of objects or records. Specifically, we study the following scenario: After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place. (For example, the data may be found on a web site, or may be obtained through a legal discovery process.) At this point the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. Using an analogy with cookies stolen from a cookie jar, if we catch Freddie with a single cookie, he can argue that a friend gave him the cookie. But if we catch Freddie with 5 cookies, it will be much harder for him to argue that his hands were not in the cookie jar. If the distributor sees "enough evidence'' that an agent leaked data, he may stop doing business with him, or may initiate legal proceedings.

We develop a model for assessing the "guilt" of agents. Based on this model, we present algorithms for distributing objects to agents in a way that improves our chances of identifying a leaker. Finally, we also consider the option of adding "fake" objects to the distributed set. Such objects do not correspond to real entities but appear realistic to the agents. In a sense, the fake objects act as a type of watermark for the entire set, without modifying any individual members. If it turns out an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty.

You may want to check our short presentation or the full paper for details.

Labels: , , , , , ,

Matching Hierarchies Using Shared Objects (Posted by Robert Ikeda)

Objects are often organized in a hierarchy to help in managing or browsing them. For example, books in a library can be divided by subject (history, science, ...) and then by country of publication (United Kingdom, USA, ...). Web pages at a site can also be placed in a hierarchy. For instance, a French tourist site may have categories cities, history, hotels, tours; within the cities category we have pages divided by city, and then by events, maps, restaurants.

At Stanford, we have studied the problem of hierarchy matching, in particular, how to determine corresponding edges between two related hierarchies. The need to match hierarchies arises in many situations where the objects come from different systems. In our book hierarchy example above, we may want to combine the book catalogs of two different libraries; in our tourism example, we may want to build a meta-web-site that combines the resources of two or more French tourism sites.

Traditionally, hierarchy matching is done by comparing the text associated with each edge. In contrast, we explored using the placement of objects present in both hierarchies to infer how the hierarchies relate. We chose to explore this alternative approach since the text matching method is not foolproof. For instance, there could be different wordings in the facets; one book hierarchy may refer to "drugs" while the other uses the term "medications."

Suppose we are matching two book hierarchies and both hierarchies contain a particular book. We would infer from the presence of this shared book in the two hierarchies that the two corresponding paths in the hierarchies have a relationship; perhaps they share edges in common. Given many shared objects, we would have a lot of information about possible relationships to reconcile, and how to best reconcile the information is not obvious. We developed two algorithms (one rule-based, one statistics-based) that use shared objects to determine feasible facets for a second hierarchy, given a hierarchy with known facets (label-value pairs that define what objects are placed under an edge).

We ran experiments on both real and synthetic data and compared the performances of the two algorithms. What we found was that given clean synthetic data (data that conformed to the properties we encoded in our rule-based algorithm), our rule-based algorithm performed consistently better than the statistics-based algorithm. However, with real data, or even with some of our synthetic data, the statistics-based algorithm proved to be more robust. For details, please see here.

Labels: , , , ,