Stanford InfoBlog: Why Uncertainty in Data is Great (Posted by Anish Das Sarma)

Background: Uncertain Data

With the advent and growing popularity of several new applications (such as information extraction on the web, information integration, scientific databases, sensor data management, entity resolution), there is an increasing need to manage uncertain data in a principled fashion. Examples of uncertain data are:

Extraction: You extract structure from an HTML table/text on the Web, but due to the inherent uncertainty in the extraction process, you only have some confidence on the result. You extract the fact that John works for Google, but are not entirely sure, so associated a probability 0.8 with this fact.
Human Readings: Let’s suppose people are viewing birds in California (Christmas Bird Count: http://www.audubon.org/Bird/cbc/). George reports that he saw a bird fly past, but wasn’t sure whether it was a Crow or a Raven. He may also attach confidences with each of them: He is 75% sure it was a Crow, and associates only a 25% chance of it being a Raven.
Sensors: Sensor S1 reported the temperature of a room to be 75±5.
Inherent Data Uncertainty: From weather.com, you extract the fact that there will be rain in Stanford on Monday, but you only have a 75% confidence in this.
Data Integration/Entity Resolution: You are mapping schemas of various tables, and are unsure of whether "mailing-address" in one table corresponds to "home-address" or "office-address" in another table. We are de-duplicating a database of names, and are not sure whether "John Doe" and "J. Doe" refer to the same person.

There are many other examples of uncertainty arising in real-world scenarios.

How Do We Deal With Uncertainty

With large volumes of uncertain data being produced that needs to be subsequently queried and analyzed, there is a dire need to deal with this uncertainty in some way. At a high-level, there are two approaches to dealing with data uncertainty:

CLEAN (Approach-C): "Clean" the data as quickly as possible to get rid of the uncertainty. Once the data has been cleaned, it can be stored in and queried by any traditional DBMS, and life is good thereafter.
MANAGE (Approach-M): In contrast, we could keep the uncertainty in data around, and "manage" it in a principled fashion. This involves building DBMSs that can store such uncertain data, and process them correctly, i.e., handle the probabilities, range of values, dependencies, etc.

Let us compare the two approaches. Both Approach-C and Approach-M entail several technical challenges. Cleaning uncertain data is not a trivial process by any stretch of imagination. There has been work in the database community in cleaning uncertain data in various environments. (For one piece of work on cleaning in sensor/RFID networks, check this IQIS 2006 paper.) Once the data has been cleaned, there is no additional effort involved, as it can be processed by any off-the-shelf DBMS. In contrast, with Approach-M less upfront effort is involved. However, processing uncertain data becomes significantly more challenging. I would like to highlight the Trio project (http://infolab.stanford.edu/trio/) at Stanford that is building a system to manage uncertain data along with data lineage (also known as history or provenance). The lineage feature in Trio allows for an intuitive representation (see our VLDB 2006 paper), and efficient query processing (see our ICDE 2008 paper or a short DBClip). I would like to note that several other database groups are also studying the problem of managing uncertain data.

Why Managing Is Better Than Cleaning

Without going into technical details, I would like to describe why Approach-M is in general better than Approach-C. While Approach-C gives instant gratification by removing all "dirty data," Approach-M gives better long term results. Intuitively, Approach-C greedily eliminates all uncertainty, but Approach-M could resolve uncertainty more accurately later on because it has more information. Another way to look at it is that in Approach-C, the error involved in cleaning the data keeps compounding as we further query the certain data. But in Approach-M, since the uncertainty is explicitly modeled, we don’t have this problem.

Let us take a very simple example to see how uncertainty can be resolved using Approach-M. Consider the Christmas Bird Count described earlier, where people report bird sightings with a relational schema (Observer, Bird-ID, Bird-Name). (Suppose for the sake of this example birds are tagged with an identifying number, and the main challenge is in associating the number with the bird species.) In this extremely simple example, suppose there is just one sighting in our database by Mary, who saw Bird-1, and identified it as being either a Finch (80%) or a Toucan (20%). At this point we know that Bird-1 is definitely a Finch or a Toucan, but we are not sure which one it is; using Approach-C, we would like to create a certain database, and since Mary feels much more confident that Bird-1 is a Finch, we would enter the tuple (Mary, Bird-1, Finch) into the database and forget the information that it could have possibly been a Toucan as well.

In Approach-M, we would store Mary’s sighting as is, which could be represented as:

(Mary, Bird-1, {Finch: 0.8, Toucan: 0.2})

Why is this better? Suppose next day we get another independent observer, Susan, reports the sighting of Bird-1, and she thinks it’s either a Nightingale (70%) or Toucan (30%). The following day we get another independent observer’s sighting who says Bird-1 is either a Hummingbird (65%) or Toucan (35%). Clearly when we reconcile all these sightings, the likelihood of Bird-1 being Toucan is quite high. The method of "reconciling" these readings can be quite complicated, and is an important topic of research, but any reasonable reconciliation should indicate the probability of Bird-1 being Toucan quite high since all other sightings are conflicting. However, using Approach-C, all the three observers’ readings of Toucan would have been weeded out.

Uncertainty In Data Integration

For readers who are not convinced by the synthetic example above, here’s a very real application: data integration, which has been an important area of research for several decades.

Data integration systems offer a uniform interface to a set of data sources. As argued in our chapter to appear in a book on uncertain data, data integration systems need to model uncertainty at their core. In addition to the possibility of data being uncertain, semantics mappings and the mediated schema may be approximate. For example, in an application like Google Base that enables anyone to upload structured data, or when mapping millions of sources on the deep web, we cannot imagine specifying exact mappings. The chapter provides the theoretical foundations for modeling uncertainty in a data integration system.

At Google, we built a data integration system that incorporates this probabilistic framework, and completely automatically sets up probabilistic mediated schemas and probabilistic mappings, after which queries can be answered. We applied our system to integrate tables gathered from all over the web in multiple domains, including 50-800 data sources. Details of this work can be found in our SIGMOD 2008 paper. We observed that the system is able to produce high-quality answers with no human intervention. The answers obtained using the probabilistic framework was significantly better than any deterministic technique compared against.

Labels: anish, das sarma, integration, Trio, uncertain data, uncertainty

This entry was posted by Anish Das Sarma, on Wednesday, July 23, 2008. You can leave your response.

Anonymous | July 29, 2008 at 3:30 PM |
Propagating uncertainty is a cornerstone of Bayesian statistics, which provides a general framework for uncertainty integration. Some of your colleagues in CS at Stanford are using it for a general model of natural language processing that propagates uncertainty through a linguistic processing pipeline (see this paper).

Back in the 1980s, I used to do logic programming and knowledge representation for computational linguistics, with a theoretical focus on uncertainty propagation (though it was disjunctive or inheritance-based uncertainty).

Fast forward to the 2000s, and I'm currently working on an NIH grant, the foundation of which is high recall techniques for linking textual mentions of genes, mutations, diseases and other biological entities to databases (see this tech report, blog entry or tutorial). You just can't get high recall with state of the art data cleaning in 2008, at least in the domains we care about.

A related issue I'm working on now is a hierarchical Bayesian model of determining true annotations from multiple annotators (like your Toucan example). I discuss the general problems with the current way of measuring agreement for producing gold standard data in my last two blog entries, Good Kappa's Not Enough and Good Kappa's Not Necessary, Either.

Finally, it's not just unclean data that needs to be reasoned with, but also missing data, which presents a different set of problems. For that, you might consider multiple imputation.
Anish Das Sarma | August 1, 2008 at 1:44 PM |
You are absolutely that propagating probabilities have been considered in AI (as well as in other fields), and we are aware of this past work. However, note that the uncertain data management we (and other DB groups around the globe) are doing is different in several respects. First, we consider more general kinds of uncertainty, which also includes non-probabilistic but uncertain data, and probably at a larger scale. Second, the kinds of queries that need to be answered in a relational setting (all of SQL) goes beyond Bayesian inference. And for these queries, we need to consider new kinds of indexes and statistics, etc.

That said, I would like to point out that within the DB community itself there has been past work that uses AI-ish techniques for managing relation uncertain data (See this ICDE 2007 and VLDB 2008 paper).

Finally, note that uncertainty is only one component of the Trio project I referred to. Data lineage constitutes the other key component in Trio. We've been looking at how lineage can play a crucial role in improving the efficiency and usability in uncertain data. (Our ICDE 2008 paper shows how lineage can help in confidence computation, and this paper shows that lineage can greatly simplify data modifications and versioning.)

As for your note on my Toucan example on how to "reconcile" independent sources of uncertain data, thanks for pointing me to this past work! Along with other members of the Trio project, I've been doing some work on building a theoretical framework that allows us to combine such uncertain information. Our work is based on and extends the theory of data integration, which fundamentally relies on the notion of containment. We believe the approach we are taking is more principled. Unfortunately I can't point you to our results yet, as we haven't published them.

Thanks a lot for your comments and for pointing me to your past and current work!
Prasad | December 9, 2008 at 11:58 PM |
A very nice illustration on the significance of managing uncertain data; I have worked in the past in this area on evaluating threshold based preference queries on uncertain data; but we modeled uncertain data as ranges; For example instead of saying, I have a confidence of 80% on value X, I would say, the value is somewhere in the range [a,b] where this range could again follow an arbitrary distribution'; In the case of sensor networks, it mostly would follow a Gaussian distribution. There has been some other related work in the same lines by some DB groups in Purdue, Hong Kong University, Toronto too.

leave a response

Why Uncertainty in Data is Great (Posted by Anish Das Sarma)

Search

recent posts

Archives

Authors

Links

Admin