Stanford InfoBlog: Certain Answers From Uncertain Data (Posted by Parag Agrawal)

In his blog entry, Anish argues that maintaining uncertainty through the data management system (Approach-M) can yield benefits. Processing uncertain data correctly involves capturing dependencies (or correlations) along with probability values in the system. For example, consider a simplified weather forecasting system which predicts that it will rain in either Palo Alto or Sunnyvale with probabilities .3 and .7, because of uncertainty with regards to wind direction. It also predicts that it will rain in Fremont with .2 probability if it rains in Palo Alto, since Fremont is downwind from Palo Alto. (The uncertainty perhaps being with respect to wind speed.) Similarly, it will rain in Milpitas with .5 probability if it rains in Sunnyvale. Capturing correlations correctly would let us conclude that at least one of Sunnyvale or Fremont will be dry. While these correlations are crucial to drawing correct conclusions, end users may often prefer a final result that is simpler and more certain.

Allowing the user to compute most likely answers is a common way to provide a "simple to use" result. The result may be restricted to these high-probability results using a confidence threshold, or a top-k by confidence query. This paper is a part of the large body of work that addresses this problem. For the example above, a user might only want to get a travel warning when the chance of rain in a city of interest exceeded a threshold (say .5). This can be posed as confidence threshold query with a predicate restricting the search to only cities of interest for the user. Queries like this just "clean" up the result to remove some of the uncertainty, allowing the user to "zoom" into the interesting information in the result. I am interested in exploring other ways of cleaning uncertainty that may be useful to some applications.

While the techniques above return more certain answers, they don't resolve any uncertainty. However, can throwing more data at the problem improve results by actually reconciling uncertainty? Consider weather forecast information from multiple sources -- each could be uncertain, they could be mutually inconsistent or mutually reinforcing. Can careful resolution of these data sources yield better, more certain results? I am betting that the answer is "yes" -- This paper provides the foundation for such resolution in a principled manner.

Labels: agrawal, data integration, integration, parag, paraga, probabilistic, threshold, top-k, topk, uncertain data, uncertainty

This entry was posted by Parag Agrawal, on Thursday, August 28, 2008. You can leave your response.

Certain Answers From Uncertain Data (Posted by Parag Agrawal)

Search

recent posts

Archives

Authors

Links

Admin