Comments on Stanford InfoBlog: Why Uncertainty in Data is Great (Posted by Anish Das Sarma)

A very nice illustration on the significance of ma...

2008-12-09T23:58:00.000-08:00

A very nice illustration on the significance of managing uncertain data; I have worked in the past in this area on evaluating threshold based preference queries on uncertain data; but we modeled uncertain data as ranges; For example instead of saying, I have a confidence of 80% on value X, I would say, the value is somewhere in the range [a,b] where this range could again follow an arbitrary distribution'; In the case of sensor networks, it mostly would follow a Gaussian distribution. There has been some other related work in the same lines by some DB groups in Purdue, Hong Kong University, Toronto too.

You are absolutely that propagating probabilities ...

2008-08-01T13:44:00.000-07:00

You are absolutely that propagating probabilities have been considered in AI (as well as in other fields), and we are aware of this past work. However, note that the uncertain data management we (and other DB groups around the globe) are doing is different in several respects. First, we consider more general kinds of uncertainty, which also includes non-probabilistic but uncertain data, and probably at a larger scale. Second, the kinds of queries that need to be answered in a relational setting (all of SQL) goes beyond Bayesian inference. And for these queries, we need to consider new kinds of indexes and statistics, etc.

That said, I would like to point out that within the DB community itself there has been past work that uses AI-ish techniques for managing relation uncertain data (See this ICDE 2007 and VLDB 2008 paper).

Finally, note that uncertainty is only one component of the Trio project I referred to. Data lineage constitutes the other key component in Trio. We've been looking at how lineage can play a crucial role in improving the efficiency and usability in uncertain data. (Our ICDE 2008 paper shows how lineage can help in confidence computation, and this paper shows that lineage can greatly simplify data modifications and versioning.)

As for your note on my Toucan example on how to "reconcile" independent sources of uncertain data, thanks for pointing me to this past work! Along with other members of the Trio project, I've been doing some work on building a theoretical framework that allows us to combine such uncertain information. Our work is based on and extends the theory of data integration, which fundamentally relies on the notion of containment. We believe the approach we are taking is more principled. Unfortunately I can't point you to our results yet, as we haven't published them.

Thanks a lot for your comments and for pointing me to your past and current work!

Propagating uncertainty is a cornerstone of Bayesi...

2008-07-29T15:30:00.000-07:00

Propagating uncertainty is a cornerstone of Bayesian statistics, which provides a general framework for uncertainty integration. Some of your colleagues in CS at Stanford are using it for a general model of natural language processing that propagates uncertainty through a linguistic processing pipeline (see this paper).

Back in the 1980s, I used to do logic programming and knowledge representation for computational linguistics, with a theoretical focus on uncertainty propagation (though it was disjunctive or inheritance-based uncertainty).

Fast forward to the 2000s, and I'm currently working on an NIH grant, the foundation of which is high recall techniques for linking textual mentions of genes, mutations, diseases and other biological entities to databases (see this tech report, blog entry or tutorial). You just can't get high recall with state of the art data cleaning in 2008, at least in the domains we care about.

A related issue I'm working on now is a hierarchical Bayesian model of determining true annotations from multiple annotators (like your Toucan example). I discuss the general problems with the current way of measuring agreement for producing gold standard data in my last two blog entries, Good Kappa's Not Enough and Good Kappa's Not Necessary, Either.

Finally, it's not just unclean data that needs to be reasoned with, but also missing data, which presents a different set of problems. For that, you might consider multiple imputation.