Stanford InfoBlog: December 2008

It's vacation time in the InfoLab, so I thought I would write a follow-up post on a previous topic. In June, Hector wrote a post about attending the Berkeley Database Research Self-Assessment. A few months later (in late August), the Self-Assessment came out as the Claremont Report on Database Research (PDF available here).

There was a little bit of discussion at the time. A dbworld message went out on August 19th, and there were a few follow-ups. Alon Halevy was the first to (briefly) blog about the report, Dave Kellogg gave a good summary of the main points as did Tasso Argyros at the Aster Data Systems Blog, and there are a number of other posts with in depth comments, like the ebiquity blog looking for more Semantic Web discussion and Stephen Arnold discussing massive heterogeneous data. Two posts connect the Claremont Report's focus on big data to Nature's recent issue on Big Data.

I would summarise the report, but I actually think it's pretty clear. (These are my thoughts, incidentally, and may not represent the views of all of the InfoLab on such a broad report.) We are at a historic point in data management, and there are huge opportunities. Many communities would like better tools for data management, and would be happy and willing to learn them if we provided them (as long as it isn't Datalog, see Hellerstein below... "Datalog? OMG!"). But, we're not, or at least, not really. Sam Madden's contribution actually struck a chord with me as someone who is much more often a consumer rather than a creator of database technology (working mostly on the web rather than core database research):

At the risk of feeding what ultimately may be a really well crafted troll, what Madden describes is what I face on a daily basis. My usual tools end up being things like awk, sort, join, ad-hoc Python and perl scripts, SQLite with a ramdisk or otherwise in memory, and other one-off or only somewhat appropriate tools even when my data is relational and I would be able to phrase my queries much more succinctly in a declarative language. Rather than being able to use a distributed DBMS to do parallel work, I end up using MapReduce (Hadoop), usually with some hacks to use higher level language (currently Dumbo, maybe I'll try Pig or ~~FQL~~ Hive again sometime soon).

Is anyone seriously working to address this problem? It seems much sexier to work on new semantics (e.g., semi-structured data, streams, uncertainty, anonymity) or new ways to optimise retrieval (e.g., column stores, self-tuning). But neither of these really address what seems to be the massive cost of boundary crossing. This isn't to deride any of the work on new semantics or new optimizations, and on the contrary, that work is extremely important for the database community to remain relevant to a wide community of potential users. But, it seems to take forever to get data into a database (and exotic bulk loading tools make it complex as well), index it, and get it ready to be queried. Then it takes forever to get data back out if you are using the database for declarative data manipulation rather than having online queries be the end result. Maybe the answer is having data in XML, and then querying that data directly (but, to paraphrase JWZ, paraphrasing someone else, "Some people, when confronted with a problem, think 'I know, I'll use XML.' Now they have two problems."). Maybe the answer is that the relational database is an oddity, and that the much more common pattern is for simple, bad languages and bad data models to succeed, especially if they have simple models of computation and look like C (see for example Worse is Better, particularly "Models of Software Acceptance: How Winners Win").

Are there tools that will let me manipulate my data declaratively and efficiently, but then get out of my way when I want the data in R, or I want to write some ad-hoc analysis? Are there any production level tools that don't have a huge start-up cost when data goes in, and that might actually give me some indication of when the data will come out? Is everyone just using various forms of delimited text to organise massive amounts of structured, semi-structured, and unstructured data? (Except banks and retailers anyway.)

In any case, while I personally found Madden's research direction most accurate in describing what I need from databases in my work, there are a number of interesting research directions that people presented about. Unfortunately, they're in a variety of formats (they're all originally from the Claremont Report page), so I've munged them and then put them on Slideshare for your perusal. (Some are a little bit like reading tea leaves without seeing the actual talk, but most seem pretty clear in content.)

What do you, as a reader of the Stanford InfoBlog, think is the most important research direction below? Was something missed that is near and dear to your heart? What solutions do you use today to manipulate big and exotic data in your work?

Update 2008-12-22 23:12:00 EST: Switched link to FQL to be a link to Hive. Good catch, Jeff!

Research Directions

Rakesh Agrawal

Anastasia Ailamaki

Philip A. Bernstein

Eric A. Brewer

Michael J. Carey

Surajit Chaudhuri

AnHai Doan

Michael J. Franklin

Johannes Gehrke

Le Gruenwald

Laura M. Haas

Alon Y. Halevy

Joseph M. Hellerstein