InfoLab Logo Header

« Home | Next » | Next » | Next » | Next » | Next » | Next » | Next » | Next » | Next » | Next »

Vacation Post: Claremont (Berkeley) Database Research Self-Assessment Revisited (Posted by Paul Heymann)

It's vacation time in the InfoLab, so I thought I would write a follow-up post on a previous topic. In June, Hector wrote a post about attending the Berkeley Database Research Self-Assessment. A few months later (in late August), the Self-Assessment came out as the Claremont Report on Database Research (PDF available here).

There was a little bit of discussion at the time. A dbworld message went out on August 19th, and there were a few follow-ups. Alon Halevy was the first to (briefly) blog about the report, Dave Kellogg gave a good summary of the main points as did Tasso Argyros at the Aster Data Systems Blog, and there are a number of other posts with in depth comments, like the ebiquity blog looking for more Semantic Web discussion and Stephen Arnold discussing massive heterogeneous data. Two posts connect the Claremont Report's focus on big data to Nature's recent issue on Big Data.

I would summarise the report, but I actually think it's pretty clear. (These are my thoughts, incidentally, and may not represent the views of all of the InfoLab on such a broad report.) We are at a historic point in data management, and there are huge opportunities. Many communities would like better tools for data management, and would be happy and willing to learn them if we provided them (as long as it isn't Datalog, see Hellerstein below... "Datalog? OMG!"). But, we're not, or at least, not really. Sam Madden's contribution actually struck a chord with me as someone who is much more often a consumer rather than a creator of database technology (working mostly on the web rather than core database research):


At the risk of feeding what ultimately may be a really well crafted troll, what Madden describes is what I face on a daily basis. My usual tools end up being things like awk, sort, join, ad-hoc Python and perl scripts, SQLite with a ramdisk or otherwise in memory, and other one-off or only somewhat appropriate tools even when my data is relational and I would be able to phrase my queries much more succinctly in a declarative language. Rather than being able to use a distributed DBMS to do parallel work, I end up using MapReduce (Hadoop), usually with some hacks to use higher level language (currently Dumbo, maybe I'll try Pig or FQL Hive again sometime soon).

Is anyone seriously working to address this problem? It seems much sexier to work on new semantics (e.g., semi-structured data, streams, uncertainty, anonymity) or new ways to optimise retrieval (e.g., column stores, self-tuning). But neither of these really address what seems to be the massive cost of boundary crossing. This isn't to deride any of the work on new semantics or new optimizations, and on the contrary, that work is extremely important for the database community to remain relevant to a wide community of potential users. But, it seems to take forever to get data into a database (and exotic bulk loading tools make it complex as well), index it, and get it ready to be queried. Then it takes forever to get data back out if you are using the database for declarative data manipulation rather than having online queries be the end result. Maybe the answer is having data in XML, and then querying that data directly (but, to paraphrase JWZ, paraphrasing someone else, "Some people, when confronted with a problem, think 'I know, I'll use XML.' Now they have two problems."). Maybe the answer is that the relational database is an oddity, and that the much more common pattern is for simple, bad languages and bad data models to succeed, especially if they have simple models of computation and look like C (see for example Worse is Better, particularly "Models of Software Acceptance: How Winners Win").

Are there tools that will let me manipulate my data declaratively and efficiently, but then get out of my way when I want the data in R, or I want to write some ad-hoc analysis? Are there any production level tools that don't have a huge start-up cost when data goes in, and that might actually give me some indication of when the data will come out? Is everyone just using various forms of delimited text to organise massive amounts of structured, semi-structured, and unstructured data? (Except banks and retailers anyway.)

In any case, while I personally found Madden's research direction most accurate in describing what I need from databases in my work, there are a number of interesting research directions that people presented about. Unfortunately, they're in a variety of formats (they're all originally from the Claremont Report page), so I've munged them and then put them on Slideshare for your perusal. (Some are a little bit like reading tea leaves without seeing the actual talk, but most seem pretty clear in content.)

What do you, as a reader of the Stanford InfoBlog, think is the most important research direction below? Was something missed that is near and dear to your heart? What solutions do you use today to manipulate big and exotic data in your work?

Update 2008-12-22 23:12:00 EST: Switched link to FQL to be a link to Hive. Good catch, Jeff!

Research Directions

Rakesh Agrawal



Anastasia Ailamaki



Philip A. Bernstein



Eric A. Brewer





Michael J. Carey



Surajit Chaudhuri



AnHai Doan



Michael J. Franklin



Johannes Gehrke



Le Gruenwald



Laura M. Haas



Alon Y. Halevy



Joseph M. Hellerstein



Yannis E. Ioannidis



Hank F. Korth



Donald Kossmann



Samuel Madden



Beng Chin Ooi



Raghu Ramakrishnan



Sunita Sarawagi



Michael Stonebraker



Alexander S. Szalay





Gerhard Weikum

Labels: , , , , , , ,

  1. Blogger Hypercube | December 22, 2008 at 1:35 PM |  

    I think databases and security share a common property: everybody seems to think they're an expert in the field. Homebrewing an unscalable data management system (e.g. using a filesystem) is just as easy as homebrewing a completely insecure security system. Further, nobody realizes they have a bad system until it's too late. Homebrew security systems get compromised; the users of homebrew data management systems end up with terabytes of data in a non-standard format that can no longer be handled efficiently as flat files, and takes an enormous engineering effort to import to a real database system.

  2. Blogger Paul Heymann | December 22, 2008 at 6:50 PM |  

    Cube:

    I think you highlight an important point, which is that a lot of data management issues are as much human or social issues as anything else. More specifically, you seem to be implying that a lot of these issues look like process or project management.

    It's not really clear who is to blame for the sort of process problems you describe though. It's odd, because it seems like the database vendors sort of implicitly assume that the data manipulation for an application should be built around an RDBMS from the start, and that it's bad project management to not be using a production relational data model from the beginning. But on the other hand, building a product which ignores that most (at least initially small-scale) projects are not going to be built in this way, with foresight for eventual data management needs, seems a little negligent (or at least not optimal).

  3. Anonymous Anonymous | December 22, 2008 at 9:57 PM |  

    Hey Paul,

    I think you meant to reference Hive, Facebook's structured data management layer above Hadoop, not FQL, their SQL-like interface to their developer API.

    Great topic and post though; certainly looking forward to where all of this innovation leads in ten years.

    Regards,
    Jeff

  4. Blogger Paul Heymann | December 22, 2008 at 11:59 PM |  

    Jeff:

    Good catch, I've updated FQL to be Hive in the text above. I feel bad, Ragho even demoed Hive for me a few months ago!

    In any case, I'd like to take the opportunity to point out that there is a tiny contradiction in my comments above. As my colleague David pointed out to me, I claim all of these problems are terrible unsolved problems, and then explain how I've solved them in my own work using Hadoop and other tools. I guess the problem is that none of the current solutions seem to have that clean, efficient feeling to them, but maybe I'm (or we're all) being too picky. Hadoop in many ways seems to be happiest about taking our data with all of its warts.

    I never ended up posting a response to the whole Database Column MapReduce controversy (I, II). (Side note: Now that was a well-crafted troll...) But my main thought after reading it was that the authors (who are far wiser than I) really should read the (somewhat crazy in places) Software Acceptance Models if they haven't already.

    Specifically, and paraphrasing heavily, programmers flock to simple models of computation, simple implementations of those models, and languages that look like C (e.g., MapReduce and C++/Java in this case) and they won't use something that does not have these qualities. Simple implementations in particular mean that many people can become gurus: MapReduce is simple enough that it has now been rewritten at least three times I know of by different companies. Each time it gets rewritten, more people become experts on the internals, and understand details and limitations.

    Furthermore, as more and more programmers become invested in MapReduce, there will be more and more incentive to improve it until it becomes, maybe, 95% of what the right distributed DBMS solution would have been. This is already happening through efforts like Pig and Hive as well as (as far as I can tell) closed efforts like Sawzall. So maybe MapReduce, with its relatively low cost of boundary crossings, simple implementation, and exploding user base will eventually be the "database" solution that provides 95% of what we need and that we all have to use.

  5. Blogger Greg Linden | December 29, 2008 at 8:10 PM |  

    Excellent post, Paul.

    I have to say, a lot of this struck close to home, including the use of simple tools like sort, cut, perl, python and sqlite for small data processing, Hadoop (or similar) for big data processing, the pain of boundary crossings in RDMSs and XML, and the wonder that we often seem to be able to do no better than the simplest format of delimited text.

    I am not sure how I missed it in the past, but I much enjoyed Richard Gabriel's "Models of Software Acceptance: How Winners Win". Thanks for pointing that out too.

  6. Blogger Paul Heymann | December 30, 2008 at 3:09 AM |  

    Greg:

    Hey, thanks for the note. It's weird, I don't think a lot of researchers are aware of the Worse is Better work even though it impacts a lot of what we do. I suppose some of the reason is that it is more of a theory, it wasn't really produced by a "science-y" researcher and it's not really that testable. But it really does seem to provide some compelling insights into how things get popular, even more so with even bigger network effects on the web.

    As an aside, if you haven't read Gabriel before, he's a really fascinating writer. I read a little bit of Patterns of Software (which is now available online) when I borrowed it off an old advisor's shelf a number of years ago. Beyond an expansion of some of the ideas in Acceptance Models, he's not afraid to just veer off somewhere completely different whenever he thinks it's interesting. His Stanford PhD and Lucid stories are both filled with a surprising level of detail and honesty (near the end of PoS).

leave a response