InfoLab Logo Header

« Home | Next » | Next » | Next » | Next » | Next » | Next » | Next » | Next » | Next » | Next »

Vacation Post: Claremont (Berkeley) Database Research Self-Assessment Revisited (Posted by Paul Heymann)

It's vacation time in the InfoLab, so I thought I would write a follow-up post on a previous topic. In June, Hector wrote a post about attending the Berkeley Database Research Self-Assessment. A few months later (in late August), the Self-Assessment came out as the Claremont Report on Database Research (PDF available here).

There was a little bit of discussion at the time. A dbworld message went out on August 19th, and there were a few follow-ups. Alon Halevy was the first to (briefly) blog about the report, Dave Kellogg gave a good summary of the main points as did Tasso Argyros at the Aster Data Systems Blog, and there are a number of other posts with in depth comments, like the ebiquity blog looking for more Semantic Web discussion and Stephen Arnold discussing massive heterogeneous data. Two posts connect the Claremont Report's focus on big data to Nature's recent issue on Big Data.

I would summarise the report, but I actually think it's pretty clear. (These are my thoughts, incidentally, and may not represent the views of all of the InfoLab on such a broad report.) We are at a historic point in data management, and there are huge opportunities. Many communities would like better tools for data management, and would be happy and willing to learn them if we provided them (as long as it isn't Datalog, see Hellerstein below... "Datalog? OMG!"). But, we're not, or at least, not really. Sam Madden's contribution actually struck a chord with me as someone who is much more often a consumer rather than a creator of database technology (working mostly on the web rather than core database research):


At the risk of feeding what ultimately may be a really well crafted troll, what Madden describes is what I face on a daily basis. My usual tools end up being things like awk, sort, join, ad-hoc Python and perl scripts, SQLite with a ramdisk or otherwise in memory, and other one-off or only somewhat appropriate tools even when my data is relational and I would be able to phrase my queries much more succinctly in a declarative language. Rather than being able to use a distributed DBMS to do parallel work, I end up using MapReduce (Hadoop), usually with some hacks to use higher level language (currently Dumbo, maybe I'll try Pig or FQL Hive again sometime soon).

Is anyone seriously working to address this problem? It seems much sexier to work on new semantics (e.g., semi-structured data, streams, uncertainty, anonymity) or new ways to optimise retrieval (e.g., column stores, self-tuning). But neither of these really address what seems to be the massive cost of boundary crossing. This isn't to deride any of the work on new semantics or new optimizations, and on the contrary, that work is extremely important for the database community to remain relevant to a wide community of potential users. But, it seems to take forever to get data into a database (and exotic bulk loading tools make it complex as well), index it, and get it ready to be queried. Then it takes forever to get data back out if you are using the database for declarative data manipulation rather than having online queries be the end result. Maybe the answer is having data in XML, and then querying that data directly (but, to paraphrase JWZ, paraphrasing someone else, "Some people, when confronted with a problem, think 'I know, I'll use XML.' Now they have two problems."). Maybe the answer is that the relational database is an oddity, and that the much more common pattern is for simple, bad languages and bad data models to succeed, especially if they have simple models of computation and look like C (see for example Worse is Better, particularly "Models of Software Acceptance: How Winners Win").

Are there tools that will let me manipulate my data declaratively and efficiently, but then get out of my way when I want the data in R, or I want to write some ad-hoc analysis? Are there any production level tools that don't have a huge start-up cost when data goes in, and that might actually give me some indication of when the data will come out? Is everyone just using various forms of delimited text to organise massive amounts of structured, semi-structured, and unstructured data? (Except banks and retailers anyway.)

In any case, while I personally found Madden's research direction most accurate in describing what I need from databases in my work, there are a number of interesting research directions that people presented about. Unfortunately, they're in a variety of formats (they're all originally from the Claremont Report page), so I've munged them and then put them on Slideshare for your perusal. (Some are a little bit like reading tea leaves without seeing the actual talk, but most seem pretty clear in content.)

What do you, as a reader of the Stanford InfoBlog, think is the most important research direction below? Was something missed that is near and dear to your heart? What solutions do you use today to manipulate big and exotic data in your work?

Update 2008-12-22 23:12:00 EST: Switched link to FQL to be a link to Hive. Good catch, Jeff!

Research Directions

Rakesh Agrawal



Anastasia Ailamaki



Philip A. Bernstein



Eric A. Brewer





Michael J. Carey



Surajit Chaudhuri



AnHai Doan



Michael J. Franklin



Johannes Gehrke



Le Gruenwald



Laura M. Haas



Alon Y. Halevy



Joseph M. Hellerstein



Yannis E. Ioannidis



Hank F. Korth



Donald Kossmann



Samuel Madden



Beng Chin Ooi



Raghu Ramakrishnan



Sunita Sarawagi



Michael Stonebraker



Alexander S. Szalay





Gerhard Weikum

Labels: , , , , , , ,

  1. Blogger Hypercube | December 22, 2008 1:35 PM |  

    I think databases and security share a common property: everybody seems to think they're an expert in the field. Homebrewing an unscalable data management system (e.g. using a filesystem) is just as easy as homebrewing a completely insecure security system. Further, nobody realizes they have a bad system until it's too late. Homebrew security systems get compromised; the users of homebrew data management systems end up with terabytes of data in a non-standard format that can no longer be handled efficiently as flat files, and takes an enormous engineering effort to import to a real database system.

  2. Blogger Paul Heymann | December 22, 2008 6:50 PM |  

    Cube:

    I think you highlight an important point, which is that a lot of data management issues are as much human or social issues as anything else. More specifically, you seem to be implying that a lot of these issues look like process or project management.

    It's not really clear who is to blame for the sort of process problems you describe though. It's odd, because it seems like the database vendors sort of implicitly assume that the data manipulation for an application should be built around an RDBMS from the start, and that it's bad project management to not be using a production relational data model from the beginning. But on the other hand, building a product which ignores that most (at least initially small-scale) projects are not going to be built in this way, with foresight for eventual data management needs, seems a little negligent (or at least not optimal).

  3. Anonymous Jeff | December 22, 2008 9:57 PM |  

    Hey Paul,

    I think you meant to reference Hive, Facebook's structured data management layer above Hadoop, not FQL, their SQL-like interface to their developer API.

    Great topic and post though; certainly looking forward to where all of this innovation leads in ten years.

    Regards,
    Jeff

  4. Blogger Paul Heymann | December 22, 2008 11:59 PM |  

    Jeff:

    Good catch, I've updated FQL to be Hive in the text above. I feel bad, Ragho even demoed Hive for me a few months ago!

    In any case, I'd like to take the opportunity to point out that there is a tiny contradiction in my comments above. As my colleague David pointed out to me, I claim all of these problems are terrible unsolved problems, and then explain how I've solved them in my own work using Hadoop and other tools. I guess the problem is that none of the current solutions seem to have that clean, efficient feeling to them, but maybe I'm (or we're all) being too picky. Hadoop in many ways seems to be happiest about taking our data with all of its warts.

    I never ended up posting a response to the whole Database Column MapReduce controversy (I, II). (Side note: Now that was a well-crafted troll...) But my main thought after reading it was that the authors (who are far wiser than I) really should read the (somewhat crazy in places) Software Acceptance Models if they haven't already.

    Specifically, and paraphrasing heavily, programmers flock to simple models of computation, simple implementations of those models, and languages that look like C (e.g., MapReduce and C++/Java in this case) and they won't use something that does not have these qualities. Simple implementations in particular mean that many people can become gurus: MapReduce is simple enough that it has now been rewritten at least three times I know of by different companies. Each time it gets rewritten, more people become experts on the internals, and understand details and limitations.

    Furthermore, as more and more programmers become invested in MapReduce, there will be more and more incentive to improve it until it becomes, maybe, 95% of what the right distributed DBMS solution would have been. This is already happening through efforts like Pig and Hive as well as (as far as I can tell) closed efforts like Sawzall. So maybe MapReduce, with its relatively low cost of boundary crossings, simple implementation, and exploding user base will eventually be the "database" solution that provides 95% of what we need and that we all have to use.

  5. Blogger Greg Linden | December 29, 2008 8:10 PM |  

    Excellent post, Paul.

    I have to say, a lot of this struck close to home, including the use of simple tools like sort, cut, perl, python and sqlite for small data processing, Hadoop (or similar) for big data processing, the pain of boundary crossings in RDMSs and XML, and the wonder that we often seem to be able to do no better than the simplest format of delimited text.

    I am not sure how I missed it in the past, but I much enjoyed Richard Gabriel's "Models of Software Acceptance: How Winners Win". Thanks for pointing that out too.

  6. Blogger Paul Heymann | December 30, 2008 3:09 AM |  

    Greg:

    Hey, thanks for the note. It's weird, I don't think a lot of researchers are aware of the Worse is Better work even though it impacts a lot of what we do. I suppose some of the reason is that it is more of a theory, it wasn't really produced by a "science-y" researcher and it's not really that testable. But it really does seem to provide some compelling insights into how things get popular, even more so with even bigger network effects on the web.

    As an aside, if you haven't read Gabriel before, he's a really fascinating writer. I read a little bit of Patterns of Software (which is now available online) when I borrowed it off an old advisor's shelf a number of years ago. Beyond an expansion of some of the ideas in Acceptance Models, he's not afraid to just veer off somewhere completely different whenever he thinks it's interesting. His Stanford PhD and Lucid stories are both filled with a surprising level of detail and honesty (near the end of PoS).

  7. Anonymous John Starky | September 14, 2009 10:27 AM |  

    Of course database research is so important to us. Databases and security share a common property.

    Mencari Blogpreneur Sejati | Kenali dan Kunjungi Objek Wisata di Pandeglang

  8. Anonymous moratmarit | September 22, 2009 3:24 AM |  

    This is actually really interesting regarding your fact article here, This article is very informative.
    -----------------------------------
    Kenali dan Kunjungi Objek Wisata di Pandeglang | moratmarit | oes tsetnoc

  9. Anonymous Jati Diri | September 22, 2009 9:28 AM |  

    I do agree with your article. and honestly I would like to say that this article is great and very informative.

    if you do not mind, please visit my article related to pandeglang district in Banten at Kenali dan Kunjungi Objek Wisata di Pandeglang and also related to a leadership at Mengembalikan Jati Diri Bangsa

  10. Anonymous mengembalikan jati diri bangsa | September 25, 2009 11:00 PM |  

    Great post, very informative complete with the video. I will book mark it.

    if you don't mind, please come to visit my article related to Mengembalikan Jati Diri Bangsa and Kenali dan kunjungi objek wisata di pandeglang.

    cheers,
    mengembalikan jati diri bangsa

    Kenali dan kunjungi Objek Wisata di Pandeglang

  11. Anonymous Smith | September 26, 2009 9:44 PM |  

    I think it's weird too, like paul say, I don't think a lot of researchers are aware of the Worse is Better work even though it impacts a lot of what we do. I suppose some of the reason is that it is more of a theory, it wasn't really produced by a "science-y" researcher and it's not really that testable.



    Mengembalikan Jati Diri Bangsa
    Kenali dan Kunjungi Objek Wisata di Pandeglang
    Oes Tsetnoc
    Oes Tsetnoc
    Kenali dan Kunjungi Objek Wisata di Pandeglang

  12. Anonymous Marshal | September 26, 2009 10:42 PM |  

    Nice post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I'll be subscribing to your feed and I hope you post again soon.

    if you do not mind, please visit my article related to pandeglang district in Banten, Indonesia at Kenali dan Kunjungi Objek Wisata di Pandeglang and also related to a leadership at Mengembalikan Jati Diri Bangsa

    Oes Tsetnoc | Oes Tsetnoc

  13. Anonymous Jack | October 4, 2009 12:26 AM |  

    I found your blog on google and read a few of your other posts. I just added you to my Google News Reader. Keep up the good work. Look forward to reading more from you in the future.

    iklan baris gratis
    jaringan iklan gratis baris
    iklan baris gratis
    pasang iklan baris gratis
    submit iklan baris gratis
    media pasang iklan gratis
    promosi gratis iklan baris gratis
    iklan baris gratis
    pasang iklan baris gratis

  14. Blogger Mizwar Smith | October 12, 2009 8:46 PM |  

    Thanks ever so much, very useful article. Great information!

    Pendatang Baru Kenali dan Kunjungi Objek Wisata di Pandeglang, Kembali Optimasi Kenali dan Kunjungi Objek Wisata di Pandeglang, Kenali dan Kunjungi Objek Wisata di Pandeglang Persaingan Semakin Sengit, Kenali dan Kunjungi Objek Wisata di Pandeglang, Optimasi Spam, Bolehkah?, Kenali dan Kunjungi Objek Wisata di Pandeglang SERP Baru, Kenali dan Kunjungi Objek Wisata di Pandeglang Turun Naik, Google “Ngedance” Pada Kenali dan Kunjungi Objek Wisata di Pandeglang, Kenali dan Kunjungi Objek Wisata di Pandeglang Masuk Halaman Pertama, Mencari Backlink dari .edu dan .gov, Masihkah Perlu?, Pandeglang, Banten – Eksotisme Pantai Tanjung Lesung, Pandeglang, Banten – Taman Nasional Ujung Kulon, Kenali dan Kunjungi Objek Wisata di Pandeglang

  15. Anonymous sukabumi | October 29, 2009 6:47 PM |  

    nice share, great article, very usefull for us...thank you
    Objek Wisata Pandeglang | Taman Nasional Ujung Kulon Pandeglang | Objek Wisata Pandeglang Pantai Carita |
    kenali dan kunjungi objek wisata di pandeglang | mengembalikan jati diri bangsa | Sukabumi | lowongan kerja | webdesign murah | new gadget | nintendo wii games

  16. Anonymous seman | October 31, 2009 10:32 AM |  

    Yes you are right we need pivot module at now.
    Hi you are great, still post even you got vacation :)

    Kenali dan Kunjungi Objek Wisata di Pandeglang,Mengembalikan Jati Diri Bangsa

  17. Anonymous GoMe Computer | November 1, 2009 1:40 AM |  

    great research database
    thanks Paul

    Kenali dan Kunjungi Objek Wisata di Pandeglang

    Mengembalikan Jati Diri Bangsa

    Kerja Keras adalah Energi Kita

  18. Anonymous Kenali dan Kunjungi Objek Wisata di Pandeglang | November 1, 2009 4:47 AM |  

    Yes Paul, I agree with you BDR is great work that provides a high-performance embedded database

  19. Anonymous Kenali dan Kunjungi objek Wisata di Pandeglang | November 1, 2009 11:20 AM |  

    nice share..useful articles..thanks

    Kenali dan Kunjungi objek Wisata di Pandeglang

  20. Anonymous cheez | November 3, 2009 10:02 AM |  

    This is actually really interesting regarding your fact article here, This article is very informative.

    Kenali dan Kunjungi Objek Wisata di Pandeglang

    Kenali dan Kunjungi Objek Wisata di Pandeglang

    Kenali dan Kunjungi Objek Wisata di Pandeglang

    Kenali dan Kunjungi Objek Wisata di Pandeglang

    Kenali dan Kunjungi Objek Wisata di Pandeglang

    Kenali dan Kunjungi Objek Wisata di Pandeglang

    Kenali dan Kunjungi Objek Wisata di Pandeglang

    Kenali dan Kunjungi Objek Wisata di Pandeglang

    Kenali dan Kunjungi Objek Wisata di Pandeglang | Business Blog SEO

  21. Blogger Daniela | November 21, 2009 7:16 PM |  

    Interesting post. I have been wondering about this issue,so thanks for posting. I’ll likely be coming back to your blog. Keep up great writing. Find your great Travel News and sing the songs at Free Song Lyric or you can watch the drama at Korea Drama Online one of great korea drama is A Love to Kill if you go to travel to Indonesia learn Learn Indonesia Language first! And find your home cari rumah or make a blog Belajar membuat Blog find your home again rumah dijual and again at jual rumah or something like download youtube or you can find a nice widget blog then if you want buy a new laptop see the Laptop Price List or you can buy a New Blackberry and then take care your Health & Jewerly, that's all, thank you so much.

leave a response