Comments on Stanford InfoBlog: Vacation Post: Claremont (Berkeley) Database Research Self-Assessment Revisited (Posted by Paul Heymann)

Greg:Hey, thanks for the note. It's weird, I don'...

2008-12-30T03:09:00.000-08:00

Greg:

Hey, thanks for the note. It's weird, I don't think a lot of researchers are aware of the Worse is Better work even though it impacts a lot of what we do. I suppose some of the reason is that it is more of a theory, it wasn't really produced by a "science-y" researcher and it's not really that testable. But it really does seem to provide some compelling insights into how things get popular, even more so with even bigger network effects on the web.

As an aside, if you haven't read Gabriel before, he's a really fascinating writer. I read a little bit of Patterns of Software (which is now available online) when I borrowed it off an old advisor's shelf a number of years ago. Beyond an expansion of some of the ideas in Acceptance Models, he's not afraid to just veer off somewhere completely different whenever he thinks it's interesting. His Stanford PhD and Lucid stories are both filled with a surprising level of detail and honesty (near the end of PoS).

Excellent post, Paul. I have to say, a lot of thi...

2008-12-29T20:10:00.000-08:00

Excellent post, Paul.

I have to say, a lot of this struck close to home, including the use of simple tools like sort, cut, perl, python and sqlite for small data processing, Hadoop (or similar) for big data processing, the pain of boundary crossings in RDMSs and XML, and the wonder that we often seem to be able to do no better than the simplest format of delimited text.

I am not sure how I missed it in the past, but I much enjoyed Richard Gabriel's "Models of Software Acceptance: How Winners Win". Thanks for pointing that out too.

Jeff:Good catch, I've updated FQL to be Hive in th...

2008-12-22T23:59:00.000-08:00

Jeff:

Good catch, I've updated FQL to be Hive in the text above. I feel bad, Ragho even demoed Hive for me a few months ago!

In any case, I'd like to take the opportunity to point out that there is a tiny contradiction in my comments above. As my colleague David pointed out to me, I claim all of these problems are terrible unsolved problems, and then explain how I've solved them in my own work using Hadoop and other tools. I guess the problem is that none of the current solutions seem to have that clean, efficient feeling to them, but maybe I'm (or we're all) being too picky. Hadoop in many ways seems to be happiest about taking our data with all of its warts.

I never ended up posting a response to the whole Database Column MapReduce controversy (I, II). (Side note: Now that was a well-crafted troll...) But my main thought after reading it was that the authors (who are far wiser than I) really should read the (somewhat crazy in places) Software Acceptance Models if they haven't already.

Specifically, and paraphrasing heavily, programmers flock to simple models of computation, simple implementations of those models, and languages that look like C (e.g., MapReduce and C++/Java in this case) and they won't use something that does not have these qualities. Simple implementations in particular mean that many people can become gurus: MapReduce is simple enough that it has now been rewritten at least three times I know of by different companies. Each time it gets rewritten, more people become experts on the internals, and understand details and limitations.

Furthermore, as more and more programmers become invested in MapReduce, there will be more and more incentive to improve it until it becomes, maybe, 95% of what the right distributed DBMS solution would have been. This is already happening through efforts like Pig and Hive as well as (as far as I can tell) closed efforts like Sawzall. So maybe MapReduce, with its relatively low cost of boundary crossings, simple implementation, and exploding user base will eventually be the "database" solution that provides 95% of what we need and that we all have to use.

Hey Paul,I think you meant to reference Hive, Face...

2008-12-22T21:57:00.000-08:00

Hey Paul,

I think you meant to reference Hive, Facebook's structured data management layer above Hadoop, not FQL, their SQL-like interface to their developer API.

Great topic and post though; certainly looking forward to where all of this innovation leads in ten years.

Regards,
Jeff

Cube:I think you highlight an important point, whi...

2008-12-22T18:50:00.000-08:00

Cube:

I think you highlight an important point, which is that a lot of data management issues are as much human or social issues as anything else. More specifically, you seem to be implying that a lot of these issues look like process or project management.

It's not really clear who is to blame for the sort of process problems you describe though. It's odd, because it seems like the database vendors sort of implicitly assume that the data manipulation for an application should be built around an RDBMS from the start, and that it's bad project management to not be using a production relational data model from the beginning. But on the other hand, building a product which ignores that most (at least initially small-scale) projects are not going to be built in this way, with foresight for eventual data management needs, seems a little negligent (or at least not optimal).

I think databases and security share a common prop...

2008-12-22T13:35:00.000-08:00

I think databases and security share a common property: everybody seems to think they're an expert in the field. Homebrewing an unscalable data management system (e.g. using a filesystem) is just as easy as homebrewing a completely insecure security system. Further, nobody realizes they have a bad system until it's too late. Homebrew security systems get compromised; the users of homebrew data management systems end up with terabytes of data in a non-standard format that can no longer be handled efficiently as flat files, and takes an enormous engineering effort to import to a real database system.