Stanford InfoBlog: Database Research Principles Revealed (Posted by Jennifer Widom)

Last summer I was named recipient of the 2007 ACM SIGMOD Edgar F. Codd Innovations Award, an honor that came with both good news and bad news. The good news: $1000 and something new to spice up my bio. The bad news: A last-minute trip to Beijing at an inopportune time (though enjoyable in the end) to deliver a plenary talk at the conference.

Back when Hector received the Innovations Award in 1999, there was only the good news part; an invited SIGMOD conference talk for the winner was introduced with Jeff's award in 2006. The problem with this type of talk is that you're not allowed to just trot out your latest research spiel. The talk is expected to be sweeping, insightful, and (most of all) entertaining, while still remaining technical enough to avoid any hushed remarks about being an over-the-hill armchair researcher who thinks only big thoughts and no little ones.

I spent a lot of time mulling over what I could say to these conference-goers in Beijing, at least those who didn't sneak off to visit the Forbidden City instead. (Not many of them did, probably thanks to the drizzle and thick smog.) I decided to solidify some research strategies and pet peeves that I believe have influenced my entire career, with very concrete examples for technical credibility, and photographs to keep it entertaining.

Slides from the talk are available in PowerPoint and pdf (an inline slideshare version is below). This blog post summarizes the key points.

Finding Research Ideas

There's no magic to finding research areas, at least for me. I started working in Active Databases at IBM because I was told to. I started working in Data Warehousing at Stanford because Hector came back from a company visit one day saying it was the latest hot thing and there might be some research in it. I worked in Semistructured Data as an offshoot of our Data Integration project -- the integration part made me uncomfortable so I decided to build a DBMS for our "lightweight self-describing object model" instead. Data Streams was an area I'd always felt just plain made sense, but it took years for me to convince any students to work on it. Lastly, my current work on Uncertainty and Lineage is an idea that just popped into my head one day during my morning jog. Really.

I never know where the next idea is coming from, or when it will arrive, which is actually kind of scary since I don't like to stay in areas too long. One small but interesting observation: Although I've worked in what seem to be diverse areas, the problem of Incremental View Maintenance has popped up in every single one of them.

Finding Research Topics

Once a research area has been selected, how does one find a topic within that area? Here I actually do have a strategy. If you take one of the many simple but fundamental assumptions underlying traditional database systems, and drop it, the entire kit-and-kaboodle of data management and query processing often needs to be revisited. (I like the analogy of pulling at a loose thread in a garment, ultimately unraveling the whole thing.) Once you need to revisit the data model, query language, storage and indexing structures, query processing and optimization, concurrency control and recovery, and application and user interfaces, you've got yourself a bunch of thesis topics and a fun prototype to develop.

I followed this recipe for Semistructured Data (dropped assumption: schema declared in advance), Data Streams (dropped assumption: data resides in persistent data sets), and now Uncertain Data (dropped assumption: tuples contain exact values). Of course you don't need to revisit every aspect of data management and query processing every time, but so far there have always been plenty of topics to go around.

The Research Itself

Here comes my biggest pet peeve. If one is to follow my recipe and reconsider data management and query processing for a new kind of DBMS, it's imperative to think about all three of the critical components -- data model, query language, and system -- and in that order! We in research have a rare luxury, compared to those in industry, that we can mull over a data model for a long time before we move on to think about how we'll query it, and we can nail down a solid syntax and semantics for a query language before we implement it. This sequence is not only a luxury, I consider it a requirement for good research: Lay down the foundations cleanly and carefully before system-building begins. This policy has been the the biggest underlying principle of my research and, I believe, the primary reason for its success (on those occasions it's been successful).

Let's look briefly at the three critical components, then talk about how to disseminate research results.

Data Model

Nailing down a new data model that "works" is hardly a trivial task. The talk (here are the PowerPoint and pdf links again) provides some concrete examples of subtleties in data stream models, where the same query can (and across current systems, does) give very different results depending on some hidden and often overlooked aspects of a stream model. In the Trio project, we debated uncertainty data models for nearly a year before settling on the one we used, and it was well worth it in the end.

Query Language

Like data models, the subtleties involved in query language design are often underestimated. First, there seems to be some confusion between syntax and semantics: from a research perspective, only semantics is really interesting. For example, if we apply SQL syntax to a data stream model, or to a model for uncertain data, we certainly can't declare victory -- in these new models it's often unclear what the semantics of a syntactically-obvious SQL query really are. (Here too, concrete examples are given in the talk.) For both the STREAM and Trio projects, just the task of specifying an exact semantics for SQL queries over the new model was a significant challenge.

Unfortunately, the challenges and contributions of specifying a new query language (or new semantics for an existing one) don't tend to be recognized in traditional ways. Publishing a SIGMOD or VLDB paper about a query language is near impossible. After many failed attempts to publish a paper describing the Lore query language, we finally sent it to a new journal that was desperate for submissions. The Lorel paper now has over 500 citations on Citeseer (over 1200 on Google Scholar) and was among the top-100 cited papers in all of Computer Science for a spell. The fact is that language papers are very difficult to publish, but they can have huge impact in the long run. Unfortunately that's tough to explain to a graduate student.

In another of my favorite language-related stories, I was confused about the semantics of a "competing" trigger (active database) language to the one I was designing; this was way back around 1990. I asked the obvious person running the other project (who shall remain nameless, but is very tall) what the semantics would be of a specific set of triggers in his language. His response: "Hmm, that's a tricky one. I would have to run it to find out."

The talk includes examples not only of trickiness in applying SQL to new models, but also subtleties in designing query languages for semistructured data and for data streams. It also demonstrates a guiding principle for designing query language semantics in the "modified-relational" models I tend to work with: reuse relational semantics whenever possible (which is not the same thing as reusing SQL or even relational algebra syntax); it's a clean and well-defined place to start, and can cover a lot of ground if the semantics are compartmentalized well.

System

After all that thinking, debating, designing, specifying, and proving that goes into figuring out a new data model and query language, building a prototype system to realize them is a very satisfying finishing step, and critical for full impact of new ideas.

I'll admit the model-language-system sequence isn't quite as clean a division as I've made it out to be: When building a system and trying it out, one inevitably discovers flaws in the data model and query language, and there tends to be at least a moderate feedback loop. Even then, working out (modified) foundations before committing them to code is, in my mind, rule number one.

Disseminating Research Results

I have strong feelings on this topic. First, if you've done something important, don't wait to tell others about it. There's no place for secrecy (or laziness) in research, and there's every place for being the first one with a new idea or result. Write up your work, do it well and do it soon, post it on the web and inflict it on friends.

Second, don't get discouraged by SIGMOD and VLDB rejections. Those conferences aren't the only places for important work, by a long shot. Workshops often reach the most important people in a specific area. I've always been a fan of SIGMOD Record (and more recently the CIDR conference) for disseminating ideas or results that, for whatever reason, aren't destined for a major conference.

Finally, build prototypes and make them easy to use. That means a decent interface (both human and API), and even more importantly setting things up so folks can try out the prototype over the web before committing to a full download and install.

Labels: beijing, databases, infolab, innovations award, jennifer, principles, research, sigmod, widom

This entry was posted by Jennifer Widom, on Friday, July 11, 2008. You can leave your response.

Database Research Principles Revealed (Posted by Jennifer Widom)

Search

recent posts

Archives

Authors

Links

Admin