InfoLab Logo Header

CIDR 2009 Trip Report (Posted by Steven Whang)

This collaborative blog post was written by InfoLab members who attended the recent CIDR conference held Jan. 4-7, 2009 at the Asilomar Conference Grounds. The viewpoints in the blog do not necessarily represent the entire InfoLab.

Instead of trying to cover all the interesting work presented, we focus on major trends centered on the two keynotes and the Best Paper Award. These three talks covered important research directions for the database community: user interfaces, power-sensitive systems, and new hardware for databases. We then briefly mention works by InfoLab members/alums.

1. User Interfaces

The first keynote by Jeff Heer demonstrated how people can easily collaborate on data analysis. As an example, we were shown visualizations of United States census data over the last 150 years using, a prototype web application for social visual data analysis. Users can analyze the data (e.g., adding an annotation that the sharp decrease in the number of people with military jobs in the late 1920's was due to the Great Depression) and easily share their results with others (e.g., posting a URL of their view). Hence, the key contribution of is supporting asynchronous collaboration for visualization. 

The demonstration clearly showed the importance of a good user interface for database systems. As DBMSs and query languages like SQL are used by a broader audience of programmers, there is a need to provide easier tools for end-user data management and manipulation. While is already an excellent tool for collaboration, it still remains to be seen how database systems can adopt the underlying science of human interaction. Several issues were raised by the audience including managing groups of collaborators, preserving privacy, and exporting visualizations to other systems. 

Two other works in the conference also focused on user interfaces. A presentation by Yannis Ioannidis discussed the challenges of providing a natural language user interface for databases (e.g., a database should give back an answer like "The director's name is Woody Allen" instead of a table). A presentation by Zachary Ives demonstrated CopyCat, a tool that provides an interface for integrating data without having to design a schema; the CopyCat system "learns" the schema based on copies and pastes made by the user.

2. Power-sensitive Systems

The second keynote by James Hamilton proposed a cost and power efficient system for internet-scale services (CEMS project). The first observation made was that server efficiency is key to improving the overall data center power efficiency (nearly 60% of the power delivered to a high-scale data center is delivered to servers). Next, servers were built using low-cost, low-power client or embedded components. The resulting CEMS prototype outperforms a high-scale commercial internet service by 379% in terms of performance/joule. Hence, the prototype is a significant contribution to the recent trend of power-sensitive systems. 

In addition to the keynote on the CEMS project, which improved hardware for better power efficiency, there was discussion on the software side of improving power efficiency. Mehul Shah's presentation argued that data management software should also be optimized for power efficiency and suggested promising areas in database systems that could improve. Stavros Harizopoulos's "Gong Show" presentation went a step further and suggested that performance should be thrown away in favor of power efficiency! A presentation by Willis Lang proposed two specific techniques that trade power efficiency for performance. 

3. New Hardware for Databases

The Best Paper Award was given to "uFLIP: Understanding Flash IO Patterns" (Luc Bouganim et al.), which proposed a benchmark for flash devices. The motivation is that commercially available flash devices do not behave as the flash chips they contain due to the additional layer of block mapping, wear-leveling, and error correction. Consequently, a benchmark is necessary to understand the complex behavior of flash devices (e.g., block writes are not uniform in time) that can help algorithm and system design. The authors compared various flash devices using their uFLIP benchmark and produced helpful guidelines for algorithm development (e.g., random writes should be limited to a focused area of 4-16MB in order to perform nearly as well as sequential writes). 

A natural question to ask is whether we can use an interface for flash memory that allows databases to directly access flash cards (instead of relying on black-box flash devices). In the case of disks, database systems usually access the raw disk (instead of accessing the disk through a file system) and disable the operating system's data caching in order to handle caching themselves. Moreover, database systems sometimes even have full control over processor scheduling and the mapping of page tables where both tasks are usually left to the operating system. We suspect that directly accessing flash cards is currently a difficult task compared to directly accessing disks. 

Another advance in new hardware is using multi-core processors (or chip-level multiprocessors, CMP). Ippodratis Pandis's Gong Show presentation provided various solutions for scaling databases on CMPs.

4. Works by InfoLab Members/Alums

InfoLab members Georgia Koutrika and Benjamin Bercovitz presented CourseRank, a popular course evaluation and planning social system that is used by over 9,000 students out of 14,000 at Stanford (most undergrads use CourseRank). Based on their experience with CourseRank, Georgia and Benjamin proposed various research challenges for social sites such as encouraging information discovery (using tag clouds) and enabling flexible recommendations in a declarative fashion. 

InfoLab alum Chris Olston presented a general two-phase approach for interactive querying over web-scale data. The idea is to first supply a query template in advance to the system, which can then prepare auxiliary structures (e.g., materialized views and indexes) to facilitate real-time query responses later on. As a result, interactive querying is possible for a general class of queries and data at a very large scale. 

InfoLab alum Shivnath Babu presented an integrated diagnostic tool for database and SAN administrators. Using an abstraction that ties together the execution path of queries and the SAN, it is possible to diagnose query slowdowns caused by combinations of events across the database and the SAN. 

InfoLab alum Yannis Papakonstantinou presented app2you, a web application creator that lets developers create web applications without doing database coding or designing. The app2you platform presents a tradeoff point between having a wide application scope (e.g., by building applications using Java, Ajax, and SQL) and providing ease of specification (e.g., by simply copying an application template).

The following InfoLab members/alums co-authored papers, but did not give talks:

Check out other blog posts on CIDR 2009 by Joe Hellerstein, Pat Helland, and Leigh Dodds.

Labels: , , ,

Vacation Post: Claremont (Berkeley) Database Research Self-Assessment Revisited (Posted by Paul Heymann)

It's vacation time in the InfoLab, so I thought I would write a follow-up post on a previous topic. In June, Hector wrote a post about attending the Berkeley Database Research Self-Assessment. A few months later (in late August), the Self-Assessment came out as the Claremont Report on Database Research (PDF available here).

There was a little bit of discussion at the time. A dbworld message went out on August 19th, and there were a few follow-ups. Alon Halevy was the first to (briefly) blog about the report, Dave Kellogg gave a good summary of the main points as did Tasso Argyros at the Aster Data Systems Blog, and there are a number of other posts with in depth comments, like the ebiquity blog looking for more Semantic Web discussion and Stephen Arnold discussing massive heterogeneous data. Two posts connect the Claremont Report's focus on big data to Nature's recent issue on Big Data.

I would summarise the report, but I actually think it's pretty clear. (These are my thoughts, incidentally, and may not represent the views of all of the InfoLab on such a broad report.) We are at a historic point in data management, and there are huge opportunities. Many communities would like better tools for data management, and would be happy and willing to learn them if we provided them (as long as it isn't Datalog, see Hellerstein below... "Datalog? OMG!"). But, we're not, or at least, not really. Sam Madden's contribution actually struck a chord with me as someone who is much more often a consumer rather than a creator of database technology (working mostly on the web rather than core database research):

At the risk of feeding what ultimately may be a really well crafted troll, what Madden describes is what I face on a daily basis. My usual tools end up being things like awk, sort, join, ad-hoc Python and perl scripts, SQLite with a ramdisk or otherwise in memory, and other one-off or only somewhat appropriate tools even when my data is relational and I would be able to phrase my queries much more succinctly in a declarative language. Rather than being able to use a distributed DBMS to do parallel work, I end up using MapReduce (Hadoop), usually with some hacks to use higher level language (currently Dumbo, maybe I'll try Pig or FQL Hive again sometime soon).

Is anyone seriously working to address this problem? It seems much sexier to work on new semantics (e.g., semi-structured data, streams, uncertainty, anonymity) or new ways to optimise retrieval (e.g., column stores, self-tuning). But neither of these really address what seems to be the massive cost of boundary crossing. This isn't to deride any of the work on new semantics or new optimizations, and on the contrary, that work is extremely important for the database community to remain relevant to a wide community of potential users. But, it seems to take forever to get data into a database (and exotic bulk loading tools make it complex as well), index it, and get it ready to be queried. Then it takes forever to get data back out if you are using the database for declarative data manipulation rather than having online queries be the end result. Maybe the answer is having data in XML, and then querying that data directly (but, to paraphrase JWZ, paraphrasing someone else, "Some people, when confronted with a problem, think 'I know, I'll use XML.' Now they have two problems."). Maybe the answer is that the relational database is an oddity, and that the much more common pattern is for simple, bad languages and bad data models to succeed, especially if they have simple models of computation and look like C (see for example Worse is Better, particularly "Models of Software Acceptance: How Winners Win").

Are there tools that will let me manipulate my data declaratively and efficiently, but then get out of my way when I want the data in R, or I want to write some ad-hoc analysis? Are there any production level tools that don't have a huge start-up cost when data goes in, and that might actually give me some indication of when the data will come out? Is everyone just using various forms of delimited text to organise massive amounts of structured, semi-structured, and unstructured data? (Except banks and retailers anyway.)

In any case, while I personally found Madden's research direction most accurate in describing what I need from databases in my work, there are a number of interesting research directions that people presented about. Unfortunately, they're in a variety of formats (they're all originally from the Claremont Report page), so I've munged them and then put them on Slideshare for your perusal. (Some are a little bit like reading tea leaves without seeing the actual talk, but most seem pretty clear in content.)

What do you, as a reader of the Stanford InfoBlog, think is the most important research direction below? Was something missed that is near and dear to your heart? What solutions do you use today to manipulate big and exotic data in your work?

Update 2008-12-22 23:12:00 EST: Switched link to FQL to be a link to Hive. Good catch, Jeff!

Research Directions

Rakesh Agrawal

Anastasia Ailamaki

Philip A. Bernstein

Eric A. Brewer

Michael J. Carey

Surajit Chaudhuri

AnHai Doan

Michael J. Franklin

Johannes Gehrke

Le Gruenwald

Laura M. Haas

Alon Y. Halevy

Joseph M. Hellerstein

Yannis E. Ioannidis

Hank F. Korth

Donald Kossmann

Samuel Madden

Beng Chin Ooi

Raghu Ramakrishnan

Sunita Sarawagi

Michael Stonebraker

Alexander S. Szalay

Gerhard Weikum

Labels: , , , , , , ,

Clustering the Tagged Web (Posted by Daniel Ramage)

I'm looking forward to presenting Clustering the Tagged Web (mirror) at WSDM 2009. The paper is joint work with Paul Heymann, Hector Garcia-Molina, and Christopher D. Manning. We examine how and when social web bookmarks like those found on can be used to improve unsupervised flat clustering algorithms on a web corpus. The one-line summary is that tags should be incorporated as a parallel information channel to page text and to anchor text to gain maximum benefit on broad clustering tasks.

A social web bookmark is like a regular web bookmark of a URL by some web user, except that the social web bookmark is posted to a centralized web site with a set of associated tags. A tag is (usually) a single-word annotation that describes the URL being bookmarked. Earlier this year,, a popular social bookmarking site, had a page up that defined a tag as:
"A tag is simply a word you use to describe a bookmark. Unlike folders, you make up tags when you need them and you can use as many as you like. The result is a better way to organize your bookmarks and a great way to discover interesting things on the Web." --

In the screenshot above, the web page of Strunk and White's classic book on writing, The Elements of Style, has been tagged with "writing," "reference," "english," and "grammar" by users.

Why tags? Tags are a rich source of information about a large subset of the web's most relevant pages. We believe they are a particularly useful source of information for automatic web page clustering because, by and large, each is a human-provided annotation of page content. Additionally, new, high quality documents on the web are often tagged before they establish much of a footprint in the web's link graph [Can Social Bookmarking Improve Web Search?]. So making good use of tags promises impact when link structure and anchor text are still spotty.

Given a large set of items (web pages in our case), the goal of a flat clustering algorithm with hard assignments is to coherently group together those items into some smaller number of clusters.

Why clustering? Automatic clusterings of web pages into topical groups can impact the quality of information retrieval on the tagged web in several ways, including: promoting diversity of search results, enabling improved cluster-driven search interfaces like scatter/gather, and improving upon language-model based retrieval algorithms, among others. Plus it's a fairly well understood task, but one in which a new data source like tags really should (and indeed does) have a substantial impact. Our goal in the paper is to show how that impact can best be achieved.

Some of the paper's main findings are:
  1. Staying within the standard vector space model (VSM), the K-means clustering algorithm can be modified to make effective use of tags as well as page text. We found that weighting all word dimensions in aggregate equally to (or with some constant multiple of) all the tag dimensions resulted in the best performance. In other words, if you normalize tag dimensions and word dimensions independently, then your clustering model improves. This insight may well apply to other tasks in the VSM.

  2. Our experience incorporating tagging in the VSM prompted us to consider other clustering approaches with the potential to more directly model the difference between tags and page text. One such approach is a new hidden-variable clustering model, Multi-Multinomial Latent Dirichlet Allocation (MM-LDA), similar to latent Dirichlet allocation (LDA). Like LDA, MM-LDA assumes that every document is made up of a (weighted) mixture of topics, and that each word comes from the word distribution of one of those topics. MM-LDA also assumes that each tag is analogously chosen from a per-topic tag distribution. (The figure above shows MM-LDA's plate diagram for those who are familiar with Bayesian graphical models and can be safely ignored by those who aren't.) The upshot is that MM-LDA acts as a high performing clustering algorithm that makes even better use of tags and page text by treating each as a set of observations linked only by a shared topic distribution. It's fairly fast and generally outperforms K-means.

  3. Anchor text - the text surrounding links to a target web page - is another kind of web-user provided page annotation, but it acts quite differently than tags, at least from the perspective of web page clustering. Treating anchor text, page text, and tags as independent information channels is again the best way to combine these three types of signal. We found that including tags improves performance even when anchor text was provided to the clustering algorithm. However, including anchor text as an additional source of annotations doesn't really improve on just clustering with words and tags.

  4. If you're trying to infer a relatively small number of clusters in a very diverse set of documents, it makes sense to utilize page text and tags jointly as described above. But if you're looking to cluster a set of pages that are already a specific subset (like pages about Programming Languages), tags often directly encode a sort of cluster membership (like "java" and "python"). So when tags are at the same level of specificity as the clusters you're trying to infer and you have enough tags to avoid sparsity issues, you might do better clustering on tags alone, ignoring page text.
We invite you to read the paper and I hope to see some of you in Barcelona!

Labels: , , , , ,

An Often Ignored Collaboration Pitfall: Time Phase Agenda Mismatch (Posted by Andreas Paepcke)

[An earlier version of the following thoughts were posted on an internal online forum of the Council on Library and Information Resources (CLIR). The material was further discussed and developed at CLIR's symposium on Promoting Digital Scholarship: Formulating Research Challenges in the Humanities, Social Sciences, and Computation.]

The Stanford Infolab has enjoyed a multi-year string of active, cross disciplinary collaborations. We have worked closely with biodiversity researchers, physicians, and political scientists on projects of mutual interest. Several publications emerged from these collaborations, not just in the CS community, but also in the Biology literature [e.g. 1, 2, 3, 4, 5].

Stanford University, the National Science Foundation, and others attempt to encourage cross-disciplinary efforts through financial and other incentives. In our experience such collaborations are in fact highly beneficial. They are not, however, trivial to manage.

Time Phase Agenda Mismatch

Every cross disciplinary work we have been involved in has experienced some degree of mismatch in what would be an optimal activity for each party at any given time. For example, the best new computing tool that would provide the optimal, immediate progress to, say, a political scientist, might be of no interest to a computer scientist needing to publish; the underlying science for the tool was developed several years ago.

Vice versa, a cutting-edge CS prototype might either be too exotic for use by a political scientist trained in more standard tools, or it might prove too brittle and incomplete for everyday use. In entering collaborative work both partners therefore need to be clear about expectations.

Note that all parties in an endeavor might well agree that long-term collaboration is the right approach. The problem lies in the day-to-day decisions about resource and time allocation. A look at the traditional process of computer science research will clarify the issue from the CS point of view.

Computer Science Workflow

Here is the required workflow for many research university computer science faculty: Propose an important, difficult-to-solve problem, plus thoughts towards a solution to the National Science Foundation. Grant in hand, compete with other faculty of the same university for student interest. Ph.D. students are the most valuable in this competition, because they will stay longer than Masters or Undergraduate students and will dig deeper.

The faculty member's responsibility towards Ph.D. students is to move them towards graduation. This task requires the identification of constituent sub-problems, whose solutions will be published in highly regarded computer science conferences or journals. Often the work will include a prototype that is stable enough for performance measurements or usability testing. Very rarely will this prototype include all the details that would be required for practical use.

In fact, forcing Ph.D. students into such 'filler' work of adding the required bells and whistles to a prototype might be considered irresponsible, because these students are already trained for this type of work and need new challenges.

Employment of Masters students can be, and often is, the answer. Two issues arise around this solution. First, the best Masters students will be looking to tackle cutting edge CS work. Being offered filler work, they will choose other projects, leaving only less talented or insufficiently trained students who then need very significant supervision.

The second downside of hiring Masters students for filler work is that the investment---currently about $75,000 per academic year at many institutions---will not move a computer science professor closer to the next grant that will be required to feed the existing Ph.D. students who usually straddle the time boundaries of at least two grants.

Where's the Payoff?

Enter the biologist, physician, historian, political scientist, or law scholar in the cross-disciplinary enterprise. Let me denote this person the 'partner'. We assume here that the common vision of a collaborative project is compelling to both participants. Both are perfectly well disposed towards the other. Let us even assume that the respective fields' jargon as well as deeper conceptual notions are mutually understood. Assume further that the CS professor will hear and understand the needs of the partner.

A novel prototype is now constructed with important input from the partner. Everyone is rightfully excited. But now the problem sets in. The CS professor and the involved students will write a CS paper, and they are then ready to move on to the next sub-problem of the collaborative project. The partner, in contrast, is eager to start using the tool, ... which breaks under even mild use and does not include all the required features.

Do note that this state of a prototype is acceptable in the context of CS publications. A perfectly honest prototype is expected to be built up to the point where the *salient* features are solid and can be measured. It is understood in the CS research community that the remainder of a prototype may be a scaffold. That state of affairs is not a scam. Taking software from prototype to product quality is extremely expensive and, again, will not lead to progress in the students' or professor's research career.

Where are we now in this scenario? The CS professor is impatient to move on to the next sub-problem within the project. The partner is disappointed. He has invested significant time explaining his problems to the CS team, and testing intermediate results. Now, when his labor's results seem near, they are not.

The know-how for the often very large remainder, the filler work, was developed in CS years ago, when the partner did not need it. Now he does, but the CS resources are not allocated. The CS professor's and the partner's agendas are out of phase, even though their long term goals match.

The take-away point is that a collaboration agreement must address this situation before work begins. Expectations must be managed and mutually understood.

Some Solutions

Our own past successes broke out of this difficulty along different paths. Admittedly, we did not plan any of these solutions in advance. In one case the possibility of a startup company was enough to make the partner's work worth-while to him. Sometimes, if CS results from the prototype promise economic interest, an existing company might license the ideas and work the prototype into a product, from which the partner can then benefit. Delays in the partner's satisfaction are naturally built into this solution.

In another case the succession of published results led to follow-on funding that included resources for the partner. The CS-typical rapid forward movement without full development of the covered terrain thereby benefited everyone: The readers of resulting publications were learning; the CS professor and partner enjoyed the satisfaction of having produced knowledge that neither could have produced alone; the funding agencies produced innovation in accordance with their mission; the CS professor's future research will be colored by the new understanding of the partner community's needs, and the partner can enjoy financial resources in addition to having gained an improved understanding of what is easy in CS, and what is hard. Future collaborations will thereby be improved as well. The disadvantage of this solution is that the partner's research community cannot see the impact on their area of expertise until much later.

Yet another model we have followed is for research staff to skip vacations and to spend the summer implementing filler portions of a prototype. This activity means that correspondingly little grand thinking is achieved during that time. But the prototype moves to full usability by the partners. Unfortunately, this solution is difficult to scale.

No matter the field of a partner, the computer science side will often need to engage in at least some 'grunt work' at some point in the project. This work needs to be of immediate, convincing benefit to the partner. CS research culture will need to learn how to accommodate these activities even though they are currently often not respected.

The Role of Funding Agencies

Some calls for funding proposals require proof that the output tangibles of the research---prototypes, data sets, and such---will be maintained and expanded after expiration of the grant. While likely motivated by the right concerns, such a requirement is usually impractical. For what can proposing research organizations promise?

A startup company is one option for a continuation promise. Unfortunately, economic feasibility can usually not be predicted in the context of advanced research projects. The promise of a startup company is therefore unrealistic at the time proposals are written.

Another promise might be the hire of full-time staff that will care for the tangibles after the grant terminates. Two problems arise with this solution. First, such staff needs to be financed over long periods of time---a commitment most funding agencies are unable and unwilling to make.

Second, a CS research organization cannot through grant after grant staff maintenance of ever more orphan tangibles. Such an organization would quickly run dry of funds and supervision resources for students to whom at least educational institutions owe focus.

Unfunded mandates in calls for grant proposals are thus not a likely answer. One possibility might be for grants to include money specifically for hardening prototypes. For example, such funds might be spent to hire the student(s) who constructed the prototype for the summer following their graduation. The advantage of this solution is that the creators of the prototype are in the best position to improve code quickly. However, salaries would likely need to be higher than what is typical for students, because first, these potential hires will be graduates at that point, and second, the work of hardening is not desirable for many (at that point former) students.

Another component to addressing the problem of time phase agenda mismatch would be for funding agencies truly to acknowledge the efforts of non-CS partners in collaborative grants. Concretely, such acknowledgment would mean that subsequent proposals by, say, a historian could realistically cite the results of an earlier collaborative effort as past achievement in the field of history. Even if the collaborative effort did not immediately lead to changes in historical inquiry, the advancement of computing methods towards use by historians must 'count' as a true contribution.


In summary, cross disciplinary computing projects harbor immense potential for both parties. Both can be inspired just by grasping the other's mode of thought. The potential exists for moving both fields forward. Frequently, however, results of cross disciplinary work cannot advance both disciplines equally during any given phase of a collaboration. When one party is satisfied, additional work, time, and money is often required to provide satisfactory closure for the other as well. Satisfaction will usually not be symmetric at any given time during a collaboration. Both sides must anticipate overhead work that would not be considered worthy of attention in a single-disciplinary activity. Funding agencies can play a role by (i) encouraging the hardening of tools, and (ii) by creating a culture where collaboration is rewarded with favorable consideration for future funding even if the significance of outcomes are asymmetric among the participating parties.

The answer to the above complications in cross disciplinary work is to adjust reward structures and foster the cultural adjustments that will be required across the disciplines. The potential benefits are well worth this effort.

Labels: , , , , , ,

Generic Entity Resolution with Negative Rules (Posted by Steven Whang)

Entity Resolution (ER) is a process of identifying records that refer to the same real-world entity and merging them together. For example, two companies that merge may want to combine their customer records: for a given customer that dealt with the two companies they create a composite record that combines the known information.

However, the process for matching and merging records is most often application-specific, complex, and error-prone. The input records may contain ambiguous and not-fully specified data, and it may be impossible to capture all the application nuances and subtleties in whatever logic is used to decide when records match and how they should be merged. Thus, the set of resolved records (after ER) may contain "errors" that may be apparent to a domain specialist. For example, we may have a customer record with an address in a country we do not do business with. Or two different company records where the expert happens to know that one company recently required the other so they are now the same entity.

In our paper, we address the identification and handling of such inconsistencies using negative rules, which are predicates that take a number of records and return whether the records are consistent or not. For example, if ER mistakenly merges two person records with different genders, we can specify a binary negative rule that flags an inconsistency for any record that has both genders. The negative rules are then used with the original ER rules to find a consistent ER solution.

There are two reasons why we cannot simply modify the ER rules instead of using the negative rules. First, the negative rules are only used to check the consistency of the final ER result and not the consistency of the intermediate records that are created during the ER process. For example, a record that has both genders may turn out to be Female (based on more evidence) at the end of the ER process (and thus consistent). Second, the negative rules can be viewed as an effort of a second party to fix the errors made by the ER rules of the first party.

We propose a general algorithm that finds a consistent ER solution and an enhanced algorithm that is more efficient than the general algorithm, but assumes certain properties on the negative and ER rules. Our main findings are as follows:
  • Negative rules can significantly improve the accuracy of the ER process. Depending on the strategy, a domain expert may need to help resolve the inconsistencies.
  • Different combinations of negative rules may result in different accuracy and runtime results. The negative rules themselves also need to be correct and effectively pinpoint the errors of the ER rules.
  • Applying negative rules is an expensive process (at least quadratic) that should not be run on the entire dataset if the dataset is very large. A common technique in ER is to use blocking techniques (many variations exist) where the entire dataset is divided into smaller blocks, and the blocks are processed one at a time. The negative rules can be used within each block.
  • Even without the properties, the enhanced algorithm can efficiently produce results that are nearly identical to those of the general algorithm.