InfoLab Logo Header

SIGIR 2008 Trip Report (Posted by Paul Heymann)

I just got back from SIGIR'08 in Singapore. I was there to give a talk on my Social Tag Prediction paper, which hopefully I will get around to writing up a full post about sometime soon. In the meantime, I am posting some of my personal reactions to the conference. This is a long post, so feel free to skip down as none of the parts rely on the others.

Kai-Fu Lee Keynote

Keynotes seem to be pretty hard to get right. One hour is a long time, and they either end up being too vague or focus too much on a specific area of the speaker's interest. However, an even bigger issue for me with industry speakers (at least when the talk is not about a research paper) is that they often seem too worried about giving away secrets about their company. As a result, this means that these sorts of talks can end up as reheated summaries of what all of the researchers know, rather than giving any new insight into the challenges industry is facing. Kai-Fu Lee's keynote mostly avoided these pitfalls, and overall I thought it gave a really good overview of the challenges faced by Google in China.

Kai Fu-Lee Keynote

His talk was more or less in the form of bullet points about Google's strategy in China, so I will take more or less that form here. (Greg Linden also has a blog post about the keynote here.)
  • The number of Internet users in China is growing at 35% each year, and accelerating. This means that whoever can capture new users (who presumably have no idea what they are doing) will ultimately be the market leader in China (assuming no outside intervention).
  • New users are kind of classless. For new users, result quality in search does not matter and neither does a clean user interface. New users either do not care or cannot yet appreciate these things. Instead, Google is focusing on Internet cafes, entertainment, and music, which users tend to like (also, blended search with video, news, URLs, and other verticals, called "universal search" at Google). These new users look at a clean empty page like and wonder whether the company forgot to finish the rest.
  • Google could not get people to spell their name. They tried a number of things until finally just giving up and registering "".
  • Local engineers building new local products (rather than just cursory localization of worldwide products) seems to be key to Google's plans in China. Examples include an expanded Chinese Zeitgeist, a tool for finding holiday greetings to SMS, and various emergency relief efforts.
  • An interesting fact is that Chinese is substantially more dense than other languages. This substantially changes user interface design. For example, in English Google News, each entry needs a title, a date, and a snippet. However, in Chinese, the whole summary of the story can fit in the title area, obviating the need for snippets and leading to more stories per page.
  • Kai-Fu showed a graph with no axes, which he said related to mobile usage. He said that the iPhone web usage was 50 times higher than the nearest mobile competitor, even in China where iPhones are black market. There has been some reporting of the 50 times higher stat elsewhere. (On a personal note, when using my friends' iPhones, I find typing so inconvenient that I usually end up running to Google because I know that a search for "cnn" will give me the right result and then I do not have to type the full URL. Maybe my iPhone touch typing skills will improve with time.)
  • Users love clicking and hate typing due to annoying text input tools. This leads to a Google product to help with text input as well as a focus on (ugly) directories which these users love.
  • Piracy in China is high, and Kai-Fu (I could not tell how seriously) said that it had gone down from 99% to 96%. Whether or not this is true, it allowed Kai-Fu to make a little fun of his former employer, saying that a drop from 99% to 96% meant four times the revenues in China.
  • Google thinks that freeware authors who add ad software (but not malware) to their products may be a reasonable distribution method in China for software. Of course, this is not without hazards, I think Kai-Fu stated that the average Chinese computer gets reinstalled every four months.
  • China has huge broadband penetration, mostly through Internet cafes.
  • There are more Internet users in China than the US, and the growth in Chinese users is accelerating.
  • The average age of a Chinese Internet user is 25, and this number is dropping. Huge numbers of fifteen year olds are getting online.
Overall, Kai-Fu said that all of the recent work of Google China had led to a market share increase from (I think) about 15% to 25% in two years. One thing I was not sure of was whether Kai-Fu was overstating the contribution of this strategy to the 10% increase. He pointed out a few recent deals which I could not really evaluate, like a recent deal with China Telecom which made me wonder if Google had also gotten more politically savvy in China in the past few years. In any case, this was a great talk, and left me at least (though perhaps not seasoned China watchers) with a lot to think about.

Chinese Users Outnumber US Users

Technical Papers

There were a number of technical papers and sessions and sessions that caught my eye.

This being the Stanford InfoBlog, it seems like I should start with the Stanford papers at SIGIR this year. There were three, one by Martin, one by Mengqiu, and one by myself. Martin (of the InfoLab) presented SpotSigs (DOI), which are a duplicate detection technique tuned for news stories and other scenarios where we care about the main text of a page rather than navigational elements.

Martin Presents SpotSigs

Mengqiu (of the NLP group) presented models for improving passage based retrieval (DOI), work he did while at CMU with Luo Si (now at Purdue).

Mengqiu Presents Passage Based Retrieval

My social tag prediction paper (DOI) (with Dan Ramage of the NLP group) focused on how well we can predict tags in social bookmarking systems, and techniques for doing so (including SVM and association rules).

Paul Presents Social Tag Prediction

The tag prediction and tag recommendation problems seem to be getting really popular lately. In addition to the ECML/PKDD Discovery Challenge (which I wrote about here), a number of people are working on similar work. The two (DOI) other (DOI) papers in my session looked at tag recommendation (and related problems) as well.

Real-Time Tag Recommendation

Top-k Querying Social-Tagging Networks

After my talk, I met a number of people who have also been looking at association rules for tags or tag prediction in different contexts. I wish there was a better way to find out who is working on similar work before the work itself is completed!

Later that day, I chatted a little bit with Brian Davison. Brian has been looking at link structure of related pages on the web for many years, for example, in the context of authority propagation, trust propagation, and hypertext classification. One thing I had not realized was that he actually got interested in the link structure of related pages through his dissertation work on prefetching and caching on the web. Previously, I had known his work just in the context of link structure, for example, his work on topical locality in the web (which I think I in turn came upon via Chakrabarti's excellent Mining the Web).

Brian Davison Presents Separate and Inequal

Brian gave an excellent talk (DOI) in the morning (due to one of his student's having visa issues, which seemed fairly common at this conference), though I unfortunately missed his other student's talk (DOI) later that day on their recent work in web page classification.

Classifiers Without Borders

Unfortunately, my talk session was at the same time as the question answering session, which is a topic I have grown increasingly interested in of late. The three papers there were: 1 (DOI) 2 (DOI) 3 (DOI).

Collaborative Exploratory Search

Two papers that had some buzz about them were Pickens et al.'s paper on Collaborative Exploratory Search (DOI) and Liu et al.'s paper on BrowseRank (DOI). The former paper looked at how to work in teams on web search tasks, a topic which is apparently growing increasingly popular in the HCI community these days. The latter paper looked at results ranking based on user browsing behavior from toolbars. I think the Pickens paper ended up winning the Best Paper award.

Lastly, two talks that I found personally interesting were Jonathan Elsas' talk on models for blog feed search (DOI) (slides) and another talk (maybe by Peter Bailey?) on relevance assessment (DOI).

I have been following Jonathan's work for a while now, as he always seems to be doing something interesting with structured, social data. (I finally got to meet him at the conference as well, incidentally.) His talk was mostly about ways to combine both high level and low level structure in blogs, but I was most excited by a somewhat unrelated fact, that they were using Wikipedia for pseudo relevance feedback (PRF). Previous work (DOI) at SIGIR 2007 had looked at this as a possibility, but it was interesting to see both more confirmation that Wikipedia is good for PRF and further mechanisms for using it in that way. Mysteriously, his talk was the third half hour talk in a set of sessions where all of the other sessions had two talks, so he seemed to have the spotlight. In any case, the room was packed.

Jonathan Elsas Presents Blog Feed Search

The talk on relevance assessment was interesting to me because it seemed to be pushing back on a trend which has been happening for a while now. Specifically, in the past few years, there has been a gradual trend towards using extremely cheap sources of labor for creating test collections for evaluating various tasks. For example, some recent work by a former member of the InfoLab looked at Mechanical Turk for evaluating entity resolution (DOI). The relevance assessment paper looked at three types of judges:
  • Gold Judges: Created a topic for a particular collection, and are experts in that topic. For instance, a history professor comes up with a topic "items worn by Abraham Lincoln" and judges results as relevant and non-relevant.
  • Silver Judges: Did not create the topic, but are an expert in it. For instance, a history professor.
  • Bronze Judges: Did not create the topic and are not and expert in it. Just a random user.
What the work found was that while all three types of judges were fine for making broad distinctions between systems, using poor judges could make the distinctions between top performing systems less clear, or even reverse the ordering of top systems in some cases.

That concludes my trip report from Singapore. Apologies for any inaccuracies in the above, I have not had a chance to read many of the papers above in depth yet, so most of my observations are from my vague recollections of the talks. (Do leave a comment if you have any complaints!) Also, if you were at the conference (or were reading the conference papers) and feel like there was important work not discussed here, do leave a comment as well!

Addendum: Yannis pointed me to this other (very recent) paper with more Mechanical Turk evaluation results. Several other people are blogging about SIGIR'08, like Daniel Tunkelang, Pranam Kolari, and Paraic Sheridan. All photos are from the SIGIR'08 website.

Labels: , , , , , ,

Why Uncertainty in Data is Great (Posted by Anish Das Sarma)

Background: Uncertain Data

With the advent and growing popularity of several new applications (such as information extraction on the web, information integration, scientific databases, sensor data management, entity resolution), there is an increasing need to manage uncertain data in a principled fashion. Examples of uncertain data are:
  1. Extraction: You extract structure from an HTML table/text on the Web, but due to the inherent uncertainty in the extraction process, you only have some confidence on the result. You extract the fact that John works for Google, but are not entirely sure, so associated a probability 0.8 with this fact.
  2. Human Readings: Let’s suppose people are viewing birds in California (Christmas Bird Count: George reports that he saw a bird fly past, but wasn’t sure whether it was a Crow or a Raven. He may also attach confidences with each of them: He is 75% sure it was a Crow, and associates only a 25% chance of it being a Raven.
  3. Sensors: Sensor S1 reported the temperature of a room to be 75±5.
  4. Inherent Data Uncertainty: From, you extract the fact that there will be rain in Stanford on Monday, but you only have a 75% confidence in this.
  5. Data Integration/Entity Resolution: You are mapping schemas of various tables, and are unsure of whether "mailing-address" in one table corresponds to "home-address" or "office-address" in another table. We are de-duplicating a database of names, and are not sure whether "John Doe" and "J. Doe" refer to the same person.
There are many other examples of uncertainty arising in real-world scenarios.

How Do We Deal With Uncertainty

With large volumes of uncertain data being produced that needs to be subsequently queried and analyzed, there is a dire need to deal with this uncertainty in some way. At a high-level, there are two approaches to dealing with data uncertainty:
  1. CLEAN (Approach-C): "Clean" the data as quickly as possible to get rid of the uncertainty. Once the data has been cleaned, it can be stored in and queried by any traditional DBMS, and life is good thereafter.
  2. MANAGE (Approach-M): In contrast, we could keep the uncertainty in data around, and "manage" it in a principled fashion. This involves building DBMSs that can store such uncertain data, and process them correctly, i.e., handle the probabilities, range of values, dependencies, etc.
Let us compare the two approaches. Both Approach-C and Approach-M entail several technical challenges. Cleaning uncertain data is not a trivial process by any stretch of imagination. There has been work in the database community in cleaning uncertain data in various environments. (For one piece of work on cleaning in sensor/RFID networks, check this IQIS 2006 paper.) Once the data has been cleaned, there is no additional effort involved, as it can be processed by any off-the-shelf DBMS. In contrast, with Approach-M less upfront effort is involved. However, processing uncertain data becomes significantly more challenging. I would like to highlight the Trio project ( at Stanford that is building a system to manage uncertain data along with data lineage (also known as history or provenance). The lineage feature in Trio allows for an intuitive representation (see our VLDB 2006 paper), and efficient query processing (see our ICDE 2008 paper or a short DBClip). I would like to note that several other database groups are also studying the problem of managing uncertain data.

Why Managing Is Better Than Cleaning

Without going into technical details, I would like to describe why Approach-M is in general better than Approach-C. While Approach-C gives instant gratification by removing all "dirty data," Approach-M gives better long term results. Intuitively, Approach-C greedily eliminates all uncertainty, but Approach-M could resolve uncertainty more accurately later on because it has more information. Another way to look at it is that in Approach-C, the error involved in cleaning the data keeps compounding as we further query the certain data. But in Approach-M, since the uncertainty is explicitly modeled, we don’t have this problem.

Let us take a very simple example to see how uncertainty can be resolved using Approach-M. Consider the Christmas Bird Count described earlier, where people report bird sightings with a relational schema (Observer, Bird-ID, Bird-Name). (Suppose for the sake of this example birds are tagged with an identifying number, and the main challenge is in associating the number with the bird species.) In this extremely simple example, suppose there is just one sighting in our database by Mary, who saw Bird-1, and identified it as being either a Finch (80%) or a Toucan (20%). At this point we know that Bird-1 is definitely a Finch or a Toucan, but we are not sure which one it is; using Approach-C, we would like to create a certain database, and since Mary feels much more confident that Bird-1 is a Finch, we would enter the tuple (Mary, Bird-1, Finch) into the database and forget the information that it could have possibly been a Toucan as well.

In Approach-M, we would store Mary’s sighting as is, which could be represented as:
(Mary, Bird-1, {Finch: 0.8, Toucan: 0.2})
Why is this better? Suppose next day we get another independent observer, Susan, reports the sighting of Bird-1, and she thinks it’s either a Nightingale (70%) or Toucan (30%). The following day we get another independent observer’s sighting who says Bird-1 is either a Hummingbird (65%) or Toucan (35%). Clearly when we reconcile all these sightings, the likelihood of Bird-1 being Toucan is quite high. The method of "reconciling" these readings can be quite complicated, and is an important topic of research, but any reasonable reconciliation should indicate the probability of Bird-1 being Toucan quite high since all other sightings are conflicting. However, using Approach-C, all the three observers’ readings of Toucan would have been weeded out.

Uncertainty In Data Integration

For readers who are not convinced by the synthetic example above, here’s a very real application: data integration, which has been an important area of research for several decades.

Data integration systems offer a uniform interface to a set of data sources. As argued in our chapter to appear in a book on uncertain data, data integration systems need to model uncertainty at their core. In addition to the possibility of data being uncertain, semantics mappings and the mediated schema may be approximate. For example, in an application like Google Base that enables anyone to upload structured data, or when mapping millions of sources on the deep web, we cannot imagine specifying exact mappings. The chapter provides the theoretical foundations for modeling uncertainty in a data integration system.

At Google, we built a data integration system that incorporates this probabilistic framework, and completely automatically sets up probabilistic mediated schemas and probabilistic mappings, after which queries can be answered. We applied our system to integrate tables gathered from all over the web in multiple domains, including 50-800 data sources. Details of this work can be found in our SIGMOD 2008 paper. We observed that the system is able to produce high-quality answers with no human intervention. The answers obtained using the probabilistic framework was significantly better than any deterministic technique compared against.

Labels: , , , , ,

Making Flexible Recommendations in CourseRank (Posted by Georgia Koutrika)

Some background on CourseRank

CourseRank is a social tool for course planning that we are developing at the InfoLab. CourseRank helps Stanford students make informed decisions about classes. It displays official university information and statistics, such as bulletin course descriptions, grade distributions, and results of official course evaluations. Students can anonymously rank courses they have taken, add comments, and rank the accuracy of each others' comments. They can also shop for classes, get personalized recommendations, and organize their classes into a quarterly schedule or devise a four-year plan.

CourseRank also functions as a feedback tool for faculty and administrators, ensuring that information is as accurate as possible. Faculty can modify or add comments to their courses, and can see how their class compares to other classes.

A year after its launch, the system is already being used by more than 6,200 Stanford students, out of a total of about 14,000 students. The vast majority of CourseRank users are undergraduates, and there are only about 7,000 undergraduates at Stanford. Thus, CourseRank is already used by a very large fraction of Stanford undergraduates.

Recommendations the Classical Way

Recommendations comprise an integral part of the system functionality. In order to generate recommendations for students, we initially adopted existing recommendation approaches (see Adomavicius and Tuzhilin for a good survey). However, although the initial version of CourseRank has been very popular with students (see editorial in the Stanford student paper), we soon had first-hand experience with the limitations of classical recommendation systems.

Limitation: Inflexible, canned recommendations

Similarly to most commercial recommendation systems, our initial version offered no choices, just a fixed list of recommended courses for the student to consider. However, users may expect different recommendations under different circumstances, and we received many requests, from students and administrators, for more flexible recommendations. For example, students want to specify the type of course they are interested in (e.g., a biology class, something that satisfies Stanford's writing requirement). They also want to request recommendations based on a peer group they select (e.g., students in their major, or freshmen only) and based on different criteria (e.g., students in their major with similar grades or with similar ratings). For example, a student may want recommendations for CS courses from CS students with similar grades (i.e., with similar performance) and for dance classes from students with similar ratings (i.e., with similar tastes). In some cases, students want to see the recommendations the system would give their best friend, not themselves.

Limitation: Hard-wired recommendations

The recommendation algorithm is typically embedded in the system code, not expressed declaratively. From the designer viewpoint, this fact makes it hard to modify the algorithm, or to experiment with different approaches. However, we (the CourseRank implementers) want to experiment with different recommendation approaches: Would approach X be more effective than approach Y in our environment? What are the right weights for blending two recommendations (e.g., one based on what students like, and another based on what courses are more critical for completing a major)? What is the best way to predict, not good courses, but likely grades a student will obtain in future courses?

To accommodate our end-users’ requests and also to meet our own research and development goals, we were faced with the prospects of implementing many different recommendation approaches and their variants. Therefore, we decided we needed to have more flexible recommendations, i.e., recommendations that could be easily described, customized, and executed by a single engine.

The idea of Flexible Recommendations (FlexRecs)

So, we designed FlexRecs, a framework for flexible recommendations over relational data. The general idea of FlexRecs can be described as follows:
  • A given recommendation approach is expressed declaratively as a high-level workflow. Imagine that such a workflow comprises a series of interconnected operators that describe how a recommendation is computed. In a workflow, sets of objects flow from one operator to the next, and are filtered, ranked, etc. in the process. The workflow is a “high-level” description in the sense that it does not contain actual code, but rather, it describes what code (operators) to call upon. These descriptions can also handle parameters, so that end-users can at run time personalize their recommendations, e.g., by choosing the peer group of students to provide recommendations, or by weighting recommendations that are blended.
  • A designer can easily express multiple workflows for different types of recommendations, as well as new types that the designer may think of. The end user can select from them, depending on her information needs. This selection is done through a GUI, which also allows the user to enter parameters for workflows in order to get more accurate and personalized recommendations. For instance, the user may specify that her recommendations be based on what courses students in her major are taking. Such functionality is essentially similar to advanced searches: a designer builds a set of parameterized queries. End users can transparently execute these queries with parameter values and receive different results through an advanced search interface.
Our framework describes how a recommendation workflow can specifically be defined for relational data, using traditional relational operators such as select, project and join, plus new recommendation operators that generate or combine recommendations. The recommendation operators may call upon functions in a library that implement common tasks for generating recommendations, such as computing the Jaccard or Pearson similarity of two sets of objects. The system can execute a workflow by "compiling" it into a sequence of SQL calls, which are executed by a conventional DBMS. When possible, library functions are compiled into the SQL statements themselves; in other cases we can rely on external functions that are called by the SQL statements.

We have built a flexible recommendation engine for compiling and executing FlexRecs workflows as part of CourseRank. The FlexRecs framework has indeed made it possible to define many recommendation strategies (at least all of the ones we have been able to think of to date). The new version of CourseRank, to be released September 2008, will let end-users tailor their recommendations. Although the end-users will only see choices from simple menus, and select values from sliders, they will actually be selecting what workflow to run, and with what parameters.

Labels: , , , , , ,

Database Research Principles Revealed (Posted by Jennifer Widom)

Last summer I was named recipient of the 2007 ACM SIGMOD Edgar F. Codd Innovations Award, an honor that came with both good news and bad news. The good news: $1000 and something new to spice up my bio. The bad news: A last-minute trip to Beijing at an inopportune time (though enjoyable in the end) to deliver a plenary talk at the conference.

Back when Hector received the Innovations Award in 1999, there was only the good news part; an invited SIGMOD conference talk for the winner was introduced with Jeff's award in 2006. The problem with this type of talk is that you're not allowed to just trot out your latest research spiel. The talk is expected to be sweeping, insightful, and (most of all) entertaining, while still remaining technical enough to avoid any hushed remarks about being an over-the-hill armchair researcher who thinks only big thoughts and no little ones.

I spent a lot of time mulling over what I could say to these conference-goers in Beijing, at least those who didn't sneak off to visit the Forbidden City instead. (Not many of them did, probably thanks to the drizzle and thick smog.) I decided to solidify some research strategies and pet peeves that I believe have influenced my entire career, with very concrete examples for technical credibility, and photographs to keep it entertaining.

Slides from the talk are available in PowerPoint and pdf (an inline slideshare version is below). This blog post summarizes the key points.

Finding Research Ideas

There's no magic to finding research areas, at least for me. I started working in Active Databases at IBM because I was told to. I started working in Data Warehousing at Stanford because Hector came back from a company visit one day saying it was the latest hot thing and there might be some research in it. I worked in Semistructured Data as an offshoot of our Data Integration project -- the integration part made me uncomfortable so I decided to build a DBMS for our "lightweight self-describing object model" instead. Data Streams was an area I'd always felt just plain made sense, but it took years for me to convince any students to work on it. Lastly, my current work on Uncertainty and Lineage is an idea that just popped into my head one day during my morning jog. Really.

I never know where the next idea is coming from, or when it will arrive, which is actually kind of scary since I don't like to stay in areas too long. One small but interesting observation: Although I've worked in what seem to be diverse areas, the problem of Incremental View Maintenance has popped up in every single one of them.

Finding Research Topics

Once a research area has been selected, how does one find a topic within that area? Here I actually do have a strategy. If you take one of the many simple but fundamental assumptions underlying traditional database systems, and drop it, the entire kit-and-kaboodle of data management and query processing often needs to be revisited. (I like the analogy of pulling at a loose thread in a garment, ultimately unraveling the whole thing.) Once you need to revisit the data model, query language, storage and indexing structures, query processing and optimization, concurrency control and recovery, and application and user interfaces, you've got yourself a bunch of thesis topics and a fun prototype to develop.

I followed this recipe for Semistructured Data (dropped assumption: schema declared in advance), Data Streams (dropped assumption: data resides in persistent data sets), and now Uncertain Data (dropped assumption: tuples contain exact values). Of course you don't need to revisit every aspect of data management and query processing every time, but so far there have always been plenty of topics to go around.

The Research Itself

Here comes my biggest pet peeve. If one is to follow my recipe and reconsider data management and query processing for a new kind of DBMS, it's imperative to think about all three of the critical components -- data model, query language, and system -- and in that order! We in research have a rare luxury, compared to those in industry, that we can mull over a data model for a long time before we move on to think about how we'll query it, and we can nail down a solid syntax and semantics for a query language before we implement it. This sequence is not only a luxury, I consider it a requirement for good research: Lay down the foundations cleanly and carefully before system-building begins. This policy has been the the biggest underlying principle of my research and, I believe, the primary reason for its success (on those occasions it's been successful).

Let's look briefly at the three critical components, then talk about how to disseminate research results.

Data Model

Nailing down a new data model that "works" is hardly a trivial task. The talk (here are the PowerPoint and pdf links again) provides some concrete examples of subtleties in data stream models, where the same query can (and across current systems, does) give very different results depending on some hidden and often overlooked aspects of a stream model. In the Trio project, we debated uncertainty data models for nearly a year before settling on the one we used, and it was well worth it in the end.

Query Language

Like data models, the subtleties involved in query language design are often underestimated. First, there seems to be some confusion between syntax and semantics: from a research perspective, only semantics is really interesting. For example, if we apply SQL syntax to a data stream model, or to a model for uncertain data, we certainly can't declare victory -- in these new models it's often unclear what the semantics of a syntactically-obvious SQL query really are. (Here too, concrete examples are given in the talk.) For both the STREAM and Trio projects, just the task of specifying an exact semantics for SQL queries over the new model was a significant challenge.

Unfortunately, the challenges and contributions of specifying a new query language (or new semantics for an existing one) don't tend to be recognized in traditional ways. Publishing a SIGMOD or VLDB paper about a query language is near impossible. After many failed attempts to publish a paper describing the Lore query language, we finally sent it to a new journal that was desperate for submissions. The Lorel paper now has over 500 citations on Citeseer (over 1200 on Google Scholar) and was among the top-100 cited papers in all of Computer Science for a spell. The fact is that language papers are very difficult to publish, but they can have huge impact in the long run. Unfortunately that's tough to explain to a graduate student.

In another of my favorite language-related stories, I was confused about the semantics of a "competing" trigger (active database) language to the one I was designing; this was way back around 1990. I asked the obvious person running the other project (who shall remain nameless, but is very tall) what the semantics would be of a specific set of triggers in his language. His response: "Hmm, that's a tricky one. I would have to run it to find out."

The talk includes examples not only of trickiness in applying SQL to new models, but also subtleties in designing query languages for semistructured data and for data streams. It also demonstrates a guiding principle for designing query language semantics in the "modified-relational" models I tend to work with: reuse relational semantics whenever possible (which is not the same thing as reusing SQL or even relational algebra syntax); it's a clean and well-defined place to start, and can cover a lot of ground if the semantics are compartmentalized well.


After all that thinking, debating, designing, specifying, and proving that goes into figuring out a new data model and query language, building a prototype system to realize them is a very satisfying finishing step, and critical for full impact of new ideas.

I'll admit the model-language-system sequence isn't quite as clean a division as I've made it out to be: When building a system and trying it out, one inevitably discovers flaws in the data model and query language, and there tends to be at least a moderate feedback loop. Even then, working out (modified) foundations before committing them to code is, in my mind, rule number one.

Disseminating Research Results

I have strong feelings on this topic. First, if you've done something important, don't wait to tell others about it. There's no place for secrecy (or laziness) in research, and there's every place for being the first one with a new idea or result. Write up your work, do it well and do it soon, post it on the web and inflict it on friends.

Second, don't get discouraged by SIGMOD and VLDB rejections. Those conferences aren't the only places for important work, by a long shot. Workshops often reach the most important people in a specific area. I've always been a fan of SIGMOD Record (and more recently the CIDR conference) for disseminating ideas or results that, for whatever reason, aren't destined for a major conference.

Finally, build prototypes and make them easy to use. That means a decent interface (both human and API), and even more importantly setting things up so folks can try out the prototype over the web before committing to a full download and install.

Labels: , , , , , , , ,

Should Ad Networks Bother Fighting Click Fraud? (Posted by Bobji Mungamuru)

It's an understatement to say that online advertising has become big business. Much of the growth has been due to the emergence of advertising networks, such as Google, Yahoo! and

Over the last few years, however, click fraud has emerged as a significant threat to the business model of the online advertising industry. (I recommend this book chapter if you want to learn more about click fraud.)

You can think of it as an analogue to a "currency crisis" – advertisers (investors) avoid buying ads (currency) because they are worried about the devaluation of ads due to rampant fraud, and it is this lack of advertiser (investor) confidence itself that causes the value of ads to fall.

The central issue is this: Suppose an advertising network (or "ad network", for short) decides that a given click-through is "fraudulent". The implication is that the ad network will not bill the advertiser for that invalid click. On the other hand, if the click is marked valid, the ad network could charge full price for it. Therefore, arguably, the ad network is “leaving money on the table” by marking clicks fraudulent. As such, why would ad networks even bother fighting fraud?

Google CEO Eric Schmidt famously claimed in July 2006 that there in fact was a "perfect economic solution" to click fraud, which is to just "let it happen". He conjectured that the auctions used to sell ads are "self-correcting", since market prices would naturally adjust for fraud. Is Dr. Schmidt right? Is it economically reasonable for ad networks to just let fraud happen?

In my paper, "Should Ad Networks Bother Fighting Click Fraud? (Yes, They Should.)", I analyze a simple economic model of the online advertising market, and conclude that Dr. Schmidt is, in fact, quite wrong. In fact, from an ad network’s perspective, to "let it happen" is absolutely the worst economic solution! Ad networks who develop effective algorithms for detecting fraud, and aggressively apply these algorithms, gain a significant competitive advantage in the online marketplace.

In other words, Google's official response to the "let it happen" fiasco, while well-intentioned, missed the whole point: it's not out of the "goodness of their heart" that Google should fight fraud, rather it is out of sheer economic self-interest.

I presented a shorter version of this paper at the Financial Cryptography and Data Security conference this January. The title of the paper was "Competition and Fraud in Online Advertising Markets".

Labels: , , , , ,