InfoLab Logo Header

Who is the Data Leaker? (Posted by Panagiotis Papadimitriou)

In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted third parties. For example, social networking sites, such as Facebook, share users' data with social applications owners (a user approval is often required). Similarly, enterprises that outsource their data processing have to give data to various other companies. Data may also be shared for other purposes, e.g., a hospital may give patient records to researchers who will devise new treatments.

We call the owner of the data the distributor and the supposedly trusted third parties the agents. In a perfect world there would be no need to hand over sensitive data to parties that may unknowingly or maliciously leak it. However, in many cases we must indeed work with agents that may not be 100% trusted. So, if data is leaked we may not be certain if it came from an agent or from some other source. Our goal is to detect when the distributor's sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data.

Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks involve some modification of the original data and are very useful in many cases. However, there are cases where it is important not to alter the original distributor's data. For example, the data of a Facebook profile may not look different to different users who have access to it. If an outsourcer is doing our payroll, he must have the exact salary and customer identification numbers. If medical researchers will be treating patients (as opposed to simply computing statistics), they may need accurate data for the patients.

In this paper we propose unobtrusive techniques for detecting leakage of a set of objects or records. Specifically, we study the following scenario: After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place. (For example, the data may be found on a web site, or may be obtained through a legal discovery process.) At this point the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. Using an analogy with cookies stolen from a cookie jar, if we catch Freddie with a single cookie, he can argue that a friend gave him the cookie. But if we catch Freddie with 5 cookies, it will be much harder for him to argue that his hands were not in the cookie jar. If the distributor sees "enough evidence'' that an agent leaked data, he may stop doing business with him, or may initiate legal proceedings.

We develop a model for assessing the "guilt" of agents. Based on this model, we present algorithms for distributing objects to agents in a way that improves our chances of identifying a leaker. Finally, we also consider the option of adding "fake" objects to the distributed set. Such objects do not correspond to real entities but appear realistic to the agents. In a sense, the fake objects act as a type of watermark for the entire set, without modifying any individual members. If it turns out an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty.

You may want to check our short presentation or the full paper for details.

Labels: , , , , , ,

Matching Hierarchies Using Shared Objects (Posted by Robert Ikeda)

Objects are often organized in a hierarchy to help in managing or browsing them. For example, books in a library can be divided by subject (history, science, ...) and then by country of publication (United Kingdom, USA, ...). Web pages at a site can also be placed in a hierarchy. For instance, a French tourist site may have categories cities, history, hotels, tours; within the cities category we have pages divided by city, and then by events, maps, restaurants.

At Stanford, we have studied the problem of hierarchy matching, in particular, how to determine corresponding edges between two related hierarchies. The need to match hierarchies arises in many situations where the objects come from different systems. In our book hierarchy example above, we may want to combine the book catalogs of two different libraries; in our tourism example, we may want to build a meta-web-site that combines the resources of two or more French tourism sites.

Traditionally, hierarchy matching is done by comparing the text associated with each edge. In contrast, we explored using the placement of objects present in both hierarchies to infer how the hierarchies relate. We chose to explore this alternative approach since the text matching method is not foolproof. For instance, there could be different wordings in the facets; one book hierarchy may refer to "drugs" while the other uses the term "medications."

Suppose we are matching two book hierarchies and both hierarchies contain a particular book. We would infer from the presence of this shared book in the two hierarchies that the two corresponding paths in the hierarchies have a relationship; perhaps they share edges in common. Given many shared objects, we would have a lot of information about possible relationships to reconcile, and how to best reconcile the information is not obvious. We developed two algorithms (one rule-based, one statistics-based) that use shared objects to determine feasible facets for a second hierarchy, given a hierarchy with known facets (label-value pairs that define what objects are placed under an edge).

We ran experiments on both real and synthetic data and compared the performances of the two algorithms. What we found was that given clean synthetic data (data that conformed to the properties we encoded in our rule-based algorithm), our rule-based algorithm performed consistently better than the statistics-based algorithm. However, with real data, or even with some of our synthetic data, the statistics-based algorithm proved to be more robust. For details, please see here.

Labels: , , , ,

SIGIR 2008 Trip Report (Posted by Paul Heymann)

I just got back from SIGIR'08 in Singapore. I was there to give a talk on my Social Tag Prediction paper, which hopefully I will get around to writing up a full post about sometime soon. In the meantime, I am posting some of my personal reactions to the conference. This is a long post, so feel free to skip down as none of the parts rely on the others.

Kai-Fu Lee Keynote

Keynotes seem to be pretty hard to get right. One hour is a long time, and they either end up being too vague or focus too much on a specific area of the speaker's interest. However, an even bigger issue for me with industry speakers (at least when the talk is not about a research paper) is that they often seem too worried about giving away secrets about their company. As a result, this means that these sorts of talks can end up as reheated summaries of what all of the researchers know, rather than giving any new insight into the challenges industry is facing. Kai-Fu Lee's keynote mostly avoided these pitfalls, and overall I thought it gave a really good overview of the challenges faced by Google in China.

Kai Fu-Lee Keynote

His talk was more or less in the form of bullet points about Google's strategy in China, so I will take more or less that form here. (Greg Linden also has a blog post about the keynote here.)
  • The number of Internet users in China is growing at 35% each year, and accelerating. This means that whoever can capture new users (who presumably have no idea what they are doing) will ultimately be the market leader in China (assuming no outside intervention).
  • New users are kind of classless. For new users, result quality in search does not matter and neither does a clean user interface. New users either do not care or cannot yet appreciate these things. Instead, Google is focusing on Internet cafes, entertainment, and music, which users tend to like (also, blended search with video, news, URLs, and other verticals, called "universal search" at Google). These new users look at a clean empty page like google.com and wonder whether the company forgot to finish the rest.
  • Google could not get people to spell their name. They tried a number of things until finally just giving up and registering "G.cn".
  • Local engineers building new local products (rather than just cursory localization of worldwide products) seems to be key to Google's plans in China. Examples include an expanded Chinese Zeitgeist, a tool for finding holiday greetings to SMS, and various emergency relief efforts.
  • An interesting fact is that Chinese is substantially more dense than other languages. This substantially changes user interface design. For example, in English Google News, each entry needs a title, a date, and a snippet. However, in Chinese, the whole summary of the story can fit in the title area, obviating the need for snippets and leading to more stories per page.
  • Kai-Fu showed a graph with no axes, which he said related to mobile usage. He said that the iPhone web usage was 50 times higher than the nearest mobile competitor, even in China where iPhones are black market. There has been some reporting of the 50 times higher stat elsewhere. (On a personal note, when using my friends' iPhones, I find typing so inconvenient that I usually end up running to Google because I know that a search for "cnn" will give me the right result and then I do not have to type the full URL. Maybe my iPhone touch typing skills will improve with time.)
  • Users love clicking and hate typing due to annoying text input tools. This leads to a Google product to help with text input as well as a focus on (ugly) directories which these users love.
  • Piracy in China is high, and Kai-Fu (I could not tell how seriously) said that it had gone down from 99% to 96%. Whether or not this is true, it allowed Kai-Fu to make a little fun of his former employer, saying that a drop from 99% to 96% meant four times the revenues in China.
  • Google thinks that freeware authors who add ad software (but not malware) to their products may be a reasonable distribution method in China for software. Of course, this is not without hazards, I think Kai-Fu stated that the average Chinese computer gets reinstalled every four months.
  • China has huge broadband penetration, mostly through Internet cafes.
  • There are more Internet users in China than the US, and the growth in Chinese users is accelerating.
  • The average age of a Chinese Internet user is 25, and this number is dropping. Huge numbers of fifteen year olds are getting online.
Overall, Kai-Fu said that all of the recent work of Google China had led to a market share increase from (I think) about 15% to 25% in two years. One thing I was not sure of was whether Kai-Fu was overstating the contribution of this strategy to the 10% increase. He pointed out a few recent deals which I could not really evaluate, like a recent deal with China Telecom which made me wonder if Google had also gotten more politically savvy in China in the past few years. In any case, this was a great talk, and left me at least (though perhaps not seasoned China watchers) with a lot to think about.

Chinese Users Outnumber US Users

Technical Papers


There were a number of technical papers and sessions and sessions that caught my eye.

This being the Stanford InfoBlog, it seems like I should start with the Stanford papers at SIGIR this year. There were three, one by Martin, one by Mengqiu, and one by myself. Martin (of the InfoLab) presented SpotSigs (DOI), which are a duplicate detection technique tuned for news stories and other scenarios where we care about the main text of a page rather than navigational elements.

Martin Presents SpotSigs

Mengqiu (of the NLP group) presented models for improving passage based retrieval (DOI), work he did while at CMU with Luo Si (now at Purdue).

Mengqiu Presents Passage Based Retrieval

My social tag prediction paper (DOI) (with Dan Ramage of the NLP group) focused on how well we can predict tags in social bookmarking systems, and techniques for doing so (including SVM and association rules).

Paul Presents Social Tag Prediction

The tag prediction and tag recommendation problems seem to be getting really popular lately. In addition to the ECML/PKDD Discovery Challenge (which I wrote about here), a number of people are working on similar work. The two (DOI) other (DOI) papers in my session looked at tag recommendation (and related problems) as well.

Real-Time Tag Recommendation

Top-k Querying Social-Tagging Networks

After my talk, I met a number of people who have also been looking at association rules for tags or tag prediction in different contexts. I wish there was a better way to find out who is working on similar work before the work itself is completed!

Later that day, I chatted a little bit with Brian Davison. Brian has been looking at link structure of related pages on the web for many years, for example, in the context of authority propagation, trust propagation, and hypertext classification. One thing I had not realized was that he actually got interested in the link structure of related pages through his dissertation work on prefetching and caching on the web. Previously, I had known his work just in the context of link structure, for example, his work on topical locality in the web (which I think I in turn came upon via Chakrabarti's excellent Mining the Web).

Brian Davison Presents Separate and Inequal

Brian gave an excellent talk (DOI) in the morning (due to one of his student's having visa issues, which seemed fairly common at this conference), though I unfortunately missed his other student's talk (DOI) later that day on their recent work in web page classification.

Classifiers Without Borders

Unfortunately, my talk session was at the same time as the question answering session, which is a topic I have grown increasingly interested in of late. The three papers there were: 1 (DOI) 2 (DOI) 3 (DOI).

Collaborative Exploratory Search

Two papers that had some buzz about them were Pickens et al.'s paper on Collaborative Exploratory Search (DOI) and Liu et al.'s paper on BrowseRank (DOI). The former paper looked at how to work in teams on web search tasks, a topic which is apparently growing increasingly popular in the HCI community these days. The latter paper looked at results ranking based on user browsing behavior from toolbars. I think the Pickens paper ended up winning the Best Paper award.

Lastly, two talks that I found personally interesting were Jonathan Elsas' talk on models for blog feed search (DOI) (slides) and another talk (maybe by Peter Bailey?) on relevance assessment (DOI).

I have been following Jonathan's work for a while now, as he always seems to be doing something interesting with structured, social data. (I finally got to meet him at the conference as well, incidentally.) His talk was mostly about ways to combine both high level and low level structure in blogs, but I was most excited by a somewhat unrelated fact, that they were using Wikipedia for pseudo relevance feedback (PRF). Previous work (DOI) at SIGIR 2007 had looked at this as a possibility, but it was interesting to see both more confirmation that Wikipedia is good for PRF and further mechanisms for using it in that way. Mysteriously, his talk was the third half hour talk in a set of sessions where all of the other sessions had two talks, so he seemed to have the spotlight. In any case, the room was packed.

Jonathan Elsas Presents Blog Feed Search

The talk on relevance assessment was interesting to me because it seemed to be pushing back on a trend which has been happening for a while now. Specifically, in the past few years, there has been a gradual trend towards using extremely cheap sources of labor for creating test collections for evaluating various tasks. For example, some recent work by a former member of the InfoLab looked at Mechanical Turk for evaluating entity resolution (DOI). The relevance assessment paper looked at three types of judges:
  • Gold Judges: Created a topic for a particular collection, and are experts in that topic. For instance, a history professor comes up with a topic "items worn by Abraham Lincoln" and judges results as relevant and non-relevant.
  • Silver Judges: Did not create the topic, but are an expert in it. For instance, a history professor.
  • Bronze Judges: Did not create the topic and are not and expert in it. Just a random user.
What the work found was that while all three types of judges were fine for making broad distinctions between systems, using poor judges could make the distinctions between top performing systems less clear, or even reverse the ordering of top systems in some cases.

That concludes my trip report from Singapore. Apologies for any inaccuracies in the above, I have not had a chance to read many of the papers above in depth yet, so most of my observations are from my vague recollections of the talks. (Do leave a comment if you have any complaints!) Also, if you were at the conference (or were reading the conference papers) and feel like there was important work not discussed here, do leave a comment as well!

Addendum: Yannis pointed me to this other (very recent) paper with more Mechanical Turk evaluation results. Several other people are blogging about SIGIR'08, like Daniel Tunkelang, Pranam Kolari, and Paraic Sheridan. All photos are from the SIGIR'08 website.

Labels: , , , , , ,

Why Uncertainty in Data is Great (Posted by Anish Das Sarma)

Background: Uncertain Data

With the advent and growing popularity of several new applications (such as information extraction on the web, information integration, scientific databases, sensor data management, entity resolution), there is an increasing need to manage uncertain data in a principled fashion. Examples of uncertain data are:
  1. Extraction: You extract structure from an HTML table/text on the Web, but due to the inherent uncertainty in the extraction process, you only have some confidence on the result. You extract the fact that John works for Google, but are not entirely sure, so associated a probability 0.8 with this fact.
  2. Human Readings: Let’s suppose people are viewing birds in California (Christmas Bird Count: http://www.audubon.org/Bird/cbc/). George reports that he saw a bird fly past, but wasn’t sure whether it was a Crow or a Raven. He may also attach confidences with each of them: He is 75% sure it was a Crow, and associates only a 25% chance of it being a Raven.
  3. Sensors: Sensor S1 reported the temperature of a room to be 75±5.
  4. Inherent Data Uncertainty: From weather.com, you extract the fact that there will be rain in Stanford on Monday, but you only have a 75% confidence in this.
  5. Data Integration/Entity Resolution: You are mapping schemas of various tables, and are unsure of whether "mailing-address" in one table corresponds to "home-address" or "office-address" in another table. We are de-duplicating a database of names, and are not sure whether "John Doe" and "J. Doe" refer to the same person.
There are many other examples of uncertainty arising in real-world scenarios.

How Do We Deal With Uncertainty

With large volumes of uncertain data being produced that needs to be subsequently queried and analyzed, there is a dire need to deal with this uncertainty in some way. At a high-level, there are two approaches to dealing with data uncertainty:
  1. CLEAN (Approach-C): "Clean" the data as quickly as possible to get rid of the uncertainty. Once the data has been cleaned, it can be stored in and queried by any traditional DBMS, and life is good thereafter.
  2. MANAGE (Approach-M): In contrast, we could keep the uncertainty in data around, and "manage" it in a principled fashion. This involves building DBMSs that can store such uncertain data, and process them correctly, i.e., handle the probabilities, range of values, dependencies, etc.
Let us compare the two approaches. Both Approach-C and Approach-M entail several technical challenges. Cleaning uncertain data is not a trivial process by any stretch of imagination. There has been work in the database community in cleaning uncertain data in various environments. (For one piece of work on cleaning in sensor/RFID networks, check this IQIS 2006 paper.) Once the data has been cleaned, there is no additional effort involved, as it can be processed by any off-the-shelf DBMS. In contrast, with Approach-M less upfront effort is involved. However, processing uncertain data becomes significantly more challenging. I would like to highlight the Trio project (http://infolab.stanford.edu/trio/) at Stanford that is building a system to manage uncertain data along with data lineage (also known as history or provenance). The lineage feature in Trio allows for an intuitive representation (see our VLDB 2006 paper), and efficient query processing (see our ICDE 2008 paper or a short DBClip). I would like to note that several other database groups are also studying the problem of managing uncertain data.

Why Managing Is Better Than Cleaning

Without going into technical details, I would like to describe why Approach-M is in general better than Approach-C. While Approach-C gives instant gratification by removing all "dirty data," Approach-M gives better long term results. Intuitively, Approach-C greedily eliminates all uncertainty, but Approach-M could resolve uncertainty more accurately later on because it has more information. Another way to look at it is that in Approach-C, the error involved in cleaning the data keeps compounding as we further query the certain data. But in Approach-M, since the uncertainty is explicitly modeled, we don’t have this problem.

Let us take a very simple example to see how uncertainty can be resolved using Approach-M. Consider the Christmas Bird Count described earlier, where people report bird sightings with a relational schema (Observer, Bird-ID, Bird-Name). (Suppose for the sake of this example birds are tagged with an identifying number, and the main challenge is in associating the number with the bird species.) In this extremely simple example, suppose there is just one sighting in our database by Mary, who saw Bird-1, and identified it as being either a Finch (80%) or a Toucan (20%). At this point we know that Bird-1 is definitely a Finch or a Toucan, but we are not sure which one it is; using Approach-C, we would like to create a certain database, and since Mary feels much more confident that Bird-1 is a Finch, we would enter the tuple (Mary, Bird-1, Finch) into the database and forget the information that it could have possibly been a Toucan as well.

In Approach-M, we would store Mary’s sighting as is, which could be represented as:
(Mary, Bird-1, {Finch: 0.8, Toucan: 0.2})
Why is this better? Suppose next day we get another independent observer, Susan, reports the sighting of Bird-1, and she thinks it’s either a Nightingale (70%) or Toucan (30%). The following day we get another independent observer’s sighting who says Bird-1 is either a Hummingbird (65%) or Toucan (35%). Clearly when we reconcile all these sightings, the likelihood of Bird-1 being Toucan is quite high. The method of "reconciling" these readings can be quite complicated, and is an important topic of research, but any reasonable reconciliation should indicate the probability of Bird-1 being Toucan quite high since all other sightings are conflicting. However, using Approach-C, all the three observers’ readings of Toucan would have been weeded out.

Uncertainty In Data Integration

For readers who are not convinced by the synthetic example above, here’s a very real application: data integration, which has been an important area of research for several decades.

Data integration systems offer a uniform interface to a set of data sources. As argued in our chapter to appear in a book on uncertain data, data integration systems need to model uncertainty at their core. In addition to the possibility of data being uncertain, semantics mappings and the mediated schema may be approximate. For example, in an application like Google Base that enables anyone to upload structured data, or when mapping millions of sources on the deep web, we cannot imagine specifying exact mappings. The chapter provides the theoretical foundations for modeling uncertainty in a data integration system.

At Google, we built a data integration system that incorporates this probabilistic framework, and completely automatically sets up probabilistic mediated schemas and probabilistic mappings, after which queries can be answered. We applied our system to integrate tables gathered from all over the web in multiple domains, including 50-800 data sources. Details of this work can be found in our SIGMOD 2008 paper. We observed that the system is able to produce high-quality answers with no human intervention. The answers obtained using the probabilistic framework was significantly better than any deterministic technique compared against.

Labels: , , , , ,

Making Flexible Recommendations in CourseRank (Posted by Georgia Koutrika)

Some background on CourseRank

CourseRank is a social tool for course planning that we are developing at the InfoLab. CourseRank helps Stanford students make informed decisions about classes. It displays official university information and statistics, such as bulletin course descriptions, grade distributions, and results of official course evaluations. Students can anonymously rank courses they have taken, add comments, and rank the accuracy of each others' comments. They can also shop for classes, get personalized recommendations, and organize their classes into a quarterly schedule or devise a four-year plan.

CourseRank also functions as a feedback tool for faculty and administrators, ensuring that information is as accurate as possible. Faculty can modify or add comments to their courses, and can see how their class compares to other classes.

A year after its launch, the system is already being used by more than 6,200 Stanford students, out of a total of about 14,000 students. The vast majority of CourseRank users are undergraduates, and there are only about 7,000 undergraduates at Stanford. Thus, CourseRank is already used by a very large fraction of Stanford undergraduates.

Recommendations the Classical Way

Recommendations comprise an integral part of the system functionality. In order to generate recommendations for students, we initially adopted existing recommendation approaches (see Adomavicius and Tuzhilin for a good survey). However, although the initial version of CourseRank has been very popular with students (see editorial in the Stanford student paper), we soon had first-hand experience with the limitations of classical recommendation systems.

Limitation: Inflexible, canned recommendations

Similarly to most commercial recommendation systems, our initial version offered no choices, just a fixed list of recommended courses for the student to consider. However, users may expect different recommendations under different circumstances, and we received many requests, from students and administrators, for more flexible recommendations. For example, students want to specify the type of course they are interested in (e.g., a biology class, something that satisfies Stanford's writing requirement). They also want to request recommendations based on a peer group they select (e.g., students in their major, or freshmen only) and based on different criteria (e.g., students in their major with similar grades or with similar ratings). For example, a student may want recommendations for CS courses from CS students with similar grades (i.e., with similar performance) and for dance classes from students with similar ratings (i.e., with similar tastes). In some cases, students want to see the recommendations the system would give their best friend, not themselves.

Limitation: Hard-wired recommendations

The recommendation algorithm is typically embedded in the system code, not expressed declaratively. From the designer viewpoint, this fact makes it hard to modify the algorithm, or to experiment with different approaches. However, we (the CourseRank implementers) want to experiment with different recommendation approaches: Would approach X be more effective than approach Y in our environment? What are the right weights for blending two recommendations (e.g., one based on what students like, and another based on what courses are more critical for completing a major)? What is the best way to predict, not good courses, but likely grades a student will obtain in future courses?

To accommodate our end-users’ requests and also to meet our own research and development goals, we were faced with the prospects of implementing many different recommendation approaches and their variants. Therefore, we decided we needed to have more flexible recommendations, i.e., recommendations that could be easily described, customized, and executed by a single engine.

The idea of Flexible Recommendations (FlexRecs)

So, we designed FlexRecs, a framework for flexible recommendations over relational data. The general idea of FlexRecs can be described as follows:
  • A given recommendation approach is expressed declaratively as a high-level workflow. Imagine that such a workflow comprises a series of interconnected operators that describe how a recommendation is computed. In a workflow, sets of objects flow from one operator to the next, and are filtered, ranked, etc. in the process. The workflow is a “high-level” description in the sense that it does not contain actual code, but rather, it describes what code (operators) to call upon. These descriptions can also handle parameters, so that end-users can at run time personalize their recommendations, e.g., by choosing the peer group of students to provide recommendations, or by weighting recommendations that are blended.
  • A designer can easily express multiple workflows for different types of recommendations, as well as new types that the designer may think of. The end user can select from them, depending on her information needs. This selection is done through a GUI, which also allows the user to enter parameters for workflows in order to get more accurate and personalized recommendations. For instance, the user may specify that her recommendations be based on what courses students in her major are taking. Such functionality is essentially similar to advanced searches: a designer builds a set of parameterized queries. End users can transparently execute these queries with parameter values and receive different results through an advanced search interface.
Our framework describes how a recommendation workflow can specifically be defined for relational data, using traditional relational operators such as select, project and join, plus new recommendation operators that generate or combine recommendations. The recommendation operators may call upon functions in a library that implement common tasks for generating recommendations, such as computing the Jaccard or Pearson similarity of two sets of objects. The system can execute a workflow by "compiling" it into a sequence of SQL calls, which are executed by a conventional DBMS. When possible, library functions are compiled into the SQL statements themselves; in other cases we can rely on external functions that are called by the SQL statements.

We have built a flexible recommendation engine for compiling and executing FlexRecs workflows as part of CourseRank. The FlexRecs framework has indeed made it possible to define many recommendation strategies (at least all of the ones we have been able to think of to date). The new version of CourseRank, to be released September 2008, will let end-users tailor their recommendations. Although the end-users will only see choices from simple menus, and select values from sliders, they will actually be selecting what workflow to run, and with what parameters.

Labels: , , , , , ,