Stanford InfoBlog: Who is the Data Leaker? (Posted by Panagiotis Papadimitriou)

In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted third parties. For example, social networking sites, such as Facebook, share users' data with social applications owners (a user approval is often required). Similarly, enterprises that outsource their data processing have to give data to various other companies. Data may also be shared for other purposes, e.g., a hospital may give patient records to researchers who will devise new treatments.

We call the owner of the data the distributor and the supposedly trusted third parties the agents. In a perfect world there would be no need to hand over sensitive data to parties that may unknowingly or maliciously leak it. However, in many cases we must indeed work with agents that may not be 100% trusted. So, if data is leaked we may not be certain if it came from an agent or from some other source. Our goal is to detect when the distributor's sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data.

Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks involve some modification of the original data and are very useful in many cases. However, there are cases where it is important not to alter the original distributor's data. For example, the data of a Facebook profile may not look different to different users who have access to it. If an outsourcer is doing our payroll, he must have the exact salary and customer identification numbers. If medical researchers will be treating patients (as opposed to simply computing statistics), they may need accurate data for the patients.

In this paper we propose unobtrusive techniques for detecting leakage of a set of objects or records. Specifically, we study the following scenario: After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place. (For example, the data may be found on a web site, or may be obtained through a legal discovery process.) At this point the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. Using an analogy with cookies stolen from a cookie jar, if we catch Freddie with a single cookie, he can argue that a friend gave him the cookie. But if we catch Freddie with 5 cookies, it will be much harder for him to argue that his hands were not in the cookie jar. If the distributor sees "enough evidence'' that an agent leaked data, he may stop doing business with him, or may initiate legal proceedings.

We develop a model for assessing the "guilt" of agents. Based on this model, we present algorithms for distributing objects to agents in a way that improves our chances of identifying a leaker. Finally, we also consider the option of adding "fake" objects to the distributed set. Such objects do not correspond to real entities but appear realistic to the agents. In a sense, the fake objects act as a type of watermark for the entire set, without modifying any individual members. If it turns out an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty.

You may want to check our short presentation or the full paper for details.

Labels: databases, detection, leakage, panagiotis, panos, papadimitriou, research

This entry was posted by panagiotis, on Thursday, August 14, 2008. You can leave your response.

Who is the Data Leaker? (Posted by Panagiotis Papadimitriou)

Search

recent posts

Archives

Authors

Links

Admin