IEEE Internet Computing, July/August 2008:
Who are you? The tradeoff between information utility and privacy

Jim Miller
Miramontes Interactive
July 2008

Privacy of personal information has been an important issue in the computer and information industries for quite a while, as expressed on the pages of this magazine and elsewhere. Its importance has grown with the increasing number of Web sites and services that ask users for personal information. If you're at all concerned about information privacy, it's been easy and tempting to take a hard-line attitude toward privacy — that is, my information is mine, and it's none of your business.

There's no doubt that personal information can be valuable to companies, but the real problem isn't that simple. People are willing to give up information about themselves not because they're stupid or because they're being tricked by evil corporations, but because it can sometimes be in their best interests to do so. From this perspective, the important questions are how users can provide or limit access to that information, what benefits they may receive in exchange for a bit of information, and how they perceive the value of those benefits. In this column, I'll talk about different versions of this trade-off between benefits and privacy and how it's evolved over time, how people deal with it, and how it could change in the future.

Version 1: I am who I say I am

The most direct way — and one of the earliest — to get information about people has been to simply ask them, usually as part of a product purchase or a person's registration on a Web site. Note the aspects of this information: it's collected once, it's descriptive yet relatively static, and it requires a certain amount of effort from the person providing it. In addition, the company receiving the information usually has the most to gain from the information exchange: it can then make numerous sales and marketing phone calls and send out emails. There's a possible secondary economic argument: if a company knows things about you, it can anticipate your needs and, in the case of advertising-driven media, charge more for advertising and thereby continue to provide a service to you for free.

From the perspective of a company trying to gather information about its customers, this isn't particularly high-quality information. People can answer casually or not at all, or even lie if they think it's in their interests to do so. Since it's only obtained at a single point in time, it can quickly become out-of-date. Plus, if the questions get too inquisitive or too personal, many people will simply say "none of your business" and quit the entire process, leaving the questioner with nothing. For users, the not infrequent abuse of this information, from spam to identity theft, have made them question whether what they're getting for this information is worth giving it up.

The commercial use — and misuse — of personal information has led to corporate privacy policies that state what will be done with collected information, and to governmental laws and policies that legally restrict what information can be collected and how it may be used. Unfortunately, these policies are ultimately only as good as the people offering them, and spammers and information thieves have all but destroyed the perceived integrity of these claims for everyone. So we're at something of a standstill with this kind of static, descriptive information — people don't want to give it up, and companies have good reason to question its quality.

Note that there's a different venue where people are very open with information about themselves — blogs. People are amazingly willing to blog about their lives in considerable detail. The potential for abuse of this information is great, but blogs are generally resistant to large-scale mechanized abuse because of the lack of structure in the information. Technically, blogs are just collections of free text, which means that mining them in any significant way requires natural language processing that is probably beyond the current state of the art. However, as we'll see, another source of personal expression on the Internet has recently become popular, one in which the information has been conveniently segmented and classified, and where the information collection process has become remarkably easy and open to abuse.

Version 2: I am who others say I am

Another source of personal information comes about by aggregating references to individuals across many different sources that have been indexed by search engines. Using a search engine to pull up a school record here, a newspaper article there, and a county tax record there can reveal an awful lot about you. Much of this is public information that has been available in some form for a long time; the revolution lies in the ease with which it can now be gathered. This is a rather different situation than "I am who I say I am," given that the information is being gathered from existing sources, not requested from users specifically for use on the Internet. Further, the value of the gathered information is all in the hands of the people doing the gathering; users have little to gain or benefit from it.

Opting out of this sort of information collection is hard because it requires perpetual vigilance and a fervent desire to remain off the information grid. Plus, it's not always possible: Some of this information is required by law to be public, such as birth certificates or other governmental records. But the important point here is the power of aggregating information from many different sources, and triangulating in on who you are and what you do. The general impact of this aggregation is limited because so much of the information is unstructured: as with blogs, people are good at reading multiple Web pages and putting pieces together, but this is hard for computers to do in a general, total-population sense — so far, anyway.

Version 3: I am what I do

A quite different approach to gathering information about people is not to ask them, but to observe what they do when they use the Internet. This approach is less direct than asking for information, but it has several advantages, many of which come from the fact that it is less direct. It doesn't require any extra effort on the part of the users, it can be collected many times — not just once — and it's very difficult for people to lie about or disguise what they're doing.

Browsing

Tracking people as they move around the Web was the first version of this information-gathering approach to appear. One can argue that users receive some benefit from this kind of tracking, especially when it's done by for the owner of the site. Tracking visitors as they move through a site is a common usability testing technique, and it's an excellent way to determine if users easily comprehend a site's design. Using cookies to automate this process on a live site can be just another, more automated way of collecting these kinds of data in a larger, real-world setting.

Of course, the real commercial interest in these techniques comes from advertisers who are hoping to infer peoples' buying interests from their browsing behavior, and offer ads and other sales devices that match those interests. They can accomplish this in a single site with cookies, and across sites with "web bugs" — small images that can carry user-identifiable information between sites. Together, these techniques enable advertisers to build up much larger pictures of browsing behavior as visitors move from one site to another.

This is the point at which, for many people, the information collection process starts to become, well, creepy. It's one thing to ask people for information and let them decide whether to respond. It's another to, in some invisible way, look over their shoulder and watch what they're doing. From a technical perspective, cookies and web bugs play a big role in enabling this kind of tracking, and it's possible for people to avoid this tracking by disabling cookies and images in their browsers. However, this requires that people understand the concepts of cookies and web bugs and know how to reconfigure their browsers. Readers of this magazine might find this a simple task, but I can assure you that it's not so simple for the large majority of the public. In addition, using a browser in these limited ways can significantly effect how the site behaves. Cookies can improve a user's interaction with a site by providing access to his or her past uses of the site, and these capabilities might be lost if cookies are disabled. In the extreme, a lack of cookies can prevent a site from working at all.

So, we can now see people being implicitly presented with a value proposition: leave cookies and images on and receive these benefits, but with some definable costs. This presupposes, of course, that people are even aware of these information collection issues, something that's less than certain. This combination of affairs tilts things pretty clearly in favor of the information collectors.

From a user perspective, is the answer to tracking just a matter of providing good opt-in and opt-out controls? I don't think so. I'm certainly in favor of such controls, but they only work if the people and companies administering them can be trusted. As spammers have taught us, that's not always the case. Plus, as we'll see later, the information-mining world has moved to one where information control isn't as simple as opting into or out of some general collection system.

From a technical perspective, P3P is an example of a system that is meant to incorporate peoples' privacy preferences into their web browsers. With such a system, people can create profiles that state which set of privacy policies they want sites to honor during their Web browsing: "you may track me within your site, but don't allow third parties to do so." Sites then implement parallel profiles that state what they do ("we track you within our site, and we allow third parties to track you as well"). Upon entry to a site, the user's browser retrieves and compares these profiles and alerts the user to conflicts so he or she can decide whether to proceed.

This seems like a good idea, but it never really came together. Its success depended on having a critical mass of users, sites, and browsers implementing the techniques, and that critical mass never appeared. Explaining privacy policies to people was difficult, and the human interfaces used to implement these policies were sometimes lacking. In addition, there was no notion of enforcement; there's nothing to stop rogue sites from claiming to follow a certain policy while in fact implementing something quite different. Work on privacy techniques continues, but a strong, broadly accepted set of technical devices for managing privacy has yet to appear.

Site authentication

Another information gathering approach has emerged in the form of global authentication systems. As more people sign up with more Web sites, the burden of remembering user names and passwords for all those sites has become significant. This is often a mandatory activity: you have to log in to get the personalized view that's the whole point for using the site in the first place. As this authentication problem grew in magnitude, researchers and web companies proposed a technical solution: give each user a single login, administered through a third-party authentication server, that he or she would use on many different sites. For example, when I log into a participating site, my credentials would be passed off to the authentication server, which would report back the success or failure of the credentials. If the credentials were declared valid, the site would let me in.

This is a good idea, but the devil is in the details. One of the first large-scale attempts to provide such a service was Microsoft's .NET My Services, known informally as Hailstorm. Microsoft offered this as a global authentication system, one in which the company envisioned everyone who used the Internet would have a Hailstorm ID. But Hailstorm never got off the ground. Many potential users objected to the idea that a single company would be in a position to track everyone's movement and commerce activities on the Internet, and companies resisted the notion that another company — one that was in many cases a competitor — would be standing between them and their customers and in a position to monitor these transactions.

Hailstorm ultimately faded away, but the idea has continued in a different form, most recently with decentralized identity services like Open ID. The idea is much like Hailstorm, but with the difference that anyone can implement and provide Open ID services. This solves the one-provider problem, but its success will still require critical mass adoption by both users and companies. In the meantime, most browsers have begun to address the multiple-login problem by maintaining a collection of site usernames and passwords in an encrypted form in the browser. There are obvious security issues with this approach, but it's widely available and easy for users to adopt, and it could prove to be a "good enough" solution for most people.

Contextual search

Search engines, and the analysis of the context in which a series of searches takes place, has opened up entire new opportunities for — and questions about — information collection. I've already alluded to the interest in user tracking held by the advertising industry. This interest, and their focus on user tracking, is based on their desire to present ads specifically targeted toward just the people who would find them of interest. If this can be done, advertising works out for everyone. Advertisers win, because they can focus their ads on the precise interests of a narrow group of potential customers, and presumably do a better job of conveying their intended message. Users win, because these ads are more likely to carry information in which they are actually interested. And ad networks win because they can avoid wasting time and effort on showing ads to people who aren't potential customers, and charge more for the ads that they do show.

Google has of course built a modern-day empire on this idea. Note the contrasts to previous means of collecting user information. This information — the search query — is extremely up-to-date, and can be collected many times a day. Most importantly, the information is given freely and honestly, in exchange for something of clear value to the users, who are obviously interested in what they're searching for and are often highly interested in seeing ads relevant to their search.

Note that there are two different levels of user analysis that can go on here. A single search request simply returns a set of search results and accompanying ads, which can be useful in itself. However, multiple searches can be used to disambiguate each other: I might have many reasons for searching for "Bora Bora climate," but my intent is much more clear if I follow up that query with a search for either "beachfront rentals" or "hurricane prediction." Thus, an ability to examine local search histories can produce even more predictive power, and more value for the user or the information holder. Or, perhaps, for both of them.

Of course, any kind of information that can be collected, indexed, and aggregated can be just as useful as my search history — am I getting mail, or reading RSS feeds, or watching videos from travel companies or weather researchers? Add that information in, and maybe get a broader picture of what I'm interested in. In principle, this sort of aggregation, while technically challenging, should make ad presentations even more accurate, and so make users — as well as advertisers — even happier. But it's not that simple; the more information that goes into a single store and the more powerful the analysis techniques become, the more unease it's likely to inspire — the creepiness factor returns. How much am I willing to let one company inside my head to get a 5 percent improvement in ad quality, and to enable other uses of this information that we're barely thinking about now? At what point does the added value to users decline relative to that received by the holders of the information? That experiment is underway everyday, in Web browsers around the world, and, for better or worse, we're in the process of finding out. I'll continue this vein of thinking in the next installment, so stay tuned.