Opinion, Berkeley Blogs

Unintentional Revelation of Sexuality

By Chris Hoofnagle

Under 13 USC § 9(a)(2), the Department of Commerce is prohibited from "mak[ing] any publication whereby the [Census] data furnished by any particular establishment or individual under this title can be identified"  Thus, the Census Bureau must protect the identities of those who participate in the enumeration.

Modern techniques of "reidentification" are making compliance with this statutory privacy mandate more difficult. Using reidentification techniques, one can take a putatively anonymous database and with varying degrees of confidence, link individuals' identities to specific data.  Carnegie Mellon Professor Latanya Sweeney, in Uniqueness of Simple Demographics in the U.S. Population, found that using 1990 Census data, "...87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}...In general, few characteristics are needed to uniquely identify a person."  More recently, Philippe Golle found (PDF) using 2000 Census data that 63% of the population were identifiable by gender, ZIP code, and full date of birth.

Outside the Census context, researchers have reidentified many databases.  Notable examples include Arvind Narayanan's, work on the Netflix database and on social networks. Narayanan succinctly describes the thesis of his Ph.D. work as exposing that, "...the level of anonymity that society expects—and companies claim to provide—in published databases is fundamentally unrealizable."

In the legal context, University of Colorado Law Professor Paul Ohm has argued that public policy has placed too much emphasis on anonymization, especially in light of new techniques developed by Narayanan and others.

Census reidentification is not a hypothetical risk. Title 13 is a restraint on government action, it does not prohibit any effort to reidentify Census records. Many marketing companies use Census data as a target marketing tool, and have enormous financial incentives to link Census data with particular households. A list of households with gay couples would be quite lucrative, and in fact, data marketers created a list of the 4,000 San Francisco gay couples who registered for a wedding license.

The Census is a powerful political tool; it has been abused in different cultures to identify unpopular subgroups. Our own experience includes the use of the Census to identify the Japanese during World War II.

On the other hand, the same experts who are reidentifying databases are exploring more robust forms of anonymization. Sweeney and Brad Malin have collaborated on a number of projects to improve anonymization. And the Census has sophisticated experts studying the problem and applies several techniques to obscure identity.

Still this leaves individuals in a dilemma: should they comply with the mandate to participate in the Census, and in doing so risk losing control over the extent to which third parties know about their sexual orientation? Can the Census ethically collect this information? And should extra precautions be built into information collection on potentially sensitive facts about individuals?