The Risks of Data Minimization

April 2, 2015

23andme.jpg

Companies like 23andme prefer to use big datasets instead of small, restricted ones.

Large and open data sets offer great opportunities for societal advancements. In health, more data can result in greater diagnostic tools and forecasting of health outcomes, showing better ways to evaluate different types of health interventions. In energy, more data on water and electric usage can help utilities and consumers make more efficient choices about how they use those resources. Retailers can use data on consumer’s shopping habits to make their stores flow more efficiently or target their marketing campaigns.

Benefits like these and many others can often only be made when serendipitous findings are made through large data sets. One of the clearest arguments supporting this point came from Josh New in “A Lot of Private-Sector Data is Also Used for Public Good.” In that article, New brings up how companies that deal with significant amounts of data, either open source or submitted by users, end up using those large data sets for a broader good. Companies like 23andMe, Facebook, and LinkedIn, are specifically brought up by New to show how large data sets can result in serendipitous benefits.

23andMe used information, with permission, from individuals who ordered their DNA kits with other gene mapping and pharmaceutical companies to advance genomics research and medical treatments and improve recruitment for clinical trials. LinkedIn uses information from its network to help college students make important insights about their choice of college and major by analyzing the data of people who made similar decisions. Analyzing Facebook data allows for a number of beneficial externalities, including attitudes towards public health policies, specifically targeting Amber Alerts, and allowing individuals to notify friends and loved ones in the case of a geographically-specific disaster.

These benefits are good examples of when researchers and analysts go through large data sets, secured through open sources, the habits of consumers, or through devices in the “Internet of Things” (IoT), they do not always find an answer to the specific question they are asking. They often find different and possibly more important answers, as externalities or tangents to an original question. These answers are only made possible because they have the ability to query data sets they have already collected.

“Data minimization” will make finding answers like these more difficult. Data minimization is a policy by which institutions that collect data only keep data that is important to a specific issue, only keep it for a short period of time, and request permission from consumers to use the data for any non-lineated purpose.

Data minimization was addressed on page iv in the FTC staff report “Internet of Things: Privacy and Security in a Connected World.” That report presented the conclusions of FTC staff and other experts from a roundtable discussion where they considered security and privacy in the quickly growing IoT, and data minimization played a prominent role in that discussion. Specifically, the report claimed

First, larger data stores present a more attractive target for data thieves, both outside and inside a company – and increases the potential harm to consumers from such an event.

Second, if a company collects and retains large amounts of data, there is an increased risk that the data will be used in a way that departs from consumers’ reasonable expectations.

The FTC report went on to suggest

To minimize these risks, companies should examine their data practices and business needs and develop policies and practices that impose reasonable limits on the collection and retention of consumer data. However, recognizing the need to balance future, beneficial uses of data with privacy protection, staff’s recommendation on data minimization is a flexible one that gives companies many options. They can decide not to collect data at all; collect only the fields of data necessary to the product or service being offered; collect data that is less sensitive; or deidentify the data they collect.

While there may be some “flexibility” in the FTC staff’s recommendations on IoT, following those recommendations to the letter would be enough to severely limit the possible gains that large data sets can bring. In addition, as those sets and suggestions are presented, they can be difficult to quantify. Collecting data that is “necessary” to the product or service won’t help companies create better efficiencies or develop better products. “Less sensitive” data is a relative term and difficult to determine what might be sensitive enough to trigger a need to minimize risk.

The FTC Report’s point about increased risk from thieves also seems to be short sighted. Protecting data from deliberate attacks remains a technological challenge regardless of the amount or quality of data collected. While the scale of needed protections increases with more data collection, the scope of it does not. Companies need to protect data appropriately, but if it is able to protect a few terabytes for weeks, it should be able to protect a few petabytes for months. Remaining at the very edge of

data security is fundamental to any company working with personal data, and holding on to more data for longer periods doesn’t change that.

The report would also treat all data the same. Would health characteristics reported by glucose monitors or pacemakers be treated the same as climate information reported by in-home thermostats or meters reporting water usage? While the selling or sharing of some of this data with other organizations might create privacy issues for consumers, others would not.

Perhaps the strongest counter to the FTC report, however, came from FTC commissioner Joshua Wright. In his dissenting statement to the FTC report,

Without limiting the scope of “data,” staff identifies the benefits of data minimization in terms of eliminating two scenarios: (1) the possibility that larger data stores present a more attractive target for thieves; and (2) retention of large stores of data increase the risk that data will be used in a way that deviates from consumers’ reasonable expectations. In considering the costs of data minimization, staff merely acknowledges it would potentially curtail innovative uses of data. Without providing any sense of the magnitude of the costs to consumers of foregoing this innovation or of the benefits to consumers of data minimization, and without providing any evidence demonstrating that the benefits of data minimization will outweigh its costs to consumers, staff nevertheless recommends that businesses “develop policies and practices that impose reasonable limits on the collection and retention of consumer data.”

So, with little consideration of the cost of data minimization, the lost opportunities of innovation, the definition of “sensitive,” or the realities of data security, the FTC staff report leaves a lot to be desired, especially in regards to data minimization.

Right around the release of the FTC’s Report, FTC Commissioner Edith Ramirez said, "Data that hasn't been collected or has already been destroyed can't fall into the wrong hands." While that is correct, it also can’t fall into the right hands--the ones looking to make significant advances and innovations for the good of their companies and society.