Google today announced that it is open sourcing its so-called differential privacy library, an internal tool the company uses to securely draw insights from data sets that contain the private and sensitive personal information of its users.
Differential privacy is a cryptographic approach to data science, particularly with regard to analysis, that allows someone relying on software-aided analysis to draw insights from massive data sets while protecting user privacy. It does so by mixing novel user data with artificial “white noise,” as explained by Wired’s Andy Greenberg. That way, the results of any analysis cannot be used to unmask individuals or allow a malicious third party to trace any one data point back to an identifiable source.
The technique is the bedrock of Apple’s approach to privacy-minded machine learning, for instance. It lets Apple extract data from iPhone users, statistically anonymize that data, and still draw useful insights that can help it improve, say, its Siri algorithms over time.
Google wants differential privacy to be accessible to data scientists in any field.
Google does the same with Chrome using what it calls Randomized Aggregatable Privacy-Preserving Ordinal Response, or RAPPOR, a differential privacy tool for analyzing and drawing insights from its browser that prevents sensitive info like personal browsing histories from being traceable. Earlier this year, Google also open sourced a tool for its TensorFlow AI training platform, called TensorFlow Privacy, that lets researchers use differentia privacy to protect user data while training AI algorithms.
But there are a number of other sectors, like healthcare and sociology, where differential privacy can be useful, Google points out. “This type of analysis can be implemented in a wide variety of ways and for many different purposes,” writes Miguel Guevara, a Google product manager in the company’s privacy and data protection office, in a blog post. “For example, if you are a health researcher, you may want to compare the average amount of time patients remain admitted across various hospitals in
order to determine if there are differences in care. Differential privacy is a high-assurance, analytic means of ensuring that use cases like this are addressed in a privacy-preserving manner.”
In a separate interview with The Verge, Guevara says Google’s quest to develop a differential privacy approach to data analysis for its own internal tools was long, difficult, and resource-intensive, much more so than the company initially thought. That’s why Google is hoping that, by open sourcing its library on GitHub, it can help organizations and individuals without the resources of a large Silicon Valley tech company approach data analysis with a similarly rigorous approach to privacy.
“We’ve used differentially private methods to create helpful features in our products, like how busy a business is over the course of a day or how popular a particular restaurant’s dish is in Google Maps, and improve Google Fi,” Guevara writes. “From medicine, to government, to business, and beyond, it’s our hope that these open-source tools will help produce insights that benefit everyone.”