Digital Edition

SYS-CON.TV
Letting Web Users Fib Scientifically Is Key to IBM's Web Privacy Technique
Letting Web Users Fib Scientifically Is Key to IBM's Web Privacy Technique

(June 5, 2002) - If you're an online merchant, ask a visitor to your site a simple question, and you're likely to get a misleading answer, if you get anything at all. That's the dilemma prompting a new privacy-enhancing data mining technique being developed by Dr. Rakesh Agrawal and Dr. Ramakrishnan Srikant, researchers at IBM's Almaden (CA) Research Center.

The project addresses the Catch-22 created by Web users entering false personal data on sites to protect their privacy, and e-businesses relying on the data to develop data models and deliver customized services. "Our research institutionalizes the notion of fibbing on the Internet, and does so to preserve the overall reality behind the data," says Dr. Agrawal.

Called Privacy-Preserving Data Mining, the research relies on the concept that personal data can be protected by being randomized. By applying this technique, a retailer could generate highly accurate data models without ever seeing personal information. "The beauty of this research is that retailers and other Web businesses are able to extract the valuable demographic information they need without knowing the underlying personal consumer data", said Harriet P. Pearson, IBM's Chief Privacy Officer. "I believe we'll see technological approaches such as this playing a larger role in managing privacy issues."

How It Works
A Web user decides to enter a piece of personal data - e.g., age, salary, weight. Upon entry, that number (for example, age 30) is scrambled or 'randomized' by IBM software: the software takes the original number and adds (or subtracts) to it a random value. This randomization is performed independently for every user. So, a 30-year-old's age may be randomized to 42, while a 34-year-old's entry may be randomized to 28.

What doesn't change is the allowed range of the randomization. And, the range is directly linked to the desired level of privacy. Large randomization increases the uncertainty and the personal privacy of the users. At the same time, larger randomizations can cause loss in the accuracy of the results that are, at the end, produced by a data mining algorithm that uses the randomized data as input. According to Dr. Agrawal, it is clearly a trade-off. Experiments indicate only a 5-10 percent loss in accuracy even for 100 percent randomization after the data mining algorithm has applied corrections to the distributions.

Take the randomization of an IT manager's salary, which, for purposes of this example, may range between $50,000 and $150,000 per year. Then let's say that the Web merchant (or Web site owner) decides that the software's randomization parameter will be set to add a random value somewhere between -$30,000 to +$30,000.

Jane, who comes to the site and decides to enter her salary in exchange for personalized recommendations, has a salary of $100,000. When she enters this, the IBM software happens to pick a random value of -$15,000, so her salary is recorded as $85,000. No record is kept of her true salary to protect her privacy. Then Bob comes to the site and enters his true salary of $90,000. The software happens to pick +$25,000 for Bob and his salary is recorded as $115,000. Again no record is kept of Bob's true salary.

To view the effect of the randomization, look at the true salary distribution of the group of folks, in addition to Jane and Bob, who input their salary on the site, next to the randomized distribution.

Distribution Truthful   Distribution Randomized
$50,000-60,000  1 visitor  $50,000-60,000  3 visitors
60,000-70,000  4 visitors  60,000 - 70,000  7 visitors
70,000-80,000  20 visitors  70,000 - 80,000  12 visitors
90,000-100,000  50 visitors  90,000 - 100,000  33 visitors
100,000-110,000  10 visitors  100,000 - 110,000  55 visitors
110,000-120,000  45 visitors  110,000 - 120,000  23 visitors
120,000-130,000  15 visitors  120,000 - 130,000  10 visitors
130,000-140,000  3 visitors  130,000 - 140,000  2 visitors
140,000-150,000  2 visitors  140,000 - 150,000  5 visitors

Note in the randomized list that 55 people are in the 100-110 thousand range, whereas truly there were only 10 people. If this randomized data were used directly, the results would be very poor.

Once all the randomized data is in for a large number of users, the privacy preserving data mining software takes the randomized distribution and reconstructs how the true distribution might have looked like.

The software cannot determine what Jane or Bob's salaries were. It has access to only the randomized values and the parameters of randomization (i.e. random values that were added or subtracted came from the range -$30,000 to +$30,000), and nothing else. Based only on this information, the software reconstructs a close approximation of the true distribution. This reconstructed distribution is then used in building an accurate data mining model. Jane gets personalized recommendations by having the data mining model shipped to her client and applied locally.

According to Dr. Agrawal, the Privacy-Preserving Data Mining research has a wide range of potential applications, from medical research and building disease prediction models using randomized individual medical histories, to e-commerce and accurate promotions using randomized demographics of individual users.

Launched in early 2002, the IBM Privacy Institute is the industry's first formal technology research effort focused exclusively on developing privacy-enabling and data protection technologies for businesses. Under the direction of Dr. Michael Waidner, the Institute conducts privacy-enabling technology research in IBM's eight research laboratories around the world.

About Java News Desk
JDJ News Desk monitors the world of Java to present IT professionals with updates on technology advances, business trends, new products and standards in the Java and i-technology space.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1



ADS BY GOOGLE
Subscribe to the World's Most Powerful Newsletters

ADS BY GOOGLE

Consumer-driven contracts are an essential part of a mature microservice testing portfolio enabling ...
To Really Work for Enterprises, MultiCloud Adoption Requires Far Better and Inclusive Cloud Monitori...
Adding public cloud resources to an existing application can be a daunting process. The tools that y...
Using new techniques of information modeling, indexing, and processing, new cloud-based systems can ...
A valuable conference experience generates new contacts, sales leads, potential strategic partners a...
Containers and Kubernetes allow for code portability across on-premise VMs, bare metal, or multiple ...
We are seeing a major migration of enterprises applications to the cloud. As cloud and business use ...
SYS-CON Events announced today that Silicon India has been named “Media Sponsor” of SYS-CON's 21st I...
DXWorldEXPO LLC announced today that "IoT Now" was named media sponsor of CloudEXPO | DXWorldEXPO 20...
In this presentation, you will learn first hand what works and what doesn't while architecting and d...
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22n...
Everyone wants the rainbow - reduced IT costs, scalability, continuity, flexibility, manageability, ...
Founded in 2000, Chetu Inc. is a global provider of customized software development solutions and IT...
SYS-CON Events announced today that DatacenterDynamics has been named “Media Sponsor” of SYS-CON's 1...
DXWorldEXPO LLC announced today that All in Mobile, a mobile app development company from Poland, wi...
Most DevOps journeys involve several phases of maturity. Research shows that the inflection point wh...
Andi Mann, Chief Technology Advocate at Splunk, is an accomplished digital business executive with e...
Today, we have more data to manage than ever. We also have better algorithms that help us access our...
DXWorldEXPO LLC announced today that ICOHOLDER named "Media Sponsor" of Miami Blockchain Event by Fi...
Bill Schmarzo, author of "Big Data: Understanding How Data Powers Big Business" and "Big Data MBA: D...