Digital Edition

SYS-CON.TV
Letting Web Users Fib Scientifically Is Key to IBM's Web Privacy Technique
Letting Web Users Fib Scientifically Is Key to IBM's Web Privacy Technique

(June 5, 2002) - If you're an online merchant, ask a visitor to your site a simple question, and you're likely to get a misleading answer, if you get anything at all. That's the dilemma prompting a new privacy-enhancing data mining technique being developed by Dr. Rakesh Agrawal and Dr. Ramakrishnan Srikant, researchers at IBM's Almaden (CA) Research Center.

The project addresses the Catch-22 created by Web users entering false personal data on sites to protect their privacy, and e-businesses relying on the data to develop data models and deliver customized services. "Our research institutionalizes the notion of fibbing on the Internet, and does so to preserve the overall reality behind the data," says Dr. Agrawal.

Called Privacy-Preserving Data Mining, the research relies on the concept that personal data can be protected by being randomized. By applying this technique, a retailer could generate highly accurate data models without ever seeing personal information. "The beauty of this research is that retailers and other Web businesses are able to extract the valuable demographic information they need without knowing the underlying personal consumer data", said Harriet P. Pearson, IBM's Chief Privacy Officer. "I believe we'll see technological approaches such as this playing a larger role in managing privacy issues."

How It Works
A Web user decides to enter a piece of personal data - e.g., age, salary, weight. Upon entry, that number (for example, age 30) is scrambled or 'randomized' by IBM software: the software takes the original number and adds (or subtracts) to it a random value. This randomization is performed independently for every user. So, a 30-year-old's age may be randomized to 42, while a 34-year-old's entry may be randomized to 28.

What doesn't change is the allowed range of the randomization. And, the range is directly linked to the desired level of privacy. Large randomization increases the uncertainty and the personal privacy of the users. At the same time, larger randomizations can cause loss in the accuracy of the results that are, at the end, produced by a data mining algorithm that uses the randomized data as input. According to Dr. Agrawal, it is clearly a trade-off. Experiments indicate only a 5-10 percent loss in accuracy even for 100 percent randomization after the data mining algorithm has applied corrections to the distributions.

Take the randomization of an IT manager's salary, which, for purposes of this example, may range between $50,000 and $150,000 per year. Then let's say that the Web merchant (or Web site owner) decides that the software's randomization parameter will be set to add a random value somewhere between -$30,000 to +$30,000.

Jane, who comes to the site and decides to enter her salary in exchange for personalized recommendations, has a salary of $100,000. When she enters this, the IBM software happens to pick a random value of -$15,000, so her salary is recorded as $85,000. No record is kept of her true salary to protect her privacy. Then Bob comes to the site and enters his true salary of $90,000. The software happens to pick +$25,000 for Bob and his salary is recorded as $115,000. Again no record is kept of Bob's true salary.

To view the effect of the randomization, look at the true salary distribution of the group of folks, in addition to Jane and Bob, who input their salary on the site, next to the randomized distribution.

Distribution Truthful   Distribution Randomized
$50,000-60,000  1 visitor  $50,000-60,000  3 visitors
60,000-70,000  4 visitors  60,000 - 70,000  7 visitors
70,000-80,000  20 visitors  70,000 - 80,000  12 visitors
90,000-100,000  50 visitors  90,000 - 100,000  33 visitors
100,000-110,000  10 visitors  100,000 - 110,000  55 visitors
110,000-120,000  45 visitors  110,000 - 120,000  23 visitors
120,000-130,000  15 visitors  120,000 - 130,000  10 visitors
130,000-140,000  3 visitors  130,000 - 140,000  2 visitors
140,000-150,000  2 visitors  140,000 - 150,000  5 visitors

Note in the randomized list that 55 people are in the 100-110 thousand range, whereas truly there were only 10 people. If this randomized data were used directly, the results would be very poor.

Once all the randomized data is in for a large number of users, the privacy preserving data mining software takes the randomized distribution and reconstructs how the true distribution might have looked like.

The software cannot determine what Jane or Bob's salaries were. It has access to only the randomized values and the parameters of randomization (i.e. random values that were added or subtracted came from the range -$30,000 to +$30,000), and nothing else. Based only on this information, the software reconstructs a close approximation of the true distribution. This reconstructed distribution is then used in building an accurate data mining model. Jane gets personalized recommendations by having the data mining model shipped to her client and applied locally.

According to Dr. Agrawal, the Privacy-Preserving Data Mining research has a wide range of potential applications, from medical research and building disease prediction models using randomized individual medical histories, to e-commerce and accurate promotions using randomized demographics of individual users.

Launched in early 2002, the IBM Privacy Institute is the industry's first formal technology research effort focused exclusively on developing privacy-enabling and data protection technologies for businesses. Under the direction of Dr. Michael Waidner, the Institute conducts privacy-enabling technology research in IBM's eight research laboratories around the world.

About Java News Desk
JDJ News Desk monitors the world of Java to present IT professionals with updates on technology advances, business trends, new products and standards in the Java and i-technology space.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1



ADS BY GOOGLE
Subscribe to the World's Most Powerful Newsletters

ADS BY GOOGLE

"Codigm is based on the cloud and we are here to explore marketing opportunities in America. Our mis...
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objecti...
High-velocity engineering teams are applying not only continuous delivery processes, but also lesson...
"CA has been doing a lot of things in the area of DevOps. Now we have a complete set of tool sets in...
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offe...
Data scientists must access high-performance computing resources across a wide-area network. To achi...
"Akvelon is a software development company and we also provide consultancy services to folks who are...
"MobiDev is a software development company and we do complex, custom software development for everyb...
Agile has finally jumped the technology shark, expanding outside the software world. Enterprises are...
"We're developing a software that is based on the cloud environment and we are providing those servi...
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22n...
The question before companies today is not whether to become intelligent, it’s a question of how and...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 22n...
In his session at 21st Cloud Expo, James Henry, Co-CEO/CTO of Calgary Scientific Inc., introduced yo...
While some developers care passionately about how data centers and clouds are architected, for most,...
Enterprises are adopting Kubernetes to accelerate the development and the delivery of cloud-native a...
"NetApp is known as a data management leader but we do a lot more than just data management on-prem ...
"We're focused on how to get some of the attributes that you would expect from an Amazon, Azure, Goo...
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection...
"We work around really protecting the confidentiality of information, and by doing so we've develope...