JCSE, vol. 5, no. 3, pp.183-196, 2011
DOI: 10.5626/JCSE.2011.5.3.183/
Privacy Disclosure and Preservation in Learning with Multi-Relational Databases
Hongyu Guo, Herna L. Viktor, Eric Paquet
Institute for Information Technology, National Research Council of Canada, Ottawa, Canada/ School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada/ Institute for Information Technology, National Research Council of Canada, School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada
Abstract: There has recently been a surge of interest in relational database mining that aims to discover useful patterns across multiple interlinked
database relations. It is crucial for a learning algorithm to explore the multiple inter-connected relations so that important
attributes are not excluded when mining such relational repositories. However, from a data privacy perspective, it becomes difficult to
identify all possible relationships between attributes from the different relations, considering a complex database schema. That is,
seemingly harmless attributes may be linked to confidential information, leading to data leaks when building a model. Thus, we are at
risk of disclosing unwanted knowledge when publishing the results of a data mining exercise. For instance, consider a financial database
classification task to determine whether a loan is considered high risk. Suppose that we are aware that the database contains
another confidential attribute, such as income level, that should not be divulged. One may thus choose to eliminate, or distort, the
income level from the database to prevent potential privacy leakage. However, even after distortion, a learning model against the modified
database may accurately determine the income level values. It follows that the database is still unsafe and may be compromised.
This paper demonstrates this potential for privacy leakage in multi-relational classification and illustrates how such potential leaks
may be detected. We propose a method to generate a ranked list of subschemas that maintains the predictive performance on the class
attribute, while limiting the disclosure risk, and predictive accuracy, of confidential attributes. We illustrate and demonstrate the effectiveness
of our method against a financial database and an insurance database.
Keyword:
Privacy preserving data mining; multi-relational mining; Relational database
Full Paper: 118 Downloads, 2568 View
|