Proceedings of the
The Nineteenth International Conference on Computational Intelligence and Security (CIS 2023)
December 1 – 4, 2023, Haikou, China

A Method for Generating Data that Preserves Privacy and Enhances Utility

Luo Xiaopeng1 and Luo Zongwei2

1Interdisciplinary Research and Application for Data Science,BNU-HKBU United International College, Zhuhai, China.

2IAFN, BNU-HKBU United International College, Zhuhai, China.

ABSTRACT

As machine learning technology improves by leaps and bounds, there is a rapid growth in the demand for data. However, in many practical applications, a challenging problem is efficiently capturing useful information in private data. Moreover, personal information in the data will seriously threaten the privacy of participating users, building blocks for data-driven decision-making. The popularization of communication technology and data collection devices have led to mixed-type data containing numerical and categorical features. Mixed data provides more comprehensive and rich information to help us discover hidden patterns between features and data labels. It is worth noting that not all features contribute equally to the classification task. Features with poor correlation to the labels may not provide valuable information to the dataset and can even affect the accuracy of the analysis results. Such features were considered noise and irrelevant to the analysis task. This paper proposes a novel data synthesis method that considers the relevance of heterogeneous features to the data labels, even in scenarios with limited data. By employing strict privacy constraints through differential privacy and protecting user privacy information with noise, this method generates new data, increasing the quantity and diversity of training data while preserving its utility. We evaluate the newly generated data protected under privacy constraints, assessing their utility in classifiers through experiments. The experimental results demonstrate that this method preserves the original data's utility and improves the classifiers' classification results.

Keywords: Mixed data, Feature selection, Feature ranking, Privacy-preserving, Data synthesis.



Download PDF