k-均值问题的差分隐私算法综述

doi:10.15960/j.cnki.issn.1007-6093.2022.03.001

摘要/Abstract

摘要：

$k$-均值问题是机器学习和组合优化领域十分重要的问题。它是经典的NP-难问题, 被广泛的应用于数据挖掘、企业生产决策、图像处理、生物医疗科技等领域。随着时代的发展, 人们越来越注重于个人的隐私保护:在决策通常由人工智能算法做出的情况下, 如何保证尽可能多地从数据中挖掘更多信息，同时不泄露个人隐私。近十年来不断有专家学者研究探索带隐私保护的$k$-均值问题, 得到了许多具有理论指导意义和实际应用价值的结果, 本文主要介绍关于$k$-均值问题的差分隐私算法供读者参考。

关键词: $k$-均值问题, 差分隐私, 近似算法, 指数机制, 拉普拉斯机制

Abstract:

The $k$-means problem is a very important problem in the field of machine learning and combinatorial optimization. It is a classic NP-hard problem, which is widely used in data mining, business production decision-making, image processing, biomedical technology, and more. As people in these fields pay more and more attention to personal privacy protection, which raises a question: In the case that decisions are usually made by artificial intelligence algorithms, how to ensure that as much information as possible is extracted from the data without revealing personal privacy? In the past ten years, experts and scholars have continuously studied and explored the $k$-means problem with privacy protection, and has also obtained many results with theoretical guiding significance and practical application value. In this paper we mainly introduce these differential privacy algorithms of the $k$-means problem for readers.

Key words: $k$-means problem, differential privacy, approximate algorithm, exponential mechanism, Laplace mechanism

中图分类号:

O221.7

袁藩, 徐大川, 张冬梅. k-均值问题的差分隐私算法综述[J]. 运筹学学报, 2022, 26(3): 1-16.

Fan YUAN, Dachuan XU, Dongmei ZHANG. A survey of differential privacy algorithms for the $k$-means problem[J]. Operations Research Transactions, 2022, 26(3): 1-16.

图/表 4

图1

表1

图2

表2

参考文献 53

1	Dasgupta S. The hardness of $k$-means clustering[R]. San Diego: Department of Computer Science and Engineering, University of California, 2008: CS2008-0916.
2	AhmadianS,Norouzi-FardA,SvenssonO,et al.Better guarantees for $k$-means and euclidean $k$-median by primal-dual algorithms[J].SIAM Journal on Computing,2019,49(4):FOCS17-97-FOCS17-156.
3	张冬梅,李敏,徐大川,等.$k$-均值问题的理论与算法综述[J].中国科学: 数学,2020,50(9):1387-1404.
4	LuR,ZhuH,LiuX,et al.Toward efficient and privacy-preserving computing in big data era[J].IEEE Network,2014,28(4):46-50. doi: 10.1109/MNET.2014.6863131
5	WangT,ZhengZ,RehmaniM H,et al.Privacy preservation in big data from the communication perspective–-A survey[J].IEEE Communications Surveys Tutorials,2018,21(1):753-778.
6	Yao X, Zhou X, Ma J. Differential privacy of big data: An overview[C]//Proceedings of the 2nd IEEE International Conference on Big Data Security on Cloud, High Performance and Smart Computing and Intelligent Data and Security, 2016: 7-12.
7	JainP,GyanchandaniM,KhareN.Differential privacy: its technological prescriptive using big data[J].Journal of Big Data,2018,5(1):1-24. doi: 10.1186/s40537-017-0110-7
8	Skarkala M E, Maragoudakis M, Gritzalis S, et al. Privacy preservation by $k$-anonymization of weighted social networks[C]//Proceedings of the 4th IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2012: 423-428.
9	Zheleva E, Getoor L. Privacy in social networks: A survey[M]//Social Network Data Analytics, Boston: Springer, 2011: 277-306.
10	Task C, Clifton C. What should we protect? Defining differential privacy for social network analysis[M]//State of the Art Applications of Social Network Analysis. Cham: Springer, 2014: 139-161.
11	Gowtham M, Ahila S S. Privacy enhanced data communication protocol for wireless body area network[C]//Proceedings of the 4th International Conference on Advanced Computing and Communication Systems, 2017: 1-5.
12	LiM,LouW,RenK.Data security and privacy in wireless body area networks[J].IEEE Wireless communications,2010,17(1):51-58. doi: 10.1109/MWC.2010.5416350
13	Abadi M, Chu A, Goodfellow I, et al. Deep learning with differential privacy[C]//Proceedings of the 16th ACM SIGSAC Conference on Computer and Communications Security, 2016: 308-318.
14	KaissisG A,MakowskiM R,RückertD,et al.Secure, privacy-preserving and federated machine learning in medical imaging[J].Nature Machine Intelligence,2020,2(6):305-311. doi: 10.1038/s42256-020-0186-1
15	KaissisG,ZillerA,Passerat-PalmbachJ,et al.End-to-end privacy preserving deep learning on multi-institutional medical imaging[J].Nature Machine Intelligence,2021,3(6):473-484. doi: 10.1038/s42256-021-00337-8
16	Chaturvedi A, Nguy$\rm \tilde{\hat{e}}$n H L, Zakynthinou L. Differentially private decomposable submodular maximization[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021: 6984-6992.
17	Mitrovic M, Bun M, Krause A, et al. Differentially private submodular maximization: data summarization in disguise[C]//Proceedings of the 34th International Conference on Machine Learning, 2017: 2478-2487.
18	Rafiey A, Yoshida Y. Fast and private submodular and $k$-submodular functions maximization with matroid constraints[C]//Proceedings of the 37th International Conference on Machine Learning, 2020: 7887-7897.
19	Gupta A, Ligett K, McSherry F, et al. Differentially private combinatorial optimization[C]//Proceedings of 21st Annual ACM-SIAM Symposium on Discrete Algorithms, 2010: 1106-1125.
20	MatoušekJ.On approximate geometric $k$-clustering[J].Discrete and Computational Geometry,2000,24,61-84. doi: 10.1007/s004540010019
21	Dwork C. Differential privacy[C]//Proceedings of the 33rd International Colloquium on Automata, Languages, and Programming, 2006: 1-12.
22	Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis[C]//Proceedings of the 3rd Theory of Cryptography Conference, 2006: 265-284.
23	Mcsherry F, Talwar K. Mechanism design via differential privacy[C]//Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, 2007: 94-103.
24	McSherry F. Privacy integrated queries: an extensible platform for privacy-preserving data analysis[C]//Proceedings of the 35th ACM SIGMOD International Conference on Management of Data, 2009: 19-30.
25	Fisher C. Over 267 million facebook users reportedly had data exposed online[EB/OL]. [2021-12-20]. https://www.engadget.com/2019/12/19/facebook-data-exposed-online/.
26	Daniel V, Kershner I. Personal data of all 6.5 million israeli voters is exposed[EB/OL]. [2021-12-20]. https://www.nytimes.com/2020/02/10/world/middleeast/israeli-voters-leak.html.
27	Bassily R, Smith A, Thakurta A. Private empirical risk minimization: Efficient algorithms and tight error bounds[C]//Proceedings of the IEEE 55th Annual Symposium on Foundations of Computer Science, 2014: 464-473.
28	Balcan M F, Dick T, Liang Y, et al. Differentially private clustering in high-dimensional euclidean spaces[C]//Proceedings of the 34th International Conference on Machine Learning, 2017: 322-331.
29	DworkC,RothA.The algorithmic foundations of differential privacy[J].Foundations and Trends in Theoretical Computer Science,2014,9(3/4):211-407.
30	Wang Y, Wang Y X, Singh A. Differentially private subspace clustering[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015: 1000-1008.
31	Su D, Cao J, Li N, et al. Differentially private $k$-means clustering[C]//Proceedings of the 6th ACM Conference on Data and Application Security and Privacy, 2016: 26-37.
32	Feldman D, Xiang C, Zhu R, et al. Coresets for differentially private $k$-means clustering and applications to privacy in mobile sensor networks[C]//Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks, 2017: 3-16.
33	Huang Z, Liu J. Optimal differentially private algorithms for $k$-means clustering[C]//Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2018: 395-408.
34	Nock R, Canyasse R, Boreli R, et al. $k$-variates++: more pluses in the $k$-means++[C]//Proceedings of the 33rd International Conference on Machine Learning, 2016: 145-154.
35	Jones M, Nguyen H L, Nguyen T D. Differentially private clustering via maximum coverage[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021: 11555-11563.
36	Nguyen H L, Chaturvedi A, Xu E Z. Differentially private $k$-means via exponential mechanism and max cover[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021: 9101-9108.
37	Arthur D, Vassilvitskii S. $k$-means++: the advantages of careful seeding[C]//Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, 2007: 1027-1035.
38	Nissim K, Stemmer U, Vadhan S P. Locating a small cluster privately[C]//Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2016: 413-427.
39	Blum A, Dwork C, McSherry F, et al. Practical privacy: the SuLQ framework[C]//Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2005: 128-138.
40	Nissim K, Raskhodnikova S, Smith A. Smooth sensitivity and sampling in private data analysis[C]//Proceedings of the 39th Annual ACM Symposium on Theory of Computing, 2007: 75-84.
41	Zhang J, Xiao X, Yang Y, et al. PrivGene: differentially private model fitting using genetic algorithms[C]//Proceedings of the 13th ACM SIGMOD International Conference on Management of Data, 2013: 665-676.
42	Ghazi B, Kumar R, Manurangsi P. Differentially private clustering: Tight approximation ratios[C]//Proceedings of the 34th Conference on Neural Information Processing Systems, 2020: 33-93.
43	Makarychev K, Makarychev Y, Razenshteyn I. Performance of johnson-lindenstrauss transform for $k$-means and $k$-medians clustering[C]//Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019: 1027-1038.
44	KirszbraunM.ü ber die zusammenziehende und lipschitzsche transformationen[J].Fundamenta Mathematicae,1934,22(1):77-108.
45	Feldman D, Fiat A, Kaplan H, et al. Private coresets[C]//Proceedings of the 41st Annual ACM Symposium on Theory of Computing, 2009: 361-370.
46	Stemmer U. Locally private $k$-means clustering[C]//Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, 2020: 548-559.
47	Nissim K, Stemmer U. Clustering algorithms for the centralized and local models[C]//Proceedings of the 29th Algorithmic Learning Theory, 2018: 619-653.
48	Stemmer U, Kaplan H. Differentially private $k$-means with constant multiplicative error[C]//Proceedings of the 31st Annual Conference on Neural Information Processing Systems, 2018: 5431-5441.
49	Hsu J, Khanna S, Roth A. Distributed private heavy hitters[C]//Proceedings of the 39th International Colloquium on Automata, Languages, and Programming, 2012: 461-472.
50	Chang A, Ghazi B, Kumar R, et al. Locally private $k$-means in one round[C]//IProceedings of the 38th International Conference on Machine Learning, 2021: 1441-1451.
51	Har-PeledS,MendelM.Fast construction of nets in low-dimensional metrics and their applications[J].SIAM Journal on Computing,2006,35(5):1148-1184. doi: 10.1137/S0097539704446281
52	Chaturvedi A, Jones M, Nguyen H L. Locally private $k$-means clustering with constant multiplicative approximation and near-optimal additive error[EB/OL]. [2021-12-19]. https://doi.org/10.48550/arXiv.2105.15007.
53	BassilyR,NissimK,StemmerU,et al.Practical locally private heavy hitters[J].Journal of Machine Learning Research,2020,21(16):1-42.

文献	乘法项误差系数	加法项误差
Nock等^[34]	$ O(\log(k)) $	$ O(n\Delta^{2}/(\varepsilon+\log n)^{2}) $
Balcan等^[28]	$ O(\log^{3}(n)) $	$ O({\rm poly}(\log(n), d, k)) $
Nguyen等^[36]	$ O(1) $	$ O({\rm poly}(\log(n), \sqrt{d}, k)) $
Ghazi等^[42]	$ (1+\alpha)\cdot\eta $	$ O( ( ({kd+ k^{O(\alpha(1))}})/{\varepsilon} )\cdot {\rm poly} \log(n)) $

文献	乘法项误差系数	加法项误差
Nissim和Stemmer ^[47]	$ O(k) $	$ O(n^{2/3+a}\cdot d^{1/3} \cdot k^{1/2}) $
Kaplan和Stemmer ^[48]	$ O(1) $	$ O(n^{2/3+a}\cdot d^{1/3} \cdot k^{2}) $
Stemmer ^[46]	$ O(1) $	$ O(n^{1/2+a}\cdot k \cdot {\rm max} \{\sqrt{d}, \sqrt{k}\}) $
Chaturvedi等^[52], 算法1	$ (1+\alpha)\cdot\eta $	$ O(n^{1/2}\cdot d^{1/2} \cdot k^{O(1/\alpha^{2})}) $
Chaturvedi等^[52], 算法2	$ O(c^{2}) $	$ O(n^{1/2}\cdot d^{1/2} \cdot k^{1+O(1/(2c^{2}-1))}) $
Chang等^[50]	$ (1+\alpha)\cdot\eta $	$ O(n^{1/2}\cdot d^{1/2}\cdot k^{O_{\alpha}(1)} \cdot {\rm poly(\log(n)/\varepsilon)}) $

[1]	张安, 陈永, 陈光亭, 陈占文, 舒巧君, 林国辉. 优先序约束的排序问题：基于最大匹配的近似算法[J]. 运筹学学报, 2022, 26(3): 57-74.
[2]	王冬, 李刚刚, 罗文昌. 工件具有任意尺寸的混合分批平行机排序问题的近似算法[J]. 运筹学学报, 2022, 26(3): 133-142.
[3]	毕春燕, 万龙, 罗文昌. 工件有到达时间及可拒绝下的同类平行机排序问题的近似算法[J]. 运筹学学报, 2022, 26(2): 73-82.
[4]	李建平, 蔡力健, 李陈筠然, 潘鹏翔. 限制性多源点偏心距增广问题[J]. 运筹学学报, 2022, 26(1): 60-68.
[5]	陈婳, 张国川. 经典一维装箱问题近似算法的研究进展[J]. 运筹学学报, 2022, 26(1): 69-84.
[6]	刘文杰, 张冬梅, 张鹏, 邹娟. 带惩罚μ-相似Bregman散度k-均值问题的初始化算法[J]. 运筹学学报, 2022, 26(1): 99-112.
[7]	剧嘉琛, 刘茜, 张昭, 周洋. 带惩罚的相同容量k-均值问题的局部搜索算法[J]. 运筹学学报, 2022, 26(1): 113-124.
[8]	李小玮, 成夏炎, 李荣珩. 带有线性惩罚的鲁棒k-种产品设施选址问题的近似算法[J]. 运筹学学报, 2021, 25(4): 31-44.
[9]	包晓光, 路超, 黄冬梅, 余炜. 混合图上最小-最大圈覆盖问题的近似算法[J]. 运筹学学报, 2021, 25(1): 107-113.
[10]	任建峰, 田晓云. 带有异常点的平方度量设施选址问题[J]. 运筹学学报, 2021, 25(1): 114-122.
[11]	张玉忠. 工件可拒绝排序问题综述[J]. 运筹学学报, 2020, 24(2): 111-130.
[12]	刘晓霞, 余山杉, 罗文昌. 工作有到达时间且拒绝工件总个数受限的单机平行分批排序问题的近似算法[J]. 运筹学学报, 2020, 24(1): 131-139.
[13]	王翼展, 张安, 陈永, 陈光亭. 堆取料机调度问题的一个近似算法[J]. 运筹学学报, 2020, 24(1): 147-154.
[14]	张国川, 陈林. 负载均衡问题[J]. 运筹学学报, 2019, 23(3): 1-14.
[15]	姜燕君, 徐大川, 张冬梅. 平方度量动态设施选址问题的近似算法[J]. 运筹学学报, 2018, 22(3): 49-58.