Anonymization and Pseudonymization of data used in Machine Learning Projects
https://www.bitkom.org/sites/default/files/2020-10/201002_lf_anonymisierung-und-pseudonymisierung-von-daten.pdf
Examples given:
- Processing of geolocation profiles (movements)
- Google’s COVID-19 Community Mobility Reports
- De-coupled pseudonyms, e.g. for manufacurers remote monitoring machine performance at customers
- Speech recognition as example of federated learning
- Anonmyization and pseudonymization of medical text data using Natural Language Processing
- Use of sematic anonymization of sensitive data with inference-based AI and active ontolgies in the financial industry
Key words:
- Anonymization of structured data
- Approaches
- Aggregation approach
- Generalization, Microaggregation
- k-anonymity, l-diversity, t-closeness
- Mondrian algorithm, MDAV method (Maximum Distance to Average Vector)
- Randomization approach
- Synthetic approach
- (Creating a synthetic model based on original data to generate “matching” synthetic data)
- Attacks
- Was personal data of a known person used to genrate the anonymous data?
- Which data in the anonymous data relates to personal data of a known person?
- Predicting attributes of a known person
- Static anonymization, Dynamic anonymization, Interactive anonymization
- Pseudonymization
- Format preserving encryption, Tokenization, Trusted third party, Pseudonymous Authentication (PAUTH), Oblivious transfer
- Anonymization of texts
- Ensure that free text inlcudes no identifying terms (e.g. via organizational measures)
- Masking of identifying terms as part of post-processing
- Structuring via Natural Language Processing
- Caveat: Author might be identifiable based on writing style
- Anonymization of multimedia data
- Privacy via on-prem analysis and decentralization (see also: federated learning)
- Homomorphic encryption: fully homomorphic, partially homomorphic, somewhat homomorphic
- Secure multi-party computation
- Garbled circuits
- Privacy risks related to machine learning and controls
- Identification of persons
- Deanonmymization of data (e.g. of blurred images)
- Memmbership inference
- Model inversion
- Defeating noise, others..
- Federated learning
- (Moving algorithms to the local data – instead of moving data to central algorithm)
- (Local data doesn’t leave device)
- AI models as personal data
- Legal advantages of federated learning
- Attacks and controls
- Model inversion
- Querying the trained AI model to reconstruct its training data
- Membership inference
- Was a given data point used to train the model?
- Model extraction
- “Stealing” the trained model – by cloning the behaviour and predictive capabilities of a given AI model
- Adversial examples (creating inputs that trigger unintended responses)
- Countermeasures
- Restriction son outputs
- Adversarial Regularization
- Distillation
- Differential Privacy
- Cryptography
- Secure multi-party computation (MPC)
- Federated machine learning
- Differential Private Data Synthesis (DIPS) (e.g. via Copula functions, Generative Adversarial Networks)