Paper: Bitkom: Anonymisierung und Pseudonymisierung von Daten für Projekte des maschinellen Lernens

Anonymization and Pseudonymization of data used in Machine Learning Projects

https://www.bitkom.org/sites/default/files/2020-10/201002_lf_anonymisierung-und-pseudonymisierung-von-daten.pdf

Examples given:

  • Processing of geolocation profiles (movements)
  • Google’s COVID-19 Community Mobility Reports
  • De-coupled pseudonyms, e.g. for manufacurers remote monitoring machine performance at customers
  • Speech recognition as example of federated learning
  • Anonmyization and pseudonymization of medical text data using Natural Language Processing
  • Use of sematic anonymization of sensitive data with inference-based AI and active ontolgies in the financial industry

Key words:

    • Anonymization of structured data
        • Approaches
        • Aggregation approach
          • Generalization, Microaggregation
          • k-anonymity, l-diversity, t-closeness
          • Mondrian algorithm, MDAV method (Maximum Distance to Average Vector)
        • Randomization approach
          • Adding noise
        • Synthetic approach
          • (Creating a synthetic model based on original data to generate “matching” synthetic data)
      • Attacks
        • Was personal data of a known person used to genrate the anonymous data?
        • Which data in the anonymous data relates to personal data of a known person?
        • Predicting attributes of a known person
      • Static anonymization, Dynamic anonymization, Interactive anonymization
      • Pseudonymization
        • Format preserving encryption, Tokenization, Trusted third party, Pseudonymous Authentication (PAUTH), Oblivious transfer
      • Anonymization of texts
        • Ensure that free text inlcudes no identifying terms (e.g. via organizational measures)
        • Masking of identifying terms as part of post-processing
        • Structuring via Natural Language Processing
        • Caveat: Author might be identifiable based on writing style
      • Anonymization of multimedia data
      • Privacy via on-prem analysis and decentralization (see also: federated learning)
        • Homomorphic encryption: fully homomorphic, partially homomorphic, somewhat homomorphic
        • Secure multi-party computation
        • Garbled circuits
      • Privacy risks related to machine learning and controls
        • Identification of persons
        • Deanonmymization of data (e.g. of blurred images)
        • Memmbership inference
        • Model inversion
        • Defeating noise, others..
    • Federated learning
      • (Moving algorithms to the local data – instead of moving data to central algorithm)
      • (Local data doesn’t leave device)
      • AI models as personal data
      • Legal advantages of federated learning
    • Attacks and controls
      • Model inversion
        • Querying the trained AI model to reconstruct its training data
      • Membership inference
        • Was a given data point used to train the model?
      • Model extraction
        • “Stealing” the trained model – by cloning the behaviour and predictive capabilities of a given AI model
      • Adversial examples (creating inputs that trigger unintended responses)
      • Countermeasures
        • Restriction son outputs
        • Adversarial Regularization
        • Distillation
        • Differential Privacy
        • Cryptography
        • Secure multi-party computation (MPC)
        • Federated machine learning
        • Differential Private Data Synthesis (DIPS) (e.g. via Copula functions, Generative Adversarial Networks)

Spain: AEPD publishes Privacy-by-Design/Privacy-By Default Guideline

Link to AEPD’s English translation: https://www.aepd.es/sites/default/files/2020-10/guia-proteccion-datos-por-defecto-en.pdf

Press release (with links to files):
https://www.aepd.es/es/prensa-y-comunicacion/notas-de-prensa/aepd-publica-guia-proteccion-datos-por-defecto

Guideline
https://www.aepd.es/sites/default/files/2020-10/guia-proteccion-datos-por-defecto.pdf

Excel sheet with measures
https://www.aepd.es/media/guias/PDpD-listado-medidas.xlsx

Quick overview of the measures in the Excel sheet
(Quick and dirty translation – please use with a grain of salt!)

  • Amount of personal data
    • Anonymous mode operation.
    • Operation without the need to create a user account.
    • Operation with different user accounts on the same device for the same interested party.
    • Operation with different user accounts on different devices for the same interested party and processing.
    • Identification through tools and technologies that reinforce privacy such as attribute-based credentials, zero-knowledge tests,…
    • Data aggregation: in time, in space, by groups …
    • Calibration of the granularity of the data: eg reduce the frequency of collection of location data, measurement data, etc.
    • Generalization of the data: use ranges for age, postal addresses for addresses.
    • Grading of the extent of the data collected based on the use cases
    • Alternatives and voluntariness in the contact information claimed from the user: e-mail, postal, telephone …
    • Processing monitoring techniques (cookies, pixel tag, fingerprint, etc.)
    • Configuration of unique identifiers (tracking IDs), the programming of their reinitialization and the warning of activation times.
    • Device metadata collected from the device (battery consumption, O.S., versions, languages, etc.).
    • Metadata included in the media processed or generated (in documents, photos, videos, etc.)
    • Information collected about the user’s internet connection (device with which it connects, IP address, device sensor data, application used, browsing and search log, date and time stamp of web page request, etc.) and information about elements near the device (Wi-Fi access points, mobile phone service antennas, bluetooth enabled devices, etc.).
    • Information collected about user activity on the device: power on, activation of applications, use of keyboard, mouse, etc.
    • Mechanisms for staggered collection of the information necessary for the processing. Delay data collection until the stage where it is necessary.
    • Type and volume of new data inferred from automated processes such as machine learning or other artificial intelligence techniques.
    • Data enrichment and linking to external data sets
    • Activation and deactivation at will of the data collection systems (cameras, microphones, GPS, bluetooth, wifi, movement, etc.).
    • Establish a time schedule for when sensors (eg cameras, microphones, etc.) can be operational.
    • Incorporation of obfuscation mechanisms to avoid the processing of biometric data in photos, video, keyboard, mouse, etc.
    • Physical blockers (such as tabs to cover camera lenses, speaker blockers, etc.).
    • Use of privacy masks or pixelation in video surveillance systems.

  • Processing extension
    • Definition and design of the processings to minimize the amount of temporary copies of data that are generated and to minimize the conservation times, transfers and communications
    • Pseudonymization according to the processing operations that may exist in each phase or stage.
    • Local and isolated processing, including the possibility of local storage.
    • Additional processing of collected metadata – log files.
    • Exercise of rights of opposition, limitation or deletion.
    • Processing settings for profiling or automatic decisions (in the case of cookies)
    • Possibility of configuring all optional processing operations for non-essential purposes: for example, data processing to improve the service, analysis of use, personalization of ads, detection of usage patterns, etc.
    • Configuration of a secure deletion of temporary files, mainly those located outside the user’s device and outside the controller’s systems
    • Incorporation of an option to reinitialize user data to restart the relationship from scratch
    • Setting the data enrichment option
    • Consider mechanisms to audit the existence of Dark Patterns
    • Specific section for configuration options related to sensitive data
    • Help and transparency panel with examples of use and possible risks and consequences for the rights and freedoms of the user
    • Incorporation of a specific means (button or link) to return to the initial configuration with default values

  • Configuration options grouped by type of media
    • Configuration of deletion of session data after its closure.
    • Configuration of maximum terms for logging out of the application or devices.
    • Terms of conservation of user profiles.
    • Configuration of temporary copy management.
    • Control of the deletion of temporary copies.
    • Elimination of the user’s trace in the service: “right to be forgotten”.
    • Identification, within the record of files of data collected from the sections, or data within sections, that can be anonymized
    • Programming of automatic locking and erasing mechanisms.
    • Programming of automatic mechanisms for deleting outputs to printing devices.
    • Configuration of retention periods for historical data in the service: eg, in the purchase sites, last articles, last consultations, etc.
    • Incorporation of generic anonymization mechanisms.

  • Data accessibility
    • Profile information of the interested party shown to the user and third parties: name, pseudonym, telephone number, etc.
    • Information of the interested party that is shown to third parties: eg selective disclosure of elements of the CV, medical history, etc.
    • Information on the status of the interested party accessible to third parties. E.g. in the messaging applications, information on availability, writing a message, receiving a message, reading a message, …
    • Classification and labeling of processing operations, sections of documents and / or data within sections, which can be managed through an access control policy.
    • Organization, classification and labeling of the application or service according to the sensitivity of data, sections or processing operations.
    • Possibility of defining and configuring access profiles and granular privilege assignment
    • Automatic session locks.
    • Assignment of data access profiles according to the roles of the users for each phase of the processing.
    • Design of the workspace (isolated interview areas, non-accessible physical files, non-transparent folders, screens not exposed to third parties or with privacy filters, phone helmets, call centers, clean table policies, etc.)
    • Information management parameters such as where the data is stored and processed, whether it is made clear or using an encryption system, the access control mechanisms implemented, whether there are multiple copies of the data, including non-securely deleted instances , which can be accessed by third parties.
    • Control of data storage encryption
    • Control of data communication encryption
    • Procedures for managing access to shared print / output devices where documents may be left behind by the user.
    • Where appropriate, prohibition of printing.
    • Print output deletion control
    • Portable storage device management procedures for periodic formatting
    • The retention or elimination of session information, in applications, shared systems, communications or systems provided to the employee or the end user.
    • The type and amount of metadata collected in the documentation generated by the system utilities (word processors, drawing tools, cameras and videos, etc.)
    • When sending messages, configure the incorporation of threads of the conversation, as well as configure the possibility of confirming the sending of multiple recipients.
    • Mechanisms to avoid indexing on the Internet
    • Organizational and technical measures for the review and filtering of information to be made public.
    • Systems of anonymization and / or pseudonymization of texts to be disseminated.
    • Management parameters of the connectivity elements of the devices (Wifi, Bluetooth, NFC, etc.).
    • Alerts about the connectivity status of the devices.
    • Controls to prevent the communication of the unique identifiers of the device (Advertising-ID, IP, MAC, serial number, IMSI, IMEI, etc.)
    • Access control mechanisms to passive systems (such as contactless cards) with the incorporation of terminal authentication protocols or with physical measures to prevent electromagnetic access.
    • Accessibility controls to user content on social networks.
    • Incorporation of controls to collect affirmative and clear confirmation actions before making personal data public, so that dissemination is blocked by default.
    • Configuration of notices and reminders to interested parties about what policies for the dissemination and communication of information are established.
    • Definition and configuration of access permissions on data sets (databases, file systems, image galleries, …) and elements for capturing information such as sensors (cameras, GPS, microphones, etc.) of the device and information on elements near the device (Wi-Fi access points, mobile phone service antennas, activated bluetooth devices, etc.).
    • Definition and configuration of data access permission policies between applications and libraries, as in the case of mobile phones.
    • Definition of access profiles based on privileges or other types of technological and procedural barriers that prevent the unauthorized linking of independent data sources.
    • Content registered in the logs (who, when, what, what action, for what purpose,… the data is accessed).
    • Definition of automatic alert systems for specific events.
    • Traceability of data communication between managers, managers and sub-managers.
    • Configurable security options (apart from encryption options).
    • Allow different access settings based on different devices.
    • Configure alert systems for anomalous data access.
    • Configuration of some of the security parameters, in particular the keys, and how to balance the security / performance / functionality relationship based on the robustness desired by the user.
    • Control of the scope of distribution of the information that is distributed in the application environment (social networks, work networks, etc.).
    • Configuration of the reception of notifications when the information is being made accessible to third parties.
    • Control of the metadata incorporated in the information generated or distributed.
    • Mechanism of the “right to be forgotten” of information published on social networks or other systems.
    • Choice options regarding where personal data is stored, whether on local or remote devices and, in the latter case, other parameters such as managers or countries.
    • History of profiles and entities that have accessed your information.
    • Information about access to your data by authorized users
    • Information about the latest changes carried out and the profile that made the change
    • Access control configurability by functionalities provided.
    • Configurability of logical separation of data groups.
    • Configurability of physical separation of data groups.
    • Selective disablement or cancellation of functionalities.

  • General
    • In the event that the service is multi-device, possibility (not obligation) to apply general privacy criteria applicable to all of them and in a single action.
    • Reminders, icons and notices of all those actions that affect the privacy of information: configuration changes, access to data by third parties such as video capture, sound, position, etc.

Switzerland: New Data Protection Law passed parliament

Next step, is waiting if there will be a referendum. (100 day period)
The FDPIC will make detailled statements on the new law once the referendum period has passed.

There is a good write-up in German by Noémi Ziegler at
https://datenrecht.ch/die-dsg-revision-ist-abgeschlossen/

David Rosenthal (VISCHER) has a summary at
https://www.vischer.com/know-how/blog/neues-datenschutzgesetz-das-muessen-sie-wissen-38752/

Final text (in parliament) is here:
https://www.parlament.ch/centers/eparl/curia/2017/20170059/Schluzssabstimmungstext%203%20NS%20D.pdf

The VUD published an overview here:
http://www.vud.ch/view/data/2124/vud_rohstoff_revidiertes_dsg.pdf