Facing the risks of personal data leakage, where should the line be drawn between utility and privacy?
The IPoP project of the PEPR* Cybersecurity program has recently focused on this aspect of language models, with a emphasis on the health domain. It examined two approaches:
- MLM (Masked Language Modeling) to tailor a BERT-style model
- CLM (Causal Language Modeling) to tailor a GPT-style model
It is also involved in an initiative addressing the same aspect: PANAME (Privacy Auditing of AI Models). The CNIL and the ANSSI are in the loop. As is the PEReN (Pôle d’expertise de la régulation numérique; interministerial service). Objective: to develop, in 18 months, a software library that will unify the privacy evaluation of AI models. It will, we are told, be “wholly or partly available as open source”.
CNIL leads the project and frames the legal scope. ANSSI brings cyber expertise. IPoP provides scientific leadership. PEReN is mainly responsible for developing the library.
Techniques to Move into Production
Against this backdrop, a position paper from the European Data Protection Supervisor (EDPS) issued at the end of 2024. It recalls that the GDPR applies in many cases to AI models trained on personal data, precisely because of their memorization capabilities. In this context, to conclude that a model is anonymous and thus outside the GDPR’s scope, it is very often necessary to demonstrate resistance to privacy attacks.
The work in this area is often carried out at an experimental level, the PANAME teams observe. The techniques developed, even when available as open source, require substantial development work to be used in production. There is, more broadly, no unified framework to formalize the coding of privacy tests.
In its dossier on the security of AI systems, the CNIL’s laboratory discusses privacy mainly in relation to anonymization techniques. In particular the so‑called PATE (Private Aggregation of Teacher Ensembles). With it, master models train a student model by a majority vote on the outputs they produce.
The latest work that PEReN has disclosed in the AI field concerns the use of a nearest-neighbor search method to link generated content to original content. In the run-up to the AI Summit, the interministerial service had opened the first software bricks of a tool for evaluating detectors of artificial content.
* Priority Programme for Research Equipment
On the Same Theme
See all Data & AI articles
The Naming of Variables, a Key Factor for Code Assistants
By
Clément Bohic
4 min.
AGNTCY at the Linux Foundation: What Is This Internet Project? […]
By
Clément Bohic
GenAI, Explored but Underutilized for Managing Microservices
By
Clément Bohic
AI and GDPR: CNIL Finalizes Its Practical Guide Repository
By
Clément Bohic
In the United States, a national AI action plan moving forward amid ongoing deregulation
By
Clément Bohic