Is it possible to train gigantic AIs without compromising privacy?

Estimated reading time: 3 minutes

Summary

The explosion of large language models (LLMs), which today learn from data volumes equivalent to entire libraries, has generated a common fear: that this hunger for information would be incompatible with protecting people's privacy. It's important to keep in mind, moreover, that LLMs are language models, not content models. The goal has always been to communicate, not to create. 

Thus, the truth is that training gigantic AIs and respecting data protection laws, such as the LGPD (Brazilian General Data Protection Law), can not only go hand in hand, but this partnership is a good way to build a reliable and sustainable technology for the future.

Banner 3 - Privacy Tools

Data Governance

The key to this compatibility lies in intelligent data governance, which begins long before the AI ​​model even starts being trained. In practice, raw data, whether collected from the internet or private databases, undergoes rigorous "cleaning." 

In this process, as a rule, anonymization techniques remove information that could identify a person, such as names, social security numbers, and addresses. In addition, developers can use "synthetic data"—fictitious computer-generated data that mimics the patterns of real data, but without any connection to real people—ensuring that the model learns from the structure of the information.

Tokenization

The LLM training process does not store personal data in the way a conventional database would. One of the most important technical processes in this scenario is tokenization. Simply put, it is the act of breaking long texts into smaller pieces, or "tokens," so that the computer can understand them. It is precisely at this stage that data protection can be automated. 

Intelligent programs are able to identify tokens that correspond to sensitive information and replace them with generic labels. Thus, the model learns, for example, the structure of a phrase that contains an address, but never sees the actual address. It learns the concept, not the personal data. 

In other words, although the model's result may probabilistically resemble the training data, this does not mean that personal information has been "memorized." This abstraction results in the loss of references to specific individuals.

Privacy as a priority

In summary, while the use of large-scale data for LLM training presents challenges, the combination of anonymization techniques, minimization, the use of synthetic data, the abstract nature of model learning, and the implementation of data governance aligned with the rights of data subjects demonstrates that the advancement of AI is compatible with the protection of privacy and personal data.

Ultimately, treating data protection as a priority is a competitive advantage. AI models trained with high-quality data, collected ethically and properly "cleaned," are more reliable and less prone to errors and biases. By incorporating privacy from the very beginning of the project (privacy by designCompanies build public trust, which is essential for the success of any technology. It is clear, therefore, that it is perfectly possible to train AI models with trillions of parameters while simultaneously protecting people's fundamental rights.

Want to learn more about AI in the context of copyright? Click here. hereOr else those others texts by the author.

About the Author

Meet the author of this article.

  • Lawyer and Coordinator at BFBM Advogados. Professor. Author of books and articles. PhD candidate and Master's degree from UNB. Postgraduate degree (lato sensu) in Business Law from Fundação Getúlio Vargas, FGV. Postgraduate degree (lato sensu) in International Relations, UnB. Law degree from Universidade Federal do Rio Grande do Norte, UFRN. Researcher at IDP (Ethics4AI). CIPM and CDPO certified by IAPP. ECPC-B DPO certified by Maastricht University. Member of the AI ​​Commission and the National Observatory of Cybersecurity, Artificial Intelligence and Data Protection of the OAB (Brazilian Bar Association).

Want to see how Privacy Tools can help your company in practice?

Request a personalized demonstration and see how our solutions adapt to your needs.

Related articles section

Read also