I really enjoyed the episode of The Secure Developer where Tomosz Tunguz and Guy Podjarny talk about the challenges securing genAI.
If you’re a Head of Data interested in working with GenAI this is an extremely relevant episode for you. Tomasz’ insights are relevant, timely, and coalesce with our observations, so I've written down some of my take-aways from the episode.
Tomasz talks about how much AI has changed, and the potential genAI has to redefine the market. However, he then goes on to say how much the security risks of GenAI are conflicting CISO’s and can potentially cause significant harm to the career of a Head of Data. This is a particularly interesting take as it’s the first time that the Head of Data will be held responsible for privacy and security risks, and it really aligns
Before going deeper into that is, let’s look into what those risks are.
The biggest security and privacy risks are associated with training and/or finetuning genAI, and consist of:
Yes, concerns about data loss, and Data Loss Prevention are back. This is as several recent incidents have shown that every LLM is sensitive to exposing training data through prompt engineering. However, an LLM is only valuable when you can finetune it on your company’s proprietary data, exposing you to the risk of data loss when you expose it through for instance, a genAI chatbot. It’s not unimaginable that someone uses that chatbot to prompt engineer their way to your customer data.
Privacy regulations such as GDPR require organisations to inform customers on the logic that goes into automated decision-making. This is particularly challenging for genAI which are basically black boxes of which the outcomes are largely determined by the data that went into training and finetuning them.
You’ll have to finetune LLM’s on your proprietary cloud data to make them useful for your organisation, but the massive surge in cloud data breaches over the past quarters show the pressing need for fine-grained data access controls as a second line of defence. As a result, data teams have to manage access at the dataset level when finetuning genAI, but the current IAM technology and workflows make this an extremely painful process. From the hundreds of conversations we’ve had over the past year, this is already a big pain point for self-service analytics, but expect it to be a topic you’ll hear a lot more of at the coffee machine in the coming years.
I’ve added this one myself but there is a risk of data poisoning where hackers use prompt engineering to alter the data that goes into training or finetuning your genAI model with the goal of affecting the outcomes of LLM’s used for a chatbot risking repetitional harm. As such, it will be equally important to closely manage write access to your data, and not just read access.
It’s clear that genAI exposes the organisation to new privacy and security risks that are largely associated with the data that goes into training or finetuning the models. These models are basically black boxes and their outcomes are largely determined by the data used to train or finite them. Thich puts the responsibility of managing these risks with the Head of Data who will have to work closely with the CISO in a a way that closely resembles the partnership between the VP Eng and the CISO in securing application development. Where this partnership has been thoroughly formalised in clear policies, processes, and technology used in DevSecOps, this still has to take place for data in most organisations. DataOps is still a very nascent discpline, and only a handful of organisations have managed to evolve it to DataSecOps by integrating security in it.
Unfortunately, this exposes organisations that use genAI to huge regulatory fines as upcoming regulations such as the EU AI Act, and the NIS 2 Directive can impose fines of respectively 7% and 2% of global revenue in case of privacy and security breaches resulting from these models.
Therefore, Thomasz stresses how important it will be for the Head of Data to work closely with the CISO to implement the appropriate security and privacy measures when adopting GenAI. At Raito we believe that they will have to:
Integrate Data Security in MLOps/DataOps by managing data security as code in Terraform, dbt, or data contracts. If for instance, a data engineer can specify the role used for an AI model that can access a dataset in a YAML file, their dbt project, or in a data contract, it will be 1% of the effort of doing it afterwards. This prevents them from having to give admin rights just to save time.
Data Teams have to work closely with InfoSec to implement and operationalise security policies in the data stack used for training and finetuning genAI. Additionally, they have to collaborate with data owners for securing datasets, approving data access requests, and regularly reviewing data access & usage. Here we expect the Data Mesh principles of ownership and federated data governance to take hold.
To achieve data security at scale it will be important to automate data security using meta-data. If you can automate data security during the data development process you also take away the mental burden from data engineers of defining which data security measures to take and determining who can access their datasets, which can sometimes be a tricky questions to answer.
Often neglected, but something that will become a regulatory requirement is the regular monitoring of data access and usage patterns at the data set level. Not only for the data consumers but also for the service accounts used for data pipelines and training and finetuning of genAI models.
It's clear that GenAI holds the promise to increase productivity through readily available insights on your organisation's data, but without proper controls, the privacy and security risks significantly outway the rewards. Because the value and risks of GenAI are so tightly knit with the data, the Head of Data will have to collaborate closely with the CISO to secure the underlying data.
Reach out to [email protected] if you want to learn more about AI Security, or check out the full episode here: https://open.spotify.com/episode/5Lkax4Xxb4LzRxoo10ro17?si=53f83b28065b45cf