Governments around the world have thankfully recognised corporate misuse of personal data and have brought in legislation to give citizens more rights over their data. GDPR, CCPA, PIPEDA, APPI and more give individuals around the world varying levels of protection and control over their personal information.
It’s this precarious data landscape in which we see AI starting to reach mainstream adoption. Many of us will be aware of the well-documented data privacy and copyright concerns reported in the press surrounding AI. But don’t be fooled into thinking that these worries are only present for the likes of OpenAI and Anthropic!
Even small and medium sized organisations need to carefully navigate data privacy when implementing their own AI-driven tech.
Preface
We’re not privacy lawyers, and none of this article is intended as legal advice. However I would like to point you to the ICO’s guidance around using artificial intelligence within the confines of GDPR.
In this section, they rightly point out:
It is not possible to list all known security risks that might be exacerbated when you use AI to process personal data. The impact of AI on security depends on:
- the way the technology is built and deployed;
- the complexity of the organisation deploying it;
- the strength and maturity of the existing risk management capabilities; and
- the nature, scope, context and purposes of the processing of personal data by the AI system, and the risks posed to individuals as a result.
Due to the vast scope of potential use cases that AI presents, the precise way that you protect and secure user data within such a system is largely dependent on the scope, function, and construction of that system.
With this in mind, any SME exploring the use of AI and automation within their organisation needs to be aware of the below seven AI and data privacy considerations, at the very least.
1. Data Transparency Can Be Murkier Than You Think
Under GDPR, all European and British organisations now need to think more carefully about what personal data they collect, what risks they introduce by working with that data, and how to keep that data secure.
However, AI can introduce certain temptations when it comes to data processing.
AI is incredible at filtering through and making sense of large amounts of data. Many organisations have a lot of siloed info that they desperately need to assimilate, understand, and get their heads around. Charging AI with this task would seem like a silver bullet solution.
Yet there can be real data risks in lobbing chunks of personally identifiable data into the AI meat-grinder, just to see what comes out the other end!
One of the guiding tenets of GDPR is transparency. Data processors need to be honest and transparent about what data they collect, why they collect it, and how they use that data. AI adoption can present two stumbling blocks in the way of this transparency.
When a piece of software is “closed source,” that means that both users and the wider public are unable to personally inspect the software’s code because it is proprietary to a given organisation. Microsoft’s Windows operating system is a good example of closed source software.
When a solution is closed source and proprietary to an external provider, it can be difficult to interrogate quite what happens to the data you put into it, where that data goes, and what it does. Could the data end up on an insecure server somewhere? Could the data be used to further train the AI model against your data subjects’ wishes? There may not be a way for you, as the average user, to tell.
We’re not accusing any AI model or software of this behaviour, of course. But without having access to the code that runs the software, organisations like yours have little way of knowing what is truly happening under the bonnet.
The second issue is that of AI’s renowned “black box problem.” A lot of deep learning systems rely on swathes of training data and inferences that have now become so complex that even their creators don’t understand why they give some of the answers that they do.
Understandably, both issues present a significant challenge for those trying to be as transparent as possible about how personal data is used.
2. Follow the Rules Around Automated Decision Making
GDPR also contains stringent rules about automated decision making.
Individuals covered by GDPR have a right to opt out of solely automated decision making - i.e., where data controllers make significant decisions about individuals purely using an automatic programme or algorithm. Individuals also have a right to ask a human to reassess any decision solely made through automation. This remains the case whether AI plays a part in that decision process or not.
Additionally, our readers in the EU should also be aware of the new EU AI Act. This effectively bans the use of AI tools to impose “social scoring” on individuals or to identify people in real time using biometric data.
If you are considering creating a system that makes significant decisions about people’s lives, there are a few things you should bear in mind.
Firstly, identify the bare minimum data points that a human would need in order to make that decision about an individual case. This should be the absolute maximum data that you feed into your AI decision-making solution. If you give your AI solution more information than it is likely to need, you risk overexposing individuals’ data, you risk introducing bias into the AI model, and you risk regularly overworking the AI tool, which can present energy costs.
Secondly, you need to consider how your solution is going to respect the wishes of those who opt out of automated decision making. How you achieve this is going to depend heavily on what the solution does and how it works, but a way of excluding data subjects from automatic decisions should always be built in from the outset.
Above all, always keep data subjects informed about the use of their data, tell them about your use of automated processing, and give them clear ways to opt out or to challenge any automated decision. Schedule in regular checks to ensure that your decision-making tools are working as they should be too - especially when your AI tools use machine learning to pick up new things and adapt their judgement over time.
Essential Reading from ICO: Rights related to automated decision making including profiling
3. Less is More: Embrace Data Minimisation
Data minimisation is where an organisation collects the bare minimum amount of personal data they need in order to function, and it’s wise data privacy practice. After all, minimising the amount of data you hold similarly minimises your data exposure risk and minimises data storage costs too.
You might also want to adopt a related concept: purpose limitation. That’s where personal data is only collected for specified, explicit, and legitimate purposes and never processed in ways incompatible with those purposes.
So where does AI come into this? Again, it might depend on what the AI is tasked with doing. For example, say you’re developing an AI solution that is designed to monitor a video feed and flag errors on an assembly line, though not to identify those responsible. It simply doesn't make sense to store vast amounts of likely repetitive video data, which may also introduce privacy concerns for workers and visitors in the vicinity. Such huge amounts of storage would also be vastly outside the scope of the application.
It would respect individuals’ privacy a lot more to only store and analyse video data whilst an instigating error is taking place; with measures in place to obscure any personally identifying images of team members captured in that segment of video.
It’s also worth bearing in mind that when an AI model has a smaller amount of purposeful, clean data to trudge through in order to formulate a response, this can have a positive impact on the model’s performance and hardiness.
4. Build in Anonymity, Build Out Bias
If personal details aren’t relevant to data processing or storage, then keeping that data completely anonymised is great data protection practice. After all, if personal data isn’t present, it can’t be breached or misused.
But anonymising data has another benefit too. When identifying characteristics (such as name, gender, ethnicity, sexuality and geography) are completely absent from a system, this eliminates the potential for bias towards or against certain individuals or groups. We’re all aware of how humans can bring their own biases into a process - but without careful instruction and training to the contrary, AI can introduce biases too.
In an older, well-documented case, Amazon developed an ML recruiting tool to review job applicants’ CVs and spit out the best few candidates for each role in a completely objective, neutral way. However, the tool was trained using CVs submitted to the company over a 10-year period - most of which were from male candidates due to the male-dominated nature of the tech industry. The system therefore ended up “teaching itself” to favour male candidates over female ones.
Therefore, measures need to be built into systems to eradicate bias - and build in total anonymity if the scope of the project allows.
For example, RAIven is building a real-time, AI/ML-powered, health and safety monitoring tool for a leading corporate client, which incorporates data from video streams. In order to respect anonymity, we’ve built in layers of abstraction so certain actions get flagged as potentially desirable or undesirable without feeding in any data that is identifies an individual. This built-in anonymity eliminates possible privacy concerns around storing people’s physical likenesses - but it also helps to eradicate the possibility of the system picking up any biases along the way.
Care also needs to be taken around what AI tools are allowed to infer about data subjects. Even with a few seemingly innocuous data points, a solution may be able to deduce highly personal things like gender, medical conditions, or sexual orientation, simply through its incredible pattern-matching prowess!
Bias can also be purposefully built into AI tools, as evidenced by Google Gemini well-meaningly “over-diversifying” images it generated from prompts where a level of historical precision was expected.
In our view, AI tools need to be constructed with the maximum amount of anonymity and with unbiased neutrality built in from the outset.
5. Keep Your Data Lean and Local
AI tools are able to receive, process, and create new data at breakneck speeds, making it essential that any organisation using AI carefully considers the practicalities of storing that data.
Keeping your data minimised, sanitised, and process-specific obviously reduces the amount of space it is going to hold on a disc. This reduces storage costs (and environmental costs) in and of itself.
However, there’s another factor to consider here – transfer costs. Transferring data from one location to another is going to use energy and incur cost. Transferring data, especially over public networks, can also introduce cyber and privacy risks too.
With this in mind, aim to keep any data and computation as local as possible. Does a piece of data really need to be transferred halfway across the country to be computed and then returned? Or can the whole process happen on-site?
Also bear in mind that AI requires a lot more computational power than standard computing, so any hardware that is tasked with on-site AI computing will need to be fit for purpose.
For example, within some of the solutions we develop, we are able to plug an AI-ready computational device directly into a camera or sensor, so the data generated doesn’t need to travel through miles of cable in order to be computed. The needed computing all happens right there before the results of that computation are moved on to where they need to go. This keeps data risk and transfer costs to an absolute minimum.
6. Be Aware of AI-Specific Data Privacy Attacks
Many of us are aware of attacks on people’s private data like social engineering attacks. But did you know there are AI-specific privacy attacks that can be used to uncover personally identifiable information from an AI powered system?
In membership inference attacks, hackers probe an AI model using previously obtained personally identifying data about a target individual. Their aim is to work out whether that individual’s data was part of the AI’s training data or not. This could let hackers know whether an individual had interacted with a particular service during the time the training data was being amassed.
Another type of attack is a model inversion attack, where criminals (armed with some initial identifying data about their target/s) aim to probe an AI model to infer and extract personal information about those individuals within its dataset.
However, there is an important caveat here: both of these attacks involve the criminals already having some personally identifying information about the individuals they’re targeting, and both require attackers to gain access to the AI model itself. This makes a strong case for data privacy and access control best practices.
7. Document All Data Movement, Storage, and Use
The ICO make an excellent point about recording what you do with the data under your care:
ML systems require large sets of training and testing data to be copied and imported from their original context of processing, shared and stored in a variety of formats and places, including with third parties. This can make them more difficult to keep track of and manage.
Your technical teams should record and document all movements and storing of personal data from one location to another. This will help you apply appropriate security risk controls and monitor their effectiveness. Clear audit trails are also necessary to satisfy accountability and documentation requirements.
You may also find it enlightening to interrogate your technical supply chains, especially those which directly interact with sensitive data and AI components.
In Conclusion
The best way to ensure the most stringent control over data privacy within an IT system is to have it custom built. This way, you have total visibility into its internal workings, you are less beholden to external supply chain fluctuations, and you’re not locked into a particular vendor’s way of doing things.