Could AI systems cause GDPR breaches?

The last year has seen a move by IT vendors to introduce increasing amounts of artificial intelligence (AI), machine learning and cognitive systems. How well the solutions meet the claims made by vendors is not part of this article. What is important is that companies are feeding these systems vast amounts of data to do advanced analytics. The advantage of using these systems is that they are able to deliver solutions far faster than data scientists and security analysts can.

For business units, this is good news. They want new insights into data that allows them to improve their sales and marketing focus. Security teams are also keen on these systems as they have the potential to detect the very early signs of a cyberattack. However, like all technologies there is always the issue of unintended consequences. The risk here is not just from the systems but the data connections that they make and privacy legislation.

For simplicity, this article will refer to all the systems collectively as AI, even though there are differences in the way they work.

PII and anonymous data

It is the volume of data that these systems ingest that makes them effective. Much of that data comes from very disparate sources. Some of that data openly contains Personally Identifiable Information (PII) as recognised by legislation in various countries. Other sets of data are anonymous such as the data from machine logs or Internet of Things (IoT) devices. On top of this some companies may also use reference data which is also generally anonymous.

The systems then bring all of this information together to see what connections they can make between the data. In a business analytics case this may disclose previously unexpected patterns in sales data. It might deliver information that allows marketing teams to open up new markets by changing pricing or their marketing messages.

From a cybersecurity perspective, the connections are likely to be much deeper. These systems are looking for indicators of attack from different systems. They correlate logs in real-time and identify suspicious behaviour. This is not just about tracking attacks from outside the enterprise. They are looking at how employees and even systems work, what they connect to, when they connect and other data. All of this is combined to give a complete view of a particular connection.

In both cases, however, there is a risk that previously anonymous data will suddenly no longer be anonymous. The systems have the capacity to identify patterns that remove the anonymity of the data. This means that rules governing privacy need to be applied to this new data.

How dynamic IP addresses became PII

Last year, the European Court of Justice ruled against the German Government over dynamic IP addresses. A dynamic IP address is assigned to a device when it is used on a public network or most Internet services. The device only uses it while connected and can be assigned another dynamic IP address the next time it connects.

The case revolved around the way the German government captured dynamic IP addresses and other data. When users logged into government services the system captured data such as their name, time, date and the service being used. All of these were reported as PII. A separate system captured the IP addresses used when people connected. These were logged but not protected in the same way as PII.

The ECJ ruled that as it was possible to compared the two systems it was therefore possible for the German government to identify the IP address used by the user. This meant that it was possible to then track the user across multiple systems, not just those controlled by the German government.

What does this mean for businesses?

Across Europe and for any organisation of business looking to trade with Europe, a new piece of legislation called the General Data Protection Regulation (GDPR) comes into force in May 2018. It is designed to create enhanced privacy for users and extends the existing definition of PII to new types of data. Breaches of GDPR will incur the most draconian penalties of any privacy legislation around the world. It will cost organisations €20 million or 4% of their global turnover, whichever is the greater. In the worst case it has the ability to be business ending.

As with dynamic IP addresses, it is possible for AI and related systems to remove the anonymity from data. More importantly it could create new PII out of previously unconnected sets of data. GDPR and other privacy legislation has no flexibility to deal with this unintended consequence.

What is needed is for organisations to begin looking at what is being created by their systems. For example, if a cybersecurity system is able to identify a potential attack from a user, the data needs to be audited in order to see how it identified the user. All the data used must be validated to see if the process de-anonymised data. If that is the case then the company will need to establish processes to reclassify and protect data. It may also find that under GDPR it will have to disclose what has happened to the relevant national regulator.

Is this just speculation?

No. Researchers in the cybersecurity space have already highlighted the use of AI, cloud computing and complex analytics by cyber criminals. Data from multiple data breaches is being brought together and then mined to create extremely detailed profiles on individuals. That data is then used either as a phishing attack or as part of an identity theft scheme.

Could systems self audit?

This is an interesting question for which the answer is provisionally yes. However, it is more likely that a separate system would be more effective as it would remove any potential system conflict.

Several vendors are already looking at systems that can ingest all the data a company owns to identify where PII and other sensitive already exists. Those systems could be used to audit the output of the cybersecurity and even the business systems. This is unlikely to be in real-time and for cybersecurity systems, real-time is where the industry and security teams want to be.

What does this mean?

For many companies this is an unwelcome and unexpected issue. It is the last thing that they would have expected deal with. They are looking to buy these solutions to simplify issues not create new ones. For the industry it presents a particular problem where they will need to find ways to warn customers that they have potentially created new PII.

This latter is not a bad thing. As we bring together increasing sets of data and do advanced analytics on the data we will learn more and more about customers. Even data that has previously been anonymised can no longer be trusted to stay so.

This means that organisations need to do more to police their own systems. More importantly they also need to do more to protect all this data. While governments argue for the need for weaker encryption privacy laws run counter to this. There is a narrow window for many organisations, as they start to deploy these systems, to put in place the right processes to protect data and identify PII as it is created.


I have had this conversation with a number of very large IT vendors over the last six months. They have all accepted that there is a risk here but no-one has come back with a solution. It may be that they are working on solutions that they will announce but in the absence of that, this is a real risk that companies need to start addressing.


Please enter your comment!
Please enter your name here