In March, Google launched a beta for its new Data Loss Prevention (DLP) Application Programming Interface (API). That product is now in production and with Google claiming that it can identify up to 50 types of sensitive data. Among those data types are credit card numbers, names and phone numbers. Developers can now identify data and decide if they want to redact, mask or tokenise the data.
Scott Ellis, Product Manager writing on the Google Cloud Platform Blog said: “These new data de-identification capabilities help you to work with sensitive information, while reducing the risk of sensitive data being inadvertently revealed.” What makes this useful to a lot of developers is that it can be applied to existing data as well as new data.
What is the DLP API looking for?
Google has chosen a set of common data formats to protect. Some of these are used globally and some are national layouts. There are 14 countries where Google has added support for a mix of national ID numbers, passports and other data. All of this is Personally Identifiable Information (PII) and there appears to be ways to add new types of data as required.
It will use a mix of techniques to identify data types. This includes pattern matching, checksum, word lists, phrase lists, custom logic and context. It is using checksums where appropriate to ensure only certain numbers get redacted. Credit and debit card numbers can vary between issuers. Most use a 16 digit code and a three digit security code. However, American Express uses a 15 digit number and a four digit security code. A checksum will help separate these out from other long strings of numbers.
Among the global data types are identifiers that span finance, healthcare, telephony, technology and email. The list includes:
- Credit card numbers
- Email address
- IBAN Code: This is an international standard for identifying bank numbers
- ICD9 Code: The international classifications of diseases used mainly in the US to identify diagnostic and procedure codes to patients.
- ICD10 Code: Similar to ICD9 and published by the World Health Organisation.
- IMEI Hardware ID
- IP Address
- MAC Address
- Phone Number: This also includes US toll free phone numbers and can be adapted to different national phone number standards
- SWIFT Code: Uniquely identifies a particular bank and is used in money transfers
What country specific data is being targeted?
The US gets the largest set of unique identifiers with 14 separate pieces of data listed. The data types range from Social Security Number (SSN) and tax number to several related to healthcare. Outside of the US most countries see their national ID number and/or tax number listed but only a few get as much as five pieces of data identified. Others have just a single piece of PII called out.
One major surprise is that just five different European countries are mentioned. With the EU GDPR coming into force in less than seven months this seems a serious oversight. There is no EU standard for tax numbers, ID numbers or even passport numbers. Google had the opportunity to make life easier for companies and software teams as they worry about May 2018. Instead it seems to have picked the UK, France, Germany, Netherlands and Spain as countries to provide token support for. It will be interesting to see how quickly it updates the DLP API to support all EU countries with meaningful DLP types.
Outside of Europe coverage is also patchy. Brazil and Mexico are the only countries in South America to get support. Brazil has its National Persons Register (taxpayer identification number) on the list while Mexico gets Passport Number and National Identification Number.
In Asia there is support for Australia, China, Japan and Korea. Australia gets its Medicare Account Number and Tax File Number listed but no passport. The others get passport numbers and some other data.
Canada and India make up the remaining two countries. As expected, Canada gets several PII data types listed to support companies that work across the US/Canada border. India only gets the new Personal Permanent Account Number which is not universally used across the country.
What does this mean
There is increasing concern at how easy it is to inadvertently leak PII outside of company data controls. The increase in Bring Your Own Device (BYOD) for end-user computing has been eclipsed by the personal cloud services users own. They often see nothing wrong in storing data in those cloud services. There is also no evidence users check if the data they are storing contains PII before they put it in the cloud.
This creates a serious challenge for IT departments. They are responsible for tracking data no matter where it is stored. If they cannot identify it then they cannot track it. DLP helps them to prevent PII and other sensitive data being widely accessible. If it is to be moved outside the company they want to redact or protect it. Additionally, the more companies invest in data mining tools the greater the risk of unauthorised access of sensitive data.
Google has done a lot of work between the beta in March and the product release. It has refined the data types and the DLP API has had a good workout. However, it has disappointed in not supporting a wider set of data types especially in key markets. In the EU it would have taken little effort to include support for tax numbers, passports, health record numbers and other data across the entire bloc. Instead it seems to think that developers in countries not listed will simply build their own identifiers using the API. That’s a mistake and one that Google needs to address.
Despite this, Google has now laid down a marker to those cloud companies that do not have their own DLP API. Will we see a rush of new products in the near future? Almost certainly.