How data security is managed in ML projects

Compliance with the GDPR is a vital issue for all companies. But not only by regulatory mandate. In a connected and global world, data security is becoming a priority. It happens most often in large enterprises, as they are the most aware of the implications that a security breach could have.

Therefore, it is not surprising that this is one of the first questions big companies ask themselves when carrying out any project. And it will become more complex as the volume of data required increases. This is exactly what happens with Machine Learning projects.

Indeed, automatic training models (ML) require large amounts of information. In many cases, this data is also quite sensitive, so it needs to be protected both in its storage and when it is processed.  

seguridad del dato en ML y privacidad

Google offers a trusted cloud infrastructure, security that extends to ML projects developed on its platform (GCP). However, one thing is how Google stores the information - in encrypted form and ensures access only to authorized persons - and another is how access to this data is ensured.

In other words, when dealing with security in ML projects, it is necessary to determine the logic applied to this access: what roles will be created, who will be the people who will be able to access it?

On the other hand, the security of the data transport from the source to the storage place must be taken into account. As explained above, in GCP all transport is encrypted, but until it gets there, data should be transported using Transport Layer Security (TLS).

However, while security is usually a more or less standard issue, privacy is influenced by several factors: country, type of project, problem to be solved, etc. Therefore, each case requires a complete privacy study. This study must answer four main questions: 

  1. What data should be collected?
  2. What are the permitted uses?
  3. To whom can it be shared?
  4. What granular access control model is appropriate?

What data should be collected for our ML solution?

We always tend to think that quality is linked to quantity. That is why we sometimes mistakenly try to collect as much data as possible. But the right thing to do is to consider which data is needed for the solution we want to develop.

Collecting more data than required for the ML project implies cost overruns:

  • More storage needed
  • Data anonymization from different sources required 

Thus, it is essential to identify and define what data can be collected. After that, it is necessary to decide which data to collect and how to get it. Collecting all the data and then deciding is not good practice. Study the problem and think about what information you will need for the solution.

What uses are allowed?

The intended use of the data must be defined from the beginning. The reason is that users give permission to handle the data for a certain purpose. Therefore, the data cannot be used for any other objective than what was expressed in the consent. Using the data in an unauthorized way for an ML project will lead to legal problems.

To clarify this issue of permitted uses, let us take a hotel company as an example. Such a company provides a dataset with all bookings. The objective is to look for metrics to support the marketing team in their campaigns. In this case, the dataset could not be used to analyze anything else. 

To whom can we share data?

Another important point regarding the information privacy collected is which applications and users we can pass it on. The people who provide your data are the sole owners of that data, and only they should be able to access it. They will always have the possibility to revoke our systems' access to that data.

But if only they have access, how do we analyze that information? The answer is through anonymization. Anonymizing the data is the step before any processing or analysis time. However, to get to that point, at least one user - or group of users - with access to the data needs to be able to anonymize it.

Anonymization of data also reduces the risk of inadvertent disclosure when data is shared across countries, sectors and even departments within the same company.

For this task, there is a tool that detects and prevents the sharing of non-anonymized data: the Cloud Data Loss Prevention (DLP) API. This API detects data such as credit card numbers, phone numbers, emails, etc. and has its functions to anonymize them. The tool even allows you to create your detectors for your specific project.

Securing access control

In terms of securing access control, Google Cloud Storage allows two ways:

  • Using IAM Permissions. Although it can work well, this solution is more oriented to manage and grant roles. In case of looking for other applications (such as BigQuery) to access data, we must do it through "service accounts".
  • Using service accounts. This option is focused on allowing external services to access data. It can be combined with IAM permissions, as these set the roles and are assigned to each service account. If a service does not need access to the data, you can easily revoke the permissions. However, if a service is going to access the data for a short time, you can use "short-term service accounts" which are expiring accounts. This way, your data is protected if you forget to revoke the services for that account.

Other variables to take into account

Security is not only based on IAM access but also on the "env" variables or any extra data needed. It is best to have these in the GCP Secret Manager, especially for the extra data required by the app.

On the other hand, it is sometimes advisable to use Vault in a private HA GKE cluster to store our private secrets. It can also connect to our GitHub login, which has a double authentication factor, to organize these secrets on a team basis.

We want to help you achieve your digital objectives. Let's talk!

fondo-footer
base pixel px
Convert
Enter PX px
or
Enter EM em
Result