22 February, 2019 Security and privacy in Machine Learning projects on Google Cloud Platform How to ensure data security and privacy in Google Cloud Platform, including anonymization processes. One of the most important things when working on a Machine Learning project is to know how to ensure the privacy and security of all the information. To start with, we need to have a global vision of both security and privacy - as there are different ways of approaching it - and finally, we will discuss how to solve it with real success stories. Security On the one hand, we have the challenge of data security. Certainly, we need to address and think about this matter. While Google helps us to keep information safe thanks to Security Overview, we need to consider how Google stores that information - encrypted, to ensure that only those who are allowed to access it have access, and so on. - We will focus on how to secure access to, for example, specific people and give them that access or even assign them a role, etc. Normally, in a Machine Learning project, we have already allocated a site where all this information is already stored - for instance, Google Cloud Storage - and from a security point, it is essential to make sure: That all the information is encrypted and stored. The transport of information. The first point deals with how the data is stored. By default, Google Cloud handles it seamlessly by using an asymmetric algorithm to encrypt all data with keys that Google or the user will be able to handle. As for the transport of data, all data should be sent using TLS - otherwise, the transport will not be secure and will be exposed during the traffic from the origin to Google Cloud Storage. However, thanks to GCP the transport is encrypted. Privacy On the other hand, we have the challenge of privacy, which is less considered but is no less relevant. On the contrary, it is more important than security. Why is it more critical? Because it usually changes depending on the problem or the country and security is quite similar in most cases. In a Machine Learning project, when we talk about privacy, we focus on: What kind of data should we collect? What are the allowed uses? Who would you share it with (users, apps, etc...)? Which granular access model is the most appropriate? What kind of data should we collect? The easiest thing to do is to collect as much as possible. But you should collect only the most necessary information. It is not recommended to collect unnecessarily, as it will turn into uncertainty. But if that is the case, the problem is that from the beginning you would be collecting data in an unnecessary way that would be of no use to you. Therefore, this would imply a double cost. The reason is that you would not only have to store it but also ensure privacy by anonymizing the information and thus use it in different sources. The key is to study the problem and look at what kind of information we can collect. Then we have to decide what and how we could store it. There are cases where you have to keep all the data and, after a first analysis, decide which one to keep. However, this is not an optimal way to do it. The matter is that without a preliminary study, you will not know with certainty what the result will be, and for this kind of issue, uncertainty is not a good approach. What are the permitted uses? Permissible uses have to be clear from the beginning of the process. For example, if a hotel company will provide us with booking data to, through metrics, help the marketing team to run a campaign, that is permissible to use. Also, one of the most important things is that the user permits to use their data for a purpose, such as market research. The use of data should be defined before we start working on an ML project. Otherwise, we do not have permission to use the information, which could lead to legal problems. ¿Con quién lo compartirías (usuarios, apps, etc)? Este punto es importante porque la información pertenece a los “end users” que son realmente los dueños de esos datos. Podrán revocar el acceso de su información a nuestros sistemas. Entonces, si ellos son los que tienen acceso a la información cómo procederemos a analizar sus datos? La respuesta es simple, anonimizando los datos con el único propósito de abordar y analizar la información y para eso necesitamos solamente un usuario - o un grupo de usuarios - haciéndolo de esta manera minimizamos el riesgo cuando compartimos la información entre países, empresas o entre los propios departamentos de una empresa. Para detectar y compartir non-anonymized data, deberíamos usar Cloud Data Loss Prevention (DLP) API ya que detecta información del tipo confidencial (tarjetas de crédito, teléfonos móviles, emails,etc) y además tiene una serie de características que te ayuda a hacerlas anónimas. La herramienta tiene una forma propia de crear identificadores de una manera muy específica para que el contenido no se comparta, de esa manera podemos crear nuestros propios detectores, dependiendo del caso. Who would you share it with (users, apps, etc)? This point is important because the information belongs to the "end users" who are the real owners of that data. They will be able to revoke access of the information to our systems. So, if they are the ones who have access to the information, how do we proceed to analyze their data? The answer is simple, anonymizing the data for the sole purpose of accessing and analyzing the information. For that, we need only one user or a group of users. In this way, we minimize the risk when sharing information between countries, companies or between departments within a company. To detect and share non-anonymized data, we should use the Cloud Data Loss Prevention (DLP) API as it detects sensitive information (credit cards, mobile phones, emails, etc.). It also has several features that help you make them anonymous. The tool has its way of creating identifiers in a very specific way so that the content is not shared. That way, we can create our own detectors, depending on the case. Once the information is anonymized and can be processed, it can be shared with apps or even other users who are allowed to work with all that data. The most effective way to ensure this is to control access. In GCloud Storage you can do this in different ways: Using IAM Permissions: This solution works very well, but it is more oriented to role management. If we need services to access information, it is better to manage it using service accounts. Using service accounts: This solution works when we need services to access that information - for example importing Big Query data from storage. This solution is perfect when combined with IAM permissions because in IAM permission you set a role and assign it to each service account. In addition, if a service does not need access to information, you can easily revoke that access. There is also an alternative where you only need to allow access for a short period. In this case, we recommend that you use short-lived service accounts. This type of service has an expiry date - this way, you ensure that all your information is protected if you forget to revoke permissions to that particular service. Using service accounts: this solution works when we need services to access that information, for instance, importing Big Query data from storage. This solution is perfect when combined with IAM permissions because you set a role and assign it to each service account. Also, if a service does not need access to information, you can easily revoke that access. Example of how to apply security and privacy in a real case We have talked about how to deal with security and privacy. Now we are going to use it in a real project so that you can understand it better. To begin with, the project has an extra request. It is a project for a German company. So we need to use the information in a GDPR compliant way (the information should not leave the European Union). Also, this company doesn't want to use the information beyond its borders, so we will have to store all this data in the data center that Google Cloud has in Germany (Europe-west3). We need to create a cube where to store all this data. This bucket will handle all the information and after it has been identified, it will be stored in another bucket, which, in the end, will be where the services will be able to access it. The two buckets will be regional (we need to retrieve data very often) and, you will be able to find them in Europe-west3. As mentioned above, we will keep the data cube. Then we will have to analyze all that information using the Data Loss Prevention API and look at sensitive information in case it needs to be anonymized. Depending on the frequency, the raw data will be updated, and you will have to take this into account when using the DLP API to detect sensitive information. If you have used default infoType detectors, you may get false positives (like the project that detected drivers' licenses in Canada and gave false positives as a result). Once that sensitive information is found, you can automate the re-identification of sensitive data using Cloud Functions. We were able to use Cloud Function because the company allowed us to use the data center, which is in Belgium only for the transformations. In case this is not possible, you can approach it by using a compute engine in Europe-west and exposing it as if it had a flask functionality. Linking the cloud function to the bucket The Cloud Function is triggered when a new element is created in the bucket, where the raw data is updated. De-Identifying options There are different ways of re-identifying data. This particular one has been encrypted, using a key because the client needs to retrieve the information after analysis. Which data will be de-identified After the re-identified data analysis, the fields that have to be encrypted are the ones that are set in that variable. If we need more, we just need to add more elements. Once the data is anonymized and stored in another bucket, we can work with it securely. This transformation is essential because the complete dataset will have to be analyzed. For example, to implement a custom estimator in TensorFlow, and we must ensure the protection of the original data. From this moment on, the original bucket must not be accessible, while the anonymized bucket is the one that must be used by the services that need to make use of this data.