The European Union’s General Data Protection Regulation (GDPR) will have an impact on the use of machine learning that may have far-reaching effects. The regulations came into force on 25 May 2018, effectively replacing the EU Data Protection Directive of 1995 with the aim to harmonize data privacy laws across EU member states.
The key changes include increased territorial scope (the regulations will apply to any organizations dealing with EU citizens data, regardless of their headquarters), heavy penalties for non-compliance (4% of annual global turnover or €20 Million, whichever is greater), data subject consent (given explicitly), breach notification, right to access, right to be forgotten, data portability, and privacy by design.
Of interest to the machine learning community is primarily Article 22 on automated individual decision-making, including profiling. Specifically, the article states that “the data subject shall have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or similarly affects him or her”.
This is significant. In effect, it means that automated decision-making and profiling used by many machine learning algorithms (e.g. for recommendation, advertising, social networks, rating and assessment) are not allowed without the explicit consent of the data subject. There are two other exceptions. The process is allowed where it is necessary for contract formation or where it is authorized by law. In all cases, the data controller must implement suitable measures the safeguard the subject’s rights, freedoms and legitimate interests, the least right being the ability to obtain human intervention to express the subject’s point of view and to potentially contest the machine learning decision.
Further, the last part of the article clarifies that the automated decision-making should not be based on certain categories of personal data. Referencing Article 9 of the GDPR, this includes “personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation”.
Article 22 has been the subject of intense debate in AI and machine learning circles on several fronts. The first issue is whether it confers a legal right to explanation about how decisions are made by machine learning algorithms.
Certain researchers (Bryce Goodman and Seth Flaxman) have put forward that the article, in conjunction with Articles 13 to 15 on the right to access data collected and the right to know the purpose, means that the “law will also effectively create a ‘right to explanation,’ whereby a user can ask for an explanation of an algorithmic decision that was made about them.” Further, the Recital (71) that seeks to explain the article refers to the “right to […] obtain an explanation of the decision reached after such an assessment.”
Other researchers (Sandra Wachter, Brent Mittelstadt, Luciano Floridi) are of the opinion that “the GDPR only mandates that data subjects receive meaningful, but properly limited, information (Articles 13-15) about the logic involved, as well as the significance and the envisaged consequences of automated decision-making systems, what we term a ‘right to be informed’”. They also state that as the Recital (71) is non-binding, it does not confer a legal right to explanation.
In response to the debate, the Article 29 Data Protection Working Party, the group set up by the 1995 Directive that has clarified questions on the protection of individuals with regards to the processing of personal data, published in October 2017 the Guidelines on Automated Individual Decision-Making and Profiling to clarify the meaning of Article 22.
Specifically, the Guidelines state that the controller making automated decisions must:
• tell the data subject that they (the controller) are engaging in this type of activity;
• provide meaningful information about the logic involved; and
• explain the significance and envisaged consequences of the processing.
The Working Group recognizes that the growth and complexity of machine learning can make it challenging to understand how an automated decision-making process or profiling works. Certainly, learning techniques using deep learning, support vector machines or random forest regression are more difficult to explain, for example, than a simple linear regression model or decision tree.
However, the Working Group states that “the controller should find simple ways to tell the data subject about the rationale behind, or the criteria relied on in reaching the decision, without necessarily always attempting a complex explanation of the algorithms used or disclosure of the full algorithm. The information provided should be meaningful to the data subject.”
As such, the interpretation of the article seems to sit somewhere between the two debated rights. While a detailed and technical explanation of how the algorithm works and has arrived at its decision is not required, the Guidelines clearly state that a ‘layman’ explanation is. Watcher states specifically that this would be more akin to explaining system functionality; i.e. the algorithmic methods used rather than rationale behind the decision.
The footnotes of the Guidelines further stress that complexity is no excuse for failing to provide information and refer to the importance of transparency. Specifically, they cite online advertising as an example where the technological complexity makes it difficult for the data subject to know how and what data is being collected about them. This should not be a barrier to the requirement to provide meaningful information to the data subject about how this is done.
As they stand, article 22 and the accompanying Guidelines are set to radically change the way many companies use machine learning to provide services to EU data subjects. Designers and deployers of machine learning algorithms will need to figure out a way to explain potentially complex algorithm functions to data subjects. This effort will be compounded by the fact that it is sometimes unknown why a model arrived at a specific decision. Further, some may not want to explain their models due to concerns over the protection of intellectual property rights (IPR), trade secrets and other confidential or sensitive business data. Designers will need to find a way to explain black box models in a way that satisfies the GDPR and any potential IPR and other sensitive data concerns.
Wachter proposes one method of doing so through Counterfactual Explanations, which essentially offers a statement of how the input (which she denotes as “world”) would have to be different for a desirable outcome to occur. The method proposed for AI and machine learning scientists for computing counterfactuals in their models is compared to adversarial perturbations. Specifically, the research states that “many optimization techniques proposed in the adversarial perturbation literature are directly applicable to this problem, making counterfactual generation efficient”.
Other proposed explainability models proposed include:
• Local Interpretable Model-agnostic Explanations (LIME): an explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction.
• Deep Learning Important FeaTures (DeepLIFT): a method for decomposing the output prediction of a neural network on a specific input by backpropagating the contributions of all neurons in the network to every feature of the input.
• Layer-wise relevance propagation method: interprets the predictions of deep networks.
• SHapley Additive exPlanations (SHAP): identifies the class of additive feature importance methods and shows there is a unique solution in this class that adheres to desirable properties.
Likely, the applicable requirements and a strict interpretation of the wording will be hashed out in court eventually. Judges will ultimately make a decision regarding the intent of the Article and how that will apply in practical terms based on the specific context for judgement. In the EU, it is often the case that judicial rulings tend to favor individual protections rather than the interests of large corporations.
Microsoft (Internet Explorer and Abuse of a Dominant Position), Google (Android and Abuse of a Dominant Position) and Facebook (Ongoing legal battel regarding abuse of data collection practices and violation of privacy rights) have all come up against the EU and often lost out in their causes. Certainly for Facebook, the GDPR will require the firm to be much more transparent in the way they process and make use of the personal data of their EU-based users. A scenario such as that involving Cambridge Analytica, which bought access to personal Facebook data from a third party would not be permitted in the EU without the explicit consent of the subjects.
The right to explanation / information and explicit consent are not the only issues facing AI and machine learning designers. Controllers and machine learning designers will need to contend with other GDPR principles, such as those relating to fairness, legality, purpose limitation, data minimization, data retention periods, and integrity and confidentiality.
Goodman and Flaxman expose the issue of the fairness principle, which they state “will need to be balanced against algorithmic bias, in order to prevent the arbitrary discriminatory treatment of individuals and not emphasize any of the personal data categories referenced in Article 9. This right to non-discrimination seems to go against the very essence of algorithmic profiling, which is inherently discriminatory.”
The oft-cited example is racial profiling for risk assessment scoring, and notably the research done by ProPublica on such a system for scoring in US criminal sentencing. This type of algorithm would not be allowed in the EU under GDPR. Further, purging certain variables from datasets may render the algorithm less useful, and strengthen the uncertainty bias. If, through a purge, a certain group is underrepresented, there is more uncertainty associated with the decision output, especially if the algorithm is risk averse. As such, a potential prediction may be biased against this underrepresented group because it has limited information about it.
Purpose limitation is another principle that may have significant adverse effect on the use of machine learning. The principle aims to limit the use of personal data to the purpose intended. As such, reuse of that data for other algorithms will not be permitted without demanding again the explicit consent from the user for that new purpose. As such, information that may be in the hands of organizations that have a large swathe of data at their disposal (for example credit rating agencies or social media networks) will have to tread carefully in how they make use of data collected on individuals, since much of it may have come from third parties.
Finally, the data minimization principle will also pose a challenge. For machine learning algorithms, the more data they use, the better the results will be. However, data minimization requires that the data use be adequate, relevant and limited to what is required for achieving the intended purpose. Designers will have to decide what is an ‘adequate’ amount of data they can use that remains relevant and limited to purpose; this will be a difficult balance to achieve.
Certainly, the advent of the GDPR in the EU will have significant impact worldwide. The EU clearly estimates that the advances brought about by digitization, automation, AI and machine learning should not be left unchecked, and are easily leveraged to the detriment of consumers if the designers and deployers are not brought to account. The protection of individuals’ rights and freedoms is paramount, even when pitted against arguments of restraining competitiveness and free market advances, as well as those arguing regulation can stifle innovation, and impede the advance of technological development. Notably, the fact that the GDPR enshrines the principle of privacy and data protection by design will force radical changes in the way many algorithms are designed and deployed today in the EU, and likely more globally.