Photo by Clarisse Croset from Unsplash
A recent survey by EY, Cambridge University, and the World Economic Forum of over 150 financial institutions found that almost 80% of established non-fintech firms believe that AI will have high or very high strategic importance over the next two years. Firms that might not use these technologies today may feel compelled to use them in the near future in order to remain competitive.
However, these technologies do not come without risks, which many firms categorise under the operational risk umbrella. As firms adopt this technology, they may expect senior directors and operational risk professionals to provide both risk management and challenge of these tools. The BoE highlights that firms are beginning to recognise the need for employees at different levels of the firm to have machine learning knowledge and skills.
In order to provide effective oversight of these tools, managers and firm leaders must understand their main sources of risk. For machine learning models, which rely on data to learn the correct predictions or outputs, problems with data quality can cause even the best made systems to fail.
Why is data quality important for machine learning?
Machine learning models use data to learn the relationships that exist between its inputs and outputs. This means they go through a training process where they use input data to learn to create the correct outputs. A model predicting client creditworthiness may use loan applicant information to learn to predict the client’s likelihood of default. Chatbots use recorded speech and textual information to learn to produce the appropriate answers to customer questions. But what happens when we use bad data to build our models?
Bad data could be data that no longer reflects the current environment of the firm – imagine a bank builds a chatbot to answer customer questions but doesn’t train it on any data about its current products and services. The chatbot wouldn't be able to understand or answer questions about these topics because it never learned about them. Bad data also refers to biased data, which often indicates data that contains biases against certain groups.
As a recent MIT article stated, ‘models learn exactly what they are taught’ (DeBrusk), meaning that even the best built models won’t be successful if they are built on bad data. In order to avoid these data pitfalls, firms must build robust data risk management systems.
How can firms effectively manage data risk?
There any many potential methods for preventing data issues from ruining your machine learning models. We highlight three steps firms can take to limit the potential risks that data issues pose.
1. Assess the potential biases that exist in the data
Humans are biased, therefore data created by humans is often biased. Mitigating the impact of potentially biased data requires in-depth analysis of the assumptions and prejudices that currently exist in your firm’s data and processes. This may require independent expert review to help find and exclude these sources from your datasets.
2. Establish robust data governance procedures
Establishing effective data governance requires firms follow five steps:
1. Create documented and clear objectives for machine learning models that your data are capable of meeting
Before building a new machine learning model, firms must establish clear objectives they hope to accomplish with it. Managers and engineers must ensure that their data are capable of meeting these objectives.
2. Include data quality assurance processes in your model building timelines
Project timelines must include appropriate time allocated for data assessment and cleansing. This may require building systems that can automate or minimise the time required for data cleaning processes.
3. Maintain a data history trail – including which models are trained on a given dataset
Data often goes through many changes in order to prepare it for training machine learning models. Firms should keep copies of the data and preparation methods at all stages of processing. This information can be used to help firms understand issues that may emerge during the model building process.
Firms should also maintain an inventory of which models use a particular dataset. If problems are detected with a given dataset firms can quickly determine which models may require revision.
4. Appoint a data quality controller
Firms should assign an individual as the data quality controller throughout the life of the model. This person should possess thorough understanding of the data including potential quality issues. They must impose quality requirements on all input data and have the ability to decommission the model if necessary.
5. Ensure that all data are independently reviewed by experts to guarantee quality
Firms must establish a means of impartial review in order to ensure that data management procedures work as intended. This means review by those not directly involved with building or using these models.
3. Ensure that data meets GDPR requirements
Firms must make privacy considerations when they use personal data or algorithms to make decisions about individuals. In order to remain GDPR compliant when using personal data, firms must include the principles of data minimisation and fairness into their data governance model. Data minimisation means that firms must do everything possible to minimise the use of personal information. They should use anonymised data and establish procedures to use as little sensitive data as possible. The principle of fairness requires firms be extremely cautious when using data that contains information about membership of a protected group. Personal data should not highlight protected group membership and should be used ‘in accordance with what … [the data subject] might reasonably expect’ (16, Norwegian data protection authority). 
Effective data risk management
Following these three principles helps firms establish a robust system for data risk management. Managers and senior directions must ensure that appropriate data governance policies are established and relevant quality assurance roles have been assigned. With these procedures in place, firms can quickly handle any potential data issues that may arise during the life of their machine learning models.