Re-using Machine Learning Models and the “No Free Lunch” Theorem
by James Ravenscroft, Chief Technology Officer at Filament
Why re-use machine learning models?
You can get a lot of value out of training a machine learning model to solve a single use case, like predicting emotion in your customer chatbot transcripts and putting the angry ones through to real humans. However, you might be able to extract even more value out of your model by using it in more than one use case. You could use an emotion model to prioritise customer chat sessions but also to help monitor incoming email inquiries and social media channels too. A model can often be deployed across multiple channels and use cases with significantly less effort than creating a new, complete training set for each problem your business encounters. However there are some caveats that you should be aware of. In particular, the “No Free Lunch Theorem” which is concerned with the theoretical drawbacks of deploying a model across multiple use cases.
No free lunch? What has food got to do with it?
Anyone who’s familiar with Artificial Intelligence and Machine Learning should probably be familiar with the “No Free Lunch Theorem”.
The theorem, posited by David Wolpert in 1996 is based upon the adage “there’s no such thing as a free lunch”, referring to the idea that it is unusual or even impossible to get something for nothing in life. Wolpert’s theorem states that no one machine learning model can be best for all problems. Another way of thinking about this is that there are no “silver bullets” in machine learning. A lot of the time when Filament consult on an ML problem we have to try a number of approaches depending on the type and quality of data we’re given to train the system as well as the “ that our client is expecting. In traditional programming you can write an algorithm once and (assuming it is bug free) apply it to any number of new use cases. Hypothetically speaking, a good developer could build a bank accounting tool and deploy it for a bank in the UK and a bank in China with minimal changes. This is not necessarily the case in machine learning.
So what are the deeper implications of this theorem on modern machine learning use cases? According to Wolpert, there can be no guarantee that an algorithm trained on dataset A (10000 product reviews on amazon labelled with positive/negative sentiment) will work well on dataset B (1000 social media posts that mention your company, also labelled with positive/negative sentiment). You could find that it works really well or you could find that it’s a flop. There’s just no theoretical way to know! That’s scary right? Well in practice, with sensible reasoning, evaluation and by using approaches to adapt models between use cases (known as domain adaptation) maybe it’s not all that bad ...
So “No Free Lunch” means I can’t reuse my models?
Not exactly. The theorem says that there’s no correlation between your model’s performance in its intended environment and its performance in a completely new environment. However, it doesn’t rule out the possibility of be correlations if we know the nature of the new problems and data. You can think about it in terms of human expertise and specialisation. Humans learn to specialise as they work their way through the education system. A medical doctor and a veterinarian both go through extensive training in order to be able to carry out medical procedures on humans and animals respectively. A veterinarian might be excellent at operating on different species of animals. However, a veterinarian would not be able to operate on an injured human to the same level of professionalism as a medical doctor without some additional training.
A doctor knows how to operate on people but would need additional training to be able to work with animals.
Intuitively, machines operate in a similar way. A model trained on sentiment in the context of customer inquiries will learn the vocabulary that customers use to describe how happy they are with the service they received from your company. Therefore it follows that you could deploy this model across a number of channels where customers are likely to communicate with your company. On the other hand, the language that a film critic uses when reviewing a movie is quite different to that of a customer complaining about their shopping experience. This means that your customer inquiries model probably shouldn’t be re-used for sentiment analysis in product reviews without some evaluation and potentially some domain adaptation. A model trained to detect a person’s age from high-definition digital images might work well running in a phone app where the phone has a suitably high quality camera but may struggle with low quality images from webcams.
How do I know whether my model will work well for a new problem?
The first step in understanding the suitability of a model to a new domain is to try and understand how well the model performs on the new domain without any modification. We can build a small (but not too small) ground truth for our new domain and see how well the old model does at predicting the correct labels.
A good way to accelerate this process is to run your model on the new data and have humans mark the results like a teacher scores test results. As the number of possible labels in your model increases, the mental load of annotation increases. Asking the human “is this correct?” is a much easier task than making them label the data from scratch. Explosion.ai have some really good thoughts on data collection that suggest a similar viewpoint.
Once we have a dataset in the new domain we can measure accuracy, recall and precision and make a judgement on whether our old model is performing well enough that we want to simply deploy it into the new environment without modification. Of course, if we do choose to go down this path, it is still good practice to monitor your model’s performance over time and retrain it periodically on examples where classification failed.
If the model didn’t do well, we’ve got another trick up our sleeve. Domain adaptation.
What can domain adaptation do for model reuse?
On September 23rd, 1999, a $125 Million Mars lander burned up in orbit around the red planet, much to the dismay of the incredibly talented engineering teams who worked on it. The theory was sound. Diligent checks had been carried out. So why, then, had the mission gone up in flames? A review panel meeting showed that a miscommunication between teams working on the project was the cause. One set of engineers expressed force in pounds, the other team preferred to use newtons. Neither group were wrong but this small miscommunication had everything grinding to a half.
A communication block between teams working in imperial and metric measurements caused an expensive accident. Sometimes when the science is right but the language is wrong, all you need is a common dialect.
The point here is that the machine learning model that you trained for sentiment in one environment may not be that far off. It might just need “tweaking” to make it work well for the new domain. This exercise, known as “domain adaptation” is about mapping features (words) that help the classifier understand sentiment in one domain onto features (words) that help the it to understand the other domain. For example a model trained on reviews for mobile phones might learn to use negative words like “unresponsive, slow, outdated” but reviews for movies might use negative words like “cliched, dull, slow-moving”. This mapping of features is not an exact science but good mappings can be found using approaches like that of Glorot, Bordes and Bengio (2011).
Building machine learning models is a valuable but time consuming activity. It makes sense to build and reuse models where possible. However, it is important to be aware of the fact that models can become ultra-specialised at the task that they’re trained on and that some adaptation may be required to get them working in new environments. We have given some tips for evaluating a model’s performance on new tasks as well as some guidance on when re-using a model in a new environment is appropriate. We have also introduced the concept of domain adaptation, sometimes referred to as “transfer learning” which allows your existing model to learn the “language” of the new problem space.
No Free Lunch theorem can sound pretty scary when it comes to model reuse. We can’t guarantee that a model will work in a new space given what we know about its performance on an existing problem. However, using some of the skills and techniques discussed above, you can have a certain level of confidence that a model will work on a new problem and you can always “domain adapt” to make it better. Ultimately the proof that these techniques are effective lies with suppliers like IBM, Microsoft and Google. These titans of tech have developed widely used and respected general models for families of problems that are relevant across many sectors for NLP, Voice Recognition and Image Processing. Some of them are static, some trainable (although ‘trainable’ often means domain-adaptation ready) but most of them work very well with little to no modification on a wide range of use cases. The most important thing is to do some due diligence around checking that they work for you and your business.