Press enter to see results or esc to cancel.

The magic of natural language processing where ‘Australia’ + ‘Pizza’ – ‘Italy’ = ‘McDonalds’

A challenge for any data scientist is taking data and manipulating so that a computer can understand the meaning. Nearly all companies gather text data from their customers in some form, through customer feedback, product reviews, or requests for support. Extracting meaning and learning from text data can seem like an intimidating task, particularly when there are tens of thousands of lines of text to read. Natural Language Processing (NLP) is the sub-field of data science that involves programming computers to process and analyse large amounts of natural language data.

To be able to describe text data to a computer, we have to understand the information we are working with. Sentences are made up of words, but what is a word? The answer seems intuitively obvious. Consider the word ‘pizza’ – we immediately understand that pizza is a food, the word is a noun and that the word represents an object in the real world. How can we describe these concepts to a computer?

How do we describe pizza to a computer? Not like this

Traditional natural language processing systems treated words as discrete symbols. ‘Pizza’ may be represented as word258 and ‘burger’ as word523. With arbitrary encodings, a model has no useful information on the relationships between the words. Ideally, we would want to be able to use the information our model has learned about ‘pizza’ to be able apply it to ‘burger’ – e.g. both are foods and can be eaten.

Humans generally learn the meaning of words from their context. If I told someone ‘I ate haggis at a restaurant’, without knowing exactly what haggis is, someone could reasonably infer that haggis is a food.

How can we teach the meaning of a word to a model?

In 2013, Google researcher Tomas Mikolov introduced two model architectures for computing vector representations of words from very large datasets. The code for the models was released as ‘word2vec’.

Using a large body of text (all of Wikipedia for example) vectors or ‘word embeddings’ can be learned from the individual sentences taken from the source data. The idea is that a neural network can be trained to predict a target word based on context, or alternatively, the full context based on a particular word. As the network is trained, the numbers in the hidden layer are adjusted to make a correct prediction for each target word. After the training iterations are complete, we are left with vectors for each word that mathematically express context.

The original paper and the TensorFlow word2vec documentation give a more detailed explanation. The approach was very efficient – an optimised single-machine implementation could build word embeddings for more than 100 billion words in a single day.

Adding and subtracting words to show understanding

Once trained, one very cool property of word embeddings is that they can be used to perform algebraic operations on words. First we work out the vector that results from adding and subtracting other word vectors. We then look in the vector space to find the word closest to the result by measuring cosine distance. For example, we can run ‘King’ + ‘woman’ – ‘man’

This is really quite remarkable, as we’re asking ‘what is the female equivalent of a King?’ and the word embeddings are able to demonstrate their learned understanding of each of the words. What’s really impressive is that these representations were learned through an unsupervised process. We didn’t tell the model that ‘Queen’ is the female equivalent of ‘King’ – it learned the representation through the context of words in the source data. The vectors even capture regional differences. We can see what the Australian equivalent of Italian pizza is with ‘Australia’ + ‘pizza’ – ‘Italy’:

Looks like the fast food of choice for Australians is Maccas – our slang for McDonalds!

Word embeddings can even express the relationship of tense between words, for example ‘biked’ + ‘today’ – ‘yesterday’:

After having lived in London, Sydney, Brisbane and now my new home Edinburgh, I wanted to see if word embeddings captured the same relationship I felt between the cities. London and Sydney were the big bustling cities on opposite sides of the world, and I much preferred the ‘big country town’ vibe of Brisbane. Given my preference did I choose the right city to move to in the U.K? Let’s run ‘Brisbane’ + ‘UK’ – ‘Australia’:

Looks like I moved to the right city!

What can we use word embeddings for?

Culinary and geographic assessments aside, word embeddings have real business uses. Filtering toxic content is a major problem for large websites that allow users to post comments or any of their own content. Trolls are especially rampant on news sites. In 2016 The New York Times had a team of 14 moderators working full time to manually review around 11,000 comments each day. Because of how labour intensive the work is, commenting was only made available for 10% of New York Times articles.

Source: Minh Uong/The New York Times

Twitter’s users post over 6,000 tweets per second. The service is well known for the bullying and toxic behaviour of certain groups of its users. Twitter’s CEO Jack Dorsey explained:

Our model right now relies heavily on people reporting violations and reporting things they find suspicious, and then we can act on it,” he said. “But ultimately we want to take that reporting burden off the individual and automate a lot more of this.

Some users deal with the trolls their own way

Quora is a website where people can post questions, and other users contribute answers and unique insights. A key challenge is to weed out insincere questions – those founded upon false premises, or that intend to make a statement rather than look for helpful answers. For example:


  • Do you think Amazon will adopt an in house approach to manufacturing similar to the Tesla or Space X business models?
  • How do modern military submarines reduce noise to achieve stealth?
  • In your experience working with Realtors, what do you wish Realtors did better?


  • Has the United States become the largest dictatorship in the world?
  • How much more political fumbling will it take for Republicans to turn on Trump?
  • Why do all the people who claim Florida has great weather go silent every time there’s a new hurricane?

Quora released a dataset of around 1.3 million questions, each labelled as sincere or insincere. I used the dataset to test the effectiveness of various NLP models.

I tried a variety of text vectorisation methods to build classifier models, including traditional techniques like TFIDF, word embeddings learned from the source data, and pre-trained word embeddings. So far, the best results have come from using a combination of word embeddings learned from Wikipedia and a recurrent neural network model. The full code can be viewed on a Jupyter notebook on my GitHub.

This post only scratches the surface of the detail and potential applications of word embeddings. There are numerous options to consider for predictive models that use word embeddings, including convolutional neural networks, recurrent neural networks (LSTM, GRU) and attention-based models.

Word embeddings are now an essential part of any data scientist’s toolkit when building natural language processing systems.