I trained an AI to write BuzzFeed clickbait headlines – you won’t believe what happened next
Gmail introduced ‘Smart Compose’ earlier this year. The feature uses artificial intelligence to learn from the emails you have sent previously and suggests words and phrases to finish sentences.
The Google AI blog reveals the machine learning approach taken to build Smart Compose. One of the main parts of the model uses a recurrent neural network. One of the main limitations of conventional neural networks is that they take all of their input at the same time at a fixed size – an image for example. Their output is also fixed, usually a prediction that the input belongs to a particular class.
If we tried to use a vanilla neural network to predict the next word in a sentence, we would need to exactly specify how many words to look at previously to make our prediction. Say we built one that looked at the last three words:
‘The sky is….’
The last word of the sentence is fairly obvious, and a neural network might be able to make the correct prediction – ‘blue’. Let’s look at another example:
‘I was born in England. My first language is…’
For a human, it’s easy to see the missing word should be ‘English’. But the neural network only has the last three words to work with (‘first language is’). This isn’t enough to make a good prediction.
The solution is to give our neural network a memory, and run over both the inputs and the outputs sequentially. If you want to understand exactly how this works in more detail, I recommend Andrej Karpathy’s excellent article ‘The Unreasonable Effectiveness of Recurrent Neural Networks‘. Recurrent neural networks (RNNs) have shown to be enormously powerful in applications such as machine translation, language modelling and image captioning.
One particularly fun application of an RNN is to build a language model that can read in text, and after training produce new text in the same style as the input. Andrej Karpathy fed the entire works of Shakespeare into an RNN model to generate new passages of text:
I wanted to reproduce Karpathy’s work, and see if an RNN could generate realistic sounding internet article headlines. BuzzFeed uses a number of headline tricks to get people to click on their articles. It’s debatable whether these techniques count as ‘clickbait’ (the title of this post you’re reading is definitely clickbait). BuzzFeed specifically denies that their headlines are clickbait – see ‘Why BuzzFeed doesn’t do clickbait‘. Either way, the headlines themselves have a very distinct structure and style.
The headline above uses a technique called ‘newsjacking’, hitting on 3 recent and controversial topics ‘Donald Trump’, ‘the New York Times’ and ‘Immigration’.
Here they use the ‘cliffhanger’ technique, showing the audience half of the story and inviting them to click through to find out the most interesting parts.
I acquired a file containing over 60,000 article titles scraped from BuzzFeed. Using Max Woolf’s textgenrnn Python module and a Jupyter notebook hosted on the Amazon SageMaker platform, I trained an RNN to generate new BuzzFeed headlines. The textgennrnn module is built on TensorFlow, and benefits from the GPU accelerated computing offered by SageMaker. Here are some of the headlines generated at different stages of training:
After a single epoch of training on a small subset of the data, the headlines don’t make much sense, and there are very few correctly spelled words. The interesting part about this stage of training is that the algorithm isn’t just learning how to write headlines. It has to learn to spell and punctuate sentences from the ground up.
One option we can tweak is ‘word level’ which allows the network to learn directly from the context of word placement in the set, rather than letter by letter placement. Most machine learning models benefit from more data, so we’ll train the model on the full data set:
Some of the headlines are nonsensical – but others are almost indistinguishable from the real thing. I wanted to check that the algorithm hadn’t simply learned to copy the source data directly. I ran it a few more times and compared the results to the source data.
In the original training data, Justin Timberlake appears in 93 separate headlines:
The expression ‘awkward moments’ is used in 13 headlines, but none are about Justin Timberlake:
Amazingly, our RNN has learned from the structure of the headlines we provided. It was able to make the connection that Justin Timberlake is a person, and is capable of having awkward moments! While this is a small scale example, it shows the magic of RNNs in their ability to come up with plausible, genuinely novel outputs.
RNNs and their more refined LTSM variants resulted in some incredible applications like Siri, Alexa and the Google Voice assistant. The cutting-edge of current research on RNNs is around ‘attention models’. These approaches use hierarchical models to overcome the computational costs of looking further back into the network’s memory, as well as improving the level of inductive reasoning. See Google DeepMind’s Neural Turing Machines paper for more details.
The possibilities for generative AI models are very exciting. It may be possible to reproduce an artist’s work long after they are gone – imagine a new Queen album written by an AI that has been trained on the band’s full discography. Already neural style transfer algorithms (which we will look at in a future post) have been able to create works of art in the style of the artist.
If you’re interested in reproducing the clickbait headlines (or any other source data) the full code can be found on my GitHub.