Press enter to see results or esc to cancel.

Extracting topics from 11,000 Newsgroups posts with Python, Gensim and LDA

Machine learning and natural language processing techniques give us the ability to extract hidden topics from large volumes of text. Topic modelling is useful when dealing with collections of documents that are simply too large for an individual to manually read and summarise. Examples of large volumes of text include online reviews, news articles and email conversations within an organisation. Knowing the key topics of discussion is very useful for businesses that want to understand their customer’s problems or what they particularly like about the business.

Once a business has collected the data through surveys or scraping online reviews, they need an algorithm that can read through the documents and automatically output the topics. Since we don’t specify the topics beforehand, the technique is an unsupervised learning problem. There are a number of algorithms that perform topic modelling, but in this post we’ll be focusing on Latent Dirichlet Allocation (LDA) using the Gensim package in Python.

How does LDA work?

There are three layers to the classification of documents in LDA. The documents made up of a distribution of topics, and the topics are represented by a distribution of words. We don’t know how many topics there are, nor how many words belong to each topic. The Dirichlet process┬áis commonly used in Bayesian statistics where we suspect clustering among random variables. The LDA algorithm assumes that combinations of topics and words, as well as the combinations of documents and topics follow Dirichlet probability distributions.

Extracting the topics from 11,000 Newsgroups posts

To see how LDA performs, we’re going to use the 20-Newsgroups dataset. The raw data is a 22.2 MB JSON file with discussions scraped from 20 different Newsgroups discussion boards. While we don’t specify desired labels for the topics, it’s useful in our case to use a data set that we know has an underlying structure with distinct topics of discussion. We would expect that if LDA works as intended, it would be able to separate out the topics with a similar structure to each discussion group. Parts of the code below have been adapted from Selva Prabhakaran’s post here.

Loading and cleaning the data

Let’s see what the data looks like:

We have a lot of unwanted characters from email addresses and formatting that won’t play nicely with Gensim. We can clean the text with regular expressions:

Tokenising words

Tokenisation involves breaking up a sequence of strings into pieces, called tokens. Generally punctuation is discarded from each token. LDA reuqires that the text be tokenised for processing. There are several excellent Python libraries for text processing (NLTK, Stanford CoreNLP, TextBlob) but Gensim also provides what we need for tokenisation.

Removing stopwords

Stopwords like ‘the’, ‘is’ and ‘of’ don’t contain any useful information for our purposes of topic modelling, so they are removed from the data. We also add a few email related words to our stopwords that occur often in the Newsgroups data that don’t add value in our topic clustering.


Words can appear in various inflected forms. For example the word ‘ride’ may appear as ‘riding’, ‘rides’, ‘rode’. On its own, ‘ride’ is the base form of the word, known as the lemma. Lemmatisation groups all of the inflected versions of the word so they can be analysed as a single item. Lemmatisation is applied before topic clustering so that we can filter out the noise of the non-base forms of words in the documents.

Creating the dictionary and the corpus

The LDA model takes three inputs:

  1. Dictionary: gensim.corpora.Dictionary creates a mapping between words and their integer IDs. With the integer ID, we can more efficiently create a mapping of each word to that word’s frequency within a document.
  2. Corpus: The corpus is the frequency of each word within each document.
  3. Number of topics: The user defines the number of expected topics in the LDA topic model. One measure of the optimal number of topics is the ‘coherence score’ which we may look at in a future post.

Now that we have our LDA model, we can see which keywords apply to each topic and their relative weighting within that topic.

Topic modelling provides us with each word in the topics, but not a single all-encompassing word to describe the topic. The human part of the modelling involves inferring what the topic is from the list of words. With some knowledge of the data set, we can guess what each topic is about.

Visualising the topic model

While we have the results of our topic model, a much more effective way to explore the clustering is with an interactive visualisation. The pyLDAvis package can generate the chart:

The spacing and size of each of the bubbles gives us an indication of the quality of our topic model. The best topic models will have large bubbles that don’t overlap, showing a clear distinction between the topics identified. Mousing over each of the bubbles shows the keywords from each topic and their relative weighting.

Practical applications of the topic model

Ultimately we want to be able to tie the topics identified back to our original documents.

The table is showing us the topic each document was sorted into, along with the keywords for each of those topics. A lot of our topics are fairly easy to interpret based on the keywords. But to gain an even deeper understanding of the topic, it helps to read a document that the clustering process has deemed as ‘most representative’ of that topic.

In this post we looked at building a topic model using LDA and the Gensim package. Topic modelling is a very effective way to gain an understanding of a large text data set, and to pick out individual documents of particular interest.