The magic of natural language processing where ‘Australia’ + ‘Pizza’ – ‘Italy’ = ‘McDonalds’

A challenge for any data scientist is taking data and manipulating so that a computer can understand the meaning. Nearly all companies gather text data from their customers in some form, through customer feedback, product reviews, or requests for support. Extracting meaning and learning from text data can seem like an intimidating task, particularly when …

Read more

Predicting customer churn with Python: Logistic regression, decision trees and random forests

Customer churn is when a company’s customers stop doing business with that company. Businesses are very keen on measuring churn because keeping an existing customer is far less expensive than acquiring a new customer. New business involves working leads through a sales funnel, using marketing and sales budgets to gain additional customers. Existing customers will …

Read more

Machine Learning from Disaster: Data exploration and feature engineering

In this post we will look at the basics of data loading, cleaning and visualisation with the Kaggle ‘Titanic: Machine Learning from disaster’ data set. The data provides details for a number of passengers, including age, class, ticket price. The aim of the next post will be to apply tools of machine learning to predict …

Read more

Building an Artificial Neuron in Python – the Perceptron

With so much recent hype around Deep Learning, I wanted to take a closer look at neural networks and what has motivated the resurgence of algorithms that have existed since the 1950s. Neural networks, in their simplest form, work the same way as a biological neuron. A neuron has dendrites to receive signals, a cell …

Read more

Google DeepDream – turning a neural network inside out

Modern convolutional neural networks, like the one we built in the previous post are trained to detect faces or other objects in images. With multiple layers in a neural network, it can be difficult to understand how exactly the neurons in the network respond and activate to patterns in an image. One way to understand …

Read more

Building a deep neural network to read my handwriting with TensorFlow and Python

In the previous post, we looked at a high level overview of how deep neural networks are constructed and trained. Using a convolutional neural net and the TensorFlow library, we will now consider a classic visual classification problem – recognising handwritten digits. Neural nets need to be trained on lots and lots of labelled data …

Read more

An Introductory Guide to Deep Neural Networks

Algorithms designed by deep learning methods have featured extensively in recent headlines. AlphaGo, developed by the Google DeepMind team was the first computer program to beat professional Go play Lee Sedol in a five-game match. Go has significantly more branching moves than chess, so traditional AI methods were unable to outperform human players. AlphaGo’s superhuman …

Read more

The magic of natural language processing where ‘Australia’ + ‘Pizza’ – ‘Italy’ = ‘McDonalds’

A challenge for any data scientist is taking data and manipulating so that a computer can understand the meaning. Nearly all companies gather text data from their customers in some form, through customer feedback, product reviews, or requests for support. Extracting meaning and learning from text data can seem like an intimidating task, particularly when …

Read more

I trained an AI to write BuzzFeed clickbait headlines – you won’t believe what happened next

Gmail introduced ‘Smart Compose’ earlier this year. The feature uses artificial intelligence to learn from the emails you have sent previously and suggests words and phrases to finish sentences. The Google AI blog reveals the machine learning approach taken to build Smart Compose. One of the main parts of the model uses a recurrent neural …

Read more

How to read 1.7 billion Reddit comments with Spark and Python Part 1: Setting up a local cluster

In this post, I’ll provide a walkthrough of how to set up a Spark cluster locally and run some simple queries on a month of Reddit comment data. In the next post, we will look at scaling up the Spark cluster using Amazon EMR and S3 buckets to query ~1.7 billion Reddit comments from 2007 …

Read more

A song of nodes and edges – Network analysis in Game of Thrones

To demonstrate the concept of network analysis, I built an interactive, force-directed graph of character relationships for each of George R.R Martin’s Game of Thrones novels using D3.js and Tableau. The Network of Thrones article by Andrew Beveridge and Jie Shan inspired the work. The source data gathered from the novels can be found on Andrew Beveridge’s …

Read more

Cryptocurrency and Coxcombs – What can Florence Nightingale tell us about Bitcoin prices?

Florence Nightingale is perhaps best known as the founder of modern nursing. One of her less well known legacies for the use of infographics, and how they effectively communicated the meticulous records she kept of the death toll from wounds in the Crimean War. Nightingale developed a modified version of the pie-chart, where the magnitude …

Read more

A data scientist’s job hunt visualised

Sankey diagrams show the flow through a system, where the width of the bands is proportional to the flow quantity. The diagrams are named after Captain Matthew Henry Phineas Riall Sankey. In 1898, Sankey created a diagram showing the energy efficiency of a steam engine. The widths of the bands showed the gas, electric and coal energy flowing through …

Read more