MeasureCamp #7
I've just come back from #MeasureCamp, where I attended some great talks: on hierarchical models; the process of analysis; a demo of Hadoop processing Adobe Analytics hits; web scraping with Python and how machine learning will affect marketing in the future. Unfortunately the sad part of MeasureCamp is you also miss some excellent content when they clash, but that's the nature of an adhoc schedule. I also got to meet some excellent analytics bods and friends old and new. Many thanks to all the organisers!
My sessions on machine learning
After finishing my presentation I discovered I would need to talk waaay to quickly to fit it all in, so I decided to do a session on each example I had. The presentation is now available online here, so you can see what was intended.
I got some great feedback, as well as requests from people who had missed the session for some details, so this blog post will try to fill in some detail around the presentation we spoke about in the sessions.
Session 1: Introduction, Google Analytics Data and Random Forest Example
Introduction
Machine Learning gives ability for programs to learn without being explicitly programmed for a particular dataset. They make models from input data to create useful output, commonly predictive analytics. (Arthur Samuel via Wikipedia)There are plenty of machine learning resources, but not many that deal with web analytics in particular. The sessions are aimed at inspiring web analysts to use or add machine learning to their toolbox, showing two machine learning examples that detail:
 What data to extract
 How to process the data ready for the models
 Running the model
 Viewing and assessing the results
 Tips on how to put into production
Types of Machine Learning
Machine learning models are commonly split between supervised and unsupervised learning. We deal with an example from each:
Supervised: Train the model against a test set with known outcomes. Examples include spam detection and our example today, classifying users based on what they eventually buy. The model we use is known as Random Forests.
 Unsupervised: Let the model find own results. Examples include clustering of users that we do in the second example using the kmeans model.
Every machine learning project needs the below elements. They are not necessarily done in order but a successful project will need to incorporate them all:

Pose the question  This is the most important. We pose a question that our model needs to answer. We also review this question and may modify it to try and fit what the data can do as we work on the project.

Data preparation  This is the majority of work. It covers getting hold of the data, munging it so it fits the model and parsing the results. I've tried to include some R functions below that will help with this, including getting the data from Google Analytics into R.

Running the model  The sexy statistics part. Whilst superstar statistics skills is helpful to get the best results, you can still get useful output when applying model defaults which we use today. Important thing is to understand the methods.

Assessing the results  What you’ll be judged on. You will of course have a measure of how accurate the model is, but an important step is visualising this and being able to explain the model to nontechnical people.
 How to put it into production  the ROI and business impact. A model that just runs in your R code on your laptop may be of interest, but ultimately not as useful for the business as a whole if it does not recommend how to implement the model and results into production. Here you will probably need to talk to IT about how to call your model, or even rewrite your prototype into a more production level language.
Pitfalls Using Machine Learning in Web Analytics
There are some considerations when dealing with web analytics data in particular:

Web analytics is messy data  definitions can vary from website to website on various metrics, such as unique users, sessions or pageviews, so a through understanding of what you are working with is essential.

Most practical analysis needs robust unique userIds  For useful actionable output, machine learning models need to work on data that record useful dimensions, and for most websites that is your users. Unfortunately that is also the definition that is the most woolly in web analytics given the nature of different access points. Having a robust unique userID is very useful and made the examples in this blog post possible.

Timeseries techniques are quickest way in  If you don't have unique users, then you may want to look at timeseries models instead, since web analytics is also a lot of count data over time. This is the reason I did GA Effect as one of my first data apps, since it could apply to most situations of web analytics.

Correlating confounders  It can be common for web analytics to be recording highly correlating metrics e.g. PPC clicks and cost. Watch out for these in your models as they can overweight results.
 Self reinforcing results  Also be wary of applying models that will favour their own results. For example, a personalisation algo that places products at the top of the page will naturally get more clicks. To get around this, consider using weighted metrics, such as a click curve for page links. Always test.
 Do regularisation  Make sure all metrics are on the same scale, otherwise some will dominate. e.g. pageviews + bounce rate in same model
The Scenario
Here is the situation the following examples are based upon. Hopefully it will be something familiar to your own case:
You are in charge of a reward scheme website, where existing customers log in to spend their points. You want users to spend as many points as they can, so they have high perceived value. You capture a unique userId on login into custom dimension1 and use Google Analytics enhanced ecommerce to track which prizes users view and claim.
Notice this scenario involves the reliable user ID, since every user is logging in to use the website. This may be tricky to do on your own website, so you may need to only work with a subset of your users. In my view, the data gains you can make from reliable user identification means I try to encourage the design of the website to involve logged in content as much as possible.
Random Forests
Now we get into the first example. Random Forests are a popular machine learning tool as it typically has good results  in Kaggle competitions its often the benchmark to beat.Random Forests are based on decision trees, and decision trees are the topic of a recent interactive visualisation on machine learning that has been doing the rounds. Its really great, so check it out first then come back here.
Back? Ok great, so now you know about decision trees.
Random Forests are a simple extension, as a collection of decision trees are a Random Forest. A problem with decision trees is that they will overfit your data  when you throw new data at it you will get misclassification. It turns out though, that if you aggregate all the decision trees with subsets of your original data, all those slightly worse models added up make one robust model, meaning when you throw new data at a Random Forest its more likely to be a closer fit.
If you want more detail check out the very readable original paper by Breiman and Cutler and a tutorial on using it with R is here.
Example 1: Can we predict what prizes a user will claim from their view history?
Now we are back looking at our test scenario. We have noticed that a lot of user's aren't claiming prizes despite browsing the website, and we want to see if we can encourage them to claim prizes, so they value the points more and spend more to get them.We want to look at users who do claim, and see what prizes they look at before they claim. Next we will see if we can build a model to predict what a user will claim based on their view history. In production, we will use this to email users who have viewed but not claimed prize suggestions, to see if it improves uptake.
Fetching the data
Use your favourite Google Analytics to R library  I'm using my experimental new library, googleAnalyticsR, but it doesn't matter which, the important thing is looking at what is being fetched. In this example the user ID is being captured in custom dimension 1, and we're pulling out the product SKU code. This is transferable to other web analytics such as Adobe Analytics (perhaps via the RSiteCatalyst package)Note we needed two API calls to get the views and transactions as these can't be queried in the same call. They will be merged later.
Transforming the data
We now need to put the data into a format that will work with Random Forests. We need a matrix of predictors to feed into the model, one column of response showing the desired output labels, and we split it so it is one row per user action:
Running RandomForest and assessing the results
We now run the model  this can take a long time for lots of dimensions (this can be much improved using PCA for dimension reduction, see later). We then test the model on the test data, and get an accuracy figure:On my example test set I got ~70% accuracy on this initial run, which is not bad, but it is possible to get up to 9095% with some tweaking. Anyhow, lets plot the test vs predicated product frequencies, to see how it looks:
This outputted the below plot. It can be seen in general the ~70% accuracy predicted many products but with a lot of error happening for a large outlier. Examining the data this product SKU was for a cash only prize. A next step would be to look at how to deal with this product in particular since eliminating it improves accuracy to ~85% in one swoop.
Next steps for the RandomForest
There I stop but there are lots of next steps that could be done to make the model applicable to the business. A nonexhaustive list is:
 Run model on more test sets
 Train model on more data
 Try reducing number of parameters (see PCA later)
 Examine large error outliers
 Compare with simple models (last/first product viewed?)  complicated is not always best!
 Run model against users who have viewed and not sold yet
 Run email campaign with control and model results for final judgement
It is hoped the above inspired you to try it yourself.
Session 2: Kmeans, Principal Component Analysis and Summary
Example 2: Can we cluster users based on their view product behaviour?
Now we look at kmeans clustering. The questions we are trying to answer are something like this:
Do we have suitable prize categories on the website? How do our website categories compare to user behaviour?
The kmeans clustering we hope will give us data to help with decisions on how the website is organised.
For this we will use the same data as we used before for Random Forests, with some minor changes: as kmeans is an unsupervised model we will take off our product labels:
A lot of this example is inspired by this nice beginners walkthrough on Kmeans with R.
Introduction to Kmeans clustering
This video tutorial on kmeans explains it well:
The above is an example with two dimensions, but kmeans can apply to many more dimensions than that, we just can't visualise them easily. In our case we have 185 product views that will each serve as a dimension. However, problems with that many dimensions include long processing time alongside dangers of overfitting the data, so we now look at PCA.
Principal Component Analysis (PCA)
We perform Principal Component Analysis (PCA) to see if there are important products that dominate model  this could have been applied to previous Random Forest example as well, and indeed a final production model could include output from one model like kmeans to be fed into Random Forests.PCA rotates dimensions to try and minimize them as much as possible, then ranks them in amount of variance. There is a good visualisation of this here.
The clustering we will do will actually be performed on the top rotated dimensions we find via PCA, and we will then map these back to the original pages for final output. This also takes care of situations such as if one product is always viewed in every cluster: PCA will minimize this dimension.
The code below looks for the principal components, then gives us some outputs to try and decide how many dimensions we will choose. A rule of thumb is we look for components that give us roughly ~85% of the variance. For the below data this was actually 35 dimensions (reduced from the 185 before)
The plot output from the above is below. We can see the first principal component accounts for 50% of the variance, but then the variation is flattish.
How many clusters?
How many clusters to pick for kmeans can be a subjective experience. There are other clustering models that pick for you, but some kind of decision process will be dependent on what you need. There are however ways to help inform that decision.Running the kmeans modelling for increasing number of clusters, we can look at an error measure (sum of squares) of how many points are in each. When we plot these attempts for each cluster iteration, we can see how the graph changes or levels off at various cluster sizes, and use that to help with our decision:
The plot for determining the clusters is here  see the fall between 24 clusters. We went with 4 for this example, although a case could be made for 6:
Assessing the clusters and visualisation
I find heatmaps are a good way to assess clustering results, since they offer a good way to overview groupings. We are basically looking to see if the clusters found are different enough to make sense.This gives the following visualisation. In an interactive RStudio or Shiny session, this is zoomable for finer detail, but here we just exported the image:
KMeans  Next Steps
The next step is to take these clusters and examine the products that are within them, looking for patterns. This is where your domain knowledge is needed, as all we have done here is grouped together based on statistics  the "why" is not in here. When I've performed this in the past, I try to give named persona to each cluster type. Examples include "Big Spenders" for those who visit the payment page a lot, "Sport Freaks" who tend to only look at sport goods etc. Again, this will largely depend on the number of clusters you have chosen, so you may want to vary this to tweak to the results you are looking for.
Recommendations follow on how to group pages: A/B teats can then be performed to test if the clustering makes an impact.
Summary
I hope the above example workflows have inspired you to try it with your own data. Both examples can be improved, for instance we took no account of the order of product views or other metrics such as time on website, but the idea was to give you a way in to try these yourselves.
I chose kmeans and Random Forests as they are two of the most popular models, but there are lots to choose from. This diagram from a python machine learning library, scikitlearn, offers an excellent overview on how to choose which other machine learning model you may want to use for your data:
Do please let me know of any feedback, errors or what you have done with the above, I'd love to hear from you.
Good luck!