Introduction to Machine Learning with Web Analytics: Random Forests and K-Means

MeasureCamp #7

I've just come back from #MeasureCamp, where I attended some great talks: on hierarchical models; the process of analysis; a demo of Hadoop processing Adobe Analytics hits; web scraping with Python and how machine learning will affect marketing in the future.  Unfortunately the sad part of MeasureCamp is you also miss some excellent content when they clash, but that's the nature of an ad-hoc schedule.  I also got to meet some excellent analytics bods and friends old and new.  Many thanks to all the organisers!

My sessions on machine learning

After finishing my presentation I discovered I would need to talk waaay to quickly to fit it all in, so I decided to do a session on each example I had.  The presentation is now available online here, so you can see what was intended.

I got some great feedback, as well as requests from people who had missed the session for some details, so this blog post will try to fill in some detail around the presentation we spoke about in the sessions.

Session 1: Introduction, Google Analytics Data and Random Forest Example

Introduction

Machine Learning gives ability for programs to learn without being explicitly programmed for a particular dataset.  They make models from input data to create useful output, commonly predictive analytics. (Arthur Samuel via Wikipedia)

There are plenty of machine learning resources, but not many that deal with web analytics in particular.  The sessions are aimed at inspiring web analysts to use or add machine learning to their toolbox, showing two machine learning examples that detail:
  • What data to extract
  • How to process the data ready for the models
  • Running the model
  • Viewing and assessing the results 
  • Tips on how to put into production
Machine learning isn't magic.  You may be able to make a model that uses obscure features, but a lot of intuition will be lost as a result.  Its much better to have a model that uses features you can understand, and scales up what a domain expert (e.g. you) could do if you had the time to go through all the data.

Types of Machine Learning

Machine learning models are commonly split between supervised and unsupervised learning.  We deal with an example from each:
  • Supervised: Train the model against a test set with known outcomes.  Examples include spam detection and our example today, classifying users based on what they eventually buy.  The model we use is known as Random Forests.
  • Unsupervised: Let the model find own results.  Examples include clustering of users that we do in the second example using the k-means model.

Every machine learning project needs the below elements.  They are not necessarily done in order but a successful project will need to incorporate them all:

  • Pose the question - This is the most important.  We pose a question that our model needs to answer.  We also review this question and may modify it to try and fit what the data can do as we work on the project.
  • Data preparation - This is the majority of work.  It covers getting hold of the data, munging it so it fits the model and parsing the results.  I've tried to include some R functions below that will help with this, including getting the data from Google Analytics into R.
  • Running the model - The sexy statistics part.  Whilst superstar statistics skills is helpful to get the best results, you can still get useful output when applying model defaults which we use today.  Important thing is to understand the methods.
  • Assessing the results - What you’ll be judged on.  You will of course have a measure of how accurate the model is, but an important step is visualising this and being able to explain the model to non-technical people.
  • How to put it into production - the ROI and business impact.  A model that just runs in your R code on your laptop may be of interest, but ultimately not as useful for the business as a whole if it does not recommend how to implement the model and results into production.  Here you will probably need to talk to IT about how to call your model, or even rewrite your prototype into a more production level language.

Pitfalls Using Machine Learning in Web Analytics

There are some considerations when dealing with web analytics data in particular:

  • Web analytics is messy data - definitions can vary from website to website on various metrics, such as unique users, sessions or pageviews, so a through understanding of what you are working with is essential.
  • Most practical analysis needs robust unique userIds - For useful actionable output, machine learning models need to work on data that record useful dimensions, and for most websites that is your users.  Unfortunately that is also the definition that is the most woolly in web analytics given the nature of different access points.  Having a robust unique userID is very useful and made the examples in this blog post possible.
  • Time-series techniques are quickest way in - If you don't have unique users, then you may want to look at time-series models instead, since web analytics is also a lot of count data over time.  This is the reason I did GA Effect as one of my first data apps, since it could apply to most situations of web analytics.
  • Correlating confounders - It can be common for web analytics to be recording highly correlating metrics e.g. PPC clicks and cost.  Watch out for these in your models as they can overweight results.
  • Self reinforcing results - Also be wary of applying models that will favour their own results.  For example, a personalisation algo that places products at the top of the page will naturally get more clicks.  To get around this, consider using weighted metrics, such as a click curve for page links.  Always test.
  • Do regularisation -  Make sure all metrics are on the same scale, otherwise some will dominate.  e.g. pageviews + bounce rate in same model

The Scenario

Here is the situation the following examples are based upon.  Hopefully it will be something familiar to your own case:

You are in charge of a reward scheme website, where existing customers log in to spend their points.  You want users to spend as many points as they can, so they have high perceived value.  You capture a unique userId on login into custom dimension1 and use Google Analytics enhanced e-commerce to track which prizes users view and claim.

Notice this scenario involves the reliable user ID, since every user is logging in to use the website. This may be tricky to do on your own website, so you may need to only work with a subset of your users.  In my view, the data gains you can make from reliable user identification means I try to encourage the design of the website to involve logged in content as much as possible.

    Random Forests

    Now we get into the first example.  Random Forests are a popular machine learning tool as it typically has good results - in Kaggle competitions its often the benchmark to beat.
     
    Random Forests are based on decision trees, and decision trees are the topic of a recent interactive visualisation on machine learning that has been doing the rounds.  Its really great, so check it out first then come back here.

    Back? Ok great, so now you know about decision trees.

    Random Forests are a simple extension, as a collection of decision trees are a Random Forest.  A problem with decision trees is that they will overfit your data - when you throw new data at it you will get misclassification.  It turns out though, that if you aggregate all the decision trees with subsets of your original data, all those slightly worse models added up make one robust model, meaning when you throw new data at a Random Forest its more likely to be a closer fit. 

    If you want more detail check out the very readable original paper by Breiman and Cutler and a tutorial on using it with R is here.

    Example 1: Can we predict what prizes a user will claim from their view history?

    Now we are back looking at our test scenario.  We have noticed that a lot of user's aren't claiming prizes despite browsing the website, and we want to see if we can encourage them to claim prizes, so they value the points more and spend more to get them.

    We want to look at users who do claim, and see what prizes they look at before they claim.  Next we will see if we can build a model to predict what a user will claim based on their view history.  In production, we will use this to e-mail users who have viewed but not claimed prize suggestions, to see if it improves uptake.

    Fetching the data

    Use your favourite Google Analytics to R library - I'm using my experimental new library, googleAnalyticsR, but it doesn't matter which, the important thing is looking at what is being fetched.  In this example the user ID is being captured in custom dimension 1, and we're pulling out the product SKU code.  This is transferable to other web analytics such as Adobe Analytics (perhaps via the RSiteCatalyst package)



    Note we needed two API calls to get the views and transactions as these can't be queried in the same call.  They will be merged later.

    Transforming the data

    We now need to put the data into a format that will work with Random Forests.  We need a matrix of predictors to feed into the model, one column of response showing the desired output labels, and we split it so it is one row per user action:
    Here is some R code to "widen" the data to get this format. We then split the data set randomly 75% for training, 25% for testing.

    Running RandomForest and assessing the results

    We now run the model - this can take a long time for lots of dimensions (this can be much improved using PCA for dimension reduction, see later).  We then test the model on the test data, and get an accuracy figure:


    On my example test set I got ~70% accuracy on this initial run, which is not bad, but it is possible to get up to 90-95% with some tweaking.  Anyhow, lets plot the test vs predicated product frequencies, to see how it looks:


    This outputted the below plot.  It can be seen in general the ~70% accuracy predicted many products but with a lot of error happening for a large outlier.  Examining the data this product SKU was for a cash only prize.  A next step would be to look at how to deal with this product in particular since eliminating it improves accuracy to ~85% in one swoop.
     

    Next steps for the RandomForest

    There I stop but there are lots of next steps that could be done to make the model applicable to the business.  A non-exhaustive list is:

    • Run model on more test sets
    • Train model on more data
    • Try reducing number of parameters (see PCA later)
    • Examine large error outliers 
    • Compare with simple models (last/first product viewed?) - complicated is not always best!
    • Run model against users who have viewed and not sold yet
    • Run email campaign with control and model results for final judgement

    It is hoped the above inspired you to try it yourself.

    Session 2: K-means, Principal Component Analysis and Summary

    Example 2: Can we cluster users based on their view product behaviour?

    Now we look at k-means clustering.  The questions we are trying to answer are something like this:

    Do we have suitable prize categories on the website? How do our website categories compare to user behaviour?

    The k-means clustering we hope will give us data to help with decisions on how the website is organised.

    For this we will use the same data as we used before for Random Forests, with some minor changes: as k-means is an unsupervised model we will take off our product labels:

    A lot of this example is inspired by this nice beginners walk-through on K-means with R.

    Introduction to K-means clustering


    This video tutorial on k-means explains it well:



    The above is an example with two dimensions, but k-means can apply to many more dimensions than that, we just can't visualise them easily. In our case we have 185 product views that will each serve as a dimension.  However, problems with that many dimensions include long processing time alongside dangers of over-fitting the data, so we now look at PCA.

    Principal Component Analysis (PCA)

    We perform Principal Component Analysis (PCA) to see if there are important products that dominate model - this could have been applied to previous Random Forest example as well, and indeed a final production model could include output from one model like k-means to be fed into Random Forests.

    PCA rotates dimensions to try and minimize them as much as possible, then ranks them in amount of variance.  There is a good visualisation of this here.

    The clustering we will do will actually be performed on the top rotated dimensions we find via PCA, and we will then map these back to the original pages for final output. This also takes care of situations such as if one product is always viewed in every cluster: PCA will minimize this dimension.

    The code below looks for the principal components, then gives us some outputs to try and decide how many dimensions we will choose.  A rule of thumb is we look for components that give us roughly ~85% of the variance.   For the below data this was actually 35 dimensions (reduced from the 185 before)  



    The plot output from the above is below.  We can see the first principal component accounts for 50% of the variance, but then the variation is flattish.


    How many clusters?

    How many clusters to pick for k-means can be a subjective experience.  There are other clustering models that pick for you, but some kind of decision process will be dependent on what you need.  There are however ways to help inform that decision.

    Running the k-means modelling for increasing number of clusters, we can look at an error measure (sum of squares) of how many points are in each.  When we plot these attempts for each cluster iteration, we can see how the graph changes or levels off at various cluster sizes, and use that to help with our decision:


    The plot for determining the clusters is here - see the fall between 2-4 clusters.  We went with 4 for this example, although a case could be made for 6:


    Assessing the clusters and visualisation

     I find heatmaps are a good way to assess clustering results, since they offer a good way to overview groupings.  We are basically looking to see if the clusters found are different enough to make sense.


    This gives the following visualisation.  In an interactive RStudio or Shiny session, this is zoomable for finer detail, but here we just exported the image:

    From the heatmap we can see that each cluster does have distinctly different product views.

    K-Means - Next Steps

    The next step is to take these clusters and examine the products that are within them, looking for patterns.  This is where your domain knowledge is needed, as all we have done here is grouped together based on statistics - the "why" is not in here.  When I've performed this in the past, I try to give named persona to each cluster type.  Examples include "Big Spenders" for those who visit the payment page a lot, "Sport Freaks" who tend to only look at sport goods etc.  Again, this will largely depend on the number of clusters you have chosen, so you may want to vary this to tweak to the results you are looking for.

    Recommendations follow on how to group pages: A/B teats can then be performed to test if the clustering makes an impact.

    Summary

    I hope the above example workflows have inspired you to try it with your own data.  Both examples can be improved, for instance we took no account of the order of product views or other metrics such as time on website, but the idea was to give you a way in to try these yourselves.

    I chose k-means and Random Forests as they are two of the most popular models, but there are lots to choose from.  This diagram from a python machine learning library, scikit-learn, offers an excellent overview on how to choose which other machine learning model you may want to use for your data:

    All in all I hope some of the mystery around machine learning has been taken out, and how it can be applied to your work.  If you are interested in really getting to grips with machine learning, the Coursera course was excellent and what set me on my way.

    Do please let me know of any feedback, errors or what you have done with the above, I'd love to hear from you.

    Good luck!
    9 responses
    Although I understand nothing I am very impressed xx
    Thanks Mum!
    I understand it and am also impressed :)
    Thanks David! (I'd like to point out to others that I am no relation to Mr. Cottrell) ;)
    The information in the blog is so dense, which is a complement. I personally love business-y style and the data is fantastic I enjoyed reading it. It’s a great read.
    Thanks Volodynyr, its comments like yours that encourage me to keep publishing. :)
    Hi Mark, Thanks for the great article! I enjoyed reading it very much. I am particularly interested in the sentence 'Time-series techniques are quickest way in', although it can deserve its own blogpost, but do you think you could give some more hints on what some concrete and usable problem definitions for such cases would be where no userID is available? GA Effect is inspiring but what more could one think of in terms of day-to-day usage of machine learning for time-series based problems based on GA data in your view? Thanks again!
    Hi Arman, thanks for the comment :) Yes time-series analysis is a big area that more qualified people than me can talk on, and older than ML so there are more techniques available that perhaps if were new today would come under the same umbrella. You can put time-series data into ML algo's with say the timesamp as a row, but may come up with similar results as you would using more traditional techniques. Examples I have used on time-series data in the past are correlations between say TV spend and online traffic, media mix modelling using multilinear regression, forecasting via ARIMA etc., seasonality identification when analysing todays trends, etc. There is still a lot that can be done without userId, but I find having that ID gives more exciting possibilities at the moment :)
    1 visitor upvoted this post.