BigQuery Visualiser Shiny app now free and open sourced

A few weeks ago I tweeted a beta version of a BigQuery Visualiser Shiny app that was well received, and got some valuable feedback on how it could be improved, in particular from @felipehoffa - thanks Felipe!

Here is a screenshot of the app:

Motivation

The idea of the app is to enhance the standard BigQuery interface to include plots of the data you query.  It uses ggplot2, a popular R library; d3heatmaps, a d3 JavaScript library to display heatmaps; and timelyportfolio's listviewer, a nice library for viewing all the BigQuery meta data in a collapsible tree.   Other visualisations can be added fairly easily and will be done so over time, but if you have a request for something in particular you can raise an issue on the project's Github page.

I got into BigQuery once it started to receive exports from Google Analytics Premium. Since these exports carry unsampled raw data and include unique userIds, its a richer data source for analysis than the Google Analytics reporting API.

It also was a chance to create another Google API library called bigQueryR, the newest member of the googleAuthR family.  Using googleAuthR meant Shiny support, and also meant bigQueryR can be used alongside googleAnalyticsR and searchConsoleR under one shared login flow.  This is something exploited in this demo of RMarkdown, which pulls data from all three sources into a scheduled report.

Running your own BigQuery Visualiser

All set-up instructions are listed on the BigQuery Visualiser's Github project page

You can run the Shiny app locally on your computer within RStudio; within your own company intranet if its running Shiny Server; or publicly like the original app on shinyapps.io

Feedback

Please let me know what else could improve. 

I have a current pending issue on using JSON uploads for authentication that is waiting a bug update in httr, the underlying library.

In particular all the htmlwidgets() packages could be added - this wonderful R library creates an R to d3.js interface, which holds some of the nicest visualisations on the web.

In this first release, I favoured plots that could apply to as much different data sets as possible.  For your own use cases you can be more restrictive on what data is requested, and so maybe more ambitious in the plots.  If you want inspiration timelyportfolio (he who wrote the listviewer library) has a blog where he makes lots of htmlwidgets libraries.

Enjoy!  Hope its of use, let me know if you build something cool with it.

Introduction to Machine Learning with Web Analytics: Random Forests and K-Means

MeasureCamp #7

I've just come back from #MeasureCamp, where I attended some great talks: on hierarchical models; the process of analysis; a demo of Hadoop processing Adobe Analytics hits; web scraping with Python and how machine learning will affect marketing in the future.  Unfortunately the sad part of MeasureCamp is you also miss some excellent content when they clash, but that's the nature of an ad-hoc schedule.  I also got to meet some excellent analytics bods and friends old and new.  Many thanks to all the organisers!

My sessions on machine learning

After finishing my presentation I discovered I would need to talk waaay to quickly to fit it all in, so I decided to do a session on each example I had.  The presentation is now available online here, so you can see what was intended.

I got some great feedback, as well as requests from people who had missed the session for some details, so this blog post will try to fill in some detail around the presentation we spoke about in the sessions.

Session 1: Introduction, Google Analytics Data and Random Forest Example

Introduction

Machine Learning gives ability for programs to learn without being explicitly programmed for a particular dataset.  They make models from input data to create useful output, commonly predictive analytics.

There are plenty of machine learning resources, but not many that deal with web analytics in particular.  The sessions are aimed at inspiring web analysts to use or add machine learning to their toolbox, showing two machine learning examples that detail:
  • What data to extract
  • How to process the data ready for the models
  • Running the model
  • Viewing and assessing the results 
  • Tips on how to put into production
Machine learning isn't magic.  You may be able to make a model that uses obscure features, but a lot of intuition will be lost as a result.  Its much better to have a model that uses features you can understand, and scales up what a domain expert (e.g. you) could do if you had the time to go through all the data.

Types of Machine Learning

Machine learning models are commonly split between supervised and unsupervised learning.  We deal with an example from each:
  • Supervised: Train the model against a test set with known outcomes.  Examples include spam detection and our example today, classifying users based on what they eventually buy.  The model we use is known as Random Forests.
  • Unsupervised: Let the model find own results.  Examples include clustering of users that we do in the second example using the k-means model.

Every machine learning project needs the below elements.  They are not necessarily done in order but a successful project will need to incorporate them all:

  • Pose the question - This is the most important.  We pose a question that our model needs to answer.  We also review this question and may modify it to try and fit what the data can do as we work on the project.
  • Data preparation - This is the majority of work.  It covers getting hold of the data, munging it so it fits the model and parsing the results.  I've tried to include some R functions below that will help with this, including getting the data from Google Analytics into R.
  • Running the model - The sexy statistics part.  Whilst superstar statistics skills is helpful to get the best results, you can still get useful output when applying model defaults which we use today.  Important thing is to understand the methods.
  • Assessing the results - What you’ll be judged on.  You will of course have a measure of how accurate the model is, but an important step is visualising this and being able to explain the model to non-technical people.
  • How to put it into production - the ROI and business impact.  A model that just runs in your R code on your laptop may be of interest, but ultimately not as useful for the business as a whole if it does not recommend how to implement the model and results into production.  Here you will probably need to talk to IT about how to call your model, or even rewrite your prototype into a more production level language.

Pitfalls Using Machine Learning in Web Analytics

There are some considerations when dealing with web analytics data in particular:

  • Web analytics is messy data - definitions can vary from website to website on various metrics, such as unique users, sessions or pageviews, so a through understanding of what you are working with is essential.
  • Most practical analysis needs robust unique userIds - For useful actionable output, machine learning models need to work on data that record useful dimensions, and for most websites that is your users.  Unfortunately that is also the definition that is the most woolly in web analytics given the nature of different access points.  Having a robust unique userID is very useful and made the examples in this blog post possible.
  • Time-series techniques are quickest way in - If you don't have unique users, then you may want to look at time-series models instead, since web analytics is also a lot of count data over time.  This is the reason I did GA Effect as one of my first data apps, since it could apply to most situations of web analytics.
  • Correlating confounders - It can be common for web analytics to be recording highly correlating metrics e.g. PPC clicks and cost.  Watch out for these in your models as they can overweight results.
  • Self reinforcing results - Also be wary of applying models that will favour their own results.  For example, a personalisation algo that places products at the top of the page will naturally get more clicks.  To get around this, consider using weighted metrics, such as a click curve for page links.  Always test.
  • Do regularisation -  Make sure all metrics are on the same scale, otherwise some will dominate.  e.g. pageviews + bounce rate in same model

The Scenario

Here is the situation the following examples are based upon.  Hopefully it will be something familiar to your own case:

You are in charge of a reward scheme website, where existing customers log in to spend their points.  You want users to spend as many points as they can, so they have high perceived value.  You capture a unique userId on login into custom dimension1 and use Google Analytics enhanced e-commerce to track which prizes users view and claim.

Notice this scenario involves the reliable user ID, since every user is logging in to use the website. This may be tricky to do on your own website, so you may need to only work with a subset of your users.  In my view, the data gains you can make from reliable user identification means I try to encourage the design of the website to involve logged in content as much as possible.

    Random Forests

    Now we get into the first example.  Random Forests are a popular machine learning tool as it typically has good results - in Kaggle competitions its often the benchmark to beat.
     
    Random Forests are based on decision trees, and decision trees are the topic of a recent interactive visualisation on machine learning that has been doing the rounds.  Its really great, so check it out first then come back here.

    Back? Ok great, so now you know about decision trees.

    Random Forests are a simple extension, as a collection of decision trees are a Random Forest.  A problem with decision trees is that they will overfit your data - when you throw new data at it you will get misclassification.  It turns out though, that if you aggregate all the decision trees with subsets of your original data, all those slightly worse models added up make one robust model, meaning when you throw new data at a Random Forest its more likely to be a closer fit. 

    If you want more detail check out the very readable original paper by Breiman and Cutler and a tutorial on using it with R is here.

    Example 1: Can we predict what prizes a user will claim from their view history?

    Now we are back looking at our test scenario.  We have noticed that a lot of user's aren't claiming prizes despite browsing the website, and we want to see if we can encourage them to claim prizes, so they value the points more and spend more to get them.

    We want to look at users who do claim, and see what prizes they look at before they claim.  Next we will see if we can build a model to predict what a user will claim based on their view history.  In production, we will use this to e-mail users who have viewed but not claimed prize suggestions, to see if it improves uptake.

    Fetching the data

    Use your favourite Google Analytics to R library - I'm using my experimental new library, googleAnalyticsR, but it doesn't matter which, the important thing is looking at what is being fetched.  In this example the user ID is being captured in custom dimension 1, and we're pulling out the product SKU code.  This is transferable to other web analytics such as Adobe Analytics (perhaps via the RSiteCatalyst package)



    Note we needed two API calls to get the views and transactions as these can't be queried in the same call.  They will be merged later.

    Transforming the data

    We now need to put the data into a format that will work with Random Forests.  We need a matrix of predictors to feed into the model, one column of response showing the desired output labels, and we split it so it is one row per user action:
    Here is some R code to "widen" the data to get this format. We then split the data set randomly 75% for training, 25% for testing.

    Running RandomForest and assessing the results

    We now run the model - this can take a long time for lots of dimensions (this can be much improved using PCA for dimension reduction, see later).  We then test the model on the test data, and get an accuracy figure:


    On my example test set I got ~70% accuracy on this initial run, which is not bad, but it is possible to get up to 90-95% with some tweaking.  Anyhow, lets plot the test vs predicated product frequencies, to see how it looks:


    This outputted the below plot.  It can be seen in general the ~70% accuracy predicted many products but with a lot of error happening for a large outlier.  Examining the data this product SKU was for a cash only prize.  A next step would be to look at how to deal with this product in particular since eliminating it improves accuracy to ~85% in one swoop.
     

    Next steps for the RandomForest

    There I stop but there are lots of next steps that could be done to make the model applicable to the business.  A non-exhaustive list is:

    • Run model on more test sets
    • Train model on more data
    • Try reducing number of parameters (see PCA later)
    • Examine large error outliers 
    • Compare with simple models (last/first product viewed?) - complicated is not always best!
    • Run model against users who have viewed and not sold yet
    • Run email campaign with control and model results for final judgement

    It is hoped the above inspired you to try it yourself.

    Session 2: K-means, Principal Component Analysis and Summary

    Example 2: Can we cluster users based on their view product behaviour?

    Now we look at k-means clustering.  The questions we are trying to answer are something like this:

    Do we have suitable prize categories on the website? How do our website categories compare to user behaviour?

    The k-means clustering we hope will give us data to help with decisions on how the website is organised.

    For this we will use the same data as we used before for Random Forests, with some minor changes: as k-means is an unsupervised model we will take off our product labels:

    A lot of this example is inspired by this nice beginners walk-through on K-means with R.

    Introduction to K-means clustering


    This video tutorial on k-means explains it well:



    The above is an example with two dimensions, but k-means can apply to many more dimensions than that, we just can't visualise them easily. In our case we have 185 product views that will each serve as a dimension.  However, problems with that many dimensions include long processing time alongside dangers of over-fitting the data, so we now look at PCA.

    Principal Component Analysis (PCA)

    We perform Principal Component Analysis (PCA) to see if there are important products that dominate model - this could have been applied to previous Random Forest example as well, and indeed a final production model could include output from one model like k-means to be fed into Random Forests.

    PCA rotates dimensions to try and minimize them as much as possible, then ranks them in amount of variance.  There is a good visualisation of this here.

    The clustering we will do will actually be performed on the top rotated dimensions we find via PCA, and we will then map these back to the original pages for final output. This also takes care of situations such as if one product is always viewed in every cluster: PCA will minimize this dimension.

    The code below looks for the principal components, then gives us some outputs to try and decide how many dimensions we will choose.  A rule of thumb is we look for components that give us roughly ~85% of the variance.   For the below data this was actually 35 dimensions (reduced from the 185 before)  



    The plot output from the above is below.  We can see the first principal component accounts for 50% of the variance, but then the variation is flattish.


    How many clusters?

    How many clusters to pick for k-means can be a subjective experience.  There are other clustering models that pick for you, but some kind of decision process will be dependent on what you need.  There are however ways to help inform that decision.

    Running the k-means modelling for increasing number of clusters, we can look at an error measure (sum of squares) of how many points are in each.  When we plot these attempts for each cluster iteration, we can see how the graph changes or levels off at various cluster sizes, and use that to help with our decision:


    The plot for determining the clusters is here - see the fall between 2-4 clusters.  We went with 4 for this example, although a case could be made for 6:


    Assessing the clusters and visualisation

     I find heatmaps are a good way to assess clustering results, since they offer a good way to overview groupings.  We are basically looking to see if the clusters found are different enough to make sense.


    This gives the following visualisation.  In an interactive RStudio or Shiny session, this is zoomable for finer detail, but here we just exported the image:

    From the heatmap we can see that each cluster does have distinctly different product views.

    K-Means - Next Steps

    The next step is to take these clusters and examine the products that are within them, looking for patterns.  This is where your domain knowledge is needed, as all we have done here is grouped together based on statistics - the "why" is not in here.  When I've performed this in the past, I try to give named persona to each cluster type.  Examples include "Big Spenders" for those who visit the payment page a lot, "Sport Freaks" who tend to only look at sport goods etc.  Again, this will largely depend on the number of clusters you have chosen, so you may want to vary this to tweak to the results you are looking for.

    Recommendations follow on how to group pages: A/B teats can then be performed to test if the clustering makes an impact.

    Summary

    I hope the above example workflows have inspired you to try it with your own data.  Both examples can be improved, for instance we took no account of the order of product views or other metrics such as time on website, but the idea was to give you a way in to try these yourselves.

    I chose k-means and Random Forests as they are two of the most popular models, but there are lots to choose from.  This diagram from a python machine learning library, scikit-learn, offers an excellent overview on how to choose which other machine learning model you may want to use for your data:

    All in all I hope some of the mystery around machine learning has been taken out, and how it can be applied to your work.  If you are interested in really getting to grips with machine learning, the Coursera course was excellent and what set me on my way.

    Do please let me know of any feedback, errors or what you have done with the above, I'd love to hear from you.

    Good luck!

    Google API Client Library for R: googleAuthR v0.1.0 now available on CRAN

    One of the problems with working with Google APIs is that quite often the hardest bit, authentication, comes right at the start.  This presents a big hurdle for those who want to work with them, it certainly delayed me.  In particular having Google authentication work with Shiny is problematic, as the token itself needs to be reactive and only applicable to the user who is authenticating.

    But no longer! googleAuthR provides helper functions to make it easy to work with Google APIs.  And its now available on CRAN (my first CRAN package!) so you can install it easily by typing:

    > install.packages("googleAuthR")

    It should then load and you can get started by looking at the readme files on Github or typing:

    > vignette("googleAuthR")

    After my experiences making shinyga and searchConsoleR, I decided inventing the authentication wheel each time wasn't necessary, so worked on this new R package that smooths out this pain point.

    googleAuthR provides easy authentication within R or in a Shiny app for Google APIs.  It provides a function factory you can use to generate your own functions, that call or do the actions you needed.

    At last counting there are 83 APIs, many of which have no R library, so hopefully this library can help with that.  Examples include the Google Prediction API, YouTube analytics API, Gmail API etc. etc.

    Example using googleAuthR

    Here is an example of making a goo.gl R package using googleAuthR:

    If you then want to make this multi-user in Shiny, then you just need to use the helper functions provided:





    Automating Google Console search analytics data downloads with R and searchConsoleR

    Yesterday I published version 0.1 of searchConsoleR, a package that interacts with Google Search Console (formerly Google Webmaster Tools) and in particular its search analytics.

    I'm excited about the possibilities with this package, as this new improved data is now available in a way to interact with all the thousands of other R packages.

    If you'd like to see searchConsoleR capabilities, I have the package running an interactive demo here (very bare bones, but should demo the data well enough).

    The first application I'll talk about in this post is archiving data into a .csv file, but expect more guides to come, in particular combining this data with Google Analytics.

    Automatic search analytics data downloads

    The 90 day limit still applies to the search analytics data, so one of the first applications should be archiving that data to make year on year, month on month and general development of your SEO rankings.

    The below R script:

    1. Downloads and installs the searchConsoleR package if it isn't installed already.
    2. Lets you set some parameters you want to download.
    3. Downloads the data via the search_anaytics function.
    4. Writes it to a csv in the same folder the script is run in.
    5. The .csv file can be opened in Excel or similar.

    This should give you nice juicy data.

    Considerations

    The first time you will need to run the scr_auth() script yourself so you can give the package access, but afterwards it will auto-refresh the authentication each time you run the script.

    If you ever need a new user to be authenticated, run scr_auth(new_user=TRUE)

    You may want to modify the script so it appends to a file instead, rather than having a daily dump, although I do this with a folder of .csv's to import them all into one R dataframe (which you could export again to one big .csv)

    Automation

    You can now take the download script and use it in automated batch files, to run daily.

    In Windows, this can be done like this (from SO)

    • Open the scheduler: START -> All Programs -> Accessories -> System Tools -> Scheduler
    • Create a new Task
    • under tab Action, create a new action
    • choose Start Program
    • browse to Rscript.exe which should be placed e.g. here:
      "C:\Program Files\R\R-3.2.0\bin\x64\Rscript.exe"
    • input the name of your file in the parameters field
    • input the path where the script is to be found in the Start in field
    • go to the Triggers tab
    • create new trigger
    • choose that task should be done each day, month, ... repeated several times, or whatever you like

    In Linux, you can probably work it out yourself :)

    Conclusion

    Hopefully this shows how with a few lines of R you can get access to this data set.  I'll be doing more posts in the future using this package, so if you have any feedback let me know and I may be able to post about it.  If you find any bugs or features you would like, please also report an issue on the searchConsoleR issues page on Github.

    Enhance Your Google Analytics Data with R and Shiny (Free Online Dashboard Template)

    Introduction

    The aim of this post is to give you the tools to enhance your Google Analytics data with R and present it on-line using Shiny.  By following the steps below, you should have your own on-line GA dashboard, with these features:

    • Interactive trend graphs.

    • Auto-updating Google Analytics data.

    • Zoomable day-of-week heatmaps.

    • Top Level Trends via Year on Year, Month on Month and Last Month vs Month Last Year data modules.

    • A MySQL connection for data blending your own data with GA data.

    • An easy upload option to update a MySQL database.

    • Analysis of the impact of marketing events via Google's CausalImpact.

    • Detection of unusual time-points using Twitter's Anomaly Detection.

    A lot of these features are either unavailable in the normal GA reports, or only possible in Google Analytics Premium.  Under the hood, the dashboard is exporting the data via the Google Analytics Reporting API, transforming it with various R statistical packages and then publishing it on-line via Shiny.

    A live demo of the dashboard template is available on my Shinyapps.io account with dummy GA data, and all the code used is on Github here.

    Feature Detail

    Here are some details on what modules are within the dashboard.  A quick start guide on how to get the dashboard running with your own data is at the bottom.

    Trend Graph

    Most dashboards feature a trend plot, so you can quickly see how you are doing over time.  The dashboard uses dygraphs javascript library, which allows you to interact with the plot to zoom, pan and shift your date window.  Plot smoothing has been provided at the day, week, month and annual level.

    Screen Shot 2015-07-17 at 221225png

    Additionally, the events you upload via the MySQL upload also appear here, as well as any unusual time points detected as anomalies.  You can go into greater detail on these in the Analyse section.

    Heatmap

    Heatmaps use colour intensity to show metrics between categories.  The heatmap here is split into weeks and day per week, so you can quickly scan to see if a particular day of the week is popular - in the below plot, Monday/Tuesday look like they are best days for traffic.  

    Screen Shot 2015-07-19 at 114241png

    The data window is set by what you select in the trend graph, and you can zoom for more detail using the mouse.

    Top Level Trends

    Quite often headlines just need a number to quickly check.  These data modules give you a quick glance into how you are doing, comparing last week to the week before, last month to the month before and last month to the same month the year before.  Between them, you should see how your data is trending, accounting for seasonal variation.

    Screen Shot 2015-07-19 at 114335png

    MySQL Connection

    The code provides functions to connect to a MySQL database, which you can use to blend your data with Google Analytics, provided you have a key to link them on.  

    Screen Shot 2015-07-17 at 221124png

    In the demo dashboard the key used is simply the date, but this can be expanded to include linking on a userID from say a CRM database to the Google Analytics CID, Transaction IDs to off-line sales data, or extra campaign information to your campaign IDs.  An interface is also provided to let end users update the database by uploading a text file.

    CausalImpact

    In the demo dashboard, the MySQL connection is used to upload Event data, which is then used to compare with the Google Analytics data to see if the event had a statistically significant impact on your traffic.  This replicates a lot of the functionality of the GA Effect dashboard.

    Screen Shot 2015-07-17 at 221136png

    Headline impact of the event is shown in the summary dashboard tab.  If its statistically significant, the impact is shown in blue.

    Screen Shot 2015-07-19 at 160843png

    Anomaly Detection

    Twitter has released this R package to help detect unusual time points for use within their data streams, which is also handy for Google Analytics trend data.  

    Screen Shot 2015-07-17 at 221151png

    The annotations on the main trend plot are indicated using this package, and you can go into more detail and tweak the results in the Analyse section.

    Making the dashboard multi-user

    In this demo I’ve taken the usual use case of an internal department just looking to report on one Google Analytics property, but if you would like end users to authenticate with their own Google Analytics property, it can be combined with my shinyga() package, which provides functions which enable self authentication, similar to my GA Effect/Rollup/Meta apps.

    In production, you can publish the dashboard behind a Shinyapps authentication login (needs a paid plan), or deploy your own Shiny Server to publish the dashboard on your company intranet.

    Quick Start

    Now you have seen the features, the below goes through the process for getting this dashboard for yourself. This guide assumes you know of R and Shiny - if you don’t then start there: http://shiny.rstudio.com/

    You don’t need to have the MySQL details ready to see the app in action, it will just lack persistent storage.

    Setup the files

    1. Clone/copy-paste the scripts in the github repository to your own RStudio project.

    2. Find your GA View ID you want to pull data from.  The quickest way to find it is to login to your Google Analytics account, go to the View then look at the URL: the number after “p” is the ID.

    3. [Optional] Get your MySQL setup with a user and IP address. See next section on how this is done using Google Cloud SQL.  You will also need to white-list the IP of where your app will sit, which will be your own Shiny Server or shinyapps.io. Add your local IP for testing too. If using shinyapps.io their IPs are: 54.204.29.251; 54.204.34.9; 54.204.36.75; 54.204.37.78.

    4. Create a file called secrets.R file in the same directory as the app with the below content filled in with your details.  

    Configuring R

        1. Make sure you can install and run all the libraries needed by the app:

        2. Run the below command locally first, to store the auth token in the same folder.  You will be prompted to login with the Google account that has access to the GA View ID you put into step 3, and get a code to paste into the R console.  This will then be uploaded with app and handle the authentication with Google Analytics when in production:

            > rga::rga.open(where="token.rga")

        3. Test the app by hitting the “Run App” button at the top right of the server.ui script in RStudio, or by running:

            > shiny::runApp()

    Using the dashboard

    1. The app should now be running locally in a browser window with your own GA data.  It can take up to 30 seconds for all the data to load first time.

    2. Deploy the instance on-line to Shinyapps.io with a free account there, or to your own Shiny Server instance.

    3. Customise your instance. If for any reason you don’t want certain features, then remove the feature in the ui.R script - the data is only called when the needed plot is viewed.

    Getting a MySQL setup through Google Cloud SQL

    If you want a MySQL database to use with the app, I use Google Cloud SQL.  Setup is simple:
    1. Go to the Google API console and create a project if you need to.

    2. Make sure you have billing turned on with your billing accounts menu top right.

    3. Go to Storage > Cloud SQL in the left hand menu.

    4. Create a New Instance.

    5. Create a new Database called “onlinegashiny”

    6. Under “Access Control” you need to put in the IP of yourself where you test it, as well as the IPs of the Shiny Server/shinyapps.io.  If you are using shinyapps.io the IPs are: 54.204.29.251; 54.204.34.9; 54.204.36.75;54.204.37.78

    7. Under “IP Address” create a static IP (Charged at $0.24 a day)

    8. You now should have all the access info you need to put in the apps secrets.R for MySQL access.  The port should be a default 3306

    9. You can also limit the amount of data that is uploaded by the shiny.maxRequestSize option - default is 0.5 MB.

    Summary

    Hopefully the above could help inspire what can be done with your Google Analytics data.  Focus has been on trying to give you the tools that allow action to be made on your data.

    There is a lot more you can do via the thousands of R packages available, but hopefully this gives a framework you can build upon.

    I’d love to see what you build with it, so do please feel free to get in touch. :)

    My new role as Google Developer Expert for Google Analytics!

    I'm very pleased and honoured to have been accepted into the Google Developer Expert program representing Google Analytics.  I should soon have my mug listed with the other GA GDE's at the Google Developer Expert website.

    My thanks go to Simo who nominated me and Linda for helping me through the application process.

    Alongside my existing work at Wunderman, my role should include some more opportunities to get out there and show what can be done with the GA APIs, so expect me at more analytics conferences soon.

    I also will get to play with some of the new betas and hopefully be able to create more cool demo apps for users to adapt and use for their own website, mostly using R Shiny and Google App Engine.

    How I made GA Effect - creating an online statistics dashboard using R

    GA Effect is a webapp that uses Bayesian structural time-series to judge if events happening in your Google Analytics account are statistically significant.  Its been well received on Twitter and how to use it is detailed in this guest post on Online Behaviour, but this blog will be about how to build your own or similar.

    Update 18th March: I've made a package that holds a lot of the functions below, shinyga.  That may be easiest to work with.

    Screen Shot 2015-02-19 at 205931png

    What R can do

    Now is a golden time for the R community, as it gains popularity outside of its traditional academic background and hits business.  Microsoft has recently bought Revolution Analytics, an enterprise solution of R so we can expect a lot more integration with them soon, such as the machine learning in their Azure platform.

    Meanwhile RStudio are releasing more and more packages that make it quicker and easier to create interactive graphics, with tools for connecting and reshaping data and then plotting using attractive JavaScript visualisation libraries or native interactive R plots.  GA Effect is also being hosted using ShinyApps.io, an R server solution that enables you to publish straight from your console, or you can run your own server using Shiny Server.  

    Packages Used

    For the GA Effect app, the key components were these R packages:

    Putting them together

    Web Interaction

    First off, using RStudio makes this all a lot easier as they have a lot of integration with their products.

    ShinyDashboard is a custom theme of the more general Shiny.  As detailed in the getting started guide, creating a blank webpage dashboard with shinydashboard take 8 lines of R code.  You can test or run everything locally first before publishing to the web via the “Publish” button at the top.  

    Probably the most difficult concept to get around is the reactive programming functions in a Shiny app.  This is effectively how the interaction occurs, and sets up live relationships between inputs from your UX script (always called ui.R) and outputs from your server side scripts (called server.r).  These are your effective front-end and back-end in a traditional web environment.  The Shiny packages takes your R code and changes it into HTML5 and JavaScript. You can also import JavaScript of your own if you need it to cover what Shiny can’t.

    The Shiny code then creates the UI for the app, and creates reactive versions of the datatables needed for the plots.

    Google Authentication

    The Google authentication flow uses OAuth2 and could be used for any Google API in the console, such as BigQuery, Gmail, Google Drive etc.  I include the code used for the authentication dance below so you can use it in your own apps:

    Fetching Google Analytics Data

    Once a user has authenticated with Google, the user token is then passed to rga() to fetch the GA data, according to which metric and segment the user has selected. 

    This is done reactively, so each time you update the options a new data fetch to the API is made.  Shiny apps are on a per user basis and work in RAM, so the data is forgotten once the app closes down.

    Doing the Statistics

    You can now manipulate the data however you wish.  I put it through the CausalImpact package as that was the application goal, but you have a wealth of other R packages that could be used such as machine learning, text analysis, and all the other statistical packages available in the R universe.  It really is only limited by your imagination. 

    Here is a link to the CausalImpact paper, if you really want to get in-depth with the methods used.  It includes some nice examples of predicting the impact of search campaign clicks.

    Here is how CausalImpact was implemented as a function in GA Effect:

    Plotting

    dygraphs() is an R package that takes R input and outputs the JavaScript needed to display it in your browser, and as its made by RStudio they also made it compatible with Shiny.  It is an application of HTMLwidgets, which lets you take any JavaScript library and make it compatible with R code.  Here is an example of how the main result graph was generated:

    Publishing

    I’ve been testing the alpha of shinyapps.io for a year now, but it is just this month (Feb 2015) coming out of beta.  If you have an account, then publishing your app is as simple as pushing “Publish” button above your script, where it appears at a public URL.  With other paid plans, you can limit access to authenticated users only.

    Next steps

    This app only took me 3 days with my baby daughter on my lap during a sick weekend, so I’m sure you can come up with similar given time and experience.  The components are all there now to make some seriously great apps for analytics.  If you make something do please let me know!

    OSX Black Screen no Login screen but with working cursor on boot [fixed]

    I'm just posting this, to maybe help others who get the same problem.

    I had an OSX 10.10.2 update on my 2011 Macbook Air, and left the laptop open last night.  This put it in Hibernation mode which breaks the auto-installation, so when I tried to use the laptop this morning, it booted to the Apple logo, but then the screen went totally black without the option to login.  The cursor was still live though.

    The fix below will let you login again.  It will only work in the above scenario, if its your backlight broken or something else keep searching :)

    Before the below fix I tried:

    1. Pressing the increase brightness buttons (duh)
    2. Restarting in safe mode (doesn't complete login)
    3. Resetting SMC and PRAM (pusing CTRL+OPTION+POWER+other buttons on powerup - see here: https://discussions.apple.com/docs/DOC-3603 )
    4. Letting it boot, waiting, then pushing first letter of your username, pushing enter and typing in password (the most popular fix on the web)

    But finally, the solution was found at this forum called Jamfnation via some Google-wu:

    1. Perform a PRAM reset ( Cmd+Option+P+R ) on boot – let chime 3 times and let go
    2. Boot to Single User Mode (hold Command+S immediately after powering on)
    3. Verify and Mount the Drives - Once in Single user mod, run the following commands:
    4. /sbin/fsck -fy
    5. /sbin/mount -uw /
    6. After the disk has mounted in step 5, run the following commands:
    7. rm -f /Library/Preferences/com.apple.loginwindow.plist
    8. rm -f /var/db/.AppleUpgrade
    9. After deleting the files, restart.

    Hope it helps if you get this far.

    E-mail open rate tracking with Google Analytics' Measurement Protocol - Demo

    Edit 4th Feb 2015 - Google have published an email tracking guide with the Measurement Protocol.  The below goes a bit beyond that showing how to link the user sessions etc.

    The Measurement Protocol was launched at the same time as Universal Analytics, but I've seen less adoption of it with clients, so this post is an attempt to show what can be done with it with a practical example.

    The demo app is available here: http://ua-post-to-push.appspot.com/

    With this demo you should be able to track the following:

    1. You have an email address from an interested customer
    2. You send them an email and they look at it, but don't click through.
    3. Three days later they open the email again at home, and click through to the offer on your website.
    4. They complete the form on the page and convert.

    Within GA, you will be able to see for that campaign 2 opens, 1 click/visit and 1 conversion for that user.  As with all email open tracking, you are dependent on the user downloading the image, which is why I include the option to upload an image and not just a pixel, as it may be more enticing to allow images in your newsletter.

    Intro

    The Measurement Protocol lets you track beyond the website, without the need of client-side JavaScript.  You construct the URL and when that URL is loaded, you see the hit in your Google Analytics account.  That's it. 

    The clever bit is that you can link user sessions together via the CID (Customer ID), so you can track the upcoming Internet of Things off-line to on-line, but also things like email opens and affiliate thank you pages.  It also works with things like enhanced e-commerce, so can be used for customer refunds or product impressions. 

    This demo looks at e-mail opens for its example, but its minor modifications to track other things.  For instance, I use a similar script to measure in GA when my Raspberry Pi is backing up our home computers via Time Machine.

    Demo on App Engine

    To use the Measurement Protocol in production most likely needs server-side code.  I'm running a demo on Google App Engine coded in Python, which is pretty readable so should make it fairly easy for a developer to replicate in their favourite language.  App Engine is also a good choice if you are wanting to run it in production, since it has a free tier for tracking 1000s of email opens a day, but scalability to handle millions.

    This code is available on Github here - http://github.com/MarkEdmondson1234/ga-get-to-post

    App running that code is here: http://ua-post-to-push.appspot.com/

    There are instructions on Github on how it works, but I'll run through some of the key concepts here in this post.

    What the code does

    The example has four main URLs:
    • The homepage explaining the app
    • The image URL itself, that when loaded creates the hit to GA
    • A landing page with example custom GA tracking script
    • An upload image form to change the image you would display in the e-mail.
    The URLs above are controlled server side with the code in main.py

    Homepage

    This does nothing server side aside serve up the page



    Image URL


    This is the main point of the app - it turns a GET request for the image uploaded into a POST with the parameters found in the URL.  It handles the different options and sends the hit to GA as a virtual pageview or event, with a unique user CID and campaign name. An example URL here is:
    http://your-appengine-id.appspot.com/main.png?cid=blah&p=1&c=email_campaign



    Landing Page


    This does little but take the cid you put in the email URL, and outputs the CID that will be used in Google Analytics.  If this is the same CID as in the image URL and the user clicks in the email, those sessions will be linked. You can also add the GA campaign parameters, but the sever side script ignores those - the javascript on the page will take care of it. An example URL here is:
    http://your-appengine-id.appspot.com/landing-page?cid=blah&utm_source=source_me&utm_medium=medium_me&utm_campaign=campaign_me


    The CID in the landing page URL is then captured and turned into an anonymous CID for GA.  This is then served up to the Universal Analytics JavaScript on the landing page, shown below.  Use the same UA code for both, else it won't work (e.g. UA-123456-1)


    Upload Image

    This just handles the image uploading and serves the image up via App Engines blobstore.  Nothing pertinent to GA here so see the Github code if interested.

    Summary

    Its hoped this helps sell using the Measurement Protocol to more developers, as it offers a solution to a lot of the problems with digital measurement today, such as attribution of users beyond the website.  The implementation is reasonably simple, but the power is in what you send and what situations.  Hopefully this inspires what you could do with your setup.

    There are some limitations to be aware of - the CID linking won't stitch sessions together, it just discards a user's old CID if they already had one, so you may want to look at userID or how to customise the CID for users who visit your website first before the email is sent.  The best scenario would be if a user is logged in for every session, but this may not be practical.  It may be that the value of linking sessions is so advantageous in the future, entire website strategies will be focused on getting users to ID themselves, such as via social logins.

    Always consider privacy: look for user's to opt in, and make sure to use GA filters to take out any PII you may put into GA as a result.  Current policy looks to be that if the data within GA is not able to be tracked to an individual (e.g. a name, address or email) then you are able to record an anonymous personal ID, that could be exported and linked to PII outside of GA.  This is a bit of a shifting target, but in all cases keeping it as user focused and not profit focused as possible should see you through any ethical questions.






    Finding the ROI of Title tag changes using Google's CausalImpact R package

    After a conversation on Twitter about this new package, and mentioning it in my recent MeasureCamp presentation, here is a quick demo on using Google's CausalImpact applied to an SEO campaign.

    CausalImpact is a package that looks to give some statistics behind changes you may have done in a marketing campaign.  It examines the time-series of data before and after an event, and gives you some idea on whether any changes were just down to random variation, or the event actually made a difference.

    You can now test this yourself in my Shiny app that automatically pulls in your Google Analytics data so that you can apply CausalImpact to it.   This way you can A/B test changes for all your marketing channels, not just SEO.  However, if you want to try it manually yourself, keep reading.

    Considerations before getting the data

    Suffice to say, it should only be applied to time-series data (e.g. there is date or time on the x-axis), and it helps if the event was rolled out on only one of those time points.  This may influence the choice of time unit you use, so if say it rolled out over a week its probably better to use weekly data exports.  Also consider the time period you choose.  The package will use the time-series before the event to construct what it thinks should happen vs what actually happened, so if anything unusual or spikes occur in the test period it may affect your results.

    Metrics wise the example here is with visits.  You could perhaps do it with conversions or revenue, but then you may get affected by factors outside of your control (the buy button breaking etc.), so for clean results try to take out as many confounding variables as possible. 

    Example with SEO Titles

    For me though, I had an example where some title tag changes went live on one day, so could compare the SEO traffic before and after to judge if it had any effect, and also more importantly judge how much extra traffic had increased.

    I pulled in data with my go-to GA R import library, rga by Skardhamar.

    Setup

    I first setup, importing the libraries if you haven't got them and authenticating the GA account you want to pull data from.

    Import GA data

    I then pull in the data for the time period covering the event.  SEO Visits by date.

    Apply CausalImpact

    In this example, the title tags got updated on the 200th day of the time-period I pulled.  I want to examine what happened the next 44 days.

    Plot the Results

    With the plot() function you get output like this:

    1. The left vertical dotted line is where the estimate on what should have happened is calculated from.
    2. The right vertical dotted line is the event itself. (SEO title tag update)
    3. The original data you pulled is the top graph.
    4. The middle graph shows the estimated impact of the event per day.
    5. The bottom graph shows the estimated impact of the event overall.

    In this example it can be seen that after 44 days there is an estimated 90,000 more SEO visits from the title tag changes. This then can be used to work out the ROI over time for that change.

    Report the results

    The $report method gives you a nice overview of the statistics in a verbose form, to help qualify your results.  Here is a sample output:

    "During the post-intervention period, the response variable had an average value of approx. 94. By contrast, in the absence of an intervention, we would have expected an average response of 74. The 95% interval of this counterfactual prediction is [67, 81]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 20 with a 95% interval of [14, 27]. For a discussion of the significance of this effect, see below.

    Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 4.16K. By contrast, had the intervention not taken place, we would have expected a sum of 3.27K. The 95% interval of this prediction is [2.96K, 3.56K].

    The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +27%. The 95% interval of this percentage is [+18%, +37%].

    This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (20) to the original goal of the underlying intervention.

    The probability of obtaining this effect by chance is very small (Bayesian tail-area probability p = 0.001). This means the causal effect can be considered statistically significant."

    Next steps

    This could then be repeated for things like UX changes, TV campaigns, etc. You just need the time of the event and the right metrics or KPIs to measure them against.

    The above is just a brief intro, there is a lot more that can be done with the package including custom models etc, for more see the package help file and documentation.