BigQuery Visualiser Shiny app now free and open sourced

A few weeks ago I tweeted a beta version of a BigQuery Visualiser Shiny app that was well received, and got some valuable feedback on how it could be improved, in particular from @felipehoffa - thanks Felipe!

Here is a screenshot of the app:

Motivation

The idea of the app is to enhance the standard BigQuery interface to include plots of the data you query.  It uses ggplot2, a popular R library; d3heatmaps, a d3 JavaScript library to display heatmaps; and timelyportfolio's listviewer, a nice library for viewing all the BigQuery meta data in a collapsible tree.   Other visualisations can be added fairly easily and will be done so over time, but if you have a request for something in particular you can raise an issue on the project's Github page.

I got into BigQuery once it started to receive exports from Google Analytics Premium. Since these exports carry unsampled raw data and include unique userIds, its a richer data source for analysis than the Google Analytics reporting API.

It also was a chance to create another Google API library called bigQueryR, the newest member of the googleAuthR family.  Using googleAuthR meant Shiny support, and also meant bigQueryR can be used alongside googleAnalyticsR and searchConsoleR under one shared login flow.  This is something exploited in this demo of RMarkdown, which pulls data from all three sources into a scheduled report.

Running your own BigQuery Visualiser

All set-up instructions are listed on the BigQuery Visualiser's Github project page

You can run the Shiny app locally on your computer within RStudio; within your own company intranet if its running Shiny Server; or publicly like the original app on shinyapps.io

Feedback

Please let me know what else could improve. 

I have a current pending issue on using JSON uploads for authentication that is waiting a bug update in httr, the underlying library.

In particular all the htmlwidgets() packages could be added - this wonderful R library creates an R to d3.js interface, which holds some of the nicest visualisations on the web.

In this first release, I favoured plots that could apply to as much different data sets as possible.  For your own use cases you can be more restrictive on what data is requested, and so maybe more ambitious in the plots.  If you want inspiration timelyportfolio (he who wrote the listviewer library) has a blog where he makes lots of htmlwidgets libraries.

Enjoy!  Hope its of use, let me know if you build something cool with it.

Introduction to Machine Learning with Web Analytics: Random Forests and K-Means

MeasureCamp #7

I've just come back from #MeasureCamp, where I attended some great talks: on hierarchical models; the process of analysis; a demo of Hadoop processing Adobe Analytics hits; web scraping with Python and how machine learning will affect marketing in the future.  Unfortunately the sad part of MeasureCamp is you also miss some excellent content when they clash, but that's the nature of an ad-hoc schedule.  I also got to meet some excellent analytics bods and friends old and new.  Many thanks to all the organisers!

My sessions on machine learning

After finishing my presentation I discovered I would need to talk waaay to quickly to fit it all in, so I decided to do a session on each example I had.  The presentation is now available online here, so you can see what was intended.

I got some great feedback, as well as requests from people who had missed the session for some details, so this blog post will try to fill in some detail around the presentation we spoke about in the sessions.

Session 1: Introduction, Google Analytics Data and Random Forest Example

Introduction

Machine Learning gives ability for programs to learn without being explicitly programmed for a particular dataset.  They make models from input data to create useful output, commonly predictive analytics. (Arthur Samuel via Wikipedia)

There are plenty of machine learning resources, but not many that deal with web analytics in particular.  The sessions are aimed at inspiring web analysts to use or add machine learning to their toolbox, showing two machine learning examples that detail:
  • What data to extract
  • How to process the data ready for the models
  • Running the model
  • Viewing and assessing the results 
  • Tips on how to put into production
Machine learning isn't magic.  You may be able to make a model that uses obscure features, but a lot of intuition will be lost as a result.  Its much better to have a model that uses features you can understand, and scales up what a domain expert (e.g. you) could do if you had the time to go through all the data.

Types of Machine Learning

Machine learning models are commonly split between supervised and unsupervised learning.  We deal with an example from each:
  • Supervised: Train the model against a test set with known outcomes.  Examples include spam detection and our example today, classifying users based on what they eventually buy.  The model we use is known as Random Forests.
  • Unsupervised: Let the model find own results.  Examples include clustering of users that we do in the second example using the k-means model.

Every machine learning project needs the below elements.  They are not necessarily done in order but a successful project will need to incorporate them all:

  • Pose the question - This is the most important.  We pose a question that our model needs to answer.  We also review this question and may modify it to try and fit what the data can do as we work on the project.
  • Data preparation - This is the majority of work.  It covers getting hold of the data, munging it so it fits the model and parsing the results.  I've tried to include some R functions below that will help with this, including getting the data from Google Analytics into R.
  • Running the model - The sexy statistics part.  Whilst superstar statistics skills is helpful to get the best results, you can still get useful output when applying model defaults which we use today.  Important thing is to understand the methods.
  • Assessing the results - What you’ll be judged on.  You will of course have a measure of how accurate the model is, but an important step is visualising this and being able to explain the model to non-technical people.
  • How to put it into production - the ROI and business impact.  A model that just runs in your R code on your laptop may be of interest, but ultimately not as useful for the business as a whole if it does not recommend how to implement the model and results into production.  Here you will probably need to talk to IT about how to call your model, or even rewrite your prototype into a more production level language.

Pitfalls Using Machine Learning in Web Analytics

There are some considerations when dealing with web analytics data in particular:

  • Web analytics is messy data - definitions can vary from website to website on various metrics, such as unique users, sessions or pageviews, so a through understanding of what you are working with is essential.
  • Most practical analysis needs robust unique userIds - For useful actionable output, machine learning models need to work on data that record useful dimensions, and for most websites that is your users.  Unfortunately that is also the definition that is the most woolly in web analytics given the nature of different access points.  Having a robust unique userID is very useful and made the examples in this blog post possible.
  • Time-series techniques are quickest way in - If you don't have unique users, then you may want to look at time-series models instead, since web analytics is also a lot of count data over time.  This is the reason I did GA Effect as one of my first data apps, since it could apply to most situations of web analytics.
  • Correlating confounders - It can be common for web analytics to be recording highly correlating metrics e.g. PPC clicks and cost.  Watch out for these in your models as they can overweight results.
  • Self reinforcing results - Also be wary of applying models that will favour their own results.  For example, a personalisation algo that places products at the top of the page will naturally get more clicks.  To get around this, consider using weighted metrics, such as a click curve for page links.  Always test.
  • Do regularisation -  Make sure all metrics are on the same scale, otherwise some will dominate.  e.g. pageviews + bounce rate in same model

The Scenario

Here is the situation the following examples are based upon.  Hopefully it will be something familiar to your own case:

You are in charge of a reward scheme website, where existing customers log in to spend their points.  You want users to spend as many points as they can, so they have high perceived value.  You capture a unique userId on login into custom dimension1 and use Google Analytics enhanced e-commerce to track which prizes users view and claim.

Notice this scenario involves the reliable user ID, since every user is logging in to use the website. This may be tricky to do on your own website, so you may need to only work with a subset of your users.  In my view, the data gains you can make from reliable user identification means I try to encourage the design of the website to involve logged in content as much as possible.

    Random Forests

    Now we get into the first example.  Random Forests are a popular machine learning tool as it typically has good results - in Kaggle competitions its often the benchmark to beat.
     
    Random Forests are based on decision trees, and decision trees are the topic of a recent interactive visualisation on machine learning that has been doing the rounds.  Its really great, so check it out first then come back here.

    Back? Ok great, so now you know about decision trees.

    Random Forests are a simple extension, as a collection of decision trees are a Random Forest.  A problem with decision trees is that they will overfit your data - when you throw new data at it you will get misclassification.  It turns out though, that if you aggregate all the decision trees with subsets of your original data, all those slightly worse models added up make one robust model, meaning when you throw new data at a Random Forest its more likely to be a closer fit. 

    If you want more detail check out the very readable original paper by Breiman and Cutler and a tutorial on using it with R is here.

    Example 1: Can we predict what prizes a user will claim from their view history?

    Now we are back looking at our test scenario.  We have noticed that a lot of user's aren't claiming prizes despite browsing the website, and we want to see if we can encourage them to claim prizes, so they value the points more and spend more to get them.

    We want to look at users who do claim, and see what prizes they look at before they claim.  Next we will see if we can build a model to predict what a user will claim based on their view history.  In production, we will use this to e-mail users who have viewed but not claimed prize suggestions, to see if it improves uptake.

    Fetching the data

    Use your favourite Google Analytics to R library - I'm using my experimental new library, googleAnalyticsR, but it doesn't matter which, the important thing is looking at what is being fetched.  In this example the user ID is being captured in custom dimension 1, and we're pulling out the product SKU code.  This is transferable to other web analytics such as Adobe Analytics (perhaps via the RSiteCatalyst package)



    Note we needed two API calls to get the views and transactions as these can't be queried in the same call.  They will be merged later.

    Transforming the data

    We now need to put the data into a format that will work with Random Forests.  We need a matrix of predictors to feed into the model, one column of response showing the desired output labels, and we split it so it is one row per user action:
    Here is some R code to "widen" the data to get this format. We then split the data set randomly 75% for training, 25% for testing.

    Running RandomForest and assessing the results

    We now run the model - this can take a long time for lots of dimensions (this can be much improved using PCA for dimension reduction, see later).  We then test the model on the test data, and get an accuracy figure:


    On my example test set I got ~70% accuracy on this initial run, which is not bad, but it is possible to get up to 90-95% with some tweaking.  Anyhow, lets plot the test vs predicated product frequencies, to see how it looks:


    This outputted the below plot.  It can be seen in general the ~70% accuracy predicted many products but with a lot of error happening for a large outlier.  Examining the data this product SKU was for a cash only prize.  A next step would be to look at how to deal with this product in particular since eliminating it improves accuracy to ~85% in one swoop.
     

    Next steps for the RandomForest

    There I stop but there are lots of next steps that could be done to make the model applicable to the business.  A non-exhaustive list is:

    • Run model on more test sets
    • Train model on more data
    • Try reducing number of parameters (see PCA later)
    • Examine large error outliers 
    • Compare with simple models (last/first product viewed?) - complicated is not always best!
    • Run model against users who have viewed and not sold yet
    • Run email campaign with control and model results for final judgement

    It is hoped the above inspired you to try it yourself.

    Session 2: K-means, Principal Component Analysis and Summary

    Example 2: Can we cluster users based on their view product behaviour?

    Now we look at k-means clustering.  The questions we are trying to answer are something like this:

    Do we have suitable prize categories on the website? How do our website categories compare to user behaviour?

    The k-means clustering we hope will give us data to help with decisions on how the website is organised.

    For this we will use the same data as we used before for Random Forests, with some minor changes: as k-means is an unsupervised model we will take off our product labels:

    A lot of this example is inspired by this nice beginners walk-through on K-means with R.

    Introduction to K-means clustering


    This video tutorial on k-means explains it well:



    The above is an example with two dimensions, but k-means can apply to many more dimensions than that, we just can't visualise them easily. In our case we have 185 product views that will each serve as a dimension.  However, problems with that many dimensions include long processing time alongside dangers of over-fitting the data, so we now look at PCA.

    Principal Component Analysis (PCA)

    We perform Principal Component Analysis (PCA) to see if there are important products that dominate model - this could have been applied to previous Random Forest example as well, and indeed a final production model could include output from one model like k-means to be fed into Random Forests.

    PCA rotates dimensions to try and minimize them as much as possible, then ranks them in amount of variance.  There is a good visualisation of this here.

    The clustering we will do will actually be performed on the top rotated dimensions we find via PCA, and we will then map these back to the original pages for final output. This also takes care of situations such as if one product is always viewed in every cluster: PCA will minimize this dimension.

    The code below looks for the principal components, then gives us some outputs to try and decide how many dimensions we will choose.  A rule of thumb is we look for components that give us roughly ~85% of the variance.   For the below data this was actually 35 dimensions (reduced from the 185 before)  



    The plot output from the above is below.  We can see the first principal component accounts for 50% of the variance, but then the variation is flattish.


    How many clusters?

    How many clusters to pick for k-means can be a subjective experience.  There are other clustering models that pick for you, but some kind of decision process will be dependent on what you need.  There are however ways to help inform that decision.

    Running the k-means modelling for increasing number of clusters, we can look at an error measure (sum of squares) of how many points are in each.  When we plot these attempts for each cluster iteration, we can see how the graph changes or levels off at various cluster sizes, and use that to help with our decision:


    The plot for determining the clusters is here - see the fall between 2-4 clusters.  We went with 4 for this example, although a case could be made for 6:


    Assessing the clusters and visualisation

     I find heatmaps are a good way to assess clustering results, since they offer a good way to overview groupings.  We are basically looking to see if the clusters found are different enough to make sense.


    This gives the following visualisation.  In an interactive RStudio or Shiny session, this is zoomable for finer detail, but here we just exported the image:

    From the heatmap we can see that each cluster does have distinctly different product views.

    K-Means - Next Steps

    The next step is to take these clusters and examine the products that are within them, looking for patterns.  This is where your domain knowledge is needed, as all we have done here is grouped together based on statistics - the "why" is not in here.  When I've performed this in the past, I try to give named persona to each cluster type.  Examples include "Big Spenders" for those who visit the payment page a lot, "Sport Freaks" who tend to only look at sport goods etc.  Again, this will largely depend on the number of clusters you have chosen, so you may want to vary this to tweak to the results you are looking for.

    Recommendations follow on how to group pages: A/B teats can then be performed to test if the clustering makes an impact.

    Summary

    I hope the above example workflows have inspired you to try it with your own data.  Both examples can be improved, for instance we took no account of the order of product views or other metrics such as time on website, but the idea was to give you a way in to try these yourselves.

    I chose k-means and Random Forests as they are two of the most popular models, but there are lots to choose from.  This diagram from a python machine learning library, scikit-learn, offers an excellent overview on how to choose which other machine learning model you may want to use for your data:

    All in all I hope some of the mystery around machine learning has been taken out, and how it can be applied to your work.  If you are interested in really getting to grips with machine learning, the Coursera course was excellent and what set me on my way.

    Do please let me know of any feedback, errors or what you have done with the above, I'd love to hear from you.

    Good luck!

    Google API Client Library for R: googleAuthR v0.1.0 now available on CRAN

    One of the problems with working with Google APIs is that quite often the hardest bit, authentication, comes right at the start.  This presents a big hurdle for those who want to work with them, it certainly delayed me.  In particular having Google authentication work with Shiny is problematic, as the token itself needs to be reactive and only applicable to the user who is authenticating.

    But no longer! googleAuthR provides helper functions to make it easy to work with Google APIs.  And its now available on CRAN (my first CRAN package!) so you can install it easily by typing:

    > install.packages("googleAuthR")

    It should then load and you can get started by looking at the readme files on Github or typing:

    > vignette("googleAuthR")

    After my experiences making shinyga and searchConsoleR, I decided inventing the authentication wheel each time wasn't necessary, so worked on this new R package that smooths out this pain point.

    googleAuthR provides easy authentication within R or in a Shiny app for Google APIs.  It provides a function factory you can use to generate your own functions, that call or do the actions you needed.

    At last counting there are 83 APIs, many of which have no R library, so hopefully this library can help with that.  Examples include the Google Prediction API, YouTube analytics API, Gmail API etc. etc.

    Example using googleAuthR

    Here is an example of making a goo.gl R package using googleAuthR:

    If you then want to make this multi-user in Shiny, then you just need to use the helper functions provided:





    Automating Google Console search analytics data downloads with R and searchConsoleR

    Yesterday I published version 0.1 of searchConsoleR, a package that interacts with Google Search Console (formerly Google Webmaster Tools) and in particular its search analytics.

    I'm excited about the possibilities with this package, as this new improved data is now available in a way to interact with all the thousands of other R packages.

    If you'd like to see searchConsoleR capabilities, I have the package running an interactive demo here (very bare bones, but should demo the data well enough).

    The first application I'll talk about in this post is archiving data into a .csv file, but expect more guides to come, in particular combining this data with Google Analytics.

    Automatic search analytics data downloads

    The 90 day limit still applies to the search analytics data, so one of the first applications should be archiving that data to make year on year, month on month and general development of your SEO rankings.

    The below R script:

    1. Downloads and installs the searchConsoleR package if it isn't installed already.
    2. Lets you set some parameters you want to download.
    3. Downloads the data via the search_anaytics function.
    4. Writes it to a csv in the same folder the script is run in.
    5. The .csv file can be opened in Excel or similar.

    This should give you nice juicy data.

    Considerations

    The first time you will need to run the scr_auth() script yourself so you can give the package access, but afterwards it will auto-refresh the authentication each time you run the script.

    If you ever need a new user to be authenticated, run scr_auth(new_user=TRUE)

    You may want to modify the script so it appends to a file instead, rather than having a daily dump, although I do this with a folder of .csv's to import them all into one R dataframe (which you could export again to one big .csv)

    Automation

    You can now take the download script and use it in automated batch files, to run daily.

    In Windows, this can be done like this (from SO)

    • Open the scheduler: START -> All Programs -> Accessories -> System Tools -> Scheduler
    • Create a new Task
    • under tab Action, create a new action
    • choose Start Program
    • browse to Rscript.exe which should be placed e.g. here:
      "C:\Program Files\R\R-3.2.0\bin\x64\Rscript.exe"
    • input the name of your file in the parameters field
    • input the path where the script is to be found in the Start in field
    • go to the Triggers tab
    • create new trigger
    • choose that task should be done each day, month, ... repeated several times, or whatever you like

    In Linux, you can probably work it out yourself :)

    Conclusion

    Hopefully this shows how with a few lines of R you can get access to this data set.  I'll be doing more posts in the future using this package, so if you have any feedback let me know and I may be able to post about it.  If you find any bugs or features you would like, please also report an issue on the searchConsoleR issues page on Github.

    Enhance Your Google Analytics Data with R and Shiny (Free Online Dashboard Template)

    Introduction

    The aim of this post is to give you the tools to enhance your Google Analytics data with R and present it on-line using Shiny.  By following the steps below, you should have your own on-line GA dashboard, with these features:

    • Interactive trend graphs.

    • Auto-updating Google Analytics data.

    • Zoomable day-of-week heatmaps.

    • Top Level Trends via Year on Year, Month on Month and Last Month vs Month Last Year data modules.

    • A MySQL connection for data blending your own data with GA data.

    • An easy upload option to update a MySQL database.

    • Analysis of the impact of marketing events via Google's CausalImpact.

    • Detection of unusual time-points using Twitter's Anomaly Detection.

    A lot of these features are either unavailable in the normal GA reports, or only possible in Google Analytics Premium.  Under the hood, the dashboard is exporting the data via the Google Analytics Reporting API, transforming it with various R statistical packages and then publishing it on-line via Shiny.

    A live demo of the dashboard template is available on my Shinyapps.io account with dummy GA data, and all the code used is on Github here.

    Feature Detail

    Here are some details on what modules are within the dashboard.  A quick start guide on how to get the dashboard running with your own data is at the bottom.

    Trend Graph

    Most dashboards feature a trend plot, so you can quickly see how you are doing over time.  The dashboard uses dygraphs javascript library, which allows you to interact with the plot to zoom, pan and shift your date window.  Plot smoothing has been provided at the day, week, month and annual level.

    Screen Shot 2015-07-17 at 221225png

    Additionally, the events you upload via the MySQL upload also appear here, as well as any unusual time points detected as anomalies.  You can go into greater detail on these in the Analyse section.

    Heatmap

    Heatmaps use colour intensity to show metrics between categories.  The heatmap here is split into weeks and day per week, so you can quickly scan to see if a particular day of the week is popular - in the below plot, Monday/Tuesday look like they are best days for traffic.  

    Screen Shot 2015-07-19 at 114241png

    The data window is set by what you select in the trend graph, and you can zoom for more detail using the mouse.

    Top Level Trends

    Quite often headlines just need a number to quickly check.  These data modules give you a quick glance into how you are doing, comparing last week to the week before, last month to the month before and last month to the same month the year before.  Between them, you should see how your data is trending, accounting for seasonal variation.

    Screen Shot 2015-07-19 at 114335png

    MySQL Connection

    The code provides functions to connect to a MySQL database, which you can use to blend your data with Google Analytics, provided you have a key to link them on.  

    Screen Shot 2015-07-17 at 221124png

    In the demo dashboard the key used is simply the date, but this can be expanded to include linking on a userID from say a CRM database to the Google Analytics CID, Transaction IDs to off-line sales data, or extra campaign information to your campaign IDs.  An interface is also provided to let end users update the database by uploading a text file.

    CausalImpact

    In the demo dashboard, the MySQL connection is used to upload Event data, which is then used to compare with the Google Analytics data to see if the event had a statistically significant impact on your traffic.  This replicates a lot of the functionality of the GA Effect dashboard.

    Screen Shot 2015-07-17 at 221136png

    Headline impact of the event is shown in the summary dashboard tab.  If its statistically significant, the impact is shown in blue.

    Screen Shot 2015-07-19 at 160843png

    Anomaly Detection

    Twitter has released this R package to help detect unusual time points for use within their data streams, which is also handy for Google Analytics trend data.  

    Screen Shot 2015-07-17 at 221151png

    The annotations on the main trend plot are indicated using this package, and you can go into more detail and tweak the results in the Analyse section.

    Making the dashboard multi-user

    In this demo I’ve taken the usual use case of an internal department just looking to report on one Google Analytics property, but if you would like end users to authenticate with their own Google Analytics property, it can be combined with my shinyga() package, which provides functions which enable self authentication, similar to my GA Effect/Rollup/Meta apps.

    In production, you can publish the dashboard behind a Shinyapps authentication login (needs a paid plan), or deploy your own Shiny Server to publish the dashboard on your company intranet.

    Quick Start

    Now you have seen the features, the below goes through the process for getting this dashboard for yourself. This guide assumes you know of R and Shiny - if you don’t then start there: http://shiny.rstudio.com/

    You don’t need to have the MySQL details ready to see the app in action, it will just lack persistent storage.

    Setup the files

    1. Clone/copy-paste the scripts in the github repository to your own RStudio project.

    2. Find your GA View ID you want to pull data from.  The quickest way to find it is to login to your Google Analytics account, go to the View then look at the URL: the number after “p” is the ID.

    3. [Optional] Get your MySQL setup with a user and IP address. See next section on how this is done using Google Cloud SQL.  You will also need to white-list the IP of where your app will sit, which will be your own Shiny Server or shinyapps.io. Add your local IP for testing too. If using shinyapps.io their IPs are: 54.204.29.251; 54.204.34.9; 54.204.36.75; 54.204.37.78.

    4. Create a file called secrets.R file in the same directory as the app with the below content filled in with your details.  

    Configuring R

        1. Make sure you can install and run all the libraries needed by the app:

        2. Run the below command locally first, to store the auth token in the same folder.  You will be prompted to login with the Google account that has access to the GA View ID you put into step 3, and get a code to paste into the R console.  This will then be uploaded with app and handle the authentication with Google Analytics when in production:

            > rga::rga.open(where="token.rga")

        3. Test the app by hitting the “Run App” button at the top right of the server.ui script in RStudio, or by running:

            > shiny::runApp()

    Using the dashboard

    1. The app should now be running locally in a browser window with your own GA data.  It can take up to 30 seconds for all the data to load first time.

    2. Deploy the instance on-line to Shinyapps.io with a free account there, or to your own Shiny Server instance.

    3. Customise your instance. If for any reason you don’t want certain features, then remove the feature in the ui.R script - the data is only called when the needed plot is viewed.

    Getting a MySQL setup through Google Cloud SQL

    If you want a MySQL database to use with the app, I use Google Cloud SQL.  Setup is simple:
    1. Go to the Google API console and create a project if you need to.

    2. Make sure you have billing turned on with your billing accounts menu top right.

    3. Go to Storage > Cloud SQL in the left hand menu.

    4. Create a New Instance.

    5. Create a new Database called “onlinegashiny”

    6. Under “Access Control” you need to put in the IP of yourself where you test it, as well as the IPs of the Shiny Server/shinyapps.io.  If you are using shinyapps.io the IPs are: 54.204.29.251; 54.204.34.9; 54.204.36.75;54.204.37.78

    7. Under “IP Address” create a static IP (Charged at $0.24 a day)

    8. You now should have all the access info you need to put in the apps secrets.R for MySQL access.  The port should be a default 3306

    9. You can also limit the amount of data that is uploaded by the shiny.maxRequestSize option - default is 0.5 MB.

    Summary

    Hopefully the above could help inspire what can be done with your Google Analytics data.  Focus has been on trying to give you the tools that allow action to be made on your data.

    There is a lot more you can do via the thousands of R packages available, but hopefully this gives a framework you can build upon.

    I’d love to see what you build with it, so do please feel free to get in touch. :)

    How I made GA Effect - creating an online statistics dashboard using R

    GA Effect is a webapp that uses Bayesian structural time-series to judge if events happening in your Google Analytics account are statistically significant.  Its been well received on Twitter and how to use it is detailed in this guest post on Online Behaviour, but this blog will be about how to build your own or similar.

    Update 18th March: I've made a package that holds a lot of the functions below, shinyga.  That may be easiest to work with.

    Screen Shot 2015-02-19 at 205931png

    What R can do

    Now is a golden time for the R community, as it gains popularity outside of its traditional academic background and hits business.  Microsoft has recently bought Revolution Analytics, an enterprise solution of R so we can expect a lot more integration with them soon, such as the machine learning in their Azure platform.

    Meanwhile RStudio are releasing more and more packages that make it quicker and easier to create interactive graphics, with tools for connecting and reshaping data and then plotting using attractive JavaScript visualisation libraries or native interactive R plots.  GA Effect is also being hosted using ShinyApps.io, an R server solution that enables you to publish straight from your console, or you can run your own server using Shiny Server.  

    Packages Used

    For the GA Effect app, the key components were these R packages:

    Putting them together

    Web Interaction

    First off, using RStudio makes this all a lot easier as they have a lot of integration with their products.

    ShinyDashboard is a custom theme of the more general Shiny.  As detailed in the getting started guide, creating a blank webpage dashboard with shinydashboard take 8 lines of R code.  You can test or run everything locally first before publishing to the web via the “Publish” button at the top.  

    Probably the most difficult concept to get around is the reactive programming functions in a Shiny app.  This is effectively how the interaction occurs, and sets up live relationships between inputs from your UX script (always called ui.R) and outputs from your server side scripts (called server.r).  These are your effective front-end and back-end in a traditional web environment.  The Shiny packages takes your R code and changes it into HTML5 and JavaScript. You can also import JavaScript of your own if you need it to cover what Shiny can’t.

    The Shiny code then creates the UI for the app, and creates reactive versions of the datatables needed for the plots.

    Google Authentication

    The Google authentication flow uses OAuth2 and could be used for any Google API in the console, such as BigQuery, Gmail, Google Drive etc.  I include the code used for the authentication dance below so you can use it in your own apps:

    Fetching Google Analytics Data

    Once a user has authenticated with Google, the user token is then passed to rga() to fetch the GA data, according to which metric and segment the user has selected. 

    This is done reactively, so each time you update the options a new data fetch to the API is made.  Shiny apps are on a per user basis and work in RAM, so the data is forgotten once the app closes down.

    Doing the Statistics

    You can now manipulate the data however you wish.  I put it through the CausalImpact package as that was the application goal, but you have a wealth of other R packages that could be used such as machine learning, text analysis, and all the other statistical packages available in the R universe.  It really is only limited by your imagination. 

    Here is a link to the CausalImpact paper, if you really want to get in-depth with the methods used.  It includes some nice examples of predicting the impact of search campaign clicks.

    Here is how CausalImpact was implemented as a function in GA Effect:

    Plotting

    dygraphs() is an R package that takes R input and outputs the JavaScript needed to display it in your browser, and as its made by RStudio they also made it compatible with Shiny.  It is an application of HTMLwidgets, which lets you take any JavaScript library and make it compatible with R code.  Here is an example of how the main result graph was generated:

    Publishing

    I’ve been testing the alpha of shinyapps.io for a year now, but it is just this month (Feb 2015) coming out of beta.  If you have an account, then publishing your app is as simple as pushing “Publish” button above your script, where it appears at a public URL.  With other paid plans, you can limit access to authenticated users only.

    Next steps

    This app only took me 3 days with my baby daughter on my lap during a sick weekend, so I’m sure you can come up with similar given time and experience.  The components are all there now to make some seriously great apps for analytics.  If you make something do please let me know!

    Finding the ROI of Title tag changes using Google's CausalImpact R package

    After a conversation on Twitter about this new package, and mentioning it in my recent MeasureCamp presentation, here is a quick demo on using Google's CausalImpact applied to an SEO campaign.

    CausalImpact is a package that looks to give some statistics behind changes you may have done in a marketing campaign.  It examines the time-series of data before and after an event, and gives you some idea on whether any changes were just down to random variation, or the event actually made a difference.

    You can now test this yourself in my Shiny app that automatically pulls in your Google Analytics data so that you can apply CausalImpact to it.   This way you can A/B test changes for all your marketing channels, not just SEO.  However, if you want to try it manually yourself, keep reading.

    Considerations before getting the data

    Suffice to say, it should only be applied to time-series data (e.g. there is date or time on the x-axis), and it helps if the event was rolled out on only one of those time points.  This may influence the choice of time unit you use, so if say it rolled out over a week its probably better to use weekly data exports.  Also consider the time period you choose.  The package will use the time-series before the event to construct what it thinks should happen vs what actually happened, so if anything unusual or spikes occur in the test period it may affect your results.

    Metrics wise the example here is with visits.  You could perhaps do it with conversions or revenue, but then you may get affected by factors outside of your control (the buy button breaking etc.), so for clean results try to take out as many confounding variables as possible. 

    Example with SEO Titles

    For me though, I had an example where some title tag changes went live on one day, so could compare the SEO traffic before and after to judge if it had any effect, and also more importantly judge how much extra traffic had increased.

    I pulled in data with my go-to GA R import library, rga by Skardhamar.

    Setup

    I first setup, importing the libraries if you haven't got them and authenticating the GA account you want to pull data from.

    Import GA data

    I then pull in the data for the time period covering the event.  SEO Visits by date.

    Apply CausalImpact

    In this example, the title tags got updated on the 200th day of the time-period I pulled.  I want to examine what happened the next 44 days.

    Plot the Results

    With the plot() function you get output like this:

    1. The left vertical dotted line is where the estimate on what should have happened is calculated from.
    2. The right vertical dotted line is the event itself. (SEO title tag update)
    3. The original data you pulled is the top graph.
    4. The middle graph shows the estimated impact of the event per day.
    5. The bottom graph shows the estimated impact of the event overall.

    In this example it can be seen that after 44 days there is an estimated 90,000 more SEO visits from the title tag changes. This then can be used to work out the ROI over time for that change.

    Report the results

    The $report method gives you a nice overview of the statistics in a verbose form, to help qualify your results.  Here is a sample output:

    "During the post-intervention period, the response variable had an average value of approx. 94. By contrast, in the absence of an intervention, we would have expected an average response of 74. The 95% interval of this counterfactual prediction is [67, 81]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 20 with a 95% interval of [14, 27]. For a discussion of the significance of this effect, see below.

    Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 4.16K. By contrast, had the intervention not taken place, we would have expected a sum of 3.27K. The 95% interval of this prediction is [2.96K, 3.56K].

    The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +27%. The 95% interval of this percentage is [+18%, +37%].

    This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (20) to the original goal of the underlying intervention.

    The probability of obtaining this effect by chance is very small (Bayesian tail-area probability p = 0.001). This means the causal effect can be considered statistically significant."

    Next steps

    This could then be repeated for things like UX changes, TV campaigns, etc. You just need the time of the event and the right metrics or KPIs to measure them against.

    The above is just a brief intro, there is a lot more that can be done with the package including custom models etc, for more see the package help file and documentation.

    Run R, RStudio and OpenCPU on Google Compute Engine [free VM image]

    File this under "what I wished was on the web whilst trying to do this myself."

    edit 20th November, 2016 - now everything in this post is abstracted away and available in the googleComputeEngineR package - I would say its a lot easier to use that.  Here is a post on getting started with it. http://code.markedmondson.me/launch-rstudio-server-google-cloud-in-two-lines-r/

    edit 30th April, 2016: I now have a new post up on how to install RStudio Server on Google Compute Engine using Docker, which is a better way to do it. 

    edit 30th Nov, 2015: Oscar explains why some users couldn't use their username

    edit 5th October: Added how to login, add users and migrated from gcutil to gcloud

    Google Compute Engine is a very scalable and quick alternative to Amazon Web Services, but a bit less evolved in the images available for users. 

    If you would like to have a VM with R 3.01, RStudio Server 0.98 and OpenCPU installed, then you can click on the link below, and install a pre-configured version for you to build upon.

    With this image, you have a cloud server with the most popular R / Cloud interfaces available, which you can use to apply statistics, machine learning or other R applications on web APIs.  It is a fundamental building block for a lot of my projects.

    The VM image is here. [940.39MB]

    To use, follow these steps:

    Downloading the instance and uploading to your project

    1. Create your own Google Cloud Compute project if you haven't one already.
    2. Put in billing details.  Here are the prices you'll pay for running the machine. Its usually under $10 a month.
    3. Download the image from the link above (and here) and then upload it to your own project's Cloud Storage. Details here
    4. Add the uploaded image to your project with a nice name that is only lowercase, numbers or includes hyphens (-).  Details here. You can do this using gcloud and typing: 
    $ gcloud compute images create IMAGE_NAME --source-uri URI

    Creating the new Instance

    1. Now go to Google Compute Engine, and select Create New Instance
    2. Select the zone, machine type you want (i.e. you can select a 50GB RAM machine if needed for big jobs temporarily)
    3. In the dropdown for images you should be able to see the image from step 4 above.  Here is a screenshot of how it should look, I called my image "r-studio-opencpu20140628"

    Or, if you prefer using command line, you can do the steps above in one command with gcloud like this:

    $ gcloud compute instances create INSTANCE [INSTANCE ...] --image IMAGE

    Using your instance

    You should now have RStudio running on http://your-ip-address/rstudio/ and openCPU running on http://your-ip-address/ocpu/test and a welcome homepage running at the root http://your-ip-address

    To login, your Google username is an admin as you created the Google cloud project. See here for adding users to Google Cloud projects

    If you don't know your username, try this command using gcloud to see your user details:

    $ gcloud auth login

    Any users you add to Debian running on the instance will have a user in RStudio - to log into Debian and add new users, see below:

    $ ## ssh into the running instance
    $ gcloud compute ssh <your-username>@new-instance-name
    $ #### It should now tell you that you are logged into your instance #####
    $ #### Once logged in, add a user: example with jsmith
    $ sudo useradd jsmith
    $ sudo passwd jsmith
    $ ## give the new user a directory and change ownership to them
    $ sudo mkdir /home/jsmith $ sudo chown jsmith:users /home/jsmith

    Oscar in the comments below also explains why sometimes your username may not work:

    Like other comments, my username did not work.

    Rather than creating a new user, you may need to simply add a password to your user account:

    $ sudo passwd .

    Also, the username will be your email address with the '.' replaced with '_'. So xx.yy@gmail.com became xx_yy

    You may also want to remove my default user the image comes with:

    $ sudo userdel markedmondson

    ...and remove my folder:

    $ sudo rm -rf /home/markedmondson

    The configuration used

    If you would like to look before you leap, or prefer to install this yourself, a recipe is below. It largely cobbles together the instructions around the web supplied by these sources:

    Many thanks to them.

    It covers installation on the Debian Wheezy images available on GCE, with the necessary backports:








    How To Use R to Analyse and Plot Your Twitter Use

    Here is a little how-to if you want to use R to analyse Twitter.  This is the first of two posts: this one talks about the How, the second will talk about the Why.  

    If you follow all the code you should be able to produce plots like this:

    As with all analytic projects its split into four different aspects: 1. getting the data; 2. transformations; 3. analysing; 4. plotting.

    All the code is available on my first public github project:

    https://github.com/MarkEdmondson1234/r-twitter-api-ggplot2

    I did this project to help answer an idea: can I tell by my Twitter when I changed jobs or moved country?

    I have the feeling the more I am doing SEO, the more I rely on Twitter as an information source; whereas for Analytics its more independent research that takes place more on StackOverflow and Github. Hopefully this project can see if this is valid.

    1. Getting the data

    R makes getting tweets easy via the twitteR package.  You need to install that, register your app with Twitter, then authenticate to get access to the Twitter API.

    Another alternative to using the API is to use Twitter's data export, which will then let you go beyond the 3200 limit in the API. This gives you a csv which you can load into R using read.csv()

    2. Transforming the data

    For my purposes, I needed to read the timestamps of the tweets, and put them into early, morning, afternoon and evening buckets, so I could then plot the data.  I also created a few aggregates of the data, to suit what I needed to plot, and these dataframes I outputted from my function in a list.

    Again, as with most analytics projects, this section represents most of the work, with to and fro happening as I tweaked the data I wanted in the chart.  Some tip I've picked up is to try and do these data transformations in a function taking the raw data as an input and outputting your processed data, as it makes it easier to repeat for different data inputs.

    3. Analysing the data

    This will be covered in the second post, and usually is the point of the whole exercise - it only takes about 10% of time on the project, but is the most important.

    4. Plotting the data

    This part evolves as you go to and fro from steps 2-3, but what I ended up with where these functions below.

    theme_mark() is a custom ggplot2 theme you can use if you want the plots to look exactly the same as above, or at the very least show how to customise ggplot2 to your own fonts/colours.  It also uses choosePalette() and installFonts(). "mrMustard" is my name for the colour scheme chosen.

    I use two layers in the plot - one is the area plot to show the total time spent per Day Part, the second is a smoother line to help pick out the trend better for each Day Part.

    plotTweetsDP() takes as input the tweetTD (weekly) or tweetTDm (monthly) dataframes, and plots the daypart dataframe produced by the transformations above.  The timeAxis paramter expects "ym" (yearWeek) or "ym" (yearMonth) which it uses to make the x-axis be more suited to each.

    plotLinksTweets() is the same, but works on the tweetLinks dataframe.


    I hope this is of some use to someone, let me know in the comments!  Also any ideas on where to go from here - at the moment I'm working through some text mining packages to try and get something useful out of those. 

    Again the full project code is available on Github here: https://github.com/MarkEdmondson1234/r-twitter-api-ggplot2

    My Google Analytics Time Series Shiny App (Alpha)

    There are many Google Analytics dashboards like it, but this one is mine:

    My Google Analytics Time Series App

    Its a bare bones framework where I can start to publish publicly some of the R work I have been learning over the past couple of years. 

    It takes advantage of an Alpha of Shinyapps, which is a public offering of R Shiny, that I love and adore. 

    At the moment the app has just been made to authenticate and show some generic output, but I plan to create a lot more interesting plots/graphs from it in the future.

    How To Use It

    1. You need a Google Analytics account.  
    2. Go to https://mark.shinyapps.io/GA_timeseries/
    3. You'll see this screen.  Pardon the over heavy legal disclaimers, I'm just covering my arse.  I have no intention of using this app to mine data, but other's GA apps might, so I would be wary giving access to Google Analytics for other webapps, especially now its possible to add users via the management API.
    4. Click the "GA Authentication" link.  It'll take you to the Google account screen, where you say its ok to use the data (if it is), and copy-paste the token it then displays.
    5. This token allows the app (but not me) process your data.  Go back to the app and paste the token in the box.
    6. Wait about 10 seconds, depending on how many accounts you have in your Google Analytics.
    7. Sometimes you may see "Bad Request" which means the app is bad, and the GA call has errored.  If you hard reload the page (on Firefox this is SHIFT + RELOAD), you need to reauthenticate starting from step 2 above. Sorry.
    8. You should now see a table of your GA Views on the "GA View Table" tab.  You can search and browse the table, and choose the account and profile ID you want to work with via the left hand drop downs. Example using Sanne's Copenhagenish blog:
    9. If you click on "Charts" tab in the middle, you should see some Google Charts of your Visits and PageViews. Just place holders for now.
    10. If you click on the "Forecasts" tab you should see some forecasting of your visits data.  If it doesn't show, make sure the date range to the far left covers 70 days (say 1st Dec 2013 to 20th Feb 2014). 
    11. The Forecast is based on Holt-Winters exponential smoothing to try and model seasonality.  The red line is your actual data, the blue the model's guess including 70 days into the future. The green area is the margin of error to 50% confidence, and the Time axis shows number of months.  To be improved.
    12. Under the forecast model is a decomposition of the visits time series. Top graph is the actual data, second is the trend without seasonal, third graph the 31 data seasonal trend and the forth graph is the random everything else.
    13. In the last "Data Table" tab you can see the top 1000 rows of data.

    That's it for now, but I'll be doing more in the future with some more exciting uses of GA data, including clustering, unsupervised learning, multinomial regression and sexy stuff like that.

    Update 24th Feb

    I've now added a bit of segmentation, with SEO and Referral data available trended, forecasted and decomposed.