I intended this blog to cover everything else that was not code covered on https://code.markedmondson.me but as you can see that never happened. This probably points to some unbalance in my life, and I hope for 2019 to address that.
There is a lot else going on. In 2018 we moved to a dream house with a garden that gets me out a bit, and has also helped give space for some music to come back. The new house is hopefully a move that benefits all the family, even though we got it as Sanne and I's careers waxed and waned it should be net positive in lots of ways.
A 2018 resolution was to travel as much with the family as I do with work, and that has largely panned out and has been positive, so this year I'll try it with blogging - one post here about non-code for every post about code at the other place. The intended audiences for this blog is anyone who cares whilst for the code blog a more professional output.
Another habit I got back into in 2018 was reading. From the age of around 4 to 30ish I hardly did anything else aside read but I dropped out of the habit for some reason, perhaps just the amount of reading I did at work. After setting a target of one book a month and 20mins a day (a goal that would have been child's play at age 16) I found my way out of the reading funk. Its a guaranteed way to relax, if choosing a book not too much like work. I'm on Goodreads if you would like to join me, and I hope to write about some of the new thoughts that have arisen from those.
Music I get a lot more practice mooching about in my new study, but lose access to the music bunker, so not sure how that will go. But I'm a better player now at least. I'll put anything that is half-finished on Soundcloud here.
Gardening posts?! Stranger things have happened.
Religion/Politics etc. Yes, I should do this as I've sworn of touching it on Twitter decrying it as a terrible medium for politics in particular. This place should be better for long-form considerations.
A few weeks ago I tweeted a beta version of a BigQuery Visualiser Shiny app that was well received, and got some valuable feedback on how it could be improved, in particular from @felipehoffa - thanks Felipe!
Here is a screenshot of the app:
Motivation
The idea of the app is to enhance the standard BigQuery interface to include plots of the data you query. It uses ggplot2, a popular R library; d3heatmaps, a d3 JavaScript library to display heatmaps; and timelyportfolio's listviewer, a nice library for viewing all the BigQuery meta data in a collapsible tree. Other visualisations can be added fairly easily and will be done so over time, but if you have a request for something in particular you can raise an issue on the project's Github page.
I got into BigQuery once it started to receive exports from Google Analytics Premium. Since these exports carry unsampled raw data and include unique userIds, its a richer data source for analysis than the Google Analytics reporting API.
It also was a chance to create another Google API library called bigQueryR, the newest member of the googleAuthR family. Using googleAuthR meant Shiny support, and also meant bigQueryR can be used alongside googleAnalyticsR and searchConsoleR under one shared login flow. This is something exploited in this demo of RMarkdown, which pulls data from all three sources into a scheduled report.
You can run the Shiny app locally on your computer within RStudio; within your own company intranet if its running Shiny Server; or publicly like the original app on shinyapps.io.
Feedback
Please let me know what else could improve.
I have a current pending issue on using JSON uploads for authentication that is waiting a bug update in httr, the underlying library.
In particular all the htmlwidgets() packages could be added - this wonderful R library creates an R to d3.js interface, which holds some of the nicest visualisations on the web.
In this first release, I favoured plots that could apply to as much different data sets as possible. For your own use cases you can be more restrictive on what data is requested, and so maybe more ambitious in the plots. If you want inspiration timelyportfolio (he who wrote the listviewer library) has a blog where he makes lots of htmlwidgets libraries.
Enjoy! Hope its of use, let me know if you build something cool with it.
I've just come back from #MeasureCamp, where I attended some great talks: on hierarchical models; the process of analysis; a demo of Hadoop processing Adobe Analytics hits; web scraping with Python and how machine learning will affect marketing in the future. Unfortunately the sad part of MeasureCamp is you also miss some excellent content when they clash, but that's the nature of an ad-hoc schedule. I also got to meet some excellent analytics bods and friends old and new. Many thanks to all the organisers!
My sessions on machine learning
After finishing my presentation I discovered I would need to talk waaay to quickly to fit it all in, so I decided to do a session on each example I had. The presentation is now available online here, so you can see what was intended.
I got some great feedback, as well as requests from people who had missed the session for some details, so this blog post will try to fill in some detail around the presentation we spoke about in the sessions.
Session 1: Introduction, Google Analytics Data and Random Forest Example
Introduction
Machine Learning gives ability for programs to learn without being explicitly programmed for a particular dataset. They make models from input data to create useful output, commonly predictive analytics. (Arthur Samuel via Wikipedia)
There are plenty of machine learning resources, but not many that deal with web analytics in particular. The sessions are aimed at inspiring web analysts to use or add machine learning to their toolbox, showing two machine learning examples that detail:
What data to extract
How to process the data ready for the models
Running the model
Viewing and assessing the results
Tips on how to put into production
Machine learning isn't magic. You may be able to make a model that uses obscure features, but a lot of intuition will be lost as a result. Its much better to have a model that uses features you can understand, and scales up what a domain expert (e.g. you) could do if you had the time to go through all the data.
Types of Machine Learning
Machine learning models are commonly split between supervised and unsupervised learning. We deal with an example from each:
Supervised: Train the model against a test set with known outcomes. Examples include spam detection and our example today, classifying users based on what they eventually buy. The model we use is known as Random Forests.
Unsupervised: Let the model find own results. Examples include clustering of users that we do in the second example using the k-means model.
Every machine learning project needs the below elements. They are not necessarily done in order but a successful project will need to incorporate them all:
Pose the question - This is the most important. We pose a question that our model needs to answer. We also review this question and may modify it to try and fit what the data can do as we work on the project.
Data preparation - This is the majority of work. It covers getting hold of the data, munging it so it fits the model and parsing the results. I've tried to include some R functions below that will help with this, including getting the data from Google Analytics into R.
Running the model - The sexy statistics part. Whilst superstar statistics skills is helpful to get the best results, you can still get useful output when applying model defaults which we use today. Important thing is to understand the methods.
Assessing the results - What you’ll be judged on. You will of course have a measure of how accurate the model is, but an important step is visualising this and being able to explain the model to non-technical people.
How to put it into production - the ROI and business impact. A model that just runs in your R code on your laptop may be of interest, but ultimately not as useful for the business as a whole if it does not recommend how to implement the model and results into production. Here you will probably need to talk to IT about how to call your model, or even rewrite your prototype into a more production level language.
Pitfalls Using Machine Learning in Web Analytics
There are some considerations when dealing with web analytics data in particular:
Web analytics is messy data - definitions can vary from website to website on various metrics, such as unique users, sessions or pageviews, so a through understanding of what you are working with is essential.
Most practical analysis needs robust unique userIds - For useful actionable output, machine learning models need to work on data that record useful dimensions, and for most websites that is your users. Unfortunately that is also the definition that is the most woolly in web analytics given the nature of different access points. Having a robust unique userID is very useful and made the examples in this blog post possible.
Time-series techniques are quickest way in - If you don't have unique users, then you may want to look at time-series models instead, since web analytics is also a lot of count data over time. This is the reason I did GA Effect as one of my first data apps, since it could apply to most situations of web analytics.
Correlating confounders - It can be common for web analytics to be recording highly correlating metrics e.g. PPC clicks and cost. Watch out for these in your models as they can overweight results.
Self reinforcing results - Also be wary of applying models that will favour their own results. For example, a personalisation algo that places products at the top of the page will naturally get more clicks. To get around this, consider using weighted metrics, such as a click curve for page links. Always test.
Do regularisation - Make sure all metrics are on the same scale, otherwise some will dominate. e.g. pageviews + bounce rate in same model
The Scenario
Here is the situation the following examples are based upon. Hopefully it will be something familiar to your own case:
You are in charge of a reward scheme website, where existing customers log in to spend their points. You want users to spend as many points as they can, so they have high perceived value. You
capture a unique userId on login into custom dimension1 and use Google Analytics
enhanced e-commerce to track which prizes users view and claim.
Notice this scenario involves the reliable user ID, since every user is logging in to use the website. This may be tricky to do on your own website, so you may need to only work with a subset of your users. In my view, the data gains you can make from reliable user identification means I try to encourage the design of the website to involve logged in content as much as possible.
Random Forests
Now we get into the first example. Random Forests are a popular machine learning tool as it typically has good results - in Kaggle competitions its often the benchmark to beat.
Random Forests are based on decision trees, and decision trees are the topic of a recent interactive visualisation on machine learning that has been doing the rounds. Its really great, so check it out first then come back here.
Back? Ok great, so now you know about decision trees.
Random Forests are a simple extension, as a collection of decision trees are a Random Forest. A problem with decision trees is that they will overfit your data - when you throw new data at it you will get misclassification. It turns out though, that if you aggregate all the decision trees with subsets of your original data, all those slightly worse models added up make one robust model, meaning when you throw new data at a Random Forest its more likely to be a closer fit.
Example 1: Can we predict what prizes a user will claim from their view history?
Now we are back looking at our test scenario. We have noticed that a lot of user's aren't claiming prizes despite browsing the website, and we want to see if we can encourage them to claim prizes, so they value the points more and spend more to get them.
We want to look at users who do claim, and see what prizes they look at before they claim. Next we will see if we can build a model to predict what a user will claim based on their view history. In production, we will use this to e-mail users who have viewed but not claimed prize suggestions, to see if it improves uptake.
Fetching the data
Use your favourite Google Analytics to R library - I'm using my experimental new library, googleAnalyticsR, but it doesn't matter which, the important thing is looking at what is being fetched. In this example the user ID is being captured in custom dimension 1, and we're pulling out the product SKU code. This is transferable to other web analytics such as Adobe Analytics (perhaps via the RSiteCatalyst package)
Note we needed two API calls to get the views and transactions as these can't be queried in the same call. They will be merged later.
Transforming the data
We now need to put the data into a format that will work with Random Forests. We need a matrix of predictors to feed into the model, one column of response showing the desired output labels, and we split it so it is one row per user action:
Here is some R code to "widen" the data to get this format. We then split the data set randomly 75% for training, 25% for testing.
Running RandomForest and assessing the results
We now run the model - this can take a long time for lots of dimensions (this can be much improved using PCA for dimension reduction, see later). We then test the model on the test data, and get an accuracy figure:
On my example test set I got ~70% accuracy on this initial run, which is not bad, but it is possible to get up to 90-95% with some tweaking. Anyhow, lets plot the test vs predicated product frequencies, to see how it looks:
This outputted the below plot. It can be seen in general the ~70% accuracy predicted many products but with a lot of error happening for a large outlier. Examining the data this product SKU was for a cash only prize. A next step would be to look at how to deal with this product in particular since eliminating it improves accuracy to ~85% in one swoop.
Next steps for the RandomForest
There I stop but there are lots of next steps that could be done to make the model applicable to the business. A non-exhaustive list is:
Run model on more test sets
Train model on more data
Try reducing number of parameters (see PCA later)
Examine large error outliers
Compare with simple models (last/first product viewed?) - complicated is not always best!
Run model against users who have viewed and not sold yet
Run email campaign with control and model results for final judgement
It is hoped the above inspired you to try it yourself.
Session 2: K-means, Principal Component Analysis and Summary
Example 2: Can we cluster users based on their view product behaviour?
Now we look at k-means clustering. The questions we are trying to answer are something like this:
Do we have suitable prize categories on the website? How do our website categories compare to user behaviour?
The k-means clustering we hope will give us data to help with decisions on how the website is organised.
For this we will use the same data as we used before for Random Forests, with some minor changes: as k-means is an unsupervised model we will take off our product labels:
The above is an example with two dimensions, but k-means can apply to many more dimensions than that, we just can't visualise them easily. In our case we have 185 product views that will each serve as a dimension. However, problems with that many dimensions include long processing time alongside dangers of over-fitting the data, so we now look at PCA.
Principal Component Analysis (PCA)
We perform Principal Component Analysis (PCA) to see if there are important products that dominate model - this could have been applied to previous Random Forest example as well, and indeed a final production model could include output from one model like k-means to be fed into Random Forests.
The clustering we will do will actually be performed on the top rotated dimensions we find via PCA, and we will then map these back to the original pages for final output. This also takes care of situations such as if one product is always viewed in every cluster: PCA will minimize this dimension.
The code below looks for the principal components, then gives us some outputs to try and decide how many dimensions we will choose. A rule of thumb is we look for components that give us roughly ~85% of the variance. For the below data this was actually 35 dimensions (reduced from the 185 before)
The plot output from the above is below. We can see the first principal component accounts for 50% of the variance, but then the variation is flattish.
How many clusters?
How many clusters to pick for k-means can be a subjective experience. There are other clustering models that pick for you, but some kind of decision process will be dependent on what you need. There are however ways to help inform that decision.
Running the k-means modelling for increasing number of clusters, we can look at an error measure (sum of squares) of how many points are in each. When we plot these attempts for each cluster iteration, we can see how the graph changes or levels off at various cluster sizes, and use that to help with our decision:
The plot for determining the clusters is here - see the fall between 2-4 clusters. We went with 4 for this example, although a case could be made for 6:
Assessing the clusters and visualisation
I find heatmaps are a good way to assess clustering results, since they offer a good way to overview groupings. We are basically looking to see if the clusters found are different enough to make sense.
This gives the following visualisation. In an interactive RStudio or Shiny session, this is zoomable for finer detail, but here we just exported the image:
From the heatmap we can see that each cluster does have distinctly different product views.
K-Means - Next Steps
The next step is to take these clusters and examine the products that are within them, looking for patterns. This is where your domain knowledge is needed, as all we have done here is grouped together based on statistics - the "why" is not in here. When I've performed this in the past, I try to give named persona to each cluster type. Examples include "Big Spenders" for those who visit the payment page a lot, "Sport Freaks" who tend to only look at sport goods etc. Again, this will largely depend on the number of clusters you have chosen, so you may want to vary this to tweak to the results you are looking for.
Recommendations follow on how to group pages: A/B teats can then be performed to test if the clustering makes an impact.
Summary
I hope the above example workflows have inspired you to try it with your own data. Both examples can be improved, for instance we took no account of the order of product views or other metrics such as time on website, but the idea was to give you a way in to try these yourselves.
I chose k-means and Random Forests as they are two of the most popular models, but there are lots to choose from. This diagram from a python machine learning library, scikit-learn, offers an excellent overview on how to choose which other machine learning model you may want to use for your data:
All in all I hope some of the mystery around machine learning has been taken out, and how it can be applied to your work. If you are interested in really getting to grips with machine learning, the Coursera course was excellent and what set me on my way.
Do please let me know of any feedback, errors or what you have done with the above, I'd love to hear from you.
One of the problems with working with Google APIs is that quite often the hardest bit, authentication, comes right at the start. This presents a big hurdle for those who want to work with them, it certainly delayed me. In particular having Google authentication work with Shiny is
problematic, as the token itself needs to be reactive and only
applicable to the user who is authenticating.
But no longer! googleAuthR provides helper functions to make it easy to work with Google APIs. And its now available on CRAN (my first CRAN package!) so you can install it easily by typing:
> install.packages("googleAuthR")
It should then load and you can get started by looking at the readme files on Github or typing:
> vignette("googleAuthR")
After my experiences making shinyga and searchConsoleR, I decided inventing the authentication wheel each time wasn't necessary, so worked on this new R package that smooths out this pain point.
googleAuthR provides easy authentication within R or in a Shiny app for Google APIs. It provides a function factory you can use to generate your own functions, that call or do the actions you needed.
At last counting there are 83 APIs, many of which have no R library, so hopefully this library can help with that. Examples include the Google Prediction API, YouTube analytics API, Gmail API etc. etc.
Example using googleAuthR
Here is an example of making a goo.gl R package using googleAuthR:
If you then want to make this multi-user in Shiny, then you just need to use the helper functions provided:
I'm
excited about the possibilities with this package, as this new improved
data is now available in a way to interact with all the thousands of
other R packages.
If you'd like to see searchConsoleR capabilities, I have the package running an interactive demo here (very bare bones, but should demo the data well enough).
The
first application I'll talk about in this post is archiving data into a
.csv file, but expect more guides to come, in particular combining this
data with Google Analytics.
Automatic search analytics data downloads
The
90 day limit still applies to the search analytics data, so one of the
first applications should be archiving that data to make year on year,
month on month and general development of your SEO rankings.
The below R script:
Downloads and installs the searchConsoleR package if it isn't installed already.
Lets you set some parameters you want to download.
Downloads the data via the search_anaytics function.
Writes it to a csv in the same folder the script is run in.
The .csv file can be opened in Excel or similar.
This should give you nice juicy data.
Considerations
The first time you will need to run the scr_auth() script
yourself so you can give the package access, but afterwards it will
auto-refresh the authentication each time you run the script.
If you ever need a new user to be authenticated, run scr_auth(new_user=TRUE)
You
may want to modify the script so it appends to a file instead, rather
than having a daily dump, although I do this with a folder of .csv's to
import them all into one R dataframe (which you could export again to
one big .csv)
Automation
You can now take the download script and use it in automated batch files, to run daily.
Open the scheduler: START -> All Programs -> Accessories -> System Tools -> Scheduler
Create a new Task
under tab Action, create a new action
choose Start Program
browse to Rscript.exe which should be placed e.g. here:
"C:\Program Files\R\R-3.2.0\bin\x64\Rscript.exe"
input the name of your file in the parameters field
input the path where the script is to be found in the Start in field
go to the Triggers tab
create new trigger
choose that task should be done each day, month, ... repeated several times, or whatever you like
In Linux, you can probably work it out yourself :)
Conclusion
Hopefully this shows how with a few lines of R you can get access to
this data set. I'll be doing more posts in the future using
this package, so if you have any feedback let me know and I may be able
to post about it. If you find any bugs or features you would like,
please also report an issue on the searchConsoleR issues page on Github.
The aim of this post is to give you the tools to enhance your Google Analytics data with R and present it on-line using Shiny. By following the steps below, you should have your own on-line GA dashboard, with these features:
Interactive trend graphs.
Auto-updating Google Analytics data.
Zoomable day-of-week heatmaps.
Top Level Trends via Year on Year, Month on Month and Last Month vs Month Last Year data modules.
A MySQL connection for data blending your own data with GA data.
An easy upload option to update a MySQL database.
Analysis of the impact of marketing events via Google's CausalImpact.
Detection of unusual time-points using Twitter's Anomaly Detection.
A lot of these features are either unavailable in the normal GA reports, or only possible in Google Analytics Premium. Under the hood, the dashboard is exporting the data via the Google Analytics Reporting API, transforming it with various R statistical packages and then publishing it on-line via Shiny.
Here are some details on what modules are within the dashboard. A quick start guide on how to get the dashboard running with your own data is at the bottom.
Trend Graph
Most dashboards feature a trend plot, so you can quickly see how you are doing over time. The dashboard uses dygraphs javascript library, which allows you to interact with the plot to zoom, pan and shift your date window. Plot smoothing has been provided at the day, week, month and annual level.
Additionally, the events you upload via the MySQL upload also appear here, as well as any unusual time points detected as anomalies. You can go into greater detail on these in the Analyse section.
Heatmap
Heatmaps use colour intensity to show metrics between categories. The heatmap here is split into weeks and day per week, so you can quickly scan to see if a particular day of the week is popular - in the below plot, Monday/Tuesday look like they are best days for traffic.
The data window is set by what you select in the trend graph, and you can zoom for more detail using the mouse.
Top Level Trends
Quite often headlines just need a number to quickly check. These data modules give you a quick glance into how you are doing, comparing last week to the week before, last month to the month before and last month to the same month the year before. Between them, you should see how your data is trending, accounting for seasonal variation.
MySQL Connection
The code provides functions to connect to a MySQL database, which you can use to blend your data with Google Analytics, provided you have a key to link them on.
In the demo dashboard the key used is simply the date, but this can be expanded to include linking on a userID from say a CRM database to the Google Analytics CID, Transaction IDs to off-line sales data, or extra campaign information to your campaign IDs. An interface is also provided to let end users update the database by uploading a text file.
CausalImpact
In the demo dashboard, the MySQL connection is used to upload Event data, which is then used to compare with the Google Analytics data to see if the event had a statistically significant impact on your traffic. This replicates a lot of the functionality of the GA Effect dashboard.
Headline impact of the event is shown in the summary dashboard tab. If its statistically significant, the impact is shown in blue.
Anomaly Detection
Twitter has released this R package to help detect unusual time points for use within their data streams, which is also handy for Google Analytics trend data.
The annotations on the main trend plot are indicated using this package, and you can go into more detail and tweak the results in the Analyse section.
Making the dashboard multi-user
In this demo I’ve taken the usual use case of an internal department just looking to report on one Google Analytics property, but if you would like end users to authenticate with their own Google Analytics property, it can be combined with my shinyga() package, which provides functions which enable self authentication, similar to my GA Effect/Rollup/Meta apps.
In production, you can publish the dashboard behind a Shinyapps authentication login (needs a paid plan), or deploy your own Shiny Server to publish the dashboard on your company intranet.
Quick Start
Now you have seen the features, the below goes through the process for getting this dashboard for yourself. This guide assumes you know of R and Shiny - if you don’t then start there: http://shiny.rstudio.com/
You don’t need to have the MySQL details ready to see the app in action, it will just lack persistent storage.
Setup the files
Clone/copy-paste the scripts in the github repository to your own RStudio project.
Find your GA View ID you want to pull data from. The quickest way to find it is to login to your Google Analytics account, go to the View then look at the URL: the number after “p” is the ID.
[Optional] Get your MySQL setup with a user and IP address. See next section on how this is done using Google Cloud SQL. You will also need to white-list the IP of where your app will sit, which will be your own Shiny Server or shinyapps.io. Add your local IP for testing too. If using shinyapps.io their IPs are: 54.204.29.251; 54.204.34.9; 54.204.36.75; 54.204.37.78.
Create a file called secrets.R file in the same directory as the app with the below content filled in with your details.
Configuring R
1. Make sure you can install and run all the libraries needed by the app:
2. Run the below command locally first, to store the auth token in the same folder. You will be prompted to login with the Google account that has access to the GA View ID you put into step 3, and get a code to paste into the R console. This will then be uploaded with app and handle the authentication with Google Analytics when in production:
> rga::rga.open(where="token.rga")
3. Test the app by hitting the “Run App” button at the top right of the server.ui script in RStudio, or by running:
> shiny::runApp()
Using the dashboard
The app should now be running locally in a browser window with your own GA data. It can take up to 30 seconds for all the data to load first time.
Deploy the instance on-line to Shinyapps.io with a free account there, or to your own Shiny Server instance.
Customise your instance. If for any reason you don’t want certain features, then remove the feature in the ui.R script - the data is only called when the needed plot is viewed.
Getting a MySQL setup through Google Cloud SQL
If you want a MySQL database to use with the app, I use Google Cloud SQL. Setup is simple:
Make sure you have billing turned on with your billing accounts menu top right.
Go to Storage > Cloud SQL in the left hand menu.
Create a New Instance.
Create a new Database called “onlinegashiny”
Under “Access Control” you need to put in the IP of yourself where you test it, as well as the IPs of the Shiny Server/shinyapps.io. If you are using shinyapps.io the IPs are: 54.204.29.251; 54.204.34.9; 54.204.36.75;54.204.37.78
Under “IP Address” create a static IP (Charged at $0.24 a day)
You now should have all the access info you need to put in the apps secrets.R for MySQL access. The port should be a default 3306
You can also limit the amount of data that is uploaded by the shiny.maxRequestSize option - default is 0.5 MB.
Summary
Hopefully the above could help inspire what can be done with your Google Analytics data. Focus has been on trying to give you the tools that allow action to be made on your data.
There is a lot more you can do via the thousands of R packages available, but hopefully this gives a framework you can build upon.
I’d love to see what you build with it, so do please feel free to get in touch. :)
I'm very pleased and honoured to have been accepted into the Google Developer Expert program representing Google Analytics. I should soon have my mug listed with the other GA GDE's at the Google Developer Expert website.
My thanks go to Simo who nominated me and Linda for helping me through the application process.
Alongside my existing work at Wunderman, my role should include some more opportunities to get out there and show what can be done with the GA APIs, so expect me at more analytics conferences soon.
I also will get to play with some of the new betas and hopefully be able to create more cool demo apps for users to adapt and use for their own website, mostly using R Shiny and Google App Engine.
GA Effect is a webapp that uses Bayesian structural time-series to judge if events happening in your Google Analytics account are statistically significant. Its been well received on Twitter and how to use it is detailed in this guest post on Online Behaviour, but this blog will be about how to build your own or similar.
Update 18th March: I've made a package that holds a lot of the functions below, shinyga. That may be easiest to work with.
Meanwhile RStudio are releasing more and more packages that make it quicker and easier to create interactive graphics, with tools for connecting and reshaping data and then plotting using attractive JavaScript visualisation libraries or native interactive R plots. GA Effect is also being hosted using ShinyApps.io, an R server solution that enables you to publish straight from your console, or you can run your own server using Shiny Server.
Packages Used
For the GA Effect app, the key components were these R packages:
First off, using RStudio makes this all a lot easier as they have a lot of integration with their products.
ShinyDashboard is a custom theme of the more general Shiny. As detailed in the getting started guide, creating a blank webpage dashboard with shinydashboard take 8 lines of R code. You can test or run everything locally first before publishing to the web via the “Publish” button at the top.
Probably the most difficult concept to get around is the reactive programming functions in a Shiny app. This is effectively how the interaction occurs, and sets up live relationships between inputs from your UX script (always called ui.R) and outputs from your server side scripts (called server.r). These are your effective front-end and back-end in a traditional web environment. The Shiny packages takes your R code and changes it into HTML5 and JavaScript. You can also import JavaScript of your own if you need it to cover what Shiny can’t.
The Shiny code then creates the UI for the app, and creates reactive versions of the datatables needed for the plots.
Google Authentication
The Google authentication flow uses OAuth2 and could be used for any Google API in the console, such as BigQuery, Gmail, Google Drive etc. I include the code used for the authentication dance below so you can use it in your own apps:
Fetching Google Analytics Data
Once a user has authenticated with Google, the user token is then passed to rga() to fetch the GA data, according to which metric and segment the user has selected.
This is done reactively, so each time you update the options a new data fetch to the API is made. Shiny apps are on a per user basis and work in RAM, so the data is forgotten once the app closes down.
Doing the Statistics
You can now manipulate the data however you wish. I put it through the CausalImpact package as that was the application goal, but you have a wealth of other R packages that could be used such as machine learning, text analysis, and all the other statistical packages available in the R universe. It really is only limited by your imagination.
Here is a link to the CausalImpact paper, if you really want to get in-depth with the methods used. It includes some nice examples of predicting the impact of search campaign clicks.
Here is how CausalImpact was implemented as a function in GA Effect:
Plotting
dygraphs() is an R package that takes R input and outputs the JavaScript needed to display it in your browser, and as its made by RStudio they also made it compatible with Shiny. It is an application of HTMLwidgets, which lets you take any JavaScript library and make it compatible with R code. Here is an example of how the main result graph was generated:
Publishing
I’ve been testing the alpha of shinyapps.io for a year now, but it is just this month (Feb 2015) coming out of beta. If you have an account, then publishing your app is as simple as pushing “Publish” button above your script, where it appears at a public URL. With other paid plans, you can limit access to authenticated users only.
Next steps
This app only took me 3 days with my baby daughter on my lap during a sick weekend, so I’m sure you can come up with similar given time and experience. The components are all there now to make some seriously great apps for analytics. If you make something do please let me know!
I'm just posting this, to maybe help others who get the same problem.
I had an OSX 10.10.2 update on my 2011 Macbook Air, and left the laptop open last night. This put it in Hibernation mode which breaks the auto-installation, so when I tried to use the laptop this morning, it booted to the Apple logo, but then the screen went totally black without the option to login. The cursor was still live though.
The fix below will let you login again. It will only work in the above scenario, if its your backlight broken or something else keep searching :)
Before the below fix I tried:
Pressing the increase brightness buttons (duh)
Restarting in safe mode (doesn't complete login)
Resetting SMC and PRAM (pusing CTRL+OPTION+POWER+other buttons on powerup - see here: https://discussions.apple.com/docs/DOC-3603 )
Letting it boot, waiting, then pushing first letter of your username, pushing enter and typing in password (the most popular fix on the web)
But finally, the solution was found at this forum called Jamfnation via some Google-wu:
Perform a PRAM reset ( Cmd+Option+P+R ) on boot – let chime 3 times and let go
Boot to Single User Mode (hold Command+S immediately after powering on)
Verify and Mount the Drives - Once in Single user mod, run the following commands:
/sbin/fsck -fy
/sbin/mount -uw /
After the disk has mounted in step 5, run the following commands:
The Measurement Protocol was launched at the same time as Universal Analytics, but I've seen less adoption of it with clients, so this post is an attempt to show what can be done with it with a practical example.
With this demo you should be able to track the following:
You have an email address from an interested customer
You send them an email and they look at it, but don't click through.
Three days later they open the email again at home, and click through to the offer on your website.
They complete the form on the page and convert.
Within GA, you will be able to see for that campaign 2 opens, 1 click/visit and 1 conversion for that user. As with all email open tracking, you are dependent on the user downloading the image, which is why I include the option to upload an image and not just a pixel, as it may be more enticing to allow images in your newsletter.
Intro
The Measurement Protocol lets you track beyond the website, without the need of client-side JavaScript. You construct the URL and when that URL is loaded, you see the hit in your Google Analytics account. That's it.
The clever bit is that you can link user sessions together via the CID (Customer ID), so you can track the upcoming Internet of Things off-line to on-line, but also things like email opens and affiliate thank you pages. It also works with things like enhanced e-commerce, so can be used for customer refunds or product impressions.
This demo looks at e-mail opens for its example, but its minor modifications to track other things. For instance, I use a similar script to measure in GA when my Raspberry Pi is backing up our home computers via Time Machine.
Demo on App Engine
To use the Measurement Protocol in production most likely needs server-side code. I'm running a demo on Google App Engine coded in Python, which is pretty readable so should make it fairly easy for a developer to replicate in their favourite language. App Engine is also a good choice if you are wanting to run it in production, since it has a free tier for tracking 1000s of email opens a day, but scalability to handle millions.
There are instructions on Github on how it works, but I'll run through some of the key concepts here in this post.
What the code does
The example has four main URLs:
The homepage explaining the app
The image URL itself, that when loaded creates the hit to GA
A landing page with example custom GA tracking script
An upload image form to change the image you would display in the e-mail.
The URLs above are controlled server side with the code in main.py
Homepage
This does nothing server side aside serve up the page
Image URL
This is the main point of the app - it turns a GET request for the image uploaded into a POST with the parameters found in the URL. It handles the different options and sends the hit to GA as a virtual pageview or event, with a unique user CID and campaign name. An example URL here is:
This does little but take the cid you put in the email URL, and outputs the CID that will be used in Google Analytics. If this is the same CID as in the image URL and the user clicks in the email, those sessions will be linked. You can also add the GA campaign parameters, but the sever side script ignores those - the javascript on the page will take care of it. An example URL here is:
The CID in the landing page URL is then captured and turned into an anonymous CID for GA. This is then served up to the Universal Analytics JavaScript on the landing page, shown below. Use the same UA code for both, else it won't work (e.g. UA-123456-1)
Upload Image
This just handles the image uploading and serves the image up via App Engines blobstore. Nothing pertinent to GA here so see the Github code if interested.
Summary
Its hoped this helps sell using the Measurement Protocol to more developers, as it offers a solution to a lot of the problems with digital measurement today, such as attribution of users beyond the website. The implementation is reasonably simple, but the power is in what you send and what situations. Hopefully this inspires what you could do with your setup.
There are some limitations to be aware of - the CID linking won't stitch sessions together, it just discards a user's old CID if they already had one, so you may want to look at userID or how to customise the CID for users who visit your website first before the email is sent. The best scenario would be if a user is logged in for every session, but this may not be practical. It may be that the value of linking sessions is so advantageous in the future, entire website strategies will be focused on getting users to ID themselves, such as via social logins.
Always consider privacy: look for user's to opt in, and make sure to use GA filters to take out any PII you may put into GA as a result. Current policy looks to be that if the data within GA is not able to be tracked to an individual (e.g. a name, address or email) then you are able to record an anonymous personal ID, that could be exported and linked to PII outside of GA. This is a bit of a shifting target, but in all cases keeping it as user focused and not profit focused as possible should see you through any ethical questions.