Run R, RStudio and OpenCPU on Google Compute Engine [free VM image]

File this under "what I wished was on the web whilst trying to do this myself."

edit 20th November, 2016 - now everything in this post is abstracted away and available in the googleComputeEngineR package - I would say its a lot easier to use that.  Here is a post on getting started with it. http://code.markedmondson.me/launch-rstudio-server-google-cloud-in-two-lines-r/

edit 30th April, 2016: I now have a new post up on how to install RStudio Server on Google Compute Engine using Docker, which is a better way to do it. 

edit 30th Nov, 2015: Oscar explains why some users couldn't use their username

edit 5th October: Added how to login, add users and migrated from gcutil to gcloud

Google Compute Engine is a very scalable and quick alternative to Amazon Web Services, but a bit less evolved in the images available for users. 

If you would like to have a VM with R 3.01, RStudio Server 0.98 and OpenCPU installed, then you can click on the link below, and install a pre-configured version for you to build upon.

With this image, you have a cloud server with the most popular R / Cloud interfaces available, which you can use to apply statistics, machine learning or other R applications on web APIs.  It is a fundamental building block for a lot of my projects.

The VM image is here. [940.39MB]

To use, follow these steps:

Downloading the instance and uploading to your project

  1. Create your own Google Cloud Compute project if you haven't one already.
  2. Put in billing details.  Here are the prices you'll pay for running the machine. Its usually under $10 a month.
  3. Download the image from the link above (and here) and then upload it to your own project's Cloud Storage. Details here
  4. Add the uploaded image to your project with a nice name that is only lowercase, numbers or includes hyphens (-).  Details here. You can do this using gcloud and typing: 
$ gcloud compute images create IMAGE_NAME --source-uri URI

Creating the new Instance

  1. Now go to Google Compute Engine, and select Create New Instance
  2. Select the zone, machine type you want (i.e. you can select a 50GB RAM machine if needed for big jobs temporarily)
  3. In the dropdown for images you should be able to see the image from step 4 above.  Here is a screenshot of how it should look, I called my image "r-studio-opencpu20140628"

Or, if you prefer using command line, you can do the steps above in one command with gcloud like this:

$ gcloud compute instances create INSTANCE [INSTANCE ...] --image IMAGE

Using your instance

You should now have RStudio running on http://your-ip-address/rstudio/ and openCPU running on http://your-ip-address/ocpu/test and a welcome homepage running at the root http://your-ip-address

To login, your Google username is an admin as you created the Google cloud project. See here for adding users to Google Cloud projects

If you don't know your username, try this command using gcloud to see your user details:

$ gcloud auth login

Any users you add to Debian running on the instance will have a user in RStudio - to log into Debian and add new users, see below:

$ ## ssh into the running instance
$ gcloud compute ssh <your-username>@new-instance-name
$ #### It should now tell you that you are logged into your instance #####
$ #### Once logged in, add a user: example with jsmith
$ sudo useradd jsmith
$ sudo passwd jsmith
$ ## give the new user a directory and change ownership to them
$ sudo mkdir /home/jsmith $ sudo chown jsmith:users /home/jsmith

Oscar in the comments below also explains why sometimes your username may not work:

Like other comments, my username did not work.

Rather than creating a new user, you may need to simply add a password to your user account:

$ sudo passwd .

Also, the username will be your email address with the '.' replaced with '_'. So xx.yy@gmail.com became xx_yy

You may also want to remove my default user the image comes with:

$ sudo userdel markedmondson

...and remove my folder:

$ sudo rm -rf /home/markedmondson

The configuration used

If you would like to look before you leap, or prefer to install this yourself, a recipe is below. It largely cobbles together the instructions around the web supplied by these sources:

Many thanks to them.

It covers installation on the Debian Wheezy images available on GCE, with the necessary backports:








How To Use R to Analyse and Plot Your Twitter Use

Here is a little how-to if you want to use R to analyse Twitter.  This is the first of two posts: this one talks about the How, the second will talk about the Why.  

If you follow all the code you should be able to produce plots like this:

As with all analytic projects its split into four different aspects: 1. getting the data; 2. transformations; 3. analysing; 4. plotting.

All the code is available on my first public github project:

https://github.com/MarkEdmondson1234/r-twitter-api-ggplot2

I did this project to help answer an idea: can I tell by my Twitter when I changed jobs or moved country?

I have the feeling the more I am doing SEO, the more I rely on Twitter as an information source; whereas for Analytics its more independent research that takes place more on StackOverflow and Github. Hopefully this project can see if this is valid.

1. Getting the data

R makes getting tweets easy via the twitteR package.  You need to install that, register your app with Twitter, then authenticate to get access to the Twitter API.

Another alternative to using the API is to use Twitter's data export, which will then let you go beyond the 3200 limit in the API. This gives you a csv which you can load into R using read.csv()

2. Transforming the data

For my purposes, I needed to read the timestamps of the tweets, and put them into early, morning, afternoon and evening buckets, so I could then plot the data.  I also created a few aggregates of the data, to suit what I needed to plot, and these dataframes I outputted from my function in a list.

Again, as with most analytics projects, this section represents most of the work, with to and fro happening as I tweaked the data I wanted in the chart.  Some tip I've picked up is to try and do these data transformations in a function taking the raw data as an input and outputting your processed data, as it makes it easier to repeat for different data inputs.

3. Analysing the data

This will be covered in the second post, and usually is the point of the whole exercise - it only takes about 10% of time on the project, but is the most important.

4. Plotting the data

This part evolves as you go to and fro from steps 2-3, but what I ended up with where these functions below.

theme_mark() is a custom ggplot2 theme you can use if you want the plots to look exactly the same as above, or at the very least show how to customise ggplot2 to your own fonts/colours.  It also uses choosePalette() and installFonts(). "mrMustard" is my name for the colour scheme chosen.

I use two layers in the plot - one is the area plot to show the total time spent per Day Part, the second is a smoother line to help pick out the trend better for each Day Part.

plotTweetsDP() takes as input the tweetTD (weekly) or tweetTDm (monthly) dataframes, and plots the daypart dataframe produced by the transformations above.  The timeAxis paramter expects "ym" (yearWeek) or "ym" (yearMonth) which it uses to make the x-axis be more suited to each.

plotLinksTweets() is the same, but works on the tweetLinks dataframe.


I hope this is of some use to someone, let me know in the comments!  Also any ideas on where to go from here - at the moment I'm working through some text mining packages to try and get something useful out of those. 

Again the full project code is available on Github here: https://github.com/MarkEdmondson1234/r-twitter-api-ggplot2

My Google Analytics Time Series Shiny App (Alpha)

There are many Google Analytics dashboards like it, but this one is mine:

My Google Analytics Time Series App

Its a bare bones framework where I can start to publish publicly some of the R work I have been learning over the past couple of years. 

It takes advantage of an Alpha of Shinyapps, which is a public offering of R Shiny, that I love and adore. 

At the moment the app has just been made to authenticate and show some generic output, but I plan to create a lot more interesting plots/graphs from it in the future.

How To Use It

  1. You need a Google Analytics account.  
  2. Go to https://mark.shinyapps.io/GA_timeseries/
  3. You'll see this screen.  Pardon the over heavy legal disclaimers, I'm just covering my arse.  I have no intention of using this app to mine data, but other's GA apps might, so I would be wary giving access to Google Analytics for other webapps, especially now its possible to add users via the management API.
  4. Click the "GA Authentication" link.  It'll take you to the Google account screen, where you say its ok to use the data (if it is), and copy-paste the token it then displays.
  5. This token allows the app (but not me) process your data.  Go back to the app and paste the token in the box.
  6. Wait about 10 seconds, depending on how many accounts you have in your Google Analytics.
  7. Sometimes you may see "Bad Request" which means the app is bad, and the GA call has errored.  If you hard reload the page (on Firefox this is SHIFT + RELOAD), you need to reauthenticate starting from step 2 above. Sorry.
  8. You should now see a table of your GA Views on the "GA View Table" tab.  You can search and browse the table, and choose the account and profile ID you want to work with via the left hand drop downs. Example using Sanne's Copenhagenish blog:
  9. If you click on "Charts" tab in the middle, you should see some Google Charts of your Visits and PageViews. Just place holders for now.
  10. If you click on the "Forecasts" tab you should see some forecasting of your visits data.  If it doesn't show, make sure the date range to the far left covers 70 days (say 1st Dec 2013 to 20th Feb 2014). 
  11. The Forecast is based on Holt-Winters exponential smoothing to try and model seasonality.  The red line is your actual data, the blue the model's guess including 70 days into the future. The green area is the margin of error to 50% confidence, and the Time axis shows number of months.  To be improved.
  12. Under the forecast model is a decomposition of the visits time series. Top graph is the actual data, second is the trend without seasonal, third graph the 31 data seasonal trend and the forth graph is the random everything else.
  13. In the last "Data Table" tab you can see the top 1000 rows of data.

That's it for now, but I'll be doing more in the future with some more exciting uses of GA data, including clustering, unsupervised learning, multinomial regression and sexy stuff like that.

Update 24th Feb

I've now added a bit of segmentation, with SEO and Referral data available trended, forecasted and decomposed.