Churn Prediction for Digital Publishers – Part I

Churn prediction is one of the most well-known applications of machine learning and data science in the Sales & Marketing fields. Churn rate, when applied to a customer base, refers to the proportion of contractual customers or subscribers who leave a supplier during a given time period. It is a possible indicator of customer dissatisfaction, cheaper and/or better offers from the competition, more successful sales and/or marketing by the competition, or reasons having to do with the customer lifecycle.

It is an important factor for any business with a subscriber-based service model, including mobile telephone networks, pay TV operators, software-as-a-service or even digital magazines and newspapers.

In the digital publishing space, reader revenues are an attractive alternative income to the uncertainty of advertiser budgets and ad-blocking. Securing revenue from readers has become publishing’s latest holy grail, but the dominance of free information has made it difficult for publishers to charge for their content.

In this article we’re focusing on analyzing churn in the context of ad-based digital publications; in other words, readers simply access the information via a web or mobile application, without subscribing to a paywall/paygate (like in a classical paid content strategy). This is an interesting case because there’s no contractual customers or subscribers. Think about it: how do we define a churned user for an ad-based digital publisher?

Well, if we refer to the Pirate Metrics framework (proposed by Dave McClure in the startup ecosystem), we understand that a certain user has to be retained first before the eventual churn occurs. This means that for ad-based digital publishers, a loyal reader is considered to be the one that comes back to the mobile/web application frequently.

In Google Analytics this can be easily explored by navigating to Audience Overview and segmenting users by “New” vs “Returning” criteria within a given period of time:

As you can see from this example 25% of users are in fact returning, which means that the loyal readers for this digital publisher are coming back to the mobile/web application at least twice within the given time frame.

Now that we’ve identified returning users, the question becomes: how many of the 25% are actually churning or how many of them (percentage) never come back?

Again, we can use Google Analytics for that to explore the Cohort Analysis report:

What this tells us is that, on average, roughly 37% of returning users are churning every month (calculated from the orange slope). We now have the churning rate for our ad-based digital publisher! To put this number into perspective, just imagine that at the current churn rate, by month 5 all of the returning users will be lost.

How can we prevent that from happening? How can we increase the lifetime of a loyal reader?

What if we could predict when a certain reader is about to churn, empowering digital publishers to be proactive instead of being reactive? Various incentives can be prompted to the users that have a high probability to churn, they can be pooled to better understand why they’re leaving or they can be retargeted in marketing campaigns on Facebook or Google Display Network.

With that in mind we started digging into ways of predicting users that are about to churn and we’ve documented the entire process below.

Setting the Stage

By default, Google Analytics includes a series of reports, for example viewing a total of users and sessions from a particular date interval.

Sample report from the Google Analytics dashboard

Most reports from Google Analytics use aggregated data, meaning they sum up the activity of all users. In 2016, the User Explorer report was added to the mix, which “lets you isolate and examine individual rather than aggregate user behavior”. This report is very valuable when it comes to understanding the behavior of individual users and personalizing their experience.

Sample User Explorer report from the Google Analytics dashboard

The free version of the Google Analytics Reporting API v4 doesn’t export any client IDs from the User Explorer report. However, it is possible to make these available by creating a custom dimension with the same value as a Client ID, a process we have documented on our Github account. This allows the analytics API to export data at the Client ID, Session or Hit level, instead of returning only aggregated data.

We should clarify that the Client ID refers to a browser, not to a user account, thus it doesn’t contain any personal data. It is possible to associate the Client ID with a user account (across devices), however in this particular use case, all client ids refer to browsers.

The Google Analytics Reporting API can export data at the user level, session level or hit level. Each user can have multiple sessions and each session has multiple hits.

Google Analytics Data Structure

Exporting & Labeling the Data

The most relevant data related to a users history we can obtain from the API includes:

  • Sessions (total sessions for each user, in a time interval);
  • Session duration (total sessions duration for each user, in a time interval);
  • Avg. session duration
  • Entrances
  • Bounces
  • Pageviews
  • Unique pageviews
  • Screen Views
  • Page value
  • Exits
  • Time on Page
  • Avg. Time on Page
  • Page Load Time (ms)
  • Avg. Page Load Time (sec)
  • Days since last session;
  • Count of sessions (total number of sessions for the user, independent of the selected time interval)
  • Hits (total hits for each user, in a time interval);
  • Device Category (mobile, desktop or tablet)

We’ve used Google Data Studio to export CSV files with all those columns, although we also had to do some stitching because Google Analytics can only export 9 dimensions/metrics at a time compared to Google Analytics 360 where there’s no such limitation.

To be able to predict if a user is going to churn or not, we first needed to label the exported data as churned / not churned by calculating the average time between sessions of retained users.

As it turns out, the mean Avg. Days Between Sessions equaled 38 days for one publisher and for another publisher it was identified as being 23 days. On the figures below, you can observe where these values are located on the long tail (the majority of users return in the first couple of days after their initial visit). In other words, if a user has a value of Days Since Last Session > mean(Avg. Days Between Sessions), he is labeled as churned.


Histograms of Avg. Days Between Sessions for two different publishers

Predicting Churners

Then we used logistic regression to classify users (churned vs. retained) and detecting when a user is getting close to churning. We did that by simply uploading the CSV file into Amazon Machine Learning and here are the results:

In the Create ML Model wizard from the Amazon ML console you can create the datasource and specify that the first line of the .csv file contains the column names. Generally, Amazon ML automatically infers the data types of attributes, distinguishing between Binary, Categorical, Numeric, and Text attributes. You can correct incorrectly inferred types.

After confirming the datasource to be used for creating a model, we needed to select the target attribute: Churn. The wizard also asks about the identifier for each row. In a practical application, you should supply an identifier for each data point–such as a customer ID–to tie the churn predictions back to the corresponding customers. Because this dataset doesn’t contain a natural ID, we chose No and proceed to creating the datasource.

As it creates the datasource, Amazon ML analyzes the data and computes basic statistics, such as the correlation of individual attributes to the target attribute, and the range and distribution of the values. It always pays to spend some time trying to understand the data you are working with.

Amazon ML automatically transforms all binary values, such as yes/no and True/False into corresponding 1/0 values.

Now that we’ve prepared the datasource, we’re ready to train the model. With Amazon ML, this can be as simple as clicking a button, pointing to the datasource, and choosing default configurations. By default, the service sets aside 30% of your training data for model evaluation and uses the other 70% to train the model.

Additionally, Amazon ML chooses default training parameters and uses a default data recipe for feature transformation prior to training. You can use the default settings, but if you carefully look at your data, supplying a custom recipe based on your insights might improve the accuracy of the model. This binary classification model actually produces a score between 0 and 1, instead of giving us a truly binary response.

As you can see, we got an initial area under the curve (AUC) of nearly 0.844 which is considered very good for most machine learning applications. This gives us a chance to further tune predictions by selecting a threshold, or a cutoff score, that the service will use to give the final prediction. By default, this threshold is set at 0.5. The Amazon ML console allows you to interactively change the threshold by showing you how a given setting affects the four statistics.

Conclusion & Next Steps

In this example, we clearly made many simplifying assumptions. To realistically predict churn, you might want to take into account some the following issues:

  • Some customers who receive retention incentives will still churn.
  • We should model the evolution of customer behavior. If reading time is dropping and the number of sessions is increasing, we are more likely to experience churn then if the trend is the opposite. A customer profile should incorporate behavior trends.
  • We didn’t assign any monetary cost to our training data and this is rarely the case when dealing with churn

Regardless of the added complexity, the same principles described in this post will likely apply. Hopefully, this post has inspired you to tackle churn prediction by using MorphL (available for free on GitHub) and connecting it to your Google Analytics to start crunching on those data.

Let’s Make AI Happen!


🔥 Our core repositories where the magic happens.
💕 Star us on Github, like us on Facebook and share the ♥ on Twitter.
🤘 Join our Slack community and chat with other product developers.
💻 Meet an incredible group of contributors all over the world. Find out how you can help.

Leave a Reply

Your email address will not be published. Required fields are marked *