Over all of my projects for all of my different clients, from data mining on biomechanics data to predicting user behavior in games to recommending connections in a social network, I began noticing that all my reports and work logs took one of four contexts; each with a consistent target audience and a consistent take-home. Formalizing those four steps gave me this data science process.

I will use three case studies as props to walk through this process by example. The first - identity theft at a financial institution - will be totally fabricated so I can idealize the numbers to clearly illustrate the goal of each step. The second - churn prediction within games on mobile devices - will be used because it is currently the most sophisticated of my projects to date that is clearly in the public domain. The third - uplift modeling in a majar bank - is chosen for it's applicability to most businesses and the fantastic results achieved.


The Data Science Process

My four part data science process is:

  1. Enumerate
  2. Evaluate
  3. Integrate
  4. Grow

Enumerate all places data and decisions meet - these are your points of leverage. Evaluate each point of leverage for it's business value by building a proof of concept. Integrate the worthwhile initiatives. Through developing those three steps, Grow your data science infrastructure.

Let's walk through these steps one at a time.


This is a business problem.


Look at all the places in your business where data and decision making intersect. Each is a possilbe point of leverage. Enumerate each of these possible points of leverage and, for each, document:

  • A list of all data that might support the decision being made
  • The precise list of variables you would like to know before making your decision. Be creative here and pretend like telling the future is possible through science. Variables like "the market result of each decision" or "what the customer most wants" are totally valid candidates.
  • A business problem you're trying to solve, being sure to include what metrics / KPIs you're wanting to optimize.

For each of the examples, we'll just pick one point of leverage and run with it to keep focus on the data science process rather than spending lots of time building business models. I challenge you to come back to this Enumerate step after reading through the rest of this post and map out possible points of leverage for the business that is most relevant to you.

Example: Identity Theft

Let's say our fabricated bank sees 1MM customers and has all historic data on each customer. The point of leverage we'll be discussing is handling fraud in the form of identity theft. Let's say identity theft happens to 0.1% of their clients yearly where each case results in an average of $5k in losses. That results in a total of $500MM in losses a year.

The business problem to be solved is early detection of identity theft: the sooner you can detect an identity has been stolen, the sooner you can cut off transactions and thus mitigate losses. The bank would, at a minimum, have a complete record of its users past transactions, so that's the data we'll run with. The variable you want to see is whether or not a given transaction is fraudulent or not.

Example: Churn Prediction

Our second example took place in the company that was PlayHaven and has since become UpSight. PlayHaven made a toolkit for games on mobile devices. If you were to make a game, and wanted to add an announcement at the start of your game for an upcoming sequel you were making, you'd have to go into the code of your game, make a pop-up, submit the new version of the game (pop-up included) to Apple or Google (taking weeks), and wait for users to update their game version (taking more weeks). PlayHaven gave you a way to drop in 'content units' throughout your game ahead of time. Each content unit could later be filled with an ad, announcement, or virtual good, with some layer of targeting and analytics on top of it all. Changing content within one of the content units could be done live through PlayHaven's dashboard, meaning you could announce your pending sequel in a matter of moments rather than months. This all enabled a higher level of sophistication in content targeting as well.

The idea is that, if you could learn when a user was about to leave a certain game, you could learn how to act to keep that user around. Other industries have shown us that it often costs less to maintain a current user than to find another. More strictly speaking, predicting when a user is about to leave is known as churn prediction while learning what action to take to keep them around is known as response modeling. This project was just the first step of that: churn prediction.

The data we have is the complete history of user activity within any PlayHaven - integrated game. The variable we want is if a given user within a given game will be leaving that game within a week. The business objective is user retention. At the time, PlayHaven was integrated into 5k games seeing 130MM unique users logging a total of 2.5B game sessions.

Example: Uplift Modeling

Our third example is a white-papered case study of uplift modeling done on marketing campaigns for U.S. Bank. Uplift modeling can be seen as a special case of response modeling applied to increasing customer engagement or retention. U.S. Bank had millions of customers, billions of checks handled yearly, and hundreds of billions of assets. In systems that large, any shift in performance results in huge shifts in the bottom line, and the shift they were seeing was failing marketing campaigns. The data available was at least the complete transaction history for each user. The data desired is which marketing content a user wants to see. The business objective is increased revenue through better cross-selling.


This is a machine learning problem.


Each of the previous Enumerated points of leverage set up the context of a machine learning problem. The list of all supporting data is the input to your model. The variables you want to know is the output of your model. The business objective defines your error function.

The questions you're answering are "How should the data be modeled?" and "What is the projected return on my business objective given the model performance?" Thus, your deliverables include the model chosen, a performance analysis in terms of the model, and the projection of performance in terms of the business.

Example: Identity Theft

Two angles to detect fraud are to 1) use supervised learning methods to extract patterns from past known cases of fraud and then use that model to look for fraudulent patterns in future data and 2) to learn each user's 'normal' behavior patterns and then flag any major deviations as possibly fraudulent. Neither angle is perfect, and both do compliment each other. For the sake of our example, let's say we build models for both angles and, together, fraud can be detected within 3 days at a 70 percent accuracy. That would give roughly a projected savings up to 500M * 0.7 = 350MM, minus the costs of implementing the model and minus the cost of those first three days of fraud. We'll make a more accurate cost analysis in the next step.

Example: Churn Prediction

There was plenty of data available, but the data I ended up using for the final product was just the event logs for 'opens'. That is, each time a user opened a game, we logged the tuple of (user id, game id, timestamp). I aggregated the raw open events into a time series representing the number of opens per day across two weeks. This two-week-long time series stored for each user observed for each game they played gave me the input to the model. The model's output - the variable to be predicted - is whether or not each of those (user id, game id) tuples would churn.

It turns out the problem is significantly harder or easier depending on what you precisely define churn (does a user have to simply open the game once in a week to be considered actively playing? Once in a month? How about once per day? How about at least once this week and then no activity the following week?)

I ended up using a technique called Discrete Multivariate Modeling to search for its version of statistical models to apply to prediction. It worked out that classification rates from 75% low 90%s were observed, churn definition dependent. For a more detail, see these slides.

This is a scenario where the business value of a data science problem is not apparent. It's 'nice' to know if someone will probably churn, but it's only valuable if you can act on that information. There are a few ways to take action on churn predictions (the most straight-forward being response or uplift modeling, as previously mentioned). Since none of those were evaluated, there is no way to map this churn prediction rate to a user retention rate and, thus, no way to map this project to a business value.

Example: Uplift Modeling

The only content given in the white paper in regards to the Evaluate step was marketing and the fact that their trialled model performed well enough to Integrate.


This is an engineering problem.


At this point, you'll have a precisely defined business problem you're solving, a model already built and proof-of-concept-ed, and a projection of the project's business value. Now it's time to decide whether or not the project is worth building into the product. If everything was thoroughly done up to this point, the decision should be clear.

Data science wise, everything is built and neatly packaged making this is the most straight forward step. Business wise, this is a giant leap that requires total commitment to the data science initiative because it will likely be done at the expense of some other feature that has clear business value. It's the most straight-forward, but it's also where I've seen the most projects falter.

A faltering project here can take one of two forms: The first is that Evaluate proved this particular initiative is not worth pursuing. That should be seen as a success of the process and resources should be shifted to the next possible point of leverage. The second is that the project proves worth pursuing, but resources cannot be

An unfortunate reality in the world of data science

Example: Identity Theft

Let's say the performance of our models when plugged into the production system held true to our original projections, catching 70% of identity theft an average of 3 days into fraud happening. Of the fraud that was not caught, we still see an average of $5k in damages per incident. Of the fraud that was caught, let's say we see an average of $1k in damages still. Let's also say there's an overhead cost of $5MM for building out the infrastructure needed to freeze accounts and satiating customers we wrongly accuse of fraud. That all adds up to $225MM savings. In our idealized example, that would obviously be a huge success.

Example: Churn Prediction

Lacking a clear business value from churn prediction due to an incomplete Evaluation, the incentive to Integrate this project was not clear. Engineering resources were, at the time, not diverted to build churn prediction into the product.

Example: Uplift Modeling

The Evaluated model was Integrated and some great results were observed. A 300% increase to cross-sell revenue resulted in over $1MM increased revenue from just the initial campaigns while sending 40% less email. All the while, this initial integration gave them a faster and more accurate model they were able to re-use repeatedly as well as apply to other parts of the business.


This is a management problem.


Each of the previous three steps have precise objectives and precise skillsets required. Understanding the demands placed on those performing each of the three steps will improve your team's ability to define it's process and, thus, learn how to grow as a team. Each step needs to be able to interact with the other two steps directly to avoid silo-ing the whole process. For example, Evaluate isolated from Integrate will certainly pick models too complex to efficiently implement at scale. Alternatively, Integrate isolated from Evaluate will dis-empower the product managers and decisions will be made that don't align with the company's best interests.


Enumerate requires people who speak the languages of both business and data. The business structure needs to be understood not as a fixed entity to work within, but as a creation itself - as a thing that can be optimized and improved. Data needs to be understood enough to know what it's capable of. Together, data plus business, gives practitioners of Enumerate the ability to look for and see the business value of opportunities to leverage data.


Evaluate requires machine learning skills: The deeper the skills, the faster high quality models can be generated within the associated niche. The broader the skills, the more types of problems can be quickly tackled. More important than the depth or breadth of experience is the ability to learn and apply new techniques. After all, new problems will always arise and the Evaluation-ers should be unafraid to tackle a new problem in a new way.


Integrate requires people that speak both math and programming. Depending on your personal skillset or the skills available in your team, you may want to teach your machine learners to write product-quality code or you may want to come at it from the other direction and encourage your programmers to work through the math (with help from the Evaluate people, as needed) to write the models themselves.


Enumerate - Evaluate - Integrate - Grow.

Enumerate all points of leverage. Pick those that seem to have the highest potential to Evaluate. Integrate the Evaluated initiatives with high projected return. Grow your data science infrastructure.

Each step requires a specific skillset and results in specific deliverables. Take care to properly match required skills to the task at hand. Any one step half-done will stunt all proceeding steps. Step by step, piece by piece, build new ways to leverage data to make better decisions.

Allen Grimm

Read more posts about this author.