Here's my personally biased stab at answering the infamous question. Try looking it up and you'll find one third buzz saying how needed it is, another third selling you something, and the final third words that don't contribute much beyond intuition for the technically savvy or leave you feeling more lost than you were before your search for the non-technical.
I'm going to give a short discussion of what's currently out there in terms of definitions of data science, propose and discuss my definition of it, and walk through a hypothetical example of a series of decisions made following data science methodology.
Current Definitions
Wikipedia says it is is the study of the generalizible extraction of knowledge from data. The tag line of Strata, a popular data conference is "Making data work." One of my favorites, posted by IBM, says a data scientist is "almost like a Renaissance individual who really wants to learn and bring change to an organization." Job descriptions can also help including this one from a recent Nike posting: "Have a deep passion for applying advanced analytic approaches, an eagerness to dig into the data, and a vision for turning disparate data streams into a cohesive view of our global consumers." For the final example, a post by Revelytix says data science "is a steadily growing discipline that is powering significant change in across industries and in companies of every size".
But what do all these snippets have in common? It obviously involves data. Data can do work. Data can also give perspective. The skillset needed to master the manipulation of data is apparently quite broad, and is also capable of causing real change to businesses.
My Definition
My definition is simple.
Data is potential knowledge. Knowledge empowers better decisions. Data science is a new word applied to the old problem of using data to make better decisions.
Or, more briefly, Data science is decision making from data.
Whatever your bottom line is, higher revenue, lower cost, more lives saved, equal treatment of all people in a society, lower blood pressure, etc, you can use data to empower you decisions.
The implications of choosing a new word for an old skill, while it annoys the side of my brain that seeks abstract truth, are quite useful for business. Between computation becoming so cheap and substantial datasets becoming so ubiquitous at all layers of business, there is an order of magnitude more opportunities for using data to leverage systems. This means there is great value in empowering the creative actuaries, analysts, and engineers who are the closest to the required skillset to take that last step into applying data at all points of a business. The mechanism used within the business world to enable the empowerment of old skillsets is to assign a new title.
Where are the Decisions Made?
If data holds some fundamental structure, pattern to be learned, correlations, systemic structure, or whatever word/phrase you want to use meaning hidden information, then learning that pattern can be useful. The distinction I most often use when organizing a new project is looking at where the decision to be optimized resides - in humans or in computers.
Data In, Processed For Human Consumption
People require some sort of report to ingest and the decisions optimized are the decisions associated with their job: the product manager can decide to implement a proposed feature, the CEO can determine a new company direction, the engineer can decide how to pre-cache content, etc.
The traditional paradigm of a person processing and plotting data for another person's consumption usually falls under the title of 'analytics', 'business intelligence', and/or 'reporting' while applying techniques more complicated than aggregations, graphing, and straight correlations is usually dubbed 'data mining'. Though, in practice, the lines dividing the above words are quite blurry.
Data In, Processed For Computer Consumption
The behavior of our computer-driven technology can be affected by the data it ingests. From things as complex as adaptive behavior of video games to things as simple as assuming someone speaks Italian when their i.p. address is traced back to Italy, the more complex our technology stacks become, the more oppotunities for decision automation arise.
Common words to describe techniques in this category include 'artificial intelligence', 'computational intelligence', 'machine learning', 'predictive analytics', and 'predictive modeling', though these techniques have even blurrier dividing lines.
It's worth noting that a technique that works well for human consumption can work quite well when applied to computer consumption and vice versa. Partially because once some fundamental structure is found, it can be leveraged in both categories. Partially because many techniques, once successfully applied to one category, gives results that can be directly translated into the language of the other. Regardless, it's always good to keep your target audience in mind in this distinction.
Current roles in Companies
There are three main roles that I regularly see in job postings. The first is a liaison between machine learners and business people. The bottom line here is people who speak math usually don't speak business and vice-versa. As hard as it is to find talent in both categories, it can be even harder to figure out how to integrate them. Somewhere along the line, someone has to speak both languages for communication to flow, and it's easier to find a single liaison than to expect everyone doing the work to also be a liaison.
The second is to find someone to manage, manipulate, and possibly merge big datasets. More strictly speaking, this is a data engineering task. A data scientist will benefit tremendously from having solid data engineering skills, but just those skills alone are not sufficient for good decision making.
The third definition is something I'm only recently seeing appear more frequently, and that is the role of a data savvy product manager or an assistant to a product manager. In places with the luxury of having enough users to thoroughly test any imaginable tweak to their product and/or image and still not exhaust their user base can and should use data science to look for and test possible points of improvement.
Example: Taking Action From Data
Say you have some online app. Really, the following story could be applied to most businesses, but let's stick with an app for this example.
Decision From Analytics
Say the following graph is the daily active users in your app:
You've been tracking your daily active users for a while and your popularity has consistently been going up. Now, over the past week or two, you see your popularity dropping for the first time. What do you do? Let's say you hire a data scientist.
This is the first moment of making a decision from data. You have your analytics team sending you weekly reports on all of the key performance indicators, especially the daily active users. Now you see some piece of information - stagnating growth - that warrants action.
Behavior Prediction to Enable Decisions
Your data scientist comes in and wants to dig into why users are leaving. A fair way to do this is to attempt to train a model to learn if a given user (and all the data surrounding that user) will leave the following week. If the model can reliably connect a user profile to a probability of leaving, then there certainly at least one quirk (but more frequently, a combination of quirks) in the data that can be addressed through action.
A framework for building and applying a model is to break up the flow into exactly those two steps: build a model, then use the model. The first step of this framework is dubbed supervised learning in machine learning.
This first step is applied by taking data from two weeks ago and attempting to connect it to the data observed from one week ago. That way, the model gets to look at each user attempting to draw information about how a user's pattern changes over time. We will call this abstracted pattern a model. In practice, you'd want to use more data than just two weeks to look for this kind of pattern, since you want the model to extract patterns in behavior from any arbitrary week to the following week - not specifically from two weeks ago till last week. That's a discussion for another time.
The second step is to apply that model on this week's data. Using this week's data as the input to the model, letting the model extract whatever patterns it perceives, and returning information that is the predictions of what will happen next week.
Let's say this all works well, the model was built smoothly in the first step and the predictions made in the second step were done with confidence: You'll now have, for each of your users, a prediction that is the % chance of them leaving your app next week.
Response Modeling
Say you happen to find that exactly 12k users are expected to leave. That's great to know, but what do you do about it? You dig into the data as an analyst looking for patterns to leverage, but nothing jumps out. A possible next step is called response modeling. Roughly, it says to try a few different treatments and see how users respond. Say you have two possible treatments - emails and gifts. An email could be a request for feedback, some pro-tips, an update on what's new or upcoming in the product, or anything else you can imagine. The gift could be a free month of service, some virtual good, or again anything you can imagine. Following good science, we'll hold out a third group and change nothing about the experience to give this response modeling phase a baseline and also to check our predictor from the last step.
Say you run the experiment this week and have the following results:
Treatment | Sample Size | Number of Users Remaining |
---|---|---|
Control | 4,000 | 756 |
4,000 | 642 | |
Gift | 4,000 | 1108 |
We see in the control group that we can expect around 19% ( = 756 / 4000) of our users to remain in each group. We see in the email group that only 16% remain - that's 3% worse! You're not going to want to send that email. The gift group did considerably better with nearly 28% of the group remaining. Weigh the cost of all the gifts you made against the profit you'll make from the retained users, and it's worth the tradeoff, then take the action of giving that gift to people who are likely to leave.
Given enough data, you can keep running this experiment on the side for a small part of your user base as you direct whatever the best perceived action is on the rest of you user base. Things get even more exciting when you start targeting specific groups within your user base, but we'll save that discussion for another time, too.
Recap, With Buzzwords.
The thing I really want to leave you with is that data empowers action. At any point you can connect data points, you can look for underlying structure. At any point there is underlying structure, you can use that structure to make decisions. Here are some types of data science initiatives currently common in the industry.
- Behavior prediction
- A/B testing
- Response modeling
- Ad optimization
- Recommender systems
- Your credit score
- Cohort analysis
- Clustering
- Time Series forecasting
- Virality modeling
- Natural language processing
- Fraud / outlier detection
- Failure prediction
The list obviously goes on, but that will give the curios reader plenty to start chewing on with google's help.