Disruptive Data Science Series: The Eight-Fold Path of Data Science


Published on

After years of practitioners, academics, and industries applying mathematical models to solve practical problems, big data technologies have created an opportunity for scientists and engineers from many different backgrounds to come together and consider how we create meaning from vast amounts of data.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Disruptive Data Science Series: The Eight-Fold Path of Data Science

  1. 1. After years of practitioners, academics, and industries applying mathematical models to solve practical problems, big data technologies have created an opportunity for scientists and engineers from many different backgrounds to come together and consider how we create meaning from vast amounts of data. The Bay Area is a hotbed for data science, but its relevance and impact is global. If the Industrial Revolution strengthened the muscular and skeletal systems of the global economy, the Internet of Things is ready to do the same to the economy’s brain and nervous system. Many smart devices already exist — smart energy meters, sensors on car and plane engines. The challenge comes in connecting these devices and the data they produce to accelerate insights and action. We see examples of this in the applications data scientists are already building, such as efforts to make oil drilling platforms smarter so they can catch issues early on and activate appropriate shut-off procedures, preventing explosions such as the one that caused the Gulf of Mexico oil spill. The basic methodology of analytics, such as the Cross Industry Standard for Data Mining (CRISP), ) remains unchanged. What data scientists have done is build upon that foundation to take into account the increasing complexity of our problems and capabilities of our tools. Annika Jimenez, who leads the Data Science team here at Pivotal, has talked about eight steps of value creation from data in her Disruptive Data Science white paper. In this paper, I am going to zero in on the data science practices that are part of this process of value creation. The key to a successful data science project following an eightfold path, consisting of four phases and four differentiating factors. Phase 1: Problem Formulation – Are you solving the right problem? We could simply improve existing analytics processes with these new technologies, but the big opportunities will be found by harnessing the new data and capabilities at our disposal to formulate new problems. An example from the data side would be improving a consumer churn propensity model in the telecom sector by adding social graph information from Call Data Records. Such models have found that users who are close friends with someone who has switched carriers are more likely to also switch. An example of the new capabilities can be found in the exciting world of the Internet Of Things, where we can help manufacturers understand how well their products are performing and, more importantly, when and why they are not. In both cases, it is very important to include all stakeholders while formulating the problem — not just the IT department who will maintain the technology platform, but also the business users who will actually use the results of the process. Domain knowledge is crucial when formulating a problem with an impactful solution. White paper The Eight-Fold Path of Data Science Disruptive Data Science Series goPivotal.com Introduction Irecently spent an evening in the San Francisco Bay Area with the emerging community of data scientists. Our group included students and academics, practitioners with varying levels of experience, and even a smattering of venture capitalists and angel investors, meeting to discuss practical applications of data science with D J Patil, one of the gurus of this new community. by Kaushik Kunal Das
  2. 2. White paper The Eight-Fold Path of Data Science 2 Phase 2: Data Step – Do you have the right feature-set? This is the step that usually takes the most time. It encompasses a number of key questions: What are the data sources that are available to us inside and outside the organization? What are the variables that we will use for our analysis? The more data we have, the better off we are. Equally important is that we incorporate domain knowledge in the features that we build. In this area, data science has made a significant contribution by exploiting big data technologies ability to refine to increasing levels of detail and combine different types of structured, unstructured and semi-structured datasets. This phase produces a set of features that are used in the subsequent analysis. For instance, incorporating Call Data Records of a phone company in a regression model for predicting churn propensity involves creating thousands of variable candidates, called features. This features are based on factors such as whether customers use text messages or phone calls, the number of outgoing or incoming calls, and the time periods of call activity. In this step, we decide whether to combine fields and determine the level of aggregation that each feature needs to be at. High levels of aggregation can mask signals, but data that is overly-low level may not be statistically significant. This process of defining features determines the boundaries of our solution. Phase 3: Modeling Step – Deploy the right algorithms to uncover causal links. The modeling step is where we identify patterns among our features. We might need to explore and transform our feature space using Principal Component Selection (PCA), Fourier and wavelet transforms, and other such techniques. If we have enough data to find strong correlations that could indicate causal relationships, we can use various powerful regression techniques like that old warhorse logistic regression,, or newer ones like Support Vector Machines. In cases where we cannot prove strong correlations, we cluster large datasets to find Phase 2: Data Step Build the right feature set making full use of the volume, variety and velocity of all available data Phase 3: Modeling Step This is where you move from answering what, where and when to answering why and what if? Phase 1: Problem Formulation Make sure you formulate a problem that is relevant to the goals and pain points of the stakeholders Phase 4: Application Create a framework for integrating the model with decision making processes and taking action using the Internet of Things Technology Selection Select the right platform and the right set of tools for solving the problem at hand Building a Narrative Create a fact-based narrative that clearly communicates insights to stakeholders Iterative Approach Perform each phase in an agile manner and iterate as required Creativity Take the opportunity to innovate at every phase
  3. 3. White paper The Eight-Fold Path of Data Science 3 hidden patterns. In many cases we use optimizations or solve complicated inverse problems, like creating images of the earth using data from earthquakes or artificial seismic surveys. Fortunately, there is large and ever-growing set of techniques and associated algorithms available today. If we go back to the example of the churn model, we find that logistic regression is used very widely today, as it is very easy to explain. Logistic regression does not necessarily provide the best estimate. In this case, we improved the explanatory power of the model by using a generalized version of logistic regression called Generalized Additive Models that combined the variable transformation and regression steps and fit them together. Phase 4: Application – This is where we finally solve the problem. The potential applications of these insights are numerous: they might inform a decision support tool, or a control system that acts based on the patterns we have uncovered. In many cases, the insights serve multiple applications. For example, when leveraging the Internet Of Things to make an oil drilling platform smarter so that it can detect signs of catastrophic failure and take corrective action, we need to build a dashboard for human operators along with connections to the drill controls. If you step back and look at these four phases, they mimic the process of human thought, starting with the formulation of a question, determining the ontology, creating a cognitive model, and applying the conclusions. Professor George Lakoff of UC Berkeley has a very interesting theory about this phenomenon. Of course, a mathematical model is a special type of cognitive model, in that it is defined rigorously using the language of math. It’s important to be careful when applying any mathematical model, and verify that the results of the models correspond to reality. The four differentiating factors are principles that we need to keep in mind as we go through the aforementioned four phase process.. They are: 1) Technology selection There are numerous very powerful technologies that exist for tackling various types of problems. Any single data science project might require us to use several such technologies. For example, you might want to use the Python library NLTK or GPText for text analysis, the fft function in the R Stats package for doing a frequency analysis of the word count, and LDA function in MADlib for topic modeling. It is crucial that we select an open and flexible platform that allows us to leverage all the technologies we need, without having to move the data. The Pivotal platform is built with that in mind. We keep the data and compute in parallel, while minimizing the movement of data. 2) Creativity There is ample room for creativity in all of these steps. Indeed, this may be the most exciting part of the job. While designing a project, determine opportunities to be creative, and do something that hasn’t been done before. For example, we use many signal processing techniques on time series data that is produced by sensors connected to the Internet Of Things. This makes it much easier to deal with problems which are often considered difficult. 3) Iterative approach We also keep our projects iterative, with a meeting at the completion of every step described above. Here we show our results and solicit feedback from the stakeholders that we incorporate. The German philosopher Edmund Husserl pointed out that we communicate using concepts based on a shared reality he called “lebenswelt,” or lifeworld. A common problem for data science projects is that we are creating a new application, one which is not a shared reality since it does not exist yet, which is therefore difficult to convey in words. It makes sense to create a prototype as soon as possible and demo it to the stakeholders. This elicits very productive feedback, and prevents us from spending time on things that would be difficult to adopt. The four-step process described here is rarely linear. We often need more that one iteration to reach the most impactful solution. For example, we created a marketing mix model for an engagement, in which we started modeling at the level of store groups. After the first iteration, we realized that the television effects were not strong enough at that level, and that those variables needed to be aggregated at the national level. (This would be described as “pooling” in the context of a Bayesian hierarchical model). Making that iterative change made the models much more accurate. 4) Building a narrative Finally, the human element remains important in the process of building a story, a narrative that makes sense of all these steps. Whether you are applying data science internally to your organization, or you are implementing a data science project for a customer, it is very important that you are able to explain what you have done and how it is helpful. For instance, take the smart drilling platform project. In this case, we have a compelling goal: preventing explosions such as the
  4. 4. GoPivotal, Pivotal, and the Pivotal logo are registered trademarks or trademarks of GoPivotal, Inc. in the United States and other countries. All other trademarks used herein are the property of their respective owners. © Copyright 2013 Go Pivotal, Inc. All rights reserved. Published in the USA. PVTL-WP-210-09/13 Pivotal, committed to open source and open standards, is a leading provider of application and data infrastructure software, agile development services, and data science consulting. Pivotal’s revolutionary Enterprise PaaS product, powered by Cloud Foundry, will be available in Q4 2013. Pivotal CF is a complete, next generation Enterprise Platform-as-a-Service that makes it possible, for the first time, for the employees of the enterprise to rapidly create modern applications. To create powerful experiences that serve their customers in the context of who they are, where they are, and what they are doing in the moment. To store, manage and deliver value from fast, massive data sets. To build, deploy and scale at an unprecedented pace. Uniting selected technology, people and programs from EMC and VMware, the following products and services are now part of Pivotal: Greenplum® , Cloud Foundry, Spring, Cetas, Pivotal Labs® , GemFire® and other products from the VMware vFabricTM Suite. White paper The Eight-Fold Path of Data Science Pivotal 1900 S Norfolk Street San Mateo CA 94403 goPivotal.com one that caused the Gulf of Mexico oil spill. We must consider: what are the steps that lead us to that goal? We start with sensor data from the drill bit, data from analysis of the drill mud that is coming out, data from the drill control system, and data from other sensors and instruments on the platform. From there, we get a set of features from the data and apply advanced signal processing techniques to get rid of the noise. Perhaps we ignore the high frequency part of the signal, treat that as noise, and use a filtering function in frequency space to get rid of the noise. Running a FFT in parallel is very fast on large datasets if you use the right technology. Now we can build regression models on our dataset, using data from several wells in one oilfield to predict the rate of penetration of the drill. This model can be used for prediction in real-time, and deviations from this prediction can be treated as anomalies. From this model, we can attach the right action (e.g. raising an alarm, stopping the drill, etc.) to each anomaly using labeled data. By articulating this sort of story to stakeholders, you can communicate the various goals, steps, and the value of developing these models.This is a crucial step of communicating what you are doing, why it is useful, and getting stakeholders’ feedback. This process also helps explain results, facilitates discussions about improvements, and prompts further next steps. This is a very exciting time to be involved in data science. What I have outlined here is an emerging process for what is an emerging field. As data scientists work on an increasing variety of problems and innovate in small garages and large companies, this framework will evolve, and become all the more significant, sophisticated, and meaningful over time. About Kaushik Kunal Das Kaushik is Senior Principal Data Scientist with Pivotal. His job is to formulate data science problems and solve them using the Pivotal Big Data Platform. He leads a team of highly accomplished data scientists working in energy, telecommunications, retail and digital media. Kaushik has an engineering background focused on solving mathematical problems requiring large datasets. He studied engineering at the Institute of Technology of the Banaras Hindu University and the University of California at Berkeley. He is interested in questions like how much can a company know their customer and customize their actions in a context-sensitive fashion? How can our living and working environments get smarter and how can we get there? Learn More To learn more about our products, services and solutions, visit us at goPivotal.com.