Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The nature of cloud-based data science


Published on

Mike Driscoll is CEO of Metamarkets, a cloud-based analytics company he co-founded in San Francisco in 2010.

Mike Driscoll is CEO of Metamarkets, a cloud-based analytics company he co-founded in San Francisco in 2010.

Published in: Technology, Business

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. A quarterly journal 2012 Issue 1 06 The third wave of customer analytics 30 The art and science of new analytics technology Reshaping the workforce with the new analytics Mike Driscoll CEO, Metamarkets 44 Natural language processing and social media intelligence 58 Building the foundation for a data science culture
  • 2. Acknowledgments Advisory Principal & Technology Leader Tom DeGarmo US Thought Leadership Partner-in-Charge Tom Craren Strategic Marketing Natalie Kontra Jordana Marx Center for Technology & Innovation Managing Editor Bo Parker Editors Vinod Baya Alan Morrison Contributors Galen Gruman Steve Hamby and Orbis Technologies Bud Mathaisel Uche Ogbuji Bill Roberts Brian Suda Editorial Advisors Larry Marion Copy Editor Lea Anne Bantsari Transcriber Dawn Regan 02 PwC Technology Forecast 2012 Issue 1
  • 3. US studio Design Lead Tatiana Pechenik Designer Peggy Fresenburg Illustrators Don Bernhardt James Millefolie Production Jeff Ginsburg Online Managing Director Online Marketing Jack Teuber Designer and Producer Scott Schmidt Animator Roger Sano Reviewers Jeff Auker Ken Campbell Murali Chilakapati Oliver Halter Matt Moore Rick Whitney Special thanks Cate Corcoran WIT Strategy Nisha Pathak Metamarkets Lisa Sheeran Sheeran/Jager Communication Industry perspectives During the preparation of this publication, we benefited greatly from interviews and conversations with the following executives: Kurt J. Bilafer Regional Vice President, Analytics, Asia Pacific Japan SAP Jonathan Chihorek Vice President, Global Supply Chain Systems Ingram Micro Zach Devereaux Chief Analyst Nexalogy Environics Mike Driscoll Chief Executive Officer Metamarkets Elissa Fink Chief Marketing Officer Tableau Software Kaiser Fung Adjunct Professor New York University Jonathan Newman Senior Director, Enterprise Web & EMEA eSolutions Ingram Micro Ashwin Rangan Chief Information Officer Edwards Lifesciences Seth Redmore Vice President, Marketing and Product Management Lexalytics Vince Schiavone Co-founder and Executive Chairman ListenLogic Jon Slade Global Online and Strategic Advertising Sales Director Financial Times Claude Théoret President Nexalogy Environics Saul Zambrano Senior Director, Customer Energy Solutions Pacific Gas & Electric Kent Kushar Chief Information Officer E. & J. Gallo Winery Josée Latendresse Owner Latendresse Groupe Conseil Mario Leone Chief Information Officer Ingram Micro Jock Mackinlay Director, Visual Analysis Tableau Software Reshaping the workforce with the new analytics 03
  • 4. The nature of cloudbased data science Mike Driscoll of Metamarkets talks about the analytics challenges and opportunities that businesses moving to the cloud face. Interview conducted by Alan Morrison and Bo Parker Mike Driscoll Mike Driscoll is CEO of Metamarkets, a cloud-based analytics company he co-founded in San Francisco in 2010. PwC: What’s your background, and how did you end up running a data science startup? MD: I came to Silicon Valley after studying computer science and biology for five years, and trying to reverse engineer the genome network for uranium-breathing bacteria. That was my thesis work in grad school. There was lots of modeling and causal inference. If you were to knock this gene out, could you increase the uptake of the reduction of uranium from a soluble to an insoluble state? I was trying all these simulations and testing with the bugs to see whether you could achieve that. PwC: You wanted to clean up radiation leaks at nuclear plants? MD: Yes. The Department of Energy funded the research work I did. Then I came out here and I gave up on the idea of building a biotech company, because I didn’t think there was enough commercial viability there from what I’d seen. I did think I could take this toolkit I’d developed and apply it to all these other businesses that have data. That was the genesis of the consultancy Dataspora. As we started working with companies at Dataspora, we found this huge gap between what was possible and what companies were actually doing. Right now the real shift is that companies are moving from this very high-latency-course era of reporting into one where they start to have lower latency, finer granularity, and better 20 PwC Technology Forecast 2012 Issue 1
  • 5. Critical business questions Some companies don’t have all the capabilities they need to create data science value. Companies need these three capabilities to excel in creating data science value. Value and change Good data visibility into their operations. They realize the problem with being walking amnesiacs, knowing what happened to their customers in the last 30 days and then forgetting every 30 days. Most businesses are just now figuring out that they have this wealth of information about their customers and how their customers interact with their products. PwC: On its own, the new availability of data creates demand for analytics. MD: Yes. The absolute number-one thing driving the current focus in analytics is the increase in data. What’s different now from what happened 30 years ago is that analytics is the province of people who have data to crunch. What’s causing the data growth? I’ve called it the attack of the exponentials— the exponential decline in the cost of compute, storage, and bandwidth, and the exponential increase in the number of nodes on the Internet. Suddenly the economics of computing over data has shifted so that almost all the data that businesses generate is worth keeping around for its analysis. PwC: And yet, companies are still throwing data away. MD: So many businesses keep only 60 days’ worth of data. The storage cost is so minimal! Why would you throw it away? This is the shift at the big data layer; when these companies store data, they store it in a very expensive relational database. There needs to be different temperatures of data, and companies need to put different values on the data— whether it’s hot or cold, whether it’s active. Most companies have only one temperature: they either keep it hot in a database, or they don’t keep it at all. PwC: So they could just keep it in the cloud? MD: Absolutely. We’re starting to see the emergence of cloud-based databases where you say, “I don’t need to maintain my own database on the premises. I can just rent some boxes in the cloud and they can persist our customer data that way.” Metamarkets is trying to deliver DaaS—data science as a service. If a company doesn’t have analytics as a core competency, it can use a service like ours instead. There’s no reason for companies to be doing a lot of tasks that they are doing in-house. You need to pick and choose your battles. We will see a lot of IT functions being delivered as cloud-based services. And now inside of those cloud-based services, you often will find an open source stack. Here at Metamarkets, we’ve drawn heavily on open source. We have Hadoop on the bottom of our stack, and then at the next layer we have our own in-memory distributed database. We’re running on Amazon Web Services and have hundreds of nodes there. Data science PwC: How are companies that do have data science groups meeting the challenge? Take the example of an orphan drug that is proven to be safe but isn’t particularly effective for the application it was designed for. Data scientists won’t know enough about a broad range of potential biological systems for which that drug might be applicable, but the people who do have that knowledge don’t know the first thing about data science. How do you bring those two groups together? MD: My data science Venn diagram helps illustrate how you bring those groups together. The diagram has three circles. [See above.] The first circle is data science. Data scientists are good at this. They can take data strings, perform processing, and transform them into data structures. They have great modeling skills, so they can use something like R or SAS and start to build a hypothesis that, for example, if a metric is three standard deviations above or below the specific threshold then someone may be more likely to cancel their membership. And data scientists are great at visualization. But companies that have the tools and expertise may not be focused on a critical business question. A company is trying to build what it calls the technology genome. If you give them a list of parts in the iPhone, they can look and see how all those different parts are related to other parts in camcorders and laptops. They built this amazingly intricate graph of the Reshaping the workforce with the new analytics 21
  • 6. “[Companies] realize the problem with being walking amnesiacs, knowing what happened to their customers in the last 30 days and then forgetting every 30 days.” actual makeup. They’ve collected large amounts of data. They have PhDs from Caltech; they have Rhodes scholars; they have really brilliant people. But they don’t have any real critical business questions, like “How is this going to make me more money?” The second circle in the diagram is critical business questions. Some companies have only the critical business questions, and many enterprises fall in this category. For instance, the CEO says, “We just released a new product and no one is buying it. Why?” The third circle is good data. A beverage company or a retailer has lots of POS [point of sale] data, but it may not have the tools or expertise to dig in and figure out fast enough where a drink was selling and what demographics it was selling to, so that the company can react. On the other hand, sometimes some web companies or small companies have critical business questions and they have the tools and expertise. But because they have no customers, they don’t have any data. PwC: Without the data, they need to do a simulation. MD: Right. The intersection in the Venn diagram is where value is created. When you think of an e-commerce company that says, “How do we upsell people and reduce the number of abandoned 22 PwC Technology Forecast 2012 Issue 1 shopping carts?” Well, the company has 600 million shopping cart flows that it has collected in the last six years. So the company says, “All right, data science group, build a sequential model that shows what we need to do to intervene with people who have abandoned their shopping carts and get them to complete the purchase.” PwC: The questioning nature of business—the culture of inquiry— seems important here. Some who lack the critical business questions don’t ask enough questions to begin with. MD: It’s interesting—a lot of businesses have this focus on real-time data, and yet it’s not helping them get answers to critical business questions. Some companies have invested a lot in getting real-time monitoring of their systems, and it’s expensive. It’s harder to do and more fragile. A friend of mine worked on the data team at a web company. That company developed, with a real effort, a real-time log monitoring framework where they can see how many people are logging in every second with 15-second latency across the ecosystem. It was hard to keep up and it was fragile. It broke down and they kept bringing it up, and then they realized that they take very few business actions in real time. So why devote all this effort to a real-time system? PwC: In many cases, the data is going to be fresh enough, because the nature of the business doesn’t change that fast. MD: Real time actually means two things. The first thing has to do with the freshness of data. The second has to do with the query speed. By query speed, I mean that if you have a question, how long it takes to answer a question such as, “What were your top products in Malaysia around Ramadan?” PwC: There’s a third one also, which is the speed to knowledge. The data could be staring you in the face, and you could have incredibly insightful things in the data, but you’re sitting there with your eyes saying, “I don’t know what the message is here.” MD: That’s right. This is about how fast can you pull the data and how fast can you actually develop an insight from it. For learning about things quickly enough after they happen, query speed is really important. This becomes a challenge at scale. One of the problems in the big data space is that databases used to be fast. You used to be able to ask a question of your inventory and you’d get an answer in seconds. SQL was quick when the scale wasn’t large; you could have an interactive dialogue with your data.
  • 7. But now, because we’re collecting millions and millions of events a day, data platforms have seen real performance degradation. Lagging performance has led to degradation of insights. Companies literally are drowning in their data. In the 1970s, when the intelligence agencies first got reconnaissance satellites, there was this proliferation in the amount of photographic data they had, and they realized that it paralyzed their decision making. So to this point of speed, I think there are a number of dimensions here. Typically when things get big, they get slow. PwC: Isn’t that the problem the new in-memory database appliances are intended to solve? MD: Yes. Our Druid engine on the back end is directly competitive with those proprietary appliances. The biggest difference between those appliances and what we provide is that we’re cloud based and are available on Amazon. appliance. We solve the performance problem in the cloud. Our mantra is visibility and performance at scale. Data in the cloud liberates companies from some of these physical box confines and constraints. That means that your data can be used as inputs to other types of services. Being a cloud service really reduces friction. The coefficient of friction around data has for a long time been high, and I think we’re seeing that start to drop. Not just the scale or amount of data being collected, but the ease with which data can interoperate with different services, both inside your company and out. I believe that’s where tremendous value lies. “Being a cloud service really reduces friction. The coefficient of friction around data has for a long time been high, and I think we’re seeing that start to drop.” If your data and operations are in the cloud, it does not make sense to have your analytics on some Reshaping the workforce with the new analytics 23
  • 8. To have a deeper conversation about this subject, please contact: Tom DeGarmo US Technology Consulting Leader +1 (267) 330 2658 Bill Abbott Principal, Applied Analytics +1 (312) 298 6889 Bo Parker Managing Director Center for Technology & Innovation +1 (408) 817 5733 Oliver Halter Principal, Applied Analytics +1 (312) 298 6886 Robert Scott Global Consulting Technology Leader +1 (416) 815 5221 Comments or requests? Please visit or send e-mail to
  • 9. This publication is printed on McCoy Silk. It is a Forest Stewardship Council™ (FSC®) certified stock containing 10% postconsumer waste (PCW) fiber and manufactured with 100% certified renewable energy. By using postconsumer recycled fiber in lieu of virgin fiber: 6 trees were preserved for the future 16 lbs of waterborne waste were not created 2,426 gallons of wastewater flow were saved 268 lbs of solid waste were not generated 529 lbs net of greenhouse gases were prevented 4,046,000 BTUs of energy were not consumed Photography Catherine Hall: Cover, pages 06, 20 Gettyimages: pages 30, 44, 58 PwC ( provides industry-focused assurance, tax and advisory services to build public trust and enhance value for its clients and their stakeholders. More than 155,000 people in 153 countries across our network share their thinking, experience and solutions to develop fresh perspectives and practical advice. © 2012 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. NY-12-0340
  • 10. Subtext Culture of inquiry A business environment focused on asking better questions, getting better answers to those questions, and using the results to inform continual improvement. A culture of inquiry infuses the skills and capabilities of data scientists into business units and compels a collaborative effort to find answers to critical business questions. It also engages the workforce at large— whether or not the workforce is formally versed in data analysis methods—in enterprise discovery efforts. In-memory A method of running entire databases in random access memory (RAM) without direct reliance on disk storage. In this scheme, large amounts of dynamic random access memory (DRAM) constitute the operational memory, and an indirect backup method called write-behind caching is the only disk function. Running databases or entire suites in memory speeds up queries by eliminating the need to perform disk writes and reads for immediate database operations. Interactive visualization The blending of a graphical user interface for data analysis with the presentation of the results, which makes possible more iterative analysis and broader use of the analytics tool. Natural language processing (NLP) M ethods of modeling and enabling machines to extract meaning and context from human speech or writing, with the goal of improving overall text analytics results. The linguistics focus of NLP complements purely statistical methods of text analytics that can range from the very simple (such as pattern matching in word counting functions) to the more sophisticated (pattern recognition or “fuzzy” matching of various kinds).