Paradigm4 Research Report: Leaving Data on the table


Published on

While Big Data enjoys widespread media coverage, not enough attention has been paid to what practitioners think — data scientists who manage and analyze massive volumes of data. We wanted to know, so Paradigm4 teamed up with Innovation Enterprise to ask over 100 data scientists for their help separating Big Data hype from reality. What we learned is that data scientists face multiple challenges achieving their company’s analytical aspirations. The upshot is that businesses are leaving data — and money — on the table.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Paradigm4 Research Report: Leaving Data on the table

  1. 1. Leaving Data on the Table Data Scientists Reveal Obstacles to Big Data Analytics
  2. 2. Paradigm4 Data Scientist Survey 2 While Big Data enjoys widespread media coverage, not enough attention has been paid to what practitioners think — data scientists who manage and analyze massive volumes of data. We wanted to know, so Paradigm4 teamed up with Innovation Enterprise to ask over 100 data scientists for their help separating Big Data hype from reality. What we learned is that data scientists face multiple challenges achieving their company’s analytical aspirations. The upshot is that businesses are leaving data — and money — on the table. This survey uses the terms “complex analytics” and “basic analytics” for which respondents were given these definitions: This distinction is important because basic analytics are “embarrassingly parallel” whereas complex analytics are not. Here’s what we mean. “Embarrassingly Parallel” (sometimes referred to as “data parallel”) refers to problems that can be separated into multiple independent sub-problems that can run in parallel and do not require access to all the data at once. This is the divide-and-conquer approach used by MapReduce/Hadoop. In contrast, “non-embarrassingly parallel” problems require using and sharing all the data at once and communicating intermediate results among processes. Matrix multiplication on matrices too large to fit on one server is an example of a non-embarrassingly parallel function. Their experiences should help inform businesses on what to look for as they investigate options to expand their analytics infrastructure. For insight on the issues and obstacles facing data scientists, read on. We asked data scientists questions such as: What obstacles prevent them from gaining insights into their data? How many use Hadoop and which limitations have they encountered when attempting to use Hadoop for complex analytics? What data types and sources would they like to leverage more effectively? Whether they’ll adopt complex analytics solutions (see below) — and how quickly? “Complex analytics” means math functions like covariance, clustering, machine learning, principal components analysis and graph operations. “Basic analytics” means business intelligence reporting such as sums, counts and aggregates.
  3. 3. Paradigm4 Data Scientist Survey 3 We’ve all heard how hard it is to analyze massive and rapidly growing data volumes. But data scientists say variety presents a bigger challenge. They are at times leaving data out of their analyses as they wrestle with how to integrate and analyze more types of data such as time-stamped sensor, location, image and behavioral data as well as network data. Data scientists are turning to large-scale complex analytics both for unbiased data- driven exploration and to wrest more value from their data. For complex analytics, data scientists are forced to move large volumes of data from existing data stores to dedicated mathematical and statistical computing software. This time-consuming and coding-intensive step adds no analytical value and impedes productivity. While Hadoop has garnered widespread media coverage, 76 percent of data scientists have encountered serious limitations using it. Hadoop is well suited for embarrassingly-parallel problems but falls short for large-scale complex analytics. Incorporating the diverse data types into analytical workflows is a major pain point for data scientists using traditional relational database software. For data scientists, Big Data means Big Stress. 39 percent say it’s made their job more stressful. 1 2 3 4 5 6 The Big Takeaways
  4. 4. Paradigm4 Data Scientist Survey 4 What Is The Biggest Problem You Face In Gaining Insights From Your Big Data? Which types of data do you anticipate using in the next year? The overwhelming volume of corporate and organizational data continues to generate headlines but it’s the diverse types of data that pose a bigger challenge. Nearly three-quarters of data scientists — 71 percent — said Big Data had made their analytics more difficult and data variety, not just volume, was the challenge. 71%TRUE I struggle with managing new types and sources of data I know how to get the answer but it takes too long (my data is too big to move to a math/ analytics software package) I don’t know what questions to ask of my data I know what I want to ask but don’t know how to get the answers Time-series Business transaction Geospatial / Location Graph (network) Clickstream Health records Sensor Image Genomic I know how to get the answer but my analysis runs out of memory 29% 40% 36% 24% 18% 17% 66% 66% 55% 46% 35% 25% 17% 13% 7% FALSE My Analytics Are Becoming More Difficult Because of the Variety and Types of Data Sources (Not Just the Volume) Data Variety Is Proving to Be More Important Than Volume
  5. 5. Paradigm4 Data Scientist Survey 5 The trend toward hyper-personalization and precision targeting illustrates this well. Recommendations, search results and ads are becoming ever more relevant and micro-targeted as they tap more and diverse data like social networks, current location, and browsing and purchasing history. Personalized insurance offerings are augmenting sensor data about driver behaviortoincorporatecontextualdataliketime-of-dayandroadcongestion.Precisionmedicine providers are gaining a more refined understanding of what works for whom by integrating molecular data with clinical, behavioral, electronic health records and environmental data. But the ability to use diverse data types poses a serious challenge. (For more on this topic, see, “Big Data at Work: Dispelling the Myths, Uncovering the Opportunities,” by Thomas Davenport, Chapter 1: “Why Big Data is Important to you and your Organization.”) What It Means: The ability to effectively use diverse data sources is proving to be a competitive differentiator in many industries.
  6. 6. Paradigm4 Data Scientist Survey 6 Data Scientists Are Turning to Complex Analytics to Analyze Their Big Data When will your company begin to use complex analytics on your Big Data? 59% 1% 4% 4% 16% W e use it now In the next 3 years M ore than 3 years down the road No plans to use com plex analytics In the next 2 years W eplantouseitinthenextyear 15% The point is not to be dazzled by the volume of data, but rather to analyze it — to convert it into insights, innovations, and business value. — Thomas Davenport, “Big Data at Work: Dispelling the Myths, Uncovering the Opportunities,” page 2. “ ”
  7. 7. Paradigm4 Data Scientist Survey 7 Many new analytical uses require significantly more powerful algorithms and computational approaches than what’s possible in Hadoop or relational databases. Data scientists increasingly need to leverage all data sources in novel ways, using tools and analytical infrastructures suitable for the task. As we have already seen in this survey, organizations are moving from simple SQL aggregates and summary statistics to next-generation analytics such as machine learning, clustering, correlation, and principal components analysis on moderately sized data sets. The move from simple to complex analytics on Big Data presages an emerging need for analytics that scale beyond single server memory limits and handle sparsity, missing values and mixed sampling frequencies appropriately. These complex analytics methods can also provide data scientists with unsupervised and assumption-free approaches, letting all the data speak for itself. What It Means: The “low hanging fruit” of Big Data has been exploited.
  8. 8. Paradigm4 Data Scientist Survey 8 Data scientists face another growing challenge: conventional analytic workflows require them to move data to mathematical and statistical computing software. This workflow made sense with small or sampled data but is either woefully inefficient or breaks with even moderately large data volumes. of data scientists utilize software capable of complex analytics in addition to their data management software of data scientists say it takes too long to get insights from their data because it is too big to move to their analytics software Moving Big Data Poses Difficult Challenges to Data Scientists 78% 36%
  9. 9. Paradigm4 Data Scientist Survey 9 This forces data scientists to make compromises, analyzing samples instead of the whole data set, leaving data and money on the table. Data scientists risk missing rare events, weak signals or important anomalies when restricted to working with samples or computing on subsets independently. (For more on this topic, see “Scaling Big Data Mining Infrastructure: The Twitter Experience,” by Twitter Engineering Manager Dmitriy Ryaboy and University of Maryland Associate Professor Jimmy Lin). What’s needed are tools capable of conducting complex analytics over massive data volumes efficiently — without sampling and without moving the data. What It Means: The size and diversity of today’s data sets pose a significant hurdle to doing more sophisticated analytics because so much time is lost moving data from files or from a database to analysis tools.
  10. 10. Paradigm4 Data Scientist Survey 10 While the Hadoop software platform garners significant media attention, Hadoop is not a viable solution for many use cases, especially those that require complex analytics. Fewer than half of data scientists surveyed (48 percent) have used Hadoop or SPARK — and of those, 76 percent cited significant limitations to its use. Hadoop Only Takes You So Far From the 76% reporting problems, what are the limitations of Hadoop / SPARK? It takes too much effort to program It’s too slow for interactive, ad-hoc queries It’s too slow for real-time analytics It’s not well-suited for my analytics (not embarrassingly parallel) 39% 37% 30% 22% of data scientists who tried Hadoop or SPARK have stopped using it 35%
  11. 11. Paradigm4 Data Scientist Survey 11 But even Hadoop vendors have recognized the limitations. They are adding SQL functionality to theirproductstoaccommodatedatascientists’preferenceforahigher-levelquerylanguageinstead of programming languages like Java and to address the limitations of MapReduce. (E.g., Cloudera has abandoned MapReduce and is offering Impala to provide SQL on HDFS.) A growing number of complex analytics use cases are proving to be unworkable in Hadoop. First-wave Hadoop adopters like Google, Facebook and LinkedIn required a small army of developers to program and maintain Hadoop. But many organizations either don’t have the required staff or face complex analytics challenges that can’t be readily solved with Hadoop. This presents a real challenge for the Hadoop infrastructure that has to address these shortcomings or risk being replaced. What It Means: Hadoop was unrealistically hyped as a universal and disruptive Big Data solution.
  12. 12. Paradigm4 Data Scientist Survey 12 Given the growing diversification of data types and sources coupled with the limitations of existing relational databases, it’s no surprise that many data scientists are frustrated leveraging these data sources in their analytical workflows. Existing relational database management systems are inadequate for analyzing the variety of data sources I am finding it harder to fit my data into relational database tables TRUE FALSE 49% 51%
  13. 13. Paradigm4 Data Scientist Survey 13 By comparison, temporal, spatial and network data may be quite sparse (containing large amounts of missing values), have mixed sampling frequencies and a natural order. Relational databases require predefined access patterns for each line of inquiry, an obvious non-starter for data scientists doing ad hoc data exploration. What It Means: Relational databases were built for storing and querying densely populated transactional data such as business purchases and customer information.
  14. 14. Paradigm4 Data Scientist Survey 14 of data scientists say the growth of Big Data has made their job more stressful in the last year say they don’t know which questions to ask of their Big Data There’s another side of the Big Data story: 39 percent of data scientists say their job has become more stressful with the growth of Big Data. That’s nearly four times the number who say it’s made their job less stressful. Big Data Means Big Stress for Data Scientists Quotes from data scientists: 24% My biggest problem is linking various data sources. ”“ The data is just too big. ”“ The biggest problem is putting multiple sources of data together. ”“ 39%
  15. 15. Paradigm4 Data Scientist Survey 15 Fulfilling those expectations falls on the data scientist. But outdated software approaches better suited to traditional transactional data — not today’s diverse data sources and rapidly growing volumes — often make it impossible to fulfill these expectations. It’s a recipe for stress. Deriving business value from organizational data starts with ad hoc analysis. Tools and workflows need to enable data scientists to conduct analysis quickly and efficiently, making data scientists more productive and lowering stress levels as a result. What It Means: Driven in part by media hype, organizations have developed inflated expectations around the value they’ll get out of Big Data.
  16. 16. Paradigm4 Data Scientist Survey 16 Data scientists play a pivotal role helping organizations unlock the potential of their Big Data. But current software tools fall short in some areas as indicated in the survey. Hype has exceeded reality and data scientists are forced to compromise, sometimes leaving data on the table. Choosing the right software solution is key but don’t expect to get there by browsing vendors’ websites. The fact that so many data scientists identified shortcomings in their infrastructure suggests that the only way to tell which solution is best suited to your organization is to do a pilot project using your data and your use cases. So What? The Paradigm4 Data Scientist Survey was fielded by Innovation Enterprise, an independent research firm, from March 27 to April 23, 2014. The responses were generated from a survey of 111 data scientists in the U.S. Paradigm4 is the creator of SciDB, a computational database management system used to solve large-scale, complex analytics challenges on Big — and Diverse — Data. Led by industry visionaries and veterans Michael Stonebraker, Marilyn Matz, Paul Brown and Bryan Lewis, Paradigm4 enables data-obsessed organizations in life sciences, e-commerce, finance, and manufacturing to answer harder questions faster. For more information, visit About the Survey About Paradigm4