Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1 Sponsored by:
Sponsored by:
‘Bad Data’ Is Polluting Big Data
Enterprises Struggle with Real-Time Control of Data Flows
A...
2
Executive Summary
The big data market is still maturing, especially as relates to
data in motion and as evidenced by lac...
3 Sponsored by:3
Key Findings
• 87% state ‘bad data’ pollutes their data stores while 74% state ‘bad data’ is
currently in...
4 Sponsored by:
METHODOLOGY AND
PARTICIPANTS
5 Sponsored by:5
Research Goal
The primary research goal was to capture how
companies manage the flow of big data. The
res...
6 Sponsored by:6
Companies Represented
Industry Size
500 - 1,000
25%
1,000 - 5,000
29%
5,000 - 10,000
16%
More than
10,000...
7 Sponsored by:7
Participant Demographics
LocationRole
6%
8%
17%
34%
52%
56%
0% 10% 20% 30% 40% 50% 60%
Business analyst
B...
8 Sponsored by:
DETAILED FINDINGS
9 Sponsored by:
What challenges
does your company
face when managing
your big data flows?
Top 3 Challenges for Big Data Fl...
10 Sponsored by:
Does ‘bad data’
occasionally get into
your data stores?
87% State ‘Bad Data’ Pollutes Their Data
Stores
Y...
11 Sponsored by:
Do you believe there
is any ‘bad data’ in
your data stores
currently?
74% State ‘Bad Data’ is Currently i...
12 Sponsored by:
How does your
company build big
data flow pipelines
today?
77% of Companies Still Use Hand Coding to
Buil...
13 Sponsored by:
On average, how
often are changes or
fixes made to typical
data flow pipeline?
53% Change Data Flow Pipel...
14 Sponsored by:
When data structure
or semantics
unexpectedly
change, how big is
the impact on the
operation of your big
...
15 Sponsored by:
How would you
assess your
ability to detect
each of the
following issues
in real-time?
More Than Half of ...
16 Sponsored by:
Only 12% Rated Their Performance as ‘Good’ or
‘Excellent’ Across All Five Key Data Flow Metrics
1. A spec...
17 Sponsored by:
In your opinion, how
valuable would it be
to be able to detect
each of these issues
in real-time?
Substan...
18 Sponsored by:
Gap Between Current Pipeline Real-Time
Visibility Capabilities and Stated Value
42%
16%
42%
46%
14%
29%
3...
19 Sponsored by:
B. Data flow throughput is degrading or latency is growing
Chasm Between Today’s Data Flow
Throughput Met...
20 Sponsored by:
Significant Gap Between Error Rate
Visibility Value and Current Capabilities
33%
7%
46%
37%
17%
38%
4%
16...
21 Sponsored by:
Chasm Between Value of Detecting
Divergent Data and Current Capabilities
23%
5%
46%
29%
26%
43%
4%
20%
1%...
22 Sponsored by:
Large Gap Between Data Privacy Value and
Current Capabilities
40%
18%
35%
33%
18%
30%
6%
13%
2%
6%
0% 10%...
23 Sponsored by:
How valuable is it to
have a single control
panel for
comprehensive
visibility and
management across
all ...
24 Sponsored by:
Which of the
following do you
consider to be the
most effective
approach to ensuring
data quality?
50% St...
25 Sponsored by:
What is the
operational impact of
upgrading big data
components (ingest
technologies,
message queues,
dat...
26 Sponsored by:26
For more information…
About Dimensional Research
Dimensional Research provides practical marketing rese...
27 Sponsored by:
APPENDIX
28 Sponsored by:
Tremendous Gaps Exist Between Currant Big Bata Flow
Management Tool Capabilities and What is Needed
Abili...
29 Sponsored by:
Which of the
following approaches
for ensuring data
quality does your
company utilize?
Various Approaches...
30 Sponsored by:
Approximately, what
percentage of data
flow changes and
fixes are made for
day-to-day
maintenance and
tro...
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Next
Upcoming SlideShare
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Next
Download to read offline and view in fullscreen.

Share

Bad Data is Polluting Big Data

Download to read offline

A global survey of more than 300 data management professionals conducted by independent research firm Dimensional Research® showed that enterprises of all sizes face challenges on a range of key data performance management issues from stopping bad data to keeping data flows operating effectively. In particular, 87 percent of respondents report flowing bad data into their data stores while just 12 percent consider themselves good at the key aspects of data flow performance management.

  • Be the first to like this

Bad Data is Polluting Big Data

  1. 1. 1 Sponsored by: Sponsored by: ‘Bad Data’ Is Polluting Big Data Enterprises Struggle with Real-Time Control of Data Flows A Global Survey of Big Data Professionals June 2016
  2. 2. 2 Executive Summary The big data market is still maturing, especially as relates to data in motion and as evidenced by lack of best practices or consistent processes to clean and manage data quality. For companies who use big data to optimize current business operations or to make strategic decisions, it is critical that they ensure their big data teams have real-time visibility and control over the data at all times. This report finds that companies who are leveraging big data are rarely capable of controlling their data flows. Almost 9 out of 10 companies report ‘bad data’ polluting their data stores and shockingly nearly 3/4 indicate there is ‘bad data’ in their stores currently. The findings also reveal a chasm between the problem detection capabilities data experts have today and what they desire. This translates into a lack of real-time visibility and control of data flows, operations, quality and security.
  3. 3. 3 Sponsored by:3 Key Findings • 87% state ‘bad data’ pollutes their data stores while 74% state ‘bad data’ is currently in their data stores • Ensuring data quality was the most common challenge cited, by 68% of respondents, and only 34% claimed to be good at detecting divergent data • 72% responded that they hand code their data flows while 53% claimed they have to change each pipeline at least several times a month • Tremendous gaps exist between today’s big data flow management tools’ capabilities and what is needed • Only 10% of respondents rated their performance as good or excellent across 5 key data flow operational performance areas • 72% desire a single pane of glass solution to manage all data flows • 81% state there is a significant operational impact when they upgrade big data components
  4. 4. 4 Sponsored by: METHODOLOGY AND PARTICIPANTS
  5. 5. 5 Sponsored by:5 Research Goal The primary research goal was to capture how companies manage the flow of big data. The research also investigated and documented current tools’ capabilities, data quality and efforts to maintain big data pipelines and infrastructure Goals and Methodology Methodology Big data professionals worldwide were invited to participate in a survey on the topic of big data and ensuring data flow operations and data quality. The survey was administered electronically and participants were offered a token compensation for their participation. Participants A total of 314 participants that manage big data operations completed the survey.
  6. 6. 6 Sponsored by:6 Companies Represented Industry Size 500 - 1,000 25% 1,000 - 5,000 29% 5,000 - 10,000 16% More than 10,000 30% 2% 1% 1% 1% 1% 4% 5% 5% 5% 6% 6% 6% 10% 12% 18% 18% 0% 5% 10% 15% 20% Other Food and Beverage Hospitality and Entertainment Media and Advertising Non-Profit Retail Transportation Energy and Utilities Telecommunications Government Services Education Healthcare Manufacturing Financial Services Technology
  7. 7. 7 Sponsored by:7 Participant Demographics LocationRole 6% 8% 17% 34% 52% 56% 0% 10% 20% 30% 40% 50% 60% Business analyst Business stakeholder who uses data to make decisions BI or Analytics Technology Owner (e.g. data architect, head of data platform) IT executive with data initiatives in my portfolio IT manager responsible for delivering data initiatives IT staff responsible for implementing and operating data infrastructure (e.g. database… United States or Canada 75% Europe 14% Mexico, Central America, or South America 4% Australia or New Zealand 3% Middle East or Africa 2% Asia 2%
  8. 8. 8 Sponsored by: DETAILED FINDINGS
  9. 9. 9 Sponsored by: What challenges does your company face when managing your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60% 68% 0% 10% 20% 30% 40% 50% 60% 70% 80% We have no challenges Adapting pipelines to meet new requirements Upgrading big data infrastructure components (Kafka, Hadoop, etc.). Building pipelines for getting data into the data store Keeping data flow pipelines operating effectively Complying with security and data privacy policies Ensuring the quality of the data (accuracy, completeness, consistency)
  10. 10. 10 Sponsored by: Does ‘bad data’ occasionally get into your data stores? 87% State ‘Bad Data’ Pollutes Their Data Stores Yes 87% No 13%
  11. 11. 11 Sponsored by: Do you believe there is any ‘bad data’ in your data stores currently? 74% State ‘Bad Data’ is Currently in Their Data Stores Yes 74% No 26%
  12. 12. 12 Sponsored by: How does your company build big data flow pipelines today? 77% of Companies Still Use Hand Coding to Build Big Data Flows 27% 63% 77% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Using big data ingestion tools such as StreamSets, NiFi, etc. Using ETL or data integration tools Coding with Python, Java, etc. or low-level frameworks such as Sqoop, Flume or Kafka
  13. 13. 13 Sponsored by: On average, how often are changes or fixes made to typical data flow pipeline? 53% Change Data Flow Pipelines At Least Several Times a Month 3% 19% 31% 26% 12% 8% 0% 5% 10% 15% 20% 25% 30% 35% Several times a day Several times a week Several times a month Several times a quarter Several times a year Less often than several times a year
  14. 14. 14 Sponsored by: When data structure or semantics unexpectedly change, how big is the impact on the operation of your big data flows (failures, slowdowns, data corruption, etc.)? 85% State Unexpected Structure and Semantic Changes Have Substantial Impact on Dataflow Operations 31% 54% 11%2%2% 0% 20% 40% 60% 80% 100% Significant impact Moderate impact Minor impact Structure and semantic changes have no effect on our big data flows Data structure and semantic changes never occur
  15. 15. 15 Sponsored by: How would you assess your ability to detect each of the following issues in real-time? More Than Half of Companies Lack Real Time Information About Data Flow Quality 18% 5% 7% 7% 16% 33% 29% 37% 37% 46% 30% 43% 38% 37% 29% 13% 20% 16% 17% 9% 6% 3% 1% 1% 1% 0% 10%20%30%40%50%60%70%80%90%100% Personally identifiable information (credit card numbers, social security numbers) is being inappropriately placed in a data store The values of incoming data are diverging from historical norms Error rates are increasing Data flow throughput is degrading or latency is growing A specific data flow pipeline has stopped operating Excellent Good Average Poor None
  16. 16. 16 Sponsored by: Only 12% Rated Their Performance as ‘Good’ or ‘Excellent’ Across All Five Key Data Flow Metrics 1. A specific data flow pipeline has stopped operating 2. Data flow throughput is degrading or latency is growing 3. Error rates are increasing 4. The values of incoming data are diverging from historical norms 5. Identify personally information within the data flows Five Key Data Flow Metrics Number of Key Data Flow Metrics Participants Represented as ‘Good’ or ‘Excellent’ 19% 17% 19% 20% 12% 12% 1 Metrics 0 Metrics All 5 Metrics 4 Metrics 3 Metrics 2 Metrics
  17. 17. 17 Sponsored by: In your opinion, how valuable would it be to be able to detect each of these issues in real-time? Substantial Value In Real-Time Data Flow Detection Capabilities 40% 23% 33% 28% 42% 35% 46% 46% 49% 42% 18% 26% 17% 20% 14% 6% 4% 4% 3% 3% 0% 20% 40% 60% 80% 100% Identify personally information within the data flows The values of incoming data are diverging from historical norms Error rates are increasing Data flow throughput is degrading or latency is growing A specific data flow pipeline has stopped operating Very valuable Valuable Average value Limited value Not valuable
  18. 18. 18 Sponsored by: Gap Between Current Pipeline Real-Time Visibility Capabilities and Stated Value 42% 16% 42% 46% 14% 29% 3% 9% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Assessed value Real-time ability Excellent/ Very valuable Good/ Valuable Average/ Average value Poor/ Limited value None/ Not valuable A specific data flow pipeline has stopped operating 62% 84%
  19. 19. 19 Sponsored by: B. Data flow throughput is degrading or latency is growing Chasm Between Today’s Data Flow Throughput Metrics and What is Needed 28% 7% 49% 37% 20% 37% 3% 17% 1% 1% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Assessed value Real-time ability Excellent/ Very valuable Good/ Valuable Average/ Average value Poor/ Limited value None/ Not valuable 44% 77% Data flow throughput is degrading or latency is growing
  20. 20. 20 Sponsored by: Significant Gap Between Error Rate Visibility Value and Current Capabilities 33% 7% 46% 37% 17% 38% 4% 16% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Assessed value Real-time ability Excellent/ Very valuable Good/ Valuable Average/ Average value Poor/ Limited value None/ Not valuable 44% 79% Error rates are increasing
  21. 21. 21 Sponsored by: Chasm Between Value of Detecting Divergent Data and Current Capabilities 23% 5% 46% 29% 26% 43% 4% 20% 1% 3% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Assessed value Real-time ability Excellent/ Very valuable Good/ Valuable Average/ Average value Poor/ Limited value None/ Not valuable 34% 69% The values of incoming data are diverging from historical norms
  22. 22. 22 Sponsored by: Large Gap Between Data Privacy Value and Current Capabilities 40% 18% 35% 33% 18% 30% 6% 13% 2% 6% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Assessed value Real-time ability Excellent/ Very valuable Good/ Valuable Average/ Average value Poor/ Limited value None/ Not valuable 51% 75% Identify personal information within the data flows
  23. 23. 23 Sponsored by: How valuable is it to have a single control panel for comprehensive visibility and management across all of your data flows? 72% Desire A Single Pane of Glass Solution To Manage All Data Flows 24% 48% 24% 4% 0% 20% 40% 60% 80% 100% Very valuable Valuable Average value Limited value
  24. 24. 24 Sponsored by: Which of the following do you consider to be the most effective approach to ensuring data quality? 50% State that Data Cleansing at the Source is the Most Effective Quality Practice Cleanse data as it flows in from the source 50% Cleanse and update data once it is in the store 27% Data scientists or business analysts cleanse data before using it 23%
  25. 25. 25 Sponsored by: What is the operational impact of upgrading big data components (ingest technologies, message queues, data stores, search stores, etc.)? 81% State There is Significant Operational Impact to Upgrading Big Data Components 17% 64% 17% 2% 0% 20% 40% 60% 80% 100% Heavy impact Moderate impact Minor impact No impact
  26. 26. 26 Sponsored by:26 For more information… About Dimensional Research Dimensional Research provides practical marketing research to help technology companies make smarter business decisions. Our researchers are experts in technology and understand how corporate IT organizations operate. Our qualitative research services deliver a clear understanding of customer and market dynamics. For more information, visit www.dimensionalresearch.com. About StreamSets Place holder For more information, visit www.streamsets.com.
  27. 27. 27 Sponsored by: APPENDIX
  28. 28. 28 Sponsored by: Tremendous Gaps Exist Between Currant Big Bata Flow Management Tool Capabilities and What is Needed Ability to Detect Area in Real-Time Compared Against Stated Value To Detect in Real-Time 18% 40% 5% 23% 7% 33% 7% 28% 16% 42% 33% 35% 29% 46% 37% 46% 37% 49% 46% 42% 30% 18% 43% 26% 38% 17% 37% 20% 29% 14% 13% 6% 20% 4% 16% 4% 17% 3% 9% 3% 6% 2% 3% 1% 1% 0% 1% 1% 1% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Personally identifiable information (credit card numbers, social security numbers) is being inappropriately placed in a data store The values of incoming data are diverging from historical norms Error rates are increasing Data flow throughput is degrading or latency is growing A specific data flow pipeline has stopped operating Excellent/ Very valuable Good/ Valuable Average/ Average value Poor/ Limited value None/ Not valuable Stated Value Current Ability Stated Value Current Ability Stated Value Current Ability Stated Value Current Ability Stated Value Current Ability
  29. 29. 29 Sponsored by: Which of the following approaches for ensuring data quality does your company utilize? Various Approaches To Managing Data Quality Indicates a Lack of Best Practice 43% 54% 55% 0% 10% 20% 30% 40% 50% 60% Data scientists or business analysts cleanse data before using it Cleanse data as it flows in from the source Cleanse and update data once it is in the store
  30. 30. 30 Sponsored by: Approximately, what percentage of data flow changes and fixes are made for day-to-day maintenance and troubleshooting purposes? Many Must Perform Maintenance and Troubleshooting on Data Flows Routinely 3% 10% 24% 27% 36% 0% 5% 10% 15% 20% 25% 30% 35% 40% More than 80% 60% - 80% 40% - 60% 20% - 40% Less than 20%

A global survey of more than 300 data management professionals conducted by independent research firm Dimensional Research® showed that enterprises of all sizes face challenges on a range of key data performance management issues from stopping bad data to keeping data flows operating effectively. In particular, 87 percent of respondents report flowing bad data into their data stores while just 12 percent consider themselves good at the key aspects of data flow performance management.

Views

Total views

963

On Slideshare

0

From embeds

0

Number of embeds

28

Actions

Downloads

12

Shares

0

Comments

0

Likes

0

×