Big Data
Pitfalls
April 8, 2015
2
Big Data Introduction
3
So What is it?
●
Misnomer and marketing speak
●
“Unstructured” data
– Text heavy
– Without obvious/clear structure
●
Comes from many places, in many styles
4
5
Where It Comes From
6
Building Your Data Lake
7
A Common Evolution
8
A Common Evolution
9
Hadoop to the Rescue!
10
You Have a Data Lake!
11
Hadoop to the Rescue
● Cross system analytics?
● Data quality confidence?
● Source of truth?
● Tool chain support?
● Giant yellow elephants?
12
Hadoop to the Rescue
● Cross system analytics?
● Data quality confidence?
● Source of truth?
● Tool chain support?
● Giant yellow elephants?
If any are ignored...
13
You have a Data Swamp!
14
Don't worry, even the Jedi had a Data Swamp...
15
Goal is to build a Data Reservoir
16
Reservoirs...
● Contain data that is...
– Managed
– Transformed
– Filtered
– Secured
– Portable
– Fit for purpose
Source: Gartner
17
Pitfalls
18
Data Warehouse Models
● Traditional models don't cover semi-
structured data
● Modern models are hybrids that cross the
structured semi-structured boundary
19
Data Vault
20
Data Vault
● Developed by Dan Linstedt
● Tie technical keys across structured and semi-structured data sources
● Semi-structured data can me made more structured and loaded into relational data
vault
● Tools have to support crossing sources
● More details: http://www.tdan.com/view-articles/5054/
21
Anchor
22
Anchor
● Developed by Lars Rönnbäck
● 6th normal form data warehouse
● Have to transform semi-structured data to match the anchor model
● Provides flexible model that should be able to have marts built upon it
● More details: http://www.anchormodeling.com/
23
Textual Disambiguation
● Developed by Bill Inmon
● Breaking semi-structured data down by context
● Converts the data into structured format, consumable by tools
● Store data within the data warehouse – 8th/9th normal form
● White papers and more details are on Bill's website:
http://www.forestrimtech.com/
24Source: http://www.slideshare.net/Roenbaeck/anchor-modeling-8140128
25
Working With “Unstructured” Data
● Most data tools require structure (Database schema, clear-cut data formatting)
● Business and technical knowledge required
– Business to provide the pattern “the grammar or syntax”
– Technical to provide the “how”
26
Working With “Unstructured” Data
“The car is hot.”
27
Identifying Context
● It's a really nice car.
● It's internal temperature requires adjustment
● It's hot to the touch
● It's on fire
28
29
How to Implement
● Map/Reduce code, Hive queries, data integration tools (Pentaho, Talend)
● Have to create the grammar/syntax rules for particular business
● MDM is _not_ the solution
● Best to have a data warehouse based on subject/relationships
– Data Vault
– Anchor
– Textual Disambiguation
30
Data Symbiosis
● Data in data lake can't stand on it's own
– Ties back to rest of the structured data
– Requires firm understanding of business rules/logic
● Provides richer data sets
● Difficult to do before data lakes, after adding a data lake the problems magnify
– But so do the rewards!
31
Data Quality
● Not just a problem for Data Warehouses!
● Measuring “fit for purpose”
● Same rules used for data warehouses
apply to big data
32
Principles of Data Quality
● Consistency
● Correctness
● Timeliness
● Precision
● Unambiguous
● Completeness
● Reliability
● Accuracy
● Objectivity
● Conciseness
● Usefulness
● Usability
● Relevance
● Quantity
Source: Data Quality Fundamentals, The Data Warehouse Institute
33
Why Data Quality?
● Main way to control/tame your data
problems
● Most hidden costs because it's hardest to
fix
● Target upstream for problem solutions
34
How to Implement
● Data integration tools
● Custom coding (Map/Reduce, etc.)
● Data Profiling
● MDM (as central “dictionary”/”grammar”
handler)
35
Tooling
36
Does Your Tool Chain...
● Support Hadoop?
● Interface with non-traditional database solutions (i.e. not an RDBMS)?
● Allow for integration across disparate sources?
● Support data quality?
37
If Not...
38
Hadoop Ecosystem
● Bridges some of the gaps
– Hive – SQL to Hadoop interface (jdbc support)
● Provides even more power
https://hadoopecosystemtable.github.io/
Plus dozens of others... and growing
39
Sources
● http://en.wikipedia.org/wiki/File:Pitfall!_Coverart.png
● http://www.networkcomputing.com/big-data-defined/d/d-id/1204588
● http://www.appliedi.net/
● http://imgbuddy.com/internet-of-things-icon.asp
● http://www.smashingapps.com/, et. al.
● http://www.colleenkerriganphotographs.com/p663330184/h217016CE#h2170
16ce

Big Data Pitfalls

  • 1.
  • 2.
  • 3.
    3 So What isit? ● Misnomer and marketing speak ● “Unstructured” data – Text heavy – Without obvious/clear structure ● Comes from many places, in many styles
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    10 You Have aData Lake!
  • 11.
    11 Hadoop to theRescue ● Cross system analytics? ● Data quality confidence? ● Source of truth? ● Tool chain support? ● Giant yellow elephants?
  • 12.
    12 Hadoop to theRescue ● Cross system analytics? ● Data quality confidence? ● Source of truth? ● Tool chain support? ● Giant yellow elephants? If any are ignored...
  • 13.
    13 You have aData Swamp!
  • 14.
    14 Don't worry, eventhe Jedi had a Data Swamp...
  • 15.
    15 Goal is tobuild a Data Reservoir
  • 16.
    16 Reservoirs... ● Contain datathat is... – Managed – Transformed – Filtered – Secured – Portable – Fit for purpose Source: Gartner
  • 17.
  • 18.
    18 Data Warehouse Models ●Traditional models don't cover semi- structured data ● Modern models are hybrids that cross the structured semi-structured boundary
  • 19.
  • 20.
    20 Data Vault ● Developedby Dan Linstedt ● Tie technical keys across structured and semi-structured data sources ● Semi-structured data can me made more structured and loaded into relational data vault ● Tools have to support crossing sources ● More details: http://www.tdan.com/view-articles/5054/
  • 21.
  • 22.
    22 Anchor ● Developed byLars Rönnbäck ● 6th normal form data warehouse ● Have to transform semi-structured data to match the anchor model ● Provides flexible model that should be able to have marts built upon it ● More details: http://www.anchormodeling.com/
  • 23.
    23 Textual Disambiguation ● Developedby Bill Inmon ● Breaking semi-structured data down by context ● Converts the data into structured format, consumable by tools ● Store data within the data warehouse – 8th/9th normal form ● White papers and more details are on Bill's website: http://www.forestrimtech.com/
  • 24.
  • 25.
    25 Working With “Unstructured”Data ● Most data tools require structure (Database schema, clear-cut data formatting) ● Business and technical knowledge required – Business to provide the pattern “the grammar or syntax” – Technical to provide the “how”
  • 26.
    26 Working With “Unstructured”Data “The car is hot.”
  • 27.
    27 Identifying Context ● It'sa really nice car. ● It's internal temperature requires adjustment ● It's hot to the touch ● It's on fire
  • 28.
  • 29.
    29 How to Implement ●Map/Reduce code, Hive queries, data integration tools (Pentaho, Talend) ● Have to create the grammar/syntax rules for particular business ● MDM is _not_ the solution ● Best to have a data warehouse based on subject/relationships – Data Vault – Anchor – Textual Disambiguation
  • 30.
    30 Data Symbiosis ● Datain data lake can't stand on it's own – Ties back to rest of the structured data – Requires firm understanding of business rules/logic ● Provides richer data sets ● Difficult to do before data lakes, after adding a data lake the problems magnify – But so do the rewards!
  • 31.
    31 Data Quality ● Notjust a problem for Data Warehouses! ● Measuring “fit for purpose” ● Same rules used for data warehouses apply to big data
  • 32.
    32 Principles of DataQuality ● Consistency ● Correctness ● Timeliness ● Precision ● Unambiguous ● Completeness ● Reliability ● Accuracy ● Objectivity ● Conciseness ● Usefulness ● Usability ● Relevance ● Quantity Source: Data Quality Fundamentals, The Data Warehouse Institute
  • 33.
    33 Why Data Quality? ●Main way to control/tame your data problems ● Most hidden costs because it's hardest to fix ● Target upstream for problem solutions
  • 34.
    34 How to Implement ●Data integration tools ● Custom coding (Map/Reduce, etc.) ● Data Profiling ● MDM (as central “dictionary”/”grammar” handler)
  • 35.
  • 36.
    36 Does Your ToolChain... ● Support Hadoop? ● Interface with non-traditional database solutions (i.e. not an RDBMS)? ● Allow for integration across disparate sources? ● Support data quality?
  • 37.
  • 38.
    38 Hadoop Ecosystem ● Bridgessome of the gaps – Hive – SQL to Hadoop interface (jdbc support) ● Provides even more power https://hadoopecosystemtable.github.io/ Plus dozens of others... and growing
  • 39.
    39 Sources ● http://en.wikipedia.org/wiki/File:Pitfall!_Coverart.png ● http://www.networkcomputing.com/big-data-defined/d/d-id/1204588 ●http://www.appliedi.net/ ● http://imgbuddy.com/internet-of-things-icon.asp ● http://www.smashingapps.com/, et. al. ● http://www.colleenkerriganphotographs.com/p663330184/h217016CE#h2170 16ce