Big Data Pitfalls

Big Data
Pitfalls
April 8, 2015

3
So What is it?
●
Misnomer and marketing speak
●
“Unstructured” data
– Text heavy
– Without obvious/clear structure
●
Comes from many places, in many styles

11
Hadoop to the Rescue
● Cross system analytics?
● Data quality confidence?
● Source of truth?
● Tool chain support?
● Giant yellow elephants?

12
Hadoop to the Rescue
● Cross system analytics?
● Data quality confidence?
● Source of truth?
● Tool chain support?
● Giant yellow elephants?
If any are ignored...

14
Don't worry, even the Jedi had a Data Swamp...

15
Goal is to build a Data Reservoir

16
Reservoirs...
● Contain data that is...
– Managed
– Transformed
– Filtered
– Secured
– Portable
– Fit for purpose
Source: Gartner

18
Data Warehouse Models
● Traditional models don't cover semi-
structured data
● Modern models are hybrids that cross the
structured semi-structured boundary

20
Data Vault
● Developed by Dan Linstedt
● Tie technical keys across structured and semi-structured data sources
● Semi-structured data can me made more structured and loaded into relational data
vault
● Tools have to support crossing sources
● More details: http://www.tdan.com/view-articles/5054/

22
Anchor
● Developed by Lars Rönnbäck
● 6th normal form data warehouse
● Have to transform semi-structured data to match the anchor model
● Provides flexible model that should be able to have marts built upon it
● More details: http://www.anchormodeling.com/

23
Textual Disambiguation
● Developed by Bill Inmon
● Breaking semi-structured data down by context
● Converts the data into structured format, consumable by tools
● Store data within the data warehouse – 8th/9th normal form
● White papers and more details are on Bill's website:
http://www.forestrimtech.com/

24Source: http://www.slideshare.net/Roenbaeck/anchor-modeling-8140128

25
Working With “Unstructured” Data
● Most data tools require structure (Database schema, clear-cut data formatting)
● Business and technical knowledge required
– Business to provide the pattern “the grammar or syntax”
– Technical to provide the “how”

26
Working With “Unstructured” Data
“The car is hot.”

27
Identifying Context
● It's a really nice car.
● It's internal temperature requires adjustment
● It's hot to the touch
● It's on fire

29
How to Implement
● Map/Reduce code, Hive queries, data integration tools (Pentaho, Talend)
● Have to create the grammar/syntax rules for particular business
● MDM is _not_ the solution
● Best to have a data warehouse based on subject/relationships
– Data Vault
– Anchor
– Textual Disambiguation

30
Data Symbiosis
● Data in data lake can't stand on it's own
– Ties back to rest of the structured data
– Requires firm understanding of business rules/logic
● Provides richer data sets
● Difficult to do before data lakes, after adding a data lake the problems magnify
– But so do the rewards!

31
Data Quality
● Not just a problem for Data Warehouses!
● Measuring “fit for purpose”
● Same rules used for data warehouses
apply to big data

32
Principles of Data Quality
● Consistency
● Correctness
● Timeliness
● Precision
● Unambiguous
● Completeness
● Reliability
● Accuracy
● Objectivity
● Conciseness
● Usefulness
● Usability
● Relevance
● Quantity
Source: Data Quality Fundamentals, The Data Warehouse Institute

33
Why Data Quality?
● Main way to control/tame your data
problems
● Most hidden costs because it's hardest to
fix
● Target upstream for problem solutions

34
How to Implement
● Data integration tools
● Custom coding (Map/Reduce, etc.)
● Data Profiling
● MDM (as central “dictionary”/”grammar”
handler)

36
Does Your Tool Chain...
● Support Hadoop?
● Interface with non-traditional database solutions (i.e. not an RDBMS)?
● Allow for integration across disparate sources?
● Support data quality?

38
Hadoop Ecosystem
● Bridges some of the gaps
– Hive – SQL to Hadoop interface (jdbc support)
● Provides even more power
https://hadoopecosystemtable.github.io/
Plus dozens of others... and growing

39
Sources
● http://en.wikipedia.org/wiki/File:Pitfall!_Coverart.png
● http://www.networkcomputing.com/big-data-defined/d/d-id/1204588
● http://www.appliedi.net/
● http://imgbuddy.com/internet-of-things-icon.asp
● http://www.smashingapps.com/, et. al.
● http://www.colleenkerriganphotographs.com/p663330184/h217016CE#h2170
16ce

Big Data Pitfalls

More Related Content

What's hot

Viewers also liked

Similar to Big Data Pitfalls

More from Alex Meadows

Recently uploaded

Big Data Pitfalls