Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Pitfalls

528 views

Published on

Big Data has been around long enough that there are some common issues that occur whenever an organization tries to implement and integrate it into their ecosystem. This presentation covers some of those pitfalls, which also impact traditional data warehouses/business intelligence ecosystems

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big Data Pitfalls

  1. 1. Big Data Pitfalls April 8, 2015
  2. 2. 2 Big Data Introduction
  3. 3. 3 So What is it? ● Misnomer and marketing speak ● “Unstructured” data – Text heavy – Without obvious/clear structure ● Comes from many places, in many styles
  4. 4. 4
  5. 5. 5 Where It Comes From
  6. 6. 6 Building Your Data Lake
  7. 7. 7 A Common Evolution
  8. 8. 8 A Common Evolution
  9. 9. 9 Hadoop to the Rescue!
  10. 10. 10 You Have a Data Lake!
  11. 11. 11 Hadoop to the Rescue ● Cross system analytics? ● Data quality confidence? ● Source of truth? ● Tool chain support? ● Giant yellow elephants?
  12. 12. 12 Hadoop to the Rescue ● Cross system analytics? ● Data quality confidence? ● Source of truth? ● Tool chain support? ● Giant yellow elephants? If any are ignored...
  13. 13. 13 You have a Data Swamp!
  14. 14. 14 Don't worry, even the Jedi had a Data Swamp...
  15. 15. 15 Goal is to build a Data Reservoir
  16. 16. 16 Reservoirs... ● Contain data that is... – Managed – Transformed – Filtered – Secured – Portable – Fit for purpose Source: Gartner
  17. 17. 17 Pitfalls
  18. 18. 18 Data Warehouse Models ● Traditional models don't cover semi- structured data ● Modern models are hybrids that cross the structured semi-structured boundary
  19. 19. 19 Data Vault
  20. 20. 20 Data Vault ● Developed by Dan Linstedt ● Tie technical keys across structured and semi-structured data sources ● Semi-structured data can me made more structured and loaded into relational data vault ● Tools have to support crossing sources ● More details: http://www.tdan.com/view-articles/5054/
  21. 21. 21 Anchor
  22. 22. 22 Anchor ● Developed by Lars Rönnbäck ● 6th normal form data warehouse ● Have to transform semi-structured data to match the anchor model ● Provides flexible model that should be able to have marts built upon it ● More details: http://www.anchormodeling.com/
  23. 23. 23 Textual Disambiguation ● Developed by Bill Inmon ● Breaking semi-structured data down by context ● Converts the data into structured format, consumable by tools ● Store data within the data warehouse – 8th/9th normal form ● White papers and more details are on Bill's website: http://www.forestrimtech.com/
  24. 24. 24Source: http://www.slideshare.net/Roenbaeck/anchor-modeling-8140128
  25. 25. 25 Working With “Unstructured” Data ● Most data tools require structure (Database schema, clear-cut data formatting) ● Business and technical knowledge required – Business to provide the pattern “the grammar or syntax” – Technical to provide the “how”
  26. 26. 26 Working With “Unstructured” Data “The car is hot.”
  27. 27. 27 Identifying Context ● It's a really nice car. ● It's internal temperature requires adjustment ● It's hot to the touch ● It's on fire
  28. 28. 28
  29. 29. 29 How to Implement ● Map/Reduce code, Hive queries, data integration tools (Pentaho, Talend) ● Have to create the grammar/syntax rules for particular business ● MDM is _not_ the solution ● Best to have a data warehouse based on subject/relationships – Data Vault – Anchor – Textual Disambiguation
  30. 30. 30 Data Symbiosis ● Data in data lake can't stand on it's own – Ties back to rest of the structured data – Requires firm understanding of business rules/logic ● Provides richer data sets ● Difficult to do before data lakes, after adding a data lake the problems magnify – But so do the rewards!
  31. 31. 31 Data Quality ● Not just a problem for Data Warehouses! ● Measuring “fit for purpose” ● Same rules used for data warehouses apply to big data
  32. 32. 32 Principles of Data Quality ● Consistency ● Correctness ● Timeliness ● Precision ● Unambiguous ● Completeness ● Reliability ● Accuracy ● Objectivity ● Conciseness ● Usefulness ● Usability ● Relevance ● Quantity Source: Data Quality Fundamentals, The Data Warehouse Institute
  33. 33. 33 Why Data Quality? ● Main way to control/tame your data problems ● Most hidden costs because it's hardest to fix ● Target upstream for problem solutions
  34. 34. 34 How to Implement ● Data integration tools ● Custom coding (Map/Reduce, etc.) ● Data Profiling ● MDM (as central “dictionary”/”grammar” handler)
  35. 35. 35 Tooling
  36. 36. 36 Does Your Tool Chain... ● Support Hadoop? ● Interface with non-traditional database solutions (i.e. not an RDBMS)? ● Allow for integration across disparate sources? ● Support data quality?
  37. 37. 37 If Not...
  38. 38. 38 Hadoop Ecosystem ● Bridges some of the gaps – Hive – SQL to Hadoop interface (jdbc support) ● Provides even more power https://hadoopecosystemtable.github.io/ Plus dozens of others... and growing
  39. 39. 39 Sources ● http://en.wikipedia.org/wiki/File:Pitfall!_Coverart.png ● http://www.networkcomputing.com/big-data-defined/d/d-id/1204588 ● http://www.appliedi.net/ ● http://imgbuddy.com/internet-of-things-icon.asp ● http://www.smashingapps.com/, et. al. ● http://www.colleenkerriganphotographs.com/p663330184/h217016CE#h2170 16ce

×