Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Lake, Virtual Database, or Data Hub - How to Choose?

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 33 Ad

Data Lake, Virtual Database, or Data Hub - How to Choose?

Download to read offline

Data integration is just plain hard and there is no magic bullet. That said, three new data integration techniques do ameliorate the misery, making silo-busting possible, if not trivial. The three approaches – data lakes, virtual databases (aka federated databases), and data hubs – are a boon to organizations big enough to have separate systems, separate lines of business, and redundant acquired or COTS data stores. Each approach has its place, but how do you make the right decision about which data silo integration approach to choose and when?

This webinar describes how you can use the key concepts of data Movement, Harmonization, and Indexing to determine what you are giving up or investing in, and make the best decision for your project.

Data integration is just plain hard and there is no magic bullet. That said, three new data integration techniques do ameliorate the misery, making silo-busting possible, if not trivial. The three approaches – data lakes, virtual databases (aka federated databases), and data hubs – are a boon to organizations big enough to have separate systems, separate lines of business, and redundant acquired or COTS data stores. Each approach has its place, but how do you make the right decision about which data silo integration approach to choose and when?

This webinar describes how you can use the key concepts of data Movement, Harmonization, and Indexing to determine what you are giving up or investing in, and make the best decision for your project.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to Data Lake, Virtual Database, or Data Hub - How to Choose? (20)

More from DATAVERSITY (20)

Advertisement

Recently uploaded (20)

Data Lake, Virtual Database, or Data Hub - How to Choose?

  1. 1. © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Damon Feldman, Ph.D @damon.feldman http://www.marklogic.com/blog/author/dfeldman/ Data Lake, Virtual Database, or Data Hub How to Choose?
  2. 2. SLIDE: 2 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Who am I? • Solutions Director at MarkLogic • About 8 years in the Big Data and Data Integration space • Previously, in OOP, JEE worlds • Focus on Data Hub and Customer or Person-360o systems
  3. 3. SLIDE: 3 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. But Why? • Data Silos • Usually work well for a single, operational purpose • Turn any cross-line-of-business question into a data integration effort
  4. 4. SLIDE: 4 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. How about EDW • For a while, Enterprise Data Warehouses were the go-to solution for silos • One master schema to rule them • Data Modeler’s Dream! • Implementors Nightmare! • BMUF • Rigid and tightly coupled
  5. 5. SLIDE: 5 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Incompatibililties • Three forms of data incompatibilities • Naming is the simplest • firstName vs. GIVEN_NAME • Structural is somewhat harder • Semantic differences are the most challenging • Status: {in cart, ordered, shipped, delivered} • Status: {selected, paid, complete} PERSON - PERS_ID - DOB - FNAME - LNAME PERS_ADDR_REL - PERS_ID - ADDR_ID ADDRESS - ADDR_ID - LINE1 - CITY - ZIP - TYPE: {US, UK} PERSON - PERS_ID - DOB - FNAME - LNAME - ADDR_L1 - ADDR_CITY - ADDR_ZIP - ADDR_MAILING_L1 - ADDR_MAILING_ZIP
  6. 6. SLIDE: 6 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Three New Approaches • Data Lakes • Put it all somewhere else • Virtual Databases (AKA Federated Databases) • Pretend it is somewhere else • Data Hubs • Put it all somewhere else, Harmonize, and Index it for operational use And a Framework to understand and choose approaches
  7. 7. SLIDE: 7 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. A Use Case Consider a customer churn use case  Review high-value customers  .. Who are at-risk customers  .. Particularly if they are dropping or cancelling services  Proactively address their trouble tickets or complaints. Customer Lifetime Value $$$ $ $$ Customer Support !@#&!!%! !@# Order/Change/Drop  ↑ 😠😠↓ Need more … please upgrade … Abysmal… dissatisfied …
  8. 8. SLIDE: 8 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Lakes • Copy the data to a new infrastructure • Typically Hadoop, but perhaps MarkLogic or other NoSQL • Difficult with SQL because many sources  Load “as-is” • Operational Separation Copy Process Support CLV Orders DATA LAKE Data is Moved to one place, but still in varied structures BI/Analytics
  9. 9. SLIDE: 9 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Virtual Database • Query everything in real time • Transparent to the caller • True real-time • Data is not Moved or Harmonized (except in memory during processing) Support CLV Orders Data Remains in source systems Query Transform Query Transform Query Transform Retain/intervene Churn Analysis Reporting Query Conversion Data Harmonization
  10. 10. SLIDE: 10 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Hubs • Copy as with a Data Lake • Harmonize and Index • Regular structures for analytics, reporting, consumption • Indexes atop the common structures Copy Support CLV Orders DATA HUB Data is Moved to one place Also Harmonized and Indexed Harmonize BI/Analytics Consumer Consumer Consumers
  11. 11. SLIDE: 11 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Beneath and Beyond the Terms The terms are useful, but vague, and don’t tell us what works for our next project Consider all these approaches in terms of: • Movement • Harmonization • Indexing
  12. 12. SLIDE: 12 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Movement • Data Movement is copying data to new, physical storage so it can be accessed via new servers and processes • Operational Separation • Organizational Separation Orders System Retain / Intervene Churn Analysis Reporting Sales Department IT
  13. 13. SLIDE: 13 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Movement and the Three Approaches • Data Lakes are all but defined by Movement • Operational and Organizational separation • Virtual Databases - unique in not Moving data • Load is pushed to the source systems • Backup, HA/DR, Security implemented on all source systems • Data Hubs also Move data
  14. 14. SLIDE: 14 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Harmonization • Recall: Three forms of data incompatibility • Naming • Structural • Semantic PERSON - PERS_ID - DOB - FNAME - LNAME PERS_ADDR_REL - PERS_ID - ADDR_ID ADDRESS - ADDR_ID - LINE1 - CITY - ZIP - TYPE: {US, UK} PERSON - PERS_ID - DOB - FNAME - LNAME - ADDR_L1 - ADDR_CITY - ADDR_ZIP - ADDR_MAILING_L1 - ADDR_MAILING_ZIP
  15. 15. SLIDE: 15 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Harmonization • Harmonization is mapping into a common structure for key data elements • Eventually, data must be consumed, aggregated and analyzed in a common form Orders System  $1400 equipment order  £ 270/month – 36 month contract  Exchange Rate: 1.28 Maintenance/trouble tickets  Network upgrade needed  Projected cost $3,000 Customer Expected Net Revenue Oren Wilkins $4,280 Sarah Ravnick $17,200 David Perez …
  16. 16. SLIDE: 16 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Person Harmonized Name Address DoB Source Eye color Height Credit Risk Data Harmonization • Harmonization is the “value add” in the process • The earlier the better for maximum use • Store it • Index it • Yet BMUF fails often • Progressive Harmonization Person Harmonized Name Address DoB Source Eye color Height Credit Risk Person Fname Lname BIRTH PHYSATTR PHYSATTR Person Given-name Family-name Eye-color Demographics DOB Person Harmonized Name Address DoB EyeColor Height Source Credit Risk Iteration 1 Iteration 2
  17. 17. SLIDE: 17 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Harmonization and the Approaches • Data Lakes don’t Harmonize • Harmonization is pushed downstream, or implicit in the jobs • Often ETL copies from format to format (particularly in Hadoop) • Virtual Databases Harmonize in real time • Each source query and result is harmonized in memory • Pushes the load to the source systems • Data Hubs Harmonize and Persist • Explicit storage and management of Harmonized data • Governable Data Lake Job 1 Job 2 Silo 1 Silo 2 Query Data Lake Data Hub
  18. 18. SLIDE: 19 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Indexing “Who Said Databases Weren’t a Good Idea?” - Ken Krupa, Enterprise CTO, MarkLogic • Indexing is a decision to make something fast  Finding, totaling, sorting, grouping, correlating, analyzing  Sometimes also accessing • Less obviously  Caching and memory use  Reference data usage
  19. 19. SLIDE: 20 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Indexing Benefits • Advance from Batch to Operational • Micro-service or SOA architectures • find the latest address • A 360o summary record of a customer • Human Services: reviewing FSA recipients – interactive dashboard • “Run your business”
  20. 20. SLIDE: 21 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Three Approaches Revisited – Virtual Databases Issues • Least-common-denominator Query • Paradox: more systems = less power • Coupling to source systems – schema change = broken DB • Weakest link problem - HA/DR, overload • Complexity • Paging, sorting, relevance, dealing with a down federate Benefit • Real Time is easy • May be ok for small or initial systems
  21. 21. SLIDE: 22 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Three Approaches Revisited - Data Lakes Issues • Still need to Harmonize the data • Typically in every batch job, ETL (PIG/HIVE) job, query, analysis • Risk of the “Data Swamp” • Batch focus • In-memory helps, but still batch • Frankenbeast workarounds create more silos, rather than solving the problem Benefit • The data is moved • Storage is cheap • One team and process to add functionality
  22. 22. SLIDE: 23 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Three Approaches Revisited – Data Hubs Data Hubs - Advantages • Most powerful solution – all of: Movement, Harmonization, Indexing • “Run your business” • Indexing builds on Harmonization • Harmonization is the value add, so index it! • Grow by regularizing, not by complicating • More data sources to the Harmonized form • Progressive Harmonization to increase the Harmonized data elements • HA/DR, scale, security, query power, batch efficiency, governance Tradeoffs • Dedicated hardware • Change detection or data push needed for real-time
  23. 23. SLIDE: 24 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Lake vs Data Hub ” The fact is, you don't put everything into a datastore and then go looking for something to do.” - Ted Dunning, MapR Chief Applications Architect Data Hubs are Operational and “Purpose-driven” Use case  API  Progressive Harmonization  Data Integration The do not merely have Harmonized data and Indexes, they are about serving Harmonized data and indexes to drive them.
  24. 24. SLIDE: 25 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Value Over Time Time, Evolution, Range of Data ROI Data Lake Data Hub Virtual Database0
  25. 25. SLIDE: 26 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Evaluating MarkLogic with the Three Criteria
  26. 26. SLIDE: 27 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MarkLogic Operational Data Hub Pattern Some say: “A Data Lake and EDW are better together” Translation: ”This Data Lake is not doing a very good job, and never will”  MarkLogic brings database/data warehouse functions into the Data Lake making it “Operational” and a “Data Hub” by virtue of Harmonization and Indexing  but not by trying to build a (smaller) EDW
  27. 27. SLIDE: 28 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MarkLogic for Operational Data Hubs • MarkLogic supports all three paradigms • Our product direction, consulting team, experience are focused on Data Hubs • MarkLogic is a database • Allowing an “Operational Data Hub” • Run your business AND observe your business • One place for the latest data – address, income, account status, health • Integrated data for 360o views
  28. 28. SLIDE: 29 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MarkLogic ODH Features - Movement • Ingest data “as-is” • Native support for JSON, XML, Binary, RDF, Text, SQL, Geo • Data Loading tools for MPP batch ingest • Index latent structure in each • Commodity hardware, commodity disk • Tiered storage for cost effective storage
  29. 29. SLIDE: 30 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Operational Data Hub Pattern in MarkLogic HARMONIZE INGEST Enveloped Documents (Entity 1) SERVE Enveloped Documents (Entity 2) RDBMS Source 1 Documents Message Bus Content Feed Data Flow Staging Raw, As-is data Final Harmonized, Indexed dataSource Systems Consuming Applications Source 2 Documents Source N Documents … … Enveloped Documents (Entity N) Operational Apps Analysis/BI Data Feeds Discovery, Harmonization Indexes, Query, Servies
  30. 30. SLIDE: 31 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MarkLogic ODH Features - Harmonization • Best in class data Transform capabilities • XSLT, XQuery implemented to spec from the ground up • JavaScript via V8 engine • Triggers, data extraction from binaries, MPP processing • Multi-modal processing of many data formats • Ontology processing – RDFS, OWL
  31. 31. SLIDE: 32 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MarkLogic ODH Features - Indexing • MarkLogic is built on the “Universal Index” • Text, document structure, fields, text and security in one index • Columnar range indexes for analysis and SQL processing • Triple index for RDF, SPARQL and semantic query • Geospatial index • Projection operations to expose one structure (e.g. JSON or XML) as SQL or RDF • Operational vs. purely analytical. You can run your business on MarkLogic
  32. 32. SLIDE: 33 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Summary • Data Lakes and Hubs are on a continuum • Primarily distinguished by level of indexing • Virtual databases are a very different animal – and not usually in a good way • Within each pattern, Movement, Harmonization and Indexing are knobs to turn • Movement – for isolation and data access • Harmonization – for micro-services and value-add • Indexing – for speed and operational use cases • Consider your goals and requirements, and plan accordingly
  33. 33. SLIDE: 34 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. More Info MarkLogic Data Hub Framework (quick start): https://marklogic.github.io/marklogic-data-hub/ MarkLogic Data Hub information: http://www.marklogic.com/solutions/operational-data-hub/ Damon’s blog on data lakes: http://www.marklogic.com/blog/data-lakes-data-hubs-federation-one-best/ Follow damon on twitter: https://twitter.com/damonfeldman

×