Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dama Day May 18 2017 Integrating Heterogeneous Data


Published on

Strategies to move from traditional data warehouse culture to the Big Data Lake

Guidance for designing a process for how the EDW can respond to the Big Data Analytics need

Understanding and maximizing the relationship of big data, Hadoop, and analytics

Steps for Using the Data Lake to “Put it all Together” - Ingest, Organize/Define/Complete, Blend and Learn, Report

Case study examples of how establishing a Data Lake can be a game-changer for today’s organization

Lessons Learned from client engagements

Published in: Data & Analytics
  • Create Your Own Odds Playing The Lottery ●●●
    Are you sure you want to  Yes  No
    Your message goes here

Dama Day May 18 2017 Integrating Heterogeneous Data

  1. 1. #DAMADay @joe_caserta @joe_caserta#DAMADay New York Chapter Integrating Heterogeneous Data DAMA Day Symposium May 18, 2017 Presented by Joe Caserta
  2. 2. #DAMADay @joe_caserta Caserta Timeline Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Data Analysis, Data Warehousing and Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise magazine Launched Data Science, Data Interaction and Cloud practices Laser focus on extending Data Analytics with Big Data solutions 1986 2004 1996 2009 2001 2013 2012 2014 Dedicated to Data Governance Techniques on Big Data (Innovation) Awarded Top 20 Big Data Companies 2016 Top 20 Most Powerful Big Data consulting firms Launched Big Data Warehousing (BDW) Meetup NYC:4,000 Members 2016 Awarded Fastest Growing Big Data Companies 2016 Established best practices for big data ecosystem implementations
  3. 3. #DAMADay @joe_caserta About Caserta Concepts – Consulting Data Innovation and Modern Data Engineering – Award-winning company – Internationally recognized work force – Strategy, Architecture, Implementation, Governance – Innovation Partner – Strategic Consulting – Advanced Architecture – Build & Deploy • Leader in Enterprise Data Solutions – Big Data Analytics – Data Warehousing – Business Intelligence • Data Science • Cloud Computing • Data Governance
  4. 4. #DAMADay @joe_caserta Caserta Client Portfolio Retail/eCommerce & Manufacturing Finance, Healthcare & Insurance Digital Media/AdTech Education & Services
  5. 5. #DAMADay @joe_caserta Awards & Recognition Top 10 Fastest Growing Big Data Companies 2016
  6. 6. #DAMADay @joe_caserta Our Partners
  7. 7. #DAMADay @joe_caserta Aligning Heterogeneous Data Sources Awareness Consideration Purchase Service Loyalty Expansion PR Radio TV Print Outdoor Word of Mouth Direct Mail Customer Service Physical Touchpoints Digital Touchpoints Search Paid Content email Website/ Landing Pages Social Media Community Chat Social Media Call Center Offers Mailings Survey Loyalty Programs email Agents Partners Ads Website Mobile 3rd Party Sites Offers Web self-service
  8. 8. #DAMADay @joe_caserta Attribution Type Comments Single Touch Rules-Based Statistically Driven Assign the credit to the first or last exposure Assign the credit to each interaction based on business rules Assign the credit to interactions based on data-driven model Ad-Click Mailing MailingE-mail E-mailAd-Click Ad-Click 100% 33% 33% 33% 27% 49% 24% - Last touch only - Ignores bulk of customer journey - Undervalues other interactions and influencers - Subjective - Assigns arbitrary values to each interaction - Lacks analytics rigor to determine weights ü Looks at full behavior patterns ü Consider all touch points ü Can apply different models for best results ü Use data to find correlations between touch points (winning combinations) Why do we Care?
  9. 9. #DAMADay @joe_caserta Onboarding New Data Business: “I need to analyze some new data” ü IT collects requirements ü Creates normalized and/or dimensional data models ü Profiles and conforms and the data ü Sophisticated ETL programs and quality standards ü Loads it into data models ü Builds a BI semantic layer ü Creates dashboards and reports IT: “You’ll have your data in 3-6 months to see if it has value! – Onboarding new data is difficult! – Rigid Structures and Data Governance – Disconnected/removed from business
  10. 10. #DAMADay @joe_caserta The New Data Paradigm OLD WAY: • Structure Data  Ingest Data  Analyze Data • Fixed Capacity • Monolith NEW WAY: • Ingest Data  Analyze Data  Structure Data • Dynamic Capacity • Ecosystem RECIPE: • Cloud • Data Lake • Holistic Architecture & Framework
  11. 11. #DAMADay @joe_caserta Ingest Raw Data Organize, Define, Complete Munging, Blending Machine Learning Data Quality and Monitoring Metadata, ILM , Security Data Catalog Data Integration Fully Governed ( trusted) Arbitrary/Ad-hoc Queries and Reporting Big Data Ware house Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” Usage Pattern Data Governance Metadata, ILM, Security Corporate Data Pyramid (CDP)
  12. 12. #DAMADay @joe_caserta • Development local or distributed is identical • Beautiful high level API’s • Full universe of Python modules • Open source and Free • Blazing fast! Spark has become our default processing engine for a data engineering & science Why Use Spark?
  13. 13. #DAMADay @joe_caserta Cloud Component AWS Google Microsoft Scalable distributed storage S3 GCS Azure Storage Pluggable fit-for-purpose processing EMR DataProc HDInsight Compute Services EC2 GCE VMs Consistent extensible framework Spark Spark Spark Dimensional MPP Data Warehouse Redshift BigQuery Azure SQL Data Warehouse Data Streaming Kenesis PubSub Azure Stream Common Interface Jupyter DataLab Azure Notebook The Data Lake on the Cloud • Remove barriers between data ingestion and analysis • Democratize data with Just Enough Data Governance (JEDG)
  14. 14. #DAMADay @joe_caserta The Notebook is the ETL Tool
  15. 15. #DAMADay @joe_caserta Data Quality and Monitoring • BUILD a robust data quality subsystem: • Metadata and error event facts • Orchestration • Based on Data Warehouse ETL Toolkit • Each error instance of each data quality check is captured • Implemented as sub-system after ingestion • Each fact stores unique identifier of the defective source row
  16. 16. #DAMADay @joe_caserta Unifying the Customer Across Channels Customer Data Integration (CDI): Match and manage customer information from all available sources Marketing channels: DMP, Salesforce, Adobe, Social, Direct Mail, Call Center, CRM In other words… We need to figure out how to LINK people across systems!
  17. 17. #DAMADay @joe_caserta Mastering Master Data is Still MDM Standardize Match Survivorship Validate
  18. 18. #DAMADay @joe_caserta Standardization and Matching Cleanse and Parse: • Names • Resolve nicknames • Create deterministic hash, phonetic representation • Addresses • Emails • Phone Numbers Matching: Join based on combinations of cleansed and standardized data to create match results: Spark map operations: • Data cleansing, transformation, and standardization – Address Parsing: usaddress, postal-address, etc – Name Hashing: fuzzy, etc – Genderization: sexmachine, etc
  19. 19. #DAMADay @joe_caserta Mastering Unmanageable Source Data Reveal • Wait for the customer to “reveal” themselves • Create link between anonymous self and known profile Vector • May need behavioral statistical profiling • Compare use vectors Rebuild • Recluster all prior activities • Rebuild the Graph
  20. 20. #DAMADay @joe_caserta The Matching Process The matching process output gives us the relationships between customers: Great, but it’s not very useable, you need to traverse the dataset to find out 1234 and 1235 are the same person (and this is a trivial case) And we need to cluster and identify our survivors (vertex) xid yid match_type 1234 4849 phone 4849 5499 email 5499 1235 address 4849 7788 cookie 5499 7788 cookie 4849 1234 phone
  21. 21. #DAMADay @joe_caserta Graph to the Rescue 1234 4849 5499 7788 We just need to import our edges into a graph and “dump” out communities Don’t think table… think Graph! These matches are actually communities 1235
  22. 22. #DAMADay @joe_caserta Connected Components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex This lowest number vertex can serve as our “survivor” (not field survivorship) Connected Components xid yid 1234 4849 1234 5499 1234 1235 1234 7788 1234 7788 1234 1234
  23. 23. #DAMADay @joe_caserta Identity Resolution Process
  24. 24. #DAMADay @joe_caserta The BDW is still Dimensional
  25. 25. #DAMADay @joe_caserta Use Graph for Data Lineage
  26. 26. #DAMADay @joe_caserta Sample Solution Architecture
  27. 27. #DAMADay @joe_caserta Joe Caserta President, Caserta Concepts @joe_Caserta • Award-winning company • Transformative Data Strategies • Modern Data Engineering • Advanced Architecture • Innovation Partner • Strategic Consulting • Advanced Technical Design • Build & Deploy Solutions • BDW Meetup • New York City • 3,000+ members • Knowledge sharing Data is not important, it’s what you do with it that’s important! Thank You