Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DxD big data and data quality as a service svcc oct 2015

686 views

Published on

Big Data and Data Quality as a Service SVCC 2015

Published in: Data & Analytics
  • Be the first to comment

DxD big data and data quality as a service svcc oct 2015

  1. 1. “Yes, you can plug Data Quality as a Service (DQaaS) into Big Data” October 4th, 2015 Master Data Management for big data October 4th, 2015
  2. 2. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com  Big data is here to stay and expanding rapidly  The 4th “V” of big data  How your data architecture is growing  Big data, and perhaps a big mess!  Data quality as a Service for your data lake  Tools of the trade (Microsoft MDS + Profisee’s Maestro)  Plugging DQaaS into your Big Data lakes
  3. 3. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com  USA (21 years) and France (14 years)  Database/Data Architecture – RDBMS’s:  Oracle, PostGres, MySQL, DB2, ……  Microsoft SQL Server/Analysis Services – Master Data Management  MDS  Maestro  Oracle  IBM (initiate) – Big Data:  Hadoop, ParAccel, NoSQL  Database talent pool: – Top database and data architects – Acclaimed Authors – Speakers at many events and conferences  Database Tools: P&T Tool - highly graphical for Sybase, Oracle and MS SQL Server  Database Education & Training  Partnerships  Microsoft  Profisee  DesignMind
  4. 4. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com
  5. 5. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Big Data’s Rapid Expansion 5 Digital Data (created and replicated)  Reached 4 zettabytes at the end of 2013  That’s 50% more than in 2012  And, 4 times more than in 2010  Will hit 50 ZB’s by 2020! Source: IDC
  6. 6. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Impact of bad data $3,100,000,000,000 IBM’s Estimate of Annual Cost of Bad Data to US Economy (IBM BDH) 15% Surveyed Executives Trusting Overall Data (IDC) 27% Surveyed Executives Sure of Data Accuracy (IBM)
  7. 7. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com You will be (or, are already) dealing with.. 7 Volume Velocity Variety Veracity  High-Volumes of data you need to access  High-Velocity of streaming data pouring in  High-Variety of information assets (structured, semi-structured, unstructured)  AND, you need to get to this data to enable enhanced decision making, insight discovery and process optimization Oh, and it better be good data (have Veracity) (source: IBM/Diginome)
  8. 8. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Are you doing the right thing?  Hadoop (HDFS solutions) lends itself to problems that can be solved through distributed strategies coupled with advanced analytics.  Other problems just need a horizontally scalable solution (via MPP) with current mainstream analytics/database (like ParAccel, Teradata, PDW…)  AND, attack the quality of the data !!!!! (veracity) Understand the problem first, Next, apply the proper architecture, and finally, choose the proper tools!
  9. 9. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Security/AccessFramework OperationsFramework-Monitoring/Alerts/Workflow DataIntegrationandAcquisitionFramework Analytics Put into different perspectives and trends (forecasting) Operational What we did Reactive Data Mining What other things might exist or are affecting what we did (why did it happen) Predictive Modeling, Simulation & Optimization See what is possible (next) and what is the best way to do it (prescriptive) Proactive Front Office Back Office Other Internal External Social/ Other Data Services/Tools/Data Visualization Islands of BI (non-IT) Data Warehouse/Marts (Aggregated/Dimensional) Operational Data Store (ODS) (Detail/Transactional/taxonomies) SAS/SPSS QlikView Dashboards/Light Analytics Application Bundled Reporting Business Objects OLAP “Big Data” (structured, semi & un-structured data) “Big Data” (structured, semi & un-structured data) Variety Governance – Quality - Certification
  10. 10. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Data Pipeline 10 Data Acquisition Data Storage Data Analysis HDFS Commands Flume (logs) Scribe (RT stream logs) Sqoop (as needed) Many others HDFS (Hadoop) Hbase (Big Table) Dryad Others MapReduce (Hadoop) Pig (data analysis/pig latin/data flow) Hive (DW for Hadoop/HiveQL) Cascading (complex MR workflows) Shark (HiveQL on Spark) Spark (In-mem/cluster computing) Flume Few others Kafka (producer/consumer model) Kestrel (distributed message queue) Storm (RT computation) Trident (Operations on top of Storm) S4 (distributed stream computing) Spark Streaming (RT Spark) Emerging: Hybrid Computational Models SummingBird, Lambdoop, others. [eliminates MapReduce, all processing paradigms supported] VolumeVelocityBoth Results Value Governance – Quality – Certification
  11. 11. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Would you drink this? 11 NO, but it likely could have been prevented (or cleaned up during acquisition or earlier)
  12. 12. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Big Data Platform Adisturbing pattern has emerged in big data Universe of External and Internal Data – 100’s of sources, dozens of formats, no control of content All new data flows to the big data platform Unidentified Records are just ignored Zero data governance Nothing is fixing bad data in the data lakes (perhaps on query?) How do you identity what is good data versus bad data?
  13. 13. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Data Quality as a Service Big Data Platform So, let’s add in data quality as a service! Universe of External Data – 100’s of sources, dozens of formats, no control of content All new data flows to the big data platform Unidentified/Unseen records flow to DQaaS DQaaS Fuzzy Matches, Users Map, Workflows Occur, Knowledge is built Cleansed data flows back to Big Data Platform
  14. 14. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Abig data effort we finished recently (with DQaaS) DownstreamData Analytics/Surfacing Internal Enterprise Data - Master Data (e.g. Customers, Products, SKU’s) - Data Warehouse (Dimensional, Facts) BIG DATA Data Staging/Data Acquisition Data Quality as a Service (BUS) External Transactional Data (streams) External 3rd Party Data Internal Transactional Data (streams) Internal “Other” Data BIG DATA Data Delivery Platform Hives Sqoop Cloudera Cloudera Hives Sqoop RAW RAW RAW RAW eMSTR STG STG STG STG eSTG Delivery Delivery Delivery Delivery Delivery Masters Matching Cleansing Enriching
  15. 15. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Ingest, master, and deliver (the new data pipeline) RAW STAGE eMaster DIFF Not Mastered Yet, or not seen before Already Mastered Data Delivery Mastered (“conformed”) Not Mastered Yet Data Quality as a Service (BUS) Masters Matching Cleansing Enriching Workflow Data Stewardship Via Subscription Views Via Staged Data Tables
  16. 16. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Things we can do easily on big data platform Existing Data (Mastered Already + Mastering Results) New Data We saw this before (use the master results) We haven’t seen this yet (master it)Needs to be mastered DIFF
  17. 17. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Filtering the Data Lake 17  Matching Strategies  Survivorship  Dedupe  Harmonization  Golden Records  Taxonomies  Cleansing  Standardization  Defaults Data Quality as a Service (BUS) Masters Matching Cleansing Enriching
  18. 18. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com SQL Server Master Data Services  Master Data Management Platform on SQL Server  Model & Rules – Managed Schema  Security and Access  Bulk data loads & consumption – table access  Hierarchy Management  Deployment, management, versioning  Application-level transaction management
  19. 19. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Entities Model Extended Attributes User-defined Metadata Sub Entities Collections Derived Hierarchies Explicit Hierarchies Domain Based Attributes Attributes Attribute Groups Business Rules Name (mandatory) Code (mandatory) Free-form Attributes - text type - numeric type - date type - link type File Attributes - files - documents - images Master List domains (types) Like color, ISO Customer Segmentation, States, Provinces, Countries, so on. Members of the Model (physical data entries) For a specific business need (example “Customer Master”) - Version - Version Lock 1:N groups many 1:N may have Transactions Annotations Hierarchies Subscription Views
  20. 20. © Data by Design, LLC ⃝ www.dataXdesign.com MDS Excel Add-in MDS Web App MDS Web App MDS Web Service MDS Staging Tables IIS SQL Server SQL Server Master Data Services
  21. 21. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com  User Experience – Stewardship, Access, Manage  Workflow – Initiate, Approve, Contribute, Calculate  Golden Record Management – Matching, Survivorship  Data Quality – Verification, Address, Person, Email  Application Integration – MDM, CRM, Federation  MDM Programmability – Web Objects, Web Services Profisee Maestro (Empowering MDM)
  22. 22. © Data by Design, LLC ⃝ www.dataXdesign.com Custom Apps Workflow Forms Web Parts Maestro Desktop MDS Excel Add-in MDS Web App MS Dynamics Salesforce Maestro Maestro Maestro Web App MDS Web App MDS Web ServiceMaestro Web Service MDS Staging Tables Maestro SDK/API Connectors IIS SQL Server Batch Integration Real-time Integration MSMQ
  23. 23. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Maestro What we can show you today! BIG DATA Data Staging/Data Acquisition Data Quality as a Service (BUS) External Transactional Data (streams) BIG DATA Data Delivery Platform Cloudera Ingest Enterprise Master Data Masters Matching Cleansing Enriching Raw Customer Data Customer Data that needs to be “mastered” ClouderaAny Any Mastered (“conformed”) Customer Data Enterprise Customer Data Maestro MDS Data Steward
  24. 24. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Staging, matching, cleansing, publishing (MDS & Maestro) MDS Demo (5 mins) Maestro Demo (10 mins)
  25. 25. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com Great options, even better opportunities  Understand your processing and data requirements!  Strive for high quality data that is relevant to your most important business drivers/needs!  Work within a consistent framework that provides you the needed performance, access, compliance, and quality your company demands!  Plug in data quality (DQaaS) as early as you can in the big data food chain (starting at acquisition (ingest) time)
  26. 26. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com  Top data experts in the industry  USA and European offices  Acclaimed Authors  Presenters at major conferences info@dataXdesign.com Data Consulting Services
  27. 27. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com
  28. 28. © Data by Design, LLC ⃝ www.dataXdesign.com© Data by Design, LLC ⃝ www.dataXdesign.com No User Interface We know how to fix the data, but we don’t have a place to fix it. Cosmic Data Volumes More data than ever before – but is it consistent? Is the noise growing faster than the signal? Multiple Stakeholders Departments, Subsidiaries, Regulatory bodies, view the world in different ways. Data Quality is Poor Duplicates exist even within systems and data values are missing or just plain wrong. Externally Sourced Data They’re talking about my things on the web, but they aren’t speaking my language. Multiple Systems Even my own systems don’t use the same names and don’t have the same attributes. How do I trust that my analytics are driving correct decisions? TheCaseforMasterDataManagement
  29. 29. © Data by Design, LLC ⃝ www.dataXdesign.com Query without a data map SELECT customers who complained on Facebook more than twice FROM Giant Hot Mess of Data WHERE product is in this giant list and flag = current BUT NOT when starts with XRB0 AND ALSO include these other products from this acquisitions list when these four conditions match but never when the country of manufacture is Sweden ... Goes on for 16 pages...are you following this?... Query with a data map SELECT customers who complained on Facebook more than twice FROM Giant Hot Mess of Data JOIN Map on known keys WHERE product is a current reporting product
  30. 30. © Data by Design, LLC ⃝ www.dataXdesign.com Name Code Source Add1.. Customer # Master XYZ Corporation Master-6001 329 Main St South C5321 Master-6001 XYZ c 6001 EXT2 329 Main Street S Master-6001 XYZ Corporation 6005 ERP 1 3229 Main St Master-6001 Xyz Corp 6009 CRM 329 Main Street So C5321 Master-6001 Profisee Master-6003 2520 Northwinds C5400 Master-6003 Profisee 6003 CRM 2520 Northwinds C5400 Master-6003 Master Customer Golden Records Bind other Candidate Records New Candidate Records are Address Corrected for Sure Matching Candidate Records are added to their “Master or Golden” Record Group Golden Records may have attributes from the candidate records or new attributes altogether
  31. 31. © Data by Design, LLC ⃝ www.dataXdesign.com 31 Profisee Maestro: Reference Architecture Company Feed Industry Feed Ratings Feed Reference Data Flat Files XML EMR1 EMR2 Credentialing1 Credentialing2 Labs1 Labs2 Datawarehouse Flat Files XML ERP1 ERP2 CRM1 CRM2 SCM1 SCM2 DW BI GL HR

×