Modern Data Architecture for a Data Lake with Informatica and Hortonworks Data Platform

25,109 views

Published on

How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?

Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.

Join Hortonworks and Informatica as we discuss:

- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake

Published in: Technology

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Data Platform

  1. 1. © Hortonworks Inc. 2013 Modern Data Architecture … and the Data Lake John Haddad Senior Director Product Marketing - Informatica Jim Walker Director Product Marketing - Hortonworks Page 1
  2. 2. © Hortonworks Inc. 2013 Your Presenters • John Haddad – Senior Director Product Marketing, Informatica – Over 25 years experience developing and marketing enterprise applications – Enjoys art, science, and the great outdoors • Jim Walker – Director Product Marketing, Hortonworks – Over 20 years in data management as a developer and a marketer – Amateur Photographer Page 2
  3. 3. © Hortonworks Inc. 2013 Today’s Topics • Introduction • Drivers for the Modern Data Architecture (MDA) • Apache Hadoop in the MDA • Informatica’s role in the MDA • Q&A Page 3
  4. 4. © Hortonworks Inc. 2013 Enterprise Data Architecture Page 4 APPLICATIONS  DATA  SYSTEMS   REPOSITORIES   RDBMS   EDW   MPP   DATA  SOURCES   OLTP,  POS   SYSTEMS   Tradi8onal  Sources     (RDBMS,  OLTP,  OLAP)   Business   Analy8cs   Custom   Applica8ons   Packaged   Applica8ons  
  5. 5. © Hortonworks Inc. 2013 Traditional Approach – Under Pressure Page 5 APPLICATIONS  DATA  SYSTEMS   REPOSITORIES   RDBMS   EDW   MPP   DATA  SOURCES   OLTP,  POS   SYSTEMS   Tradi8onal  Sources     (RDBMS,  OLTP,  OLAP)   Business   Analy8cs   Custom   Applica8ons   Packaged   Applica8ons   New  Sources     (sen8ment,  clickstream,  geo,  sensor,  …)   Source: IDC 2.8  ZB  in  2012   85%  from  New  Data  Types   15x  Machine  Data  by  2020   40  ZB  by  2020  
  6. 6. © Hortonworks Inc. 2013 Modern Data Architecture Enabled Page 6 APPLICATIONS  DATA  SYSTEMS   REPOSITORIES   RDBMS   EDW   MPP   DATA  SOURCES   OLTP,  POS   SYSTEMS   Tradi8onal  Sources     (RDBMS,  OLTP,  OLAP)   Business   Analy8cs   Custom   Applica8ons   Packaged   Applica8ons   New  Sources     (sen8ment,  clickstream,  geo,  sensor,  …)   OPERATIONAL   TOOLS   MANAGE  &   MONITOR   DEV  &  DATA   TOOLS   BUILD  &   TEST  
  7. 7. © Hortonworks Inc. 2013 Hadoop Powers Modern Data Architecture Page 7 Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment. Hadoop Cluster compute & storage . . . . . . . . compute & storage . . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
  8. 8. © Hortonworks Inc. 2013 Driving Efficiency Driving Opportunity Drivers for Hadoop Adoption Modern Data Architecture Hadoop has a central role in next generation data architectures while integrating with existing data systems Business Applications Use Hadoop to extract insights that enable new customer value and competitive edge Existing Traditional Server log Clickstream Big Data Sets Emerging Sentiment/Social Machine/Sensor Geo-locations
  9. 9. © Hortonworks Inc. 2013 Opportunity in types of data 1.  Sentiment Understand how your customers feel about your brand and products – right now 2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website 3.  Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines 4.  Geographic Analyze location-based data to manage operations where they occur 5.  Server Logs Research logs to diagnose process failures and prevent security breaches 6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents Value Page 9
  10. 10. © Hortonworks Inc. 2013 Efficiency in Modern Data Architecture •  Drive efficiency via modern data architecture •  Store data once and access it in many ways •  Often referred to a data lake or data repository •  Infrastructure platform driven •  IT-oriented, TCO based Page 10 APPLICATIONS  DATA  SYSTEMS   REPOSITORIES   RDBMS   EDW   MPP   DATA  SOURCES   OLTP,  POS   SYSTEMS   Tradi8onal  Sources     (RDBMS,  OLTP,  OLAP)   Business   Analy8cs   Custom   Applica8ons   Packaged   Applica8ons   New  Sources     (sen8ment,  clickstream,  geo,  sensor,  …)  
  11. 11. © Hortonworks Inc. 2013 Page 11 APPLICATIONS  DATA  SYSTEMS   TRADITIONAL  REPOS   DEV  &  DATA   TOOLS   OPERATIONAL   TOOLS   Viewpoint Microsoft Applications DATA  SOURCES   DATA  INTEGRATION   Engineered for Interoperability Tradi8onal  Sources     (RDBMS,  OLTP,  OLAP)   New  Sources     (sen8ment,  clickstream,  geo,  sensor,  …)  
  12. 12. © Hortonworks Inc. 2013 Integrated Interoperable with existing data center investments Skills Leverage your existing skills: development, operations, analytics Requirements for Hadoop Adoption Page 12 Key Services Platform, operational and data services essential for the enterprise 3Requirements for Hadoop’s Role in the Modern Data Architecture
  13. 13. © Hortonworks Inc. 2013 Today’s Topics • Introduction • Drivers for the Modern Data Architecture (MDA) • Apache Hadoop’s role in the MDA • Informatica’s role in the MDA • Q&A Page 13
  14. 14. © Hortonworks Inc. 2013 Hortonworks & Informatica Visual Development Environment Enterprise Repositories EDW LOAD Data Virtualization Batch CEP MDM INTERFACE HIVE JDBC HDFS API AMBARI MAPREDUCE YARN HDFS DATA REFINEMENT HIVE (HiveQL and UDFs) ProfileProfile Parse ETL Cleanse Match HDFSAPI LOAD Reference Architecture SOURCE DATA Batch Replicate Stream Archive JMS Queue’s Servers & Mainframe Files Databases Sensor data Social
  15. 15. Data Lake Processes Mobile Apps Transactions, OLTP, OLAP Social Media, Web Logs Machine Device, Scientific Documents and Emails 9. Govern & enrich with metadata 3. Stream real-time data 8. Explore & validate data 4. Mask sensitive data 2. Replicate changed data & schemas Visualization & Analytics 11. Subscribe to datasets Data Integration Hub 1. Load or archive batch data Data Virtualization 5. Access customer “golden record MDM Enterprise Data Warehouse 10. Correlate real-time events with historical patterns & trends 6. Refine & curate data 7. Move results to EDW
  16. 16. Telco Call Detail Record (CDR) Use-Case
  17. 17. Use-Case: CDR Processing •  Each job picks up a number of files containing Text CDRs (Call Detail Records) •  First task is to merge partial call records •  Some records may be partial – ex. multiple records for a single call due to a dropped line or switching cell towers •  Partial records need to be merged and total call time needs to be added to duration for the merged record •  Partial records for a single call may reside in multiple files or be included in different jobs. •  Incomplete partial records need to be reprocessed by consecutive jobs •  Second task is to sort all processed CDRs by calling number
  18. 18. Input CDR File Example These 3 numbers uniquely identify a call Partial calls starts with 1 and end with 0 Some partial records are incomplete Processed completed records are sorted by caller
  19. 19. Output CDR Files Completed Calls Partial Calls Duration times are added to the merged records Partial records are merged into a single completed record Partial records will be reprocessed
  20. 20. Logical Design Partial records only Separate partial records from completed records Completed records only Separate incomplete and complete partial records Select incomplete partial records Aggregate all completed and partial-completed records
  21. 21. Viewing Data at Design Time
  22. 22. More Data at Design Time
  23. 23. Constructing Logical Expressions
  24. 24. More Logic
  25. 25. Check Outcomes
  26. 26. Choose Where to Process
  27. 27. Hadoop Execution Plan
  28. 28. Monitor Processing
  29. 29. Results in HDFS
  30. 30. CDR Pipeline Sort records by Key Summarize by Key Group Filter by Province ID Filter by Collection Date City Code Lookup Read Files Write report •  Scenario – Filter records by Date, City and Province; Aggregate and summarize records by a composite Key
  31. 31. Design Environment
  32. 32. Adding Transactional Source •  Scenario - Report website use (Facebook, Twitter, etc.) by Age and by Postal Code Read WAP log records Get MSISDN and URL fields Lookup Age, Postal Code by MSISDN Count URL frequency Calculate percentages
  33. 33. Connecting to Relational Source
  34. 34. Result •  Easily combine big data sources with transactional data •  Example – Report website use (Facebook, Twitter, etc.) by Age and by Region Look-up of Age, Region by MSISDN CRM EDW Log Files, HDFS
  35. 35. © Hortonworks Inc. 2013 Integrated Interoperable with existing data center investments Skills Leverage your existing skills: development, operations, analytics Requirements for Hadoop Adoption Page 35 Key Services Platform, operational and data services essential for the enterprise 3Requirements for Hadoop’s Role in the Modern Data Architecture
  36. 36. © Hortonworks Inc. 2013 Next Steps: Page 36 Learn more about Informatica and Hadoop http://www.informatica.com/us/vision/harnessing-big- data/hadoop/ Get started on Hadoop with Hortonworks Sandbox http://hortonworks.com/products/hortonworks- sandbox/ Follow us: @hortonworks, @informatica

×