MapR Enterprise Data Hub Webinar w/ Mike Ferguson


Published on

Data volumes have experienced explosive growth in recent years, and that data is being generated from sources that are increasingly complex and varied. Harnessing and refining value from this data requires a new approach as data extraction, transformation, and loading (ETL) becoming increasingly more costly and difficult to scale.

Organizations are looking to leverage Hadoop as an enterprise data hub—also called a “data lake” or “data reservoir”—as a key component of their data architecture to augment their data warehouse, ETL and analytical systems in order to maximize their existing investments, reduce costs, and unlock new business value from their data.

In this webinar, you will learn:

Real-world examples that illustrate why Hadoop is the best low-cost data hub, data lake, or data landing zone (staging area) option for ETL processing
Proof points that demonstrate advantages of Hadoop and its ability to scale to manage increasing data volumes and support exploratory big data analytics
Proven best practices for a cost-effective, reliable way to implement a data management platform for your entire big data analytical ecosystem
Hidden issues to be aware of in deploying your data hub/data lake

Published in: Technology, Business

MapR Enterprise Data Hub Webinar w/ Mike Ferguson

  1. 1. ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies Best Practices for Using Hadoop as an Enterprise Data Hub Mike Ferguson – Intelligent Business Strategies Steve Wooledge – MapR June 18, 2014
  2. 2. 2 About Mike Ferguson Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specialises in business intelligence, data management and enterprise business integration. With over 32 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates. Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700
  3. 3. The Hadoop Data Refinery and Enterprise Data Hub Mike Ferguson Managing Director Intelligent Business Strategies June 2014
  4. 4. 4 Topics !  Data warehousing and the evolution of ETL processing !  New data and new analytical workloads !  Big data use cases driving business agendas !  The unprecedented demand for customer insight !  Challenges with new big data sources !  Beyond the data warehouse – new platforms for new analytical workloads !  The role of Hadoop in the modern analytical ecosystem !  Introducing the Hadoop enterprise data hub and data refinery !  Simplifying access to new big data insight using SQL on Hadoop !  Integrating Hadoop into your analytical ecosystem
  5. 5. 5 For Many Years The Traditional Data Warehouse and BI Environment Has Been Used For Analysis & Reporting Operational systems web P o r t a l Employees Partners Customers BI Tools Platform Data Integration/DQ Reports & analytics Data warehouse & data marts DW
  6. 6. 6 The Evolution of Data Integration in Data Warehousing – From Hand Coded to ETL to ELT Hand coded ETL programs DW Hand coded programs ETL Servers DW ETL Servers ELT processing Generated SQL ELT processing DWEvolution of Data Warehousing MPP RDBMS systems
  7. 7. 7 Sales Product line n Product line 4 Product line 3 Product line 2 Product/ service line 1 Marketing Service Credit Verification HR Finance Planning Procurement SupplyChain Suppliers Front Office BackOffice Operations Customers New Data Sources Have Emerged Inside And Outside The Enterprise That Business Now Wants To Analyse E.g. RFID tag sensor networks weather data Data volume Data variety Number of sources Data volume Data velocity
  8. 8. 8 Popular Big Data Analytic Applications – Web Data !  Clickstream analytics •  Site navigation behaviour (session) analysis –  Paths to buy, paths to abandonment, what else they looked at –  Improve customer experience and conversion –  Associate clicks with customers & prospects !  Social network influencer analysis •  Graph analytics for influencer behavioural impact analysis •  ‘Target the influencer’ marketing campaign effectiveness
  9. 9. 9 Popular Big Data Analytic Applications – Sensor Data For Improving Process Efficiency and Optimisation !  Sustainability analytics e.g. energy optimisation !  Supply/distribution chain optimisation !  Asset management and field service optimisation !  Manufacturing production line optimisation !  Location based advertising (mobile phones) !  Grid health monitoring •  Electricity, water, mobile phone cell network… !  Smart metering (collect data every 15 minutes) !  Fraud !  Healthcare – ITC vital signs, fit bits,…. !  Traffic optimisation " WHAT ARE YOU PREPARED TO INSTRUMENT? E.g. RFID tag
  10. 10. 10 Popular Big Data Analytic Applications – Unstructured Data !  Case management !  Fault management and field service optimisation !  “Voice of the customer” !  Sentiment analytics !  Competitor analysis !  Media coverage analysis !  Improve pharma drug trials " Unstructured content is hard to analyse How much is TEXT worth to your business?
  11. 11. 11 Big Data Analytics - Industry Use Case Examples Industry Use Case Examples Financial Services Improved risk decisions, KYC customer insight, auto programmatic trading, 360 view of financial crime, pre-trade decision support, real-time trade & corp action tagging for compliance and RT P&L, grow security services outsourcing, Reference Data Exchange Utilities Smart meter data analysis, pricing elasticity analysis, customer loyalty, sustainability, asset management Telecommunic ations Customer Churn, Network optimization analysis from device, sensor and GPS inputs, monetization of GPS and data Manufacturing Sensor data for next generation ‘smart’ products, production line optimisation, improved customer service and improved field service, distribution chain optimization, asset management Insurance “How you drive” insurance (sensors to reduce risk), broker document analysis (risk assessment) Government Smart cities (e.g. transportation optimisation), anti-terrorism, law enforcement Logistics Distribution optimisation, route optimisation,
  12. 12. 12 More Data Is Required To Get A Deeper Understanding of Customers !  We now need •  Transaction data •  Data from touch points you own •  Data from the touch points you don’t own •  Interaction data –  Need to look at Inbound interactions Vs outbound interactions –  Social interactions •  Master data •  Professional data e.g. profiles on LinkedIn •  Internal and external event data •  Competition data….. !  Then use analytics to understand and predictive desire and propensity e.g. propensity to churn
  13. 13. 13 Top Priorities - Improving Customer Experience Via Time Series Analysis of All Customer Interactions OMNI channel – analyse all customer interactions across all channels identity data behavioural data social data Customer “DNA”
  14. 14. 14 identity data behaviou ral data social data Customer “DNA” Customer Experience Management - Understanding Customer On-Line Behaviour is Mission Critical to Retention and Growth !  Important new data sources for analysis for customer ‘DNA’ •  Clickstream data from web logs •  Sentiment and social network influencer data New competitors More choice Voice of the customer On the web the customer is king On the move Easy to find
  15. 15. 15 Today Both Structured And Multi-Structured Data Are Needed For Deeper Insight Multi- structured data Click stream web log data Customer interaction data Social interaction data Sensor data Rich media data (video, audio) External content Documents Internal web content Seismic data (oil & gas) Structured data OLTP system data Data warehouse data Personal data stores e.g. Excel, Access Often un-modelled and may not be well understood Often a schema is defined and data is well understood Data characteristics are changing - Companies must deal with volume, variety and velocity
  16. 16. 16 Big Data Analytics Challenges Include The Analysis of Unstructured, Semi-structured and Structured Data { "firstName": ”Wayne", "lastName": ”Rooney", "age": 25, "address": { "streetAddress": "21 Sir Matt Busby Way", "city": ”Manchester”, “country”: “England”, "postalCode": “M1 6DY” }, "phoneNumbers": [ { "type": "home”, "number": ”0161-123-1234” }, { "type": ”mobile", "number": ”07779-123234” } ] } JSON data Text data Image Data Makes analysis more complex with new analytics and visualisations needed
  17. 17. 17 Increased Data and Analytical Complexity Has Created A Need For A New Role – The Data Scientist Image source: Wikipedia Data Science is the process of investigative / exploratory analysis of multi-structured data to discover and produce new business insights Image source:
  18. 18. 18 People In Different Roles In The Analytical Landscape Need To Work Together To Deliver Business Value Exploratory analysis Predictive / statistical model producer Business Analyst Business Manager / Operations worker / Customer Data Scientist Model consumer Data visualisation Information Producer • Build reports • Build and publish dashboards Information consumer Decision maker Action taker Strategic Business Objective Priority KPI Current KPI Value What is +1% worth? KPI Target Executive Accountable Business Initiatives (projects) Budget Allocation Action Plan 1 $$$ Project Project Project £ x Million 2 3 4 Business Strategy – strategic objectives and targets including sustainability targets sandbox
  19. 19. 19 Data Science Produces New Insights For Business Analysts Who Produce Actionable BI For Front Office Decision Makers Business Analyst Marketing Manager / Marketing, Sales and Service workers Data Scientist Data Quality Forecasting Segmentation Models Customer Lifetime Value Social Network Strategy Creation Performance & Effectiveness Reporting Direct Mail Understand Customer Behavior & Navigation Marketing Performance & Reporting Campaign Planning Financial Planning Creative Materials Marketing Attribution Operations Management Channel Efficiency Sentiment & Influence Dynamic Content Re-marketing Web Call Center Live Event Broadcast Media Mobile/ SMS Social Email Industry Specific Big Data Analytics Traditional DW/BI Workflow & Approvals New insights Actionable BI
  20. 20. 20 Big Data Analytics Has Taken Us Beyond The Traditional DW – New Big Data Analytical Workloads 1.  Analysis of data in motion 2.  Complex analysis of structured data 3.  Exploratory analysis of un-modeled multi-structured data 4.  Graph analysis e.g. social networks 5.  Accelerating ETL and analytical processing of un- modeled data to enrich data in a data warehouse or analytical appliance 6.  The storage and re-processing of archived data
  21. 21. 21 The Changing Landscape – We Now Have Different Platforms Optimised For Different Analytical Workloads Big Data workloads result in multiple platforms now being needed for analytical processing Streaming data Hadoop data store Data Warehouse RDBMS NoSQL DBMS EDW DW & marts NoSQL DB e.g. graph DB Advanced Analytic (multi-structured data) mart DW Appliance Advanced Analytics (structured data) Analytical RDBMS Graph analysis Investigative analysis, Data refinery Traditional query, reporting & analysis Real-time stream processing & decision m’gmt Data mining, model development
  22. 22. 22 Hadoop Is A Key Platform In Big Data Analytics – Data Can Be Accessed Via Multiple APIs Java MapReduce APIs to HDFS, HBase, Cascading file file file file file file file file file file file file file file webHDFS (An HTTP interface to HDFS has REST APIs) HDFS file file file file YARN PIG latin scripts SQL Vendor SQL on Hadoop engine MapReduce Application index indexIndex partition SQL BI Tools & Applications Storm Application YARN Tez or SparkMapReduce HBase HDFS API
  23. 23. 23 Defacto Standard APIs Allow Hadoop Components To Be Replaced e.g. Faster, More Secure File System Than HDFS Java MapReduce APIs to HDFS, HBase, Cascading webHDFS (An HTTP interface to HDFS has REST APIs) file file file file file file file file file file file file file file file file file file Vendor Specific File System (e.g. ) YARN HDFS API PIG latin scripts index indexIndex partition Storm Application YARN MapReduce HBase MapReduce Application SQL Vendor SQL on Hadoop engine SQL BI Tools & Applications Tez or Spark
  24. 24. 24 Apache Hadoop Components Component Description Hadoop HDFS A distributed file system that partitions files across multiple machines for high-throughput access to application data – HDFS API allows vendors to replace HDFS with an alternative Hadoop YARN" A framework for job scheduling and cluster resource management" Hadoop MapReduce A programming framework for distributed batch processing of large data sets distributed across multiple servers Avro A serialization system that creates & reads files in a format containing both JSON data definitions & the data itself for dynamic interpretation of the data by applications Hive A data warehouse system for Hadoop that facilitates data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. Hive provides a mechanism to project structure onto this data and query it using a SQL-like language called HiveQL. HiveQL programs are converted into MapReduce programs HBase HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable. Pig A high-level data-flow language for expressing Map/Reduce programs for processing and analysing large HDFS distributed data sets Mahout A scalable machine learning and data mining library Oozie A service for running and scheduling workflows of Hadoop jobs (including Map-Reduce, Pig, Hive, and Sqoop jobs) Spark A general purpose engine for large scale data processing in-memory. It supports analytical applications that wish to make use of stream processing, SQL access to columnar data and analytics on distributed in-memory data Zookeeper A high-performance coordination service for distributed applications
  25. 25. 25 The Role of Hadoop - Data Is Arriving Faster Than We Can Consume It – How Good Is Your Filter? F D I A L T T A E R Enterprise Enterprise systems
  26. 26. 26 New Requirement – The Managed Hadoop Enterprise Data Hub Parse & Prepare Data in Hadoop (MapReduce) Transform & Cleanse Data in Hadoop (MapReduce) Discover data in Hadoop ELT work -flow sandbox other data sandbox sandbox Data Reservoir (raw data) Load data into Hadoop Data Refinery New high value Insights (pub/sub) EDW Graph DBMS DW appliance contains clean, high value data XML,% JSON% Web logs
  27. 27. 27 What’s In An Enterprise Data Hub? !  A managed data reservoir (raw data) •  Organised capture of multi-structured data •  Includes real-time data capture •  May include operational reporting !  A governed data refinery •  Data integration and cleansing at scale •  Analytical sandboxes to discover high value data !  Published, protected and secure high value insights !  Long-term storage of archived data from data warehouses
  28. 28. 28 file file file file file file file file file file file file Real-time Data Capture – E.g. MapR Allows Web Log Data To Be Directly Streamed/Stored in Hadoop MapR Direct Access NFSs allows Web log files to be stored directly on their Hadoop File System so that click stream is captured in real-time MapR Distribution for Hadoop Web Server Direct Access NFS web log fileweb log file # mount localhost:/mapr /mapr HDFS Web Server Web Server
  29. 29. 29 High Volume Data Capture - Column Family Databases !  Suitable for fast capture of large amounts of sparse, volatile data •  Very fast capture and can hold vast amounts of data •  Billions of rows containing thousands or millions of columns !  Provide column-centric storage and wide de-normalised big tables can also help simplify operational reporting if used with SQL-on-Hadoop e.g. SQL access to HBase !  Allow you to •  Group together related columns into column families •  Design column families to optimize the most common queries •  Retrieve columnar data for multiple entities by iterating through a column family •  Shard rows in a column family and distribute across many servers •  Create indexes and secondary indexes •  Support schema variance - columns in a column family can vary for every row
  30. 30. 30 NoSQL Column Family Databases - HBase Row 1 # Column A = value Column B = value Column C = value Row 2 # Column X = value Column Y = value Column Z = value Hbase Storage Architecture Hmaster and several HRegionServers Regions (partitions) created automatically as tables grow Hbase allows applications to directly read and write data
  31. 31. 31 Column Families Can Be Stored In Different Files And Queries Will Only Retrieve The Column Family Needed Source: Data Access for Highly-Scalable Solutions : Using SQL, NoSQL, and Polyglot Persistence, McMurtry, Oakley, Sharp, Subramanian, Zhang Portfolio.* means all columns in the Portfolio column family Data about a customer and their stock purchases are partitioned vertically by column family Column family data can also be compressed
  32. 32. 32 Fast Data Capture – MapR-DB Is A High Speed Version of HBase Built Into The MapR Data Platform HBase API Source: MapR
  33. 33. 33 Enterprise Data Hub – We Need A Data Refinery To Process And Clean Complex Data Image source:
  34. 34. 34 Evolution of Big Data Integration Is Following The Same Cycle as it Did in Data Warehousing Hand coded ETL programs Hadoop Hand coded programs ETL Servers Hadoop ETL Servers ELT processing Generated MapReduce ELT processing HadoopEvolution of Big Data Integration
  35. 35. 35 Data Cleansing and Integration Tool Scaling ETL In A Data Refinery By Generating Pig, Hive or 3GL MapReduce Code for In-Hadoop ELT Processing Extract Parse Clean Transform AnalyseLoad Insights Option 1 ETL tool generates HQL or convert generated SQL to HQL Option 2 ETL tool generates Pig Latin (compiler converts every transform to a map reduce job) Note - Generating native MapReduce code instead of HiveQL or Pig Latin would likely perform faster because there is no need to translate into MapReduce Also HiveQL is a subset of SQL so check how ETL tools generating HiveQL do complex transformations – HiveQL on its own may not be enough e.g. Hive UDFs? Option 3 ETL tool generates 3GL MapReduce code
  36. 36. 36 Need to Parse & Extract From Multi-Structured Data While Integrating Data In A Big Data Environment E-mail (semi-structured) Text (unstructured) ExtractParse TransformLoad …
  37. 37. 37 Sandboxes In The Data Refinery - Data Science Teams Need To Conduct Exploratory Analysis on Multi-Structured Data Click stream web log data Customer interaction data Social interaction data (e.g. Twitter, Facebook) Sensor data Rich media data (video, audio) External web content Documents Internal web content Seismic data (oil & gas) Investigative / Exploratory Analysis C R U D Asset Customer Product MDM System EDW mart new business insights sandbox Multi-structured data Historical Data archived DW datamaster data Data Scientists
  38. 38. 38 In-Hadoop Analytics In A Data Refinery – Example Technologies !  Hadoop MapReduce, Tez or Spark analytic applications with custom analytics •  Pig, Java, Python, Scala, Cascading….. !  Hadoop MapReduce, Tez or Spark analytic applications using pre-built Hadoop analytics e.g. Mahout, Spark MLlib •  Several analytical algorithms for use in analysis !  Revolution Analytics RevoScaleR !  SAS Analytics and In-Memory Statistics for Hadoop !  … many more Analytical tools Data management tools
  39. 39. 39 In-Hadoop Analytics: - Mahout Supports A Number Of Analytic Techniques !  Collaborative Filtering !  User and Item based recommenders !  K-Means and Fuzzy K-Means clustering !  Mean Shift clustering !  Dirichlet process clustering !  Latent Dirichlet Allocation !  Singular value decomposition !  Parallel Frequent Pattern mining !  Complementary Naive Bayes classifier !  Random forest decision tree based classifier Now runs on Spark as well as MapReduce
  40. 40. 40 Expediting The Data Refinery Process On Hadoop With Automated Analysis – From ETL to Analytical Workflows Parse & Prepare Data in Hadoop (MapReduce) Transform & Cleanse Data in Hadoop (MapReduce) Discover data in Hadoop ELT work -flow other data Raw data Load data into Hadoop Data Refinery EDW Graph DBMS DW appliance Automated Invocation of Custom Built & Pre-built Analytics on Hadoop contains clean, high value data New high value Insights (pub/sub)
  41. 41. 41 High Value Insights Produced In A Hadoop Data Hub Can Be Brought Into A DW to Enrich What We Already Know Cloud Data HDFS Extract DW D IMap/ Reduce data transformation and analytics applications Transform e.g. PIG, IBM JAQL Cloud Data e.g. Deriving insight from huge volumes of social web content on sites like twitter, facebook. Digg, mySpace, tripAdvisor, Linkedin….for sentiment analytics Hundreds of terabytes up to petabytes new insights Operational systems
  42. 42. 42 Making New Insights Available To Business Analysts Via SQL Access To Big Data - Options SQL SQL access to big data in Hadoop SQL DW data virtualisation server SQL access to big data via data virtualisation SQL Analytical RDBMS SQL access to big data in an analytical RDBMS streaming data SQL SQL access to streaming data in motion
  43. 43. 43 Self-Service BI BI Tool(s) e.g, Visual Discovery tools Business Analyst or ‘budding’ Data Scientist personal & office data Predictive models community Publish / Share Consume / Enhance / Re-publish Transaction systems DW SQL Access to Hadoop Is Needed To Allow Hadoop Data To Be Accessed By Users With Self-Service BI Tools collaborate HDFS / Hbase/ Hive e.g. Hive interface
  44. 44. 44 SQL access to Big Data? Key Questions That May Influence If SQL Access to Big Data Is A Good Choice or What SQL Option to Take What kind of analysis? Text analysis, Graph analysis, Machine Learning, reporting What kind of data type(s) do you need to analyse? - structured, unstructured, semi- structured, What kind of data volumes do you want to analyse? Is the data at rest or is it real- time streaming data in motion? What analytical functions can you invoke on big data from SQL? Join with other data in another data store? How many concurrent users? Performance and scalability of complex queries and analytical functions (need parallelism) Is the requirement for interactive, exploratory, or real-time analysis? Data Analytical Workload
  45. 45. 45 SQL On Hadoop Initiatives Key Questions What analytic functions are provided? How can analytic functions be extended Can you join to data outside of Hadoop? Are these SQL on Hadoop options suitable for reporting and analysis, interactive discovery, exploratory analysis or all of these? Vendor SQL on Hadoop Initiative AMPlab (UC Berkeley) Shark (Forked Hive at V0.9) or SparkSQL Apache Hadoop Hive Actian Vortex (Actian Vector on Hadoop data nodes) CitusDB CitusDB (uses external tables) Cloudera Impala / Parquet Concurrent Lingual (SQL on Cascading) Hadapt Schemaless SQL Hortonworks Stinger / ORC (Hive 13) HP Vertica on Hadoop IBM BigSQL (SQL on HDFS & HBase) InfiniDB InfiniDB on Apache Hadoop Jethro Data JethroData MapR Apache Drill Microsoft Hive 13 Pivotal HawQ (uses external tables via PFX) Teradata SQL-H Splice Machine Splice Machine (SQL Engine on HBase) Phoenix (SQL engine on HBase) Attivio Active Intelligence Engine (SQL access to search indexes on Hadoop data)
  46. 46. 46 SQL on Hadoop – Apache Drill Can Access HDFS And HBase Data BI Tool(s) e.g, Visual Discovery tools Business Analyst or’ Data Scientist Drill Analytic Application SQL SQL Data Scientist HDFSHBase MapR Distribution for Hadoop Apache Drill does not use MapReduce MongoDB/ Cassandra sensors XML,% JSON% Data entering HBase
  47. 47. 47 Apache Drill Distributed Query Processing – A Storage Independent Drillbit MPP Architecture Each drillbit is capable of receiving queries from applications and BI tools - there is no master in this architecture Multiple drillbits are involved in parallel query processing on distributed data Supports Apache HDFS, Apache HBase, MapR-FS, MapR-DB, Amazon S3
  48. 48. 48 SQL on Hadoop Example – Apache Drill Supports Query of Self-Describing Data Without a Schema JSON Source: MapR
  49. 49. 49 file file file file file file file file file file file SQL on Hadoop – What Should The Schema Look Like? Star schema? Snowflake schema? De-normalised schema? Other?
  50. 50. 50 Hadoop Storage Is Independent of Any SQL Engine Accessing HDFS - Multiple SQL Engines Can Coexist On The Same Data file file file file file file file file file file file file file file HDFS file file file file YARN Batch (MapReduce) Interactive (Tez) On-line (HBase) Streaming (Storm,..) Graph (Giraph) In-memory (Spark) HPC MPI (OpenMPI) Other (Search,.) file file file file SQLSQLSQL SQL Storage is independent of any SQL engine!  Key points about Hadoop •  It is possible to have MULTIPLE SQL engines on the same data •  Different SQL engines run on different Hadoop frameworks (M/R, Tez, Spark) or on no framework at all i.e. directly access HDFS or HBase data
  51. 51. 51 Relational DBMS / Hadoop Integration – Several Vendors Have Integrated RDBMS with Hadoop to Run Analytics Relational DBMS External Polymorphic table function(s) HDFS / Hbase/ Hive SQL, XQuery RDBMS optimizer handles transparent access to external analytical platforms on behalf of the user RDBMS and Hadoop could be deployed on the same hardware cluster (preferred) or on different hardware clusters Allows join across data in a single RDBMS and Hadoop
  52. 52. 52 Relational DBMS / Hadoop Integration Example - HP Vertica and MapR Source: MapR
  53. 53. 53 Self-Service BI Self-service Data Discovery & Visualisation or Dashboard Server Business analyst Data Virtualization and Optimization personal & office data Predictive models Transaction systems Data Management Tools (ETL, DQ, etc.) DW Self-Service Access To Big Data Via Data Virtualization BUT what about optimization? Can the data virtualisation server push down analytics to underlying platforms to make them do the work?
  54. 54. 54 New Insights Can Be Added Into A Data Warehouse To Enrich What You Already Know DW D I new insights Operational systems e.g. Deriving insight from social web sites like for sentiment analytics sandbox Data Scientists social Web logs web cloud ELT
  55. 55. 55 Alternatively New Insights In Hadoop Can Integrated With A DW Using Data Virtualization To Provide Enriched Information DW D I e.g. Deriving insight from social web sites like for sentiment analytics new insights OLTP systems sandbox Data Scientists social Web logs web cloud DataVitualisation SQL on Hadoop
  56. 56. 56 Using Hadoop As A Data Archive Means Data Can Be Kept On-line, Analysed And Still Integrated With Data In The DW DW D I OLTP systems DataVitualisation SQL on Hadoop Archived data Archiveunused ordata>nyears
  57. 57. 57 SQL on Hadoop Big Data Governance – Data Sources, Sandboxes, People, Data Access Security, Results Lineage…. Graph DBMS MPP Analytical RDBMS Social graph data Unstructured / semi- structured content DW RDBMSFiles clickstream% Web logs governance governance governance governance governance governance governancegovernancegovernance
  58. 58. 58 Issues: Siloed Analytics - Different Tools to Manage and Integrate Data For Each Type of Analytical Data Store Analytical tools Data management tools EDW mart Structured data CRM ERP SCM Silo DW & marts Streaming data (markets, sensors Analytical models Silo Analytical tools/apps Data management tools Multi-structured data Silo DW Appliance Advanced Analytics (structured data) Data management tools Structured data CRM ERP SCM Analytical tools Silo Analytical tools/apps Data management tools NoSQL DB e.g. graph DB Silo Multi-structured & structured data
  59. 59. 59 EDW MDM SystemDW & marts NoSQL DB e.g. graph DB Advanced Analytic (multi-structured data) mart DW Appliance Advanced Analytics (structured data) Need to Manage The Supply of Consistent Data Across The Entire Analytical Ecosystem Common Enterprise Information Management Tool Suite Stream processing C R U D Prod Asset Cust actions feedssensors XML,% JSON% RDBMS Files office docssocial Cloud clickstream% Web logs web services New New New New New New New New NewNew New New C R U D Prod Asset Cust New data types need to be supported by EIM tool suites
  60. 60. 60 BI tools platform & data visualisation tools Search based BI tools Custom MapReduce applications Map Reduce BI tools Graph Analytics tools A New Architecture for Analytics - The Intelligent Business Strategies Extended Analytical Ecosystem Enterprise Information Management Tool Suite feedssensors XML,% JSON% RDBMS Files office docssocial Cloud clickstream% Web logs web services Event processing C R U D Prod Asset Cust EDW MDM SystemDW & marts NoSQL DB e.g. graph DB Advanced Analytics (multi-structured data) mart DW Appliance Advanced Analytics (structured data) actions Filtered data Data Virtualisation and optimization
  61. 61. 61 Conclusions !  Business demand for new more complex, high volume data is driving the need for new analytical workloads beyond the data warehouse !  Hadoop is a low cost analytical platform capable of supporting new analytical workloads on multi-stuctured data !  A key role for Hadoop is as an data hub and data refinery !  The data refinery process requires data integration and cleansing to scale to handle the volume, variety and velocity of complex multi- structured data !  Data scientists analyse big data as part of the data refining process to produce new insights that can be added to what you already know !  Hadoop is part of an extended analytical ecosystem with data management tools supplying consistent data across all data stores !  Data scientists, business analysts and information consumers need to work together to deliver new insight for competitive advantage
  62. 62. ® © 2014 MapR Technologies 62© 2014 MapR Technologies ® Best Practices for Production Success
  64. 64. ® © 2014 MapR Technologies 64 MapR: Best Product for Customer Success Top Ranked Exponential Growth 500+ Customers 3X bookings Q1 ‘13 – Q1 ‘14 80% of accounts expand 3X 90% software licenses <1% lifetime churn >$1B in incremental revenue generated by 1 customer
  65. 65. ® © 2014 MapR Technologies 65 FOUNDATION Architecture Matters for Success
  66. 66. ® © 2014 MapR Technologies 66 FOUNDATION High Availability & Data Protection High performance Multi-tenancy Operational & analytical workloads Open standards for integration NEW APPLICATIONS SLAs TRUSTED INFORMATION LOWER TCO Architecture Matters for Success
  67. 67. ® © 2014 MapR Technologies 67 The Power of the Open Source Community Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue *%Cer6fica6on/support%planned%for%2014%
  68. 68. ® © 2014 MapR Technologies 68 MapR Distribution for Hadoop Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue *%Cer6fica6on/support%planned%for%2014% •  High availability •  Data protection •  Disaster recovery •  Standard file access •  Standard database access •  Pluggable services •  Broad developer support •  Enterprise security authorization •  Wire-level authentication •  Data governance •  Ability to support predictive analytics, real-time database operations, and support high arrival rate data •  Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators •  2X to 7X higher performance •  Consistent, low latency Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
  69. 69. ® © 2014 MapR Technologies 69 Hadoop + Data Warehouse Architecture Improve data services to customers without increasing enterprise architecture costs •  Provide cloud, security, managed services, data center, & comms •  Report on customer usage, profiles, billing, and sales metrics •  Improve service: Measure service quality and repair metrics •  Reduce customer churn – identify and address IP network hotspots •  Cost of ETL & DW storage for growing IP and clickstream data; >3 months •  Reliability & cost of Hadoop alternatives limited ETL & storage offload •  MapR for data staging, ETL, and storage at 1/10th the cost •  MapR provided smallest datacenter footprint with best DR solution •  Enterprise-grade: NFS file management, consistent snapshots & mirroring •  Data warehouse for mission-critical reporting and analysis OBJECTIVES CHALLENGES SOLUTION Hadoop + Data Warehouse = New, Deeper Insights for the Business •  Increased scale to handle network IP and clickstream data •  Freed up processing on DW to maintain reporting SLA’s to business •  Unlocked new insights into network usage and customer preferences Business Impact FORTUNE 500 TELCO
  70. 70. ® © 2014 MapR Technologies 70 Q&AEngage with us! @mikeferguson1 – Intelligent Business Strategies @swooledge – MapR Technologies •  Learn more about Hadoop in your architecture: •  Upcoming Webinar series - –  6/26 Talend – ETL in/for Hadoop –  7/09 Syncsort – comScore & mainframe optimization –  7/17 Rick van der Lans – SQL-on-Hadoop –  7/23 Skytree – machine learning & analytics –  7/30 Appfluent – DW usage monitoring & optimization –  8/14 Tableau – data exploration & analysis on Hadoop •  Contact / follow us