Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Getting It Right Exactly Once: Principles for Streaming Architectures

1,302 views

Published on

Darryl Smith, Chief Data Platform Architect and Distinguished Engineer, Dell Technologies. September 2016, Strata+Hadoop World, NY

Published in: Data & Analytics
  • Be the first to comment

Getting It Right Exactly Once: Principles for Streaming Architectures

  1. 1. Getting It Right Exactly Once: Principles for Streaming Architectures Darryl Smith, Chief Data Platform Architect and Distinguished Engineer, Dell Technologies September 2016 | Strata+Hadoop World, NY
  2. 2. 2 Getting Started  I’m Darryl Smith • Chief Data Platform Architect and Distinguished Engineer Dell Technologies  Agenda • Real-Time And The Need For Streaming • Adding Real-Time And Streaming To The Data Lake • Results, Plans, Lessons Learned • Demonstration
  3. 3. 3 Trickle, Flood, or Torrent… Streaming is about continuous data motion, more than speed or volume
  4. 4. 4 The Conversation Around Streaming Website and Mobile Application Logs Internet of Things Sensors
  5. 5. The Enterprise Reality 5 Batch > Real-Time > Streaming Enterprise Opportunities Immediate Business Advantage Website and Mobile Application Logs Internet of Things Sensors
  6. 6. 6 The Enterprise Streaming Play Moving from batch to real-time streams avoids surges, normalizes compute, and drives value
  7. 7. 7 Real time and the need for streaming
  8. 8. 8 Drive DellEMC towards a Predictive Enterprise via intelligent data driving agility, increasing revenue and productivity resulting in a competitive advantage Analytics Vision
  9. 9. 9  Need to use new data for competitive advantage • Volume, Variety and Velocity  Leverage near real time and streaming data sets to optimize predictions • Make faster, better decisions  Cost-effectively scale to improve query and load performance  Put the data in the hands of the business Becoming An Analytical Enterprise DRIVE COMPETITIVE ADVANTAGE COST- EFFECTIVELY SCALE DATA ACCESS BY BUSINESS NEAR REAL-TIME ANALYTICS
  10. 10. 10 Problem Statement Teams do not have access to maintenance renewal quotes in the timeframes or the degree of quality which they need for Tech Refresh and Renewal sales. Desired Outcome Implement a cost-effective, real-time solution that improves productivity and gives confidence to produce desired outcomes efficiently. Scoping The Business Objectives
  11. 11. 11 Business Drivers CURRENT REALITY VISION FOR THE FUTURE TO REALIZE THIS VISION: IMPLEMENT CALM SOLUTION PHASES AND OPTIMZE BUSINESS PROCESSES HIGH TOUCH TACTICAL EXECUTION LOW TOUCH SELF SERVICE DATE DRIVEN PROCESSES BUSINESS VALUE DRIVEN PROCESSES INEFFICENCIES & LOST PRODUCTITY INCREASED PRODUCTIVITY SILOED DATA / LIMITED VIEWS SINGLE VIEW OF DATA/DATA SCORING VARIABLE DATA QUALITY DATA QUALITY & CONFIDENCE
  12. 12. 12 The Need for “CALM” Customer Asset Lifecycle Management For enterprise sales Who need accurate and timely customer information CALM is a real-time application Providing up to the moment customer 360 dashboards For enterprise sales Who need accurate and timely customer information CALM is a real-time application Providing up to the moment customer 360 o dashboards Install Base Pricing Device Config Contacts Contracts Analytics Contracts Component Data Offers Scorecard
  13. 13. 13 Data Lake Architecture D A T A P L A T F O R M V M W A R E V C L O U D S U I T E E X E C U T I O N P R O C E S S GREENPLUM DBSPRING XD PIVOTAL HD Gemfire H A D O O P INGESTION DATAGOVERNANCE Cassandra PostgreSQL MemSQL HDFS ON ISILON HADOOP ON SCALEIO VCE VBLOCK/VxRACK | XTREMIO | DATA DOMAIN A N A L Y T I C S T O O L B O X Network WebSensor SupplierSocial Media Market S T R U C T U R E DU N S T R U C T U R E D CRM PLMERP APPLICATIONS ApacheRangerAttivioCollibra Real-TimeMicro-BatchBatch
  14. 14. 14 Data Ingestion • Small to Big Data (high-throughput) • Structured and unstructured Data from any Source • Streams and Batches • Secure, multi-tenant, configurable Framework Real-Time Analytics • Tap into streams for in-memory Analytics • Real Time Data insights and decisions Services • Data Ingestion to Data Lake • Data Lake APIs • Data Alerting Business Data Lake Offerings Unstructured Structured
  15. 15. 15 Adding Real Time and Streaming to the Data Lake
  16. 16. 16 Seeking A Fast Database A compliment to the business data lake O P C M
  17. 17. HammerDB Platform Benchmarks HammerDB workloads testing was done following EMC’s Oracle and SQL Server DBA Teams standard practices.  Definition of workload. Mix of 5 transactions as follows: • New order: receive a new order from a customer: 45% • Payment: update the customer balance to record a payment: 43% • Delivery: deliver orders asynchronously: 4% • Order status: retrieve the status of customer’s most recent order: 4% • Stock level: return the status of the warehouse’s inventory: 4%  Testing scenario: • 100 warehouses 8 vUsers. Database creation and initial data loading. • Timed testing. 20 minutes per each testing session. • Scaled number of virtual users for each testing session from 1 until 44.  No changes done to the systems and databases configuration while running the test.
  18. 18. HammerDB Workload Testing  Each test was 16 vCPU x 32 GB RAM • RedHat 6.4 • Oracle 11g R2 • Windows Core 2012 R2 • SQL Server 2012 Ent Ed. • RedHat 6.4 • PostgreSQL 9.3.3
  19. 19. HammerDB Workload - Results Results
  20. 20. Query PostgreSQL MemSQL Opportunity(5K) 5 seconds 200ms Sales Order(170K) 1-1.5 Minutes 6 seconds Territory(60K) 60 seconds 5 seconds PostgreSQL vs In-Memory DB We picked 5 top queries run by different business functions. Presented here are 3 queries that had response times that did not meet the SLA.
  21. 21. 21 Business Data Lake – Ingestion to Fulfillment Raw Data Summary Data DATAGOVERNOR Consumers Predictive/ Prescriptive Analytics Processed Data Analytical Data GREENPLUM DATABASE HADOOP RAW Data INGEST MANAGER SPRING XD SPARK SQOOP Execution Tier CASSANDRAGEMFIRE MEMSQL POSTGRESQL Real-Time Tap
  22. 22. 22 Here Are The Data Flows We Built Low Velocity Batch Real-Time
  23. 23. 23 Data Flow Patterns – Low Velocity Analytical [BATCH] Ingestion Data Service JDBC Application Presentation [SPEED/SERVING] GREENPLUM DATABASE PIVOTAL HD POSTGRESQL MEMSQL Raw Data One-Time CASSANDRA GEMFIRE
  24. 24. Analytical [BATCH] Ingestion Data Service JDBC Application GREENPLUM DATABASE PIVOTAL HD 24 Data Flow Patterns – Batch Batch Presentation [SPEED/SERVING] POSTGRESQL MEMSQL CASSANDRA GEMFIRE
  25. 25. 25 Data Flow Patterns – Real Time Real-time Initial Load Analytical [BATCH] Ingestion Data Service JDBC Application GREENPLUM DATABASE PIVOTAL HD Presentation [SPEED/SERVING] POSTGRESQL MEMSQL CASSANDRA GEMFIRE
  26. 26. 26 Nothing Closer To Real Time Than Streaming  Let’s look at the leading edge  Apache Kafka  Messaging Semantics • At most once • At least once • Exactly once
  27. 27. 27 At most once 000 ? 01 02 03 04
  28. 28. 28 At least once 01 02 03 04 000 ?
  29. 29. 29 Exactly Once 000 01 02 03 04 01
  30. 30. 30 Understanding Streaming Semantics At most once At least once Exactly once Message pulled once Message pulled one or more times; processed each time Message pulled one or more times; processed once May or may not be received Receipt guaranteed Receipt guaranteed No duplicates Likely duplicates No duplicates Possible missing data No missing data No missing data 000 ? 000000 ? 01 01 01
  31. 31. 31 Rendering In Real Time  Picking the right business intelligence layer • Tableau • Custom Application (CF, D3, Docker) • Additional Third Party Solutions
  32. 32. 32 Results, Plans, Lessons Learned
  33. 33. 33 Business Benefits DATA QUERYING Down from 4 hours per quarter to less than 1 minute per year SIMPLIFIED PROVISIONING Reduced number of tables/report required DATA GOVERNANCE Provides one version of the truth TIME TO MARKET Reduced number of tables/report required TOOL AGNOSTIC Business logic in the DB not the tool provides increased flexibility
  34. 34. 34 Use Case: Customer Account Profile  STREAMLINED analytics ENVIRONMENT TO GAIN A HOLISTIC CUSTOMER VIEW Service Request Contracts Installed Base Bookings Billings EMC DATA LAKE BDL SERVICES DATA WORKSPACES DATA INGESTION Prof Services 23 BUSINESS MANAGED WORKSPACES
  35. 35. 35 Customer Asset Lifecycle Management Platform Roadmap Phase 1 : Foundational Capabilities/Discovery Phase 2 : Scale Platform / Automate Future Phases : Global Standard tool Integrations , advanced Analytics BAaaS/Tableau Scalable Platform Integrated Platform GBS Renewals Inside Sales Additional Business groups Oct 2015 2016 TBDAug 2015 BDL Platform Enablement CollaborationAcceleration In-Memory Capabilities (POC) We are here
  36. 36. 36 Data Services Roadmap Security Planned integration into custom BDL security API for managing Role Based Access Control (RBAC) to the underlying data Business Data Lake Plans
  37. 37. 37 Lessons Learned – Key Takeaways EDUCATE ASSESS INFRASTRUCTURE JOURNEY Educate the business Use examples of business impact Assess in-house big data skills Ensure plan to support the organization for 3- 5 years Choose the best possible infrastructure Make sure your Big Data technology platform can evolve Remember it is a journey Look for small wins as well as big wins.
  38. 38. 38 Lessons Learned: Analytics and Data Sourcing the right skills, working with a different philosophy, and some new tools will help you meet your analytical goals TRANSFORM YOUR PEOPLE CHANGE YOUR PROCESSES ADAPT YOUR TECHNOLOGY  Data science in the organization, IT or both?  Helping business units take initiative  New philosophy to running analytics projects  How and when to share data  Steadily refine toolsets based on needed analysis  Identify to infrastructure layers
  39. 39. 39 Demonstration
  40. 40. 40 Demo Agenda Showcase exactly-once semantics from Kafka 1: Data set of 200,000 transactions summing to zero 2: CREATE TABE AND CREATE PIPELINE 3: Push to Kafka and confirm exactly-once 4: Validate Resiliency and confirm exactly-once
  41. 41. Step 1: Data Source  start with a data set of 200,000 transactions representing money/goods that sum to zero
  42. 42.  200,000 transactions • Transaction number • Increase / Decrease • Amount
  43. 43. Step 2: CREATE TABLE AND CREATE PIPELINE  create a table and pipeline in MemSQL that subscribes to that Kafka topic
  44. 44. CREATE TABLE CREATE PIPELINE
  45. 45. Step 3: Push to Kafka  Push that data set to Kafka  Validate exactly-once delivery by querying MemSQL • show tables; • show pipelines; • select sum(amount) from transactions;  Should be 0 in the demo • select count(*) from transactions;  Should be 200,000 in the demo
  46. 46. 46
  47. 47. Step 4: Resiliency  induce a failures to show resiliency during exactly-once workflows a. randomly_fail_batches.py b. restart Kafka and show error count c. continue and validate exactly-once semantics
  48. 48. 48
  49. 49. Errors Total Transactions Sum
  50. 50. The mission is clear: We’re moving from batch to real-time with streaming
  51. 51. Thank You Darryl Smith Chief Data Platform Architect and Distinguished Engineer Dell Technologies

×