Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache spark empowering the real time data driven enterprise - StreamAnalytix webinar


Published on

Apache Spark is one of the most popular Big Data frameworks today. It is fast becoming the de facto technology choice for stream processing, real-time analytics, data science and machine learning applications at scale. It has moved well beyond the early-adopter phase, is supported by a vibrant open source community and is enjoying accelerated adoption in enterprises.

Join our guest speaker from Forrester Research, VP & Principal Analyst, Mike Gualtieri and StreamAnalytix, Product Head, Anand Venugopal for a discussion on the trends and directions defining the growing importance of Apache Spark for stream processing, machine learning and other advanced data analytics applications.

Published in: Data & Analytics
  • Be the first to comment

Apache spark empowering the real time data driven enterprise - StreamAnalytix webinar

  1. 1. WEBINAR Apache Spark Empowering the Real-Time Data Driven Enterprise October 13, 2017 Anand VenugopalMike Gualtieri Twitter: mgualtieri Twitter: streamanalytix VP & Principal Analyst, Forrester Product Head & AVP, StreamAnalytix
  2. 2. Our Agenda • Business Value of Streaming Analytics • Use Cases / Architecture • Streaming Analytics Platform Criteria • Spark as a Streaming Technology • Introducing StreamAnalytix - Visual Spark Studio • Success Stories and Demo • Q & A
  3. 3. Mission critical technology solutions since 1996 Fortune 500: Big Data clients 1700 people; US, India, global reach Unique mix of Big Data products and services About Impetus
  4. 4. — Mike Gualtieri, VP & Principal Analyst The Real-Time Enterprise with Apache Spark Twitter: @mgualtieri | Linkedin: mgualtieri
  5. 5. #Priority
  6. 6. © 2017 Forrester Research, Inc. Reproduction Prohibited 52% 53% 53% 54% 58% 64% 64% 65% 66% 73% 75% 0% 10% 20% 30% 40% 50% 60% 70% 80% Better leverage big data and analytics in business… Create a comprehensive strategy for addressing digital… Create a comprehensive digital marketing strategy Better comply with regulations and requirements Improve differentiation in the market Increase influence and brand reach in the market Address rising customer expectations Improve our ability to innovate Reduce costs Improve our products /services Improve the experience of our customers • Base: 3,005 global data and analytics decision-makers • Source: Global Business Technographics Data And Analytics Online Survey 2016 Data and analytics decision-makers are driven by business priorities
  7. 7. Most firms struggle to analyze data and make insights actionable in real-time
  8. 8. © 2017 Forrester Research, Inc. Reproduction Prohibited Real-time means business time
  9. 9. #Business
  10. 10. Is this customer thinking about moving to a rival firm right now?
  11. 11. What offers should you make to your customer if they are eCommerce’ing right now?
  12. 12. How can you warn other drivers that the road is slippery to avoid a crash right now?
  13. 13. © 2017 Forrester Research, Inc. Reproduction Prohibited What are movers and shakers saying about equities that we cover right now?
  14. 14. How can you prevent this dude from fleecing you right now?
  15. 15. How you detect customer SLA problems right now?
  16. 16. How can IoT data be used to predict machine failure right now?
  17. 17. #Analytics
  18. 18. © 2017 Forrester Research, Inc. Reproduction Prohibited Ideate Model Detect Adapt Machine Learning Streaming Analytics Descriptive Analytics Prescriptive Analytics (Real-time Analytics)     (Batch Analytics) Only the analytical enterprise can compete and win in the age of the customer
  19. 19. #Data
  20. 20. © 2017 Forrester Research, Inc. Reproduction Prohibited 10-49 Terabytes 5% 50-99 Terabytes 12% 100-500 Terabytes 54% Greater than 500 Terabytes 29% Enterprises have plenty of data from both internal and external sources Using your best estimate, what is the size of all data stored within your company? Source: Forrester Research, September 2015 Base: 100 US Managers and above currently using Hadoop for processing and analyzing data. Internal business data 49% External source data 51% What % of the data available is from internal business applications (ERP and business applications) versus external sources (social, IoT)?
  21. 21. Data is like a drop of rain
  22. 22. It forms instantaneously in a cloud…
  23. 23. ...and travels far before it makes a ripple
  24. 24. #Real-time
  25. 25. #
  26. 26. All data originates in real-time!
  27. 27. But, analytics to gain insights is usually done much, much later
  28. 28. #WhyWait
  29. 29. Insights are perishable
  30. 30. © 2017 Forrester Research, Inc. Reproduction Prohibited Enterprises must act on a range of perishable insights to get value from data and analytics Real-time Insights Operational Insights Performance Insights Insight: Shopping for furniture Action: Recommend cleaning supplies Insight: Profit lower than goal Action: Optimize price Insight: Demand forecast strong Action: Increase inventory Insight: Furniture demand high Action: Expand product line TimetoAct Perishability Sub-second to seconds Seconds to hours Days to weeks Weeks to years Sub-second to seconds Seconds to hours Hours to weeks Weeks to years Strategic Insights
  31. 31. © 2017 Forrester Research, Inc. Reproduction Prohibited Time To Action Data originated Analytics performed Insights gleaned Action taken Outdated insights Impotent or harmful actions Decision made Poor decision BusinessValuePositiveNegative Most analytics operations are too slow
  32. 32. © 2017 Forrester Research, Inc. Reproduction Prohibited BusinessValue Time to Action PositiveNegative The Real-time Enterprise You must compress analytics time-to-insight to maximize the value of data
  33. 33. © 2017 Forrester Research, Inc. Reproduction Prohibited Real-time Insights Strategic Insights Operational Insights Performance Insights TimetoAct Perishability Sub-second to seconds Seconds to hours Days to weeks Weeks to years Sub-second to seconds Seconds to hours Hours to weeks Weeks to years Streaming analytics Batch analytics IoT applications must act on a range of perishable insights to get value from big data
  34. 34. #Applications
  35. 35. The opportunity to become real-time is high, but enterprises must redesign applications
  36. 36. © 2017 Forrester Research, Inc. Reproduction Prohibited Streaming Data Application Interface App Logic Context Actions Real-time Context Programmed Logic Learned Logic Machine Learning Learning External Actions External Context From other data sources of applications To other data sources or applications Applications Modern applications infuse analytics to respond in real-time and become smarter
  37. 37. Streaming is essential technology to identify and act on perishable insights
  38. 38. #Streaming
  39. 39. © 2017 Forrester Research, Inc. Reproduction Prohibited Streaming analytics lets applications sense, think, and act in real-time Source: Forrester Research
  40. 40. © 2017 Forrester Research, Inc. Reproduction Prohibited Streaming analytics is very different from plain vanilla stream ingestion Source: Forrester Research
  41. 41. © 2017 Forrester Research, Inc. Reproduction Prohibited Architecture • Workload scalability • Workload latency • Fault tolerance • Operational management Stream/event Handling • Event sequencing • Enrichment Analytical Operators • Transformation • Correlation • Time windows • Complex event processing Applications Development • Development tools • Data connectors • Business solution accelerators • Community innovation Streaming analytics solutions must be scalable and have a rich set of stateful analytical operators
  42. 42. #Solutions
  43. 43. 110010011011001 010010011011001 010011001101101 010010 Historical Transactions Customerdata Security Ability to ingest structured and unstructured data from multiple sources in real-time
  44. 44. Scale to handle any volume & velocity of data
  45. 45. Process and analyse in real-time
  46. 46. Provide fault-tolerance for mission-critical applications
  47. 47. Provide tools that make it easy to manage and monitor the platform and its interaction with technology components
  48. 48. Offer tools for business users to visualize insights from real-time data
  49. 49. Capture perishable events and insights at low latency
  50. 50. Offer sophisticated stateful and stateless analytics
  51. 51. Leverage existing skills to make it easy for developers to develop, test and deploy applications
  52. 52. #
  53. 53. Hadoop is designed for volume
  54. 54. Spark is designed for speed
  55. 55. © 2017 Forrester Research, Inc. Reproduction Prohibited Spark and Hadoop often coexist in the same cluster
  56. 56. © 2017 Forrester Research, Inc. Reproduction Prohibited Hadoop and Spark are friends, but…
  57. 57. …Spark is where developers go to create real-time enterprises
  58. 58. 58,000x Spark is designed to process in-memory datasets, but can spool to disk if necessary
  59. 59. Spark’s directed acyclic graph (DAG) engine optimizes parallelization to dramatically reduce intermediary data movement
  60. 60. © 2017 Forrester Research, Inc. Reproduction Prohibited and/or and/orand/or Spark doesn’t need Hadoop; it just needs great compute and great storage
  61. 61. © 2017 Forrester Research, Inc. Reproduction Prohibited Spark includes capabilities for streaming analytics and machine learning!
  62. 62. #Opportunity
  63. 63. © 2017 Forrester Research, Inc. Reproduction Prohibited Ideate Model Detect Adapt Machine Learning Streaming Analytics Descriptive Analytics Prescriptive Analytics (Real-time Analytics)     (Batch Analytics) Unify batch and streaming analytics to create your real-time enterprise
  64. 64. #Time
  65. 65. Stop wasting it
  66. 66. Use it to your advantage
  67. 67. Thank you Mike Gualtieri Twitter: @mgualtieri
  68. 68. Real-Time Stream Processing and Machine Learning Platform ENABLING THE REAL TIME ENTERPRISE
  69. 69. “Impetus has the opportunity to make StreamAnalytix the de facto tooling standard for Spark and future streaming engines…” Impetus Technologies covers open source bases without the headaches. Take your pick. Impetus’ StreamAnalytix supports Apache Storm and Apache Spark and is architecturally positioned to support other open source streaming analytics software such as Apache Flink. StreamAnalytix also embeds EsperTech to provide advanced streaming analytics capabilities such as complex event processing. What also shines about the StreamAnalytix solution is that it includes enterprise-grade visual tooling for both development and deployment of streaming applications. StreamAnalytix tooling also unifies streaming and batch by supporting arbitrary Spark jobs such as machine learning. A Strong Performer in The Forrester Wave™: Streaming Analytics, Q3 2017
  70. 70. ENABLING THE REAL TIME ENTERPRISE 1 Real-Time Streaming Data Analytics 2 Makes Spark Easy (Visual Spark Studio)
  71. 71. SENSE Hours/ Days ANALYZE ACT SENSE ANALYZE ACTSec/ ms Not so real-time Hours/ Days Sec/ ms StreamAnalytix is a platform to build real-time apps Near real-time / real-time 1
  72. 72. Slow processing jobs Wherever you are – we can make you faster HADOOP-MR OR OTHER NON-BIG DATA TECH Faster due to in-memory SPARK BATCH JOBS Faster due to micro batch SPARK STREAMING JOBS Fastest EVENT STREAM PROCESSING 1 ENABLING THE REAL TIME ENTERPRISE
  73. 73. Real-time C360 and Churn Fraud and Anomaly Detection IoT and Log Analytics Next Best Offer or Action Predictive Maintenance Cyber Security Real-time Call Center Analytics Use Cases Real-time Streaming Data Analytics 1 ENABLING THE REAL TIME ENTERPRISE
  74. 74. Learning / Training  Real-time + Batch PMML, H20, Python – on Spark Kafka, Storm, Esper Scoring  Real-time + Batch Spark Streaming, SparkML, ML-Lib Stack Real-time Streaming Data Analytics 1 ENABLING THE REAL TIME ENTERPRISE
  75. 75. 1 Real-Time Streaming Data Analytics 2 Makes Spark Easy (Visual Spark Studio)
  76. 76. Shortage of Spark talent and the urgent need for it • Spark projects are increasing • Need to get done quickly, with budget controls • But, there is a big barrier: Talent - both quality and quantity • Deep Spark / Scala skills are hard to find • Big gap between Spark prototype app vs. production grade, scalable, stable apps that don’t need a lot of baby-sitting 2 IMPACT • S…LLL...O..OO...WW • DIFFICULT • COSTLY • RISK RIDDEN • SPARK PROJECTS
  77. 77. Is the Real-time Enterprise possible ? With Spark use-cases taking too long to deliver ? 2
  78. 78. Is the real-time enterprise possible? SOLUTION •More people? (They don’t exist yet – just gets more messy and costly) •Ditch Spark and buy proprietary platforms? ($$$$ - That’s going backwards) •Just bite the bullet, and delay the project? (Oops!) •Hire outsourcing companies? (Do they really have more skilled people?) 2
  79. 79. Is the real-time enterprise possible? SOLUTION •Get the right tools •Make existing people and teams – much more productive 2
  80. 80. The right Spark tool or platform – does this… Maintain Deploy Develop + Debug Monitor + Tune Apps Ingest Analytics/ ML ETL Visual IDE Scale Performance 2
  81. 81. Data360 Visual Spark IDE – Drag and Drop Analytics – Feature extract, ML, Time windows Transform / Enrich – Filter, Blend, Lookup Streaming, Batch + Oozie Workflow Load – HDFS, HBase, Hive, Any NoSQL View – Real-time Dashboards Ingest – Tables, Files, Kafka, APIs Visual Spark Studio 2
  82. 82. User Configurable Real-time dashboard
  83. 83. Monitoring Spark pipelines
  84. 84. Hadoop Cluster StreamAnalytix Web Server1 (CentOS / RHEL 6.x or above) Load Balancer With sticky session User StreamAnalytix leverages Zookeeper for configuration management4 Standalone spark cluster or Spark over YARN3 MySQL/ Postgres RabbitMQ Deployment diagram Secured communication via Kerberos2 StreamAnalytix Web Container (Tomcat)
  85. 85. Overview Local Mode + StreamAnalytix Spark portion + All dependencies = One Binary Full Cluster Identical user experience for building and managing Spark jobs Desktop or Single VM
  86. 86. Go to “” to view demo and download Visual Spark Studio
  87. 87. Success Stories
  88. 88. Why improve? …when you can transform your business
  89. 89. Transforming the Business - means…. • Creating a real-time enterprise • Dramatic non-linear increase in performance / cost trade off • Net new capabilities or revenue streams – that were previously not possible
  90. 90. Top airline boosts customer digital experience • Funnels all app data to enterprise bus and into StreamAnalytix • Couldn’t handle the volume and velocity of data earlier • Analytical capacity went from 3 days to 3 months • Ability to correlate events and see patterns across a larger time window • Customer experience issues proactively resolved in real-time • Foundation laid for real-time ML, predictive and prescriptive analytics
  91. 91. JSON Raw Data User Kafka Data Ingestion UI Data Diagnostic Tool Query Results Data Querying Data Search YARN Parsing Filtering Emitting StreamAnalytix Spark Pipeline X Service data Raw JSON Data • Multiple Apps • Multiple Services All Services data StreamAnalytix Pipeline Overview High Level Solution Architecture Highlights • Input data velocity ~7K /sec • Contributing to ~5 TB /day • ES Data retention of 30 days • Custom built Web UI for queries • StreamAnalytix implementation providing easy onboarding of additional services and application logs Benefits • Diagnostic ability on a larger range of data • SLAs unaffected, similar and better • Improved searching with custom Web UI • Scalable architecture • Supporting even larger data sets Solution ElasticSearch
  92. 92. •5X performance gain from the same hardware •New solution based on StreamAnalytix – costs less •Can onboard 5 times more application traffic for detecting threats Major bank - insider threat detection: 5X boost
  93. 93. Data Ingestion Processing and Enrichment Data Sink and Persistence Data pipeline – high level processing stages
  94. 94. Pharmacy business processing giant •Spark based real-time CDC and flow management •Sense-change, Ingest, Transform, Load •100s of source tables – data from a large number of pharmacies •Plus some important real-time ETL / Analytics use cases •Attunity  Kafka  StreamAnalytix / Spark - HDFS, Hive •2 mission critical data pipelines delivered in 1 day, 2 days •“I could hire a 3 person team instead of a 10 person team”
  95. 95. Problem Statement •Oracle based transactions  merge to  Hive reporting tables in seconds ACHIEVEMENT •Spark pipelines for this task built and deployed in 2 days •Partner Integration with Attunity for CDC •Consume Oracle multi-table CDC events in real-time •Capture and reconcile changes into Hive tables •De-normalize data while landing into Hive
  96. 96. Workflow: Modelled as StreamAnalytix Oozie workflow to automate execution of Spark pipelines that perform data de-normalization and incremental updates to Hive StreamAnalytix Solution Data Ingestion and Staging Stream data from Attunity replicate for multiple tables from Kafka and store raw data into HDFS A complete CDC solution has 3 parts Each aspect of the solution is modelled as StreamAnalytix pipeline Data De-normalization Join transactional data with data at rest and stores de-normalized data on HDFS Incremental Updates in Hive Merge previously processed transactional data with new incremental updates
  97. 97. Pipeline #1 - Data ingestion and staging (Streaming) Data ingestion via Attunity ‘Channel’: Reads the data from Attunity target Kafka. This channel is configured to read data feeds as well as metadata from a separate topic Data enrichment: Enriches incoming data with metadata information and event timestamp HDFS: Stores CDC data on HDFS in landing area using OOB HDFS emitter. HDFS files are rotated based on time and size configuration
  98. 98. Pipeline #2 - Data de-normalization (Batch) HDFS data channel: Ingests incremental data from previous runs of the staging location Pipeline #1 Reads reference (data at rest) from a fixed HDFS location Performs outer join to merge incremental and static data Store de-normalized data to HDFS directory
  99. 99. Pipeline #3 - Incremental updates in Hive (Batch) Pipeline #2 Hive SQL query to load a managed table from the HDFS incremental data generated from Pipeline #2 Reconciliation step - Hive “merge into” SQL, performs insert, update and delete operation based on the operation in incremental data Clean up step - runs a drop table command on the managed table to clean up processed data – so that it doesn’t get repeatedly processed
  100. 100. Workflow: Oozie Coordinator Job Oozie orchestration flow created using StreamAnalytix webstudio – it orchestrates pipeline #2 & pipeline #3 into a single Oozie flow that can scheduled as shown here
  101. 101. “After a long time we now have a new offering we can go sell proudly to our customers” - Product Manager •Net new capability for real-time inspection and diagnostics of call quality and customer experience at the contact center •Dramatically improves end-user service for their B2B customers Hosted call center adds new premium product / revenue source
  102. 102. Hosted call center Challenges solved •Individual events scattered in different media servers •Needed to filter a lot of noise in the data at the source itself •Tech support took too long to correlate and solve issues •Call Center manager had no real-time view on IVR operations •Needed a variety of cell center metrics in real-time
  103. 103. Hosted call center solution Public Internet IP IP IP IP IP IP IP C C CIP C C C ACD = Packet = Circuit Internet Caller Chat, VOIP, E-mail, Collaboration, Video Wireless Caller Live Call, IVR, Voice Mail Telephone Caller Live Call, IVR, Voice Mail Core Servers Routing, Admin, Stats, Logging Agent Servers Agent Interaction Connection Servers IVR, Voice, Chat, Video, Message Dialing Servers Predictive Engine, Campaign Manager GATEWAYS Circuit NetworksCircuit Networks Legacy Call Centers ADMINISTRATOR/ SUPERVISOR Administration, Monitoring Service Creation, Recording Reports PC AGENT - SOFTPHONE PC AGENT – IP PHONE HYBRID AGENT PHONE AGENTS
  104. 104. Hosted call center solution
  105. 105. Hosted call center solution
  106. 106. Hosted call center solution
  107. 107. • 8000+ agent desktops monitored for unethical behaviour in real-time • Secures customer information • Ensures top quality service • Net new capability they couldn’t get earlier at any reasonable price point Tier 1 Telco deploys new “agent monitoring system”
  108. 108. Desktop Analytics Key Business Metrics : • Average Handling Time • First Call Resolution • Sales Close Rate • Disconnect Save Rate 1yr benefit is $5.41M in the form of Call Volume Reduction 30 sec AHT reduction for Tech 15 sec AHT reduction for Sales
  109. 109. Desktop analytics – desktop data pipeline Call Center Agent Machine Big Data Platform Desktop Raw data processing App activity aggregation Event activity aggregation System data enrich and persist App and Event data enrich and persist • Consume Raw ACD events • Parse and Split the Bulk Jason mssg into individual • Data Process for App, Event, System events • Aggregate data: Mini batching, Data sequencing, Enrich Data with Agent Hierarchy, Aggregate Data • Persist data into HIVE, HBASE, Elastic
  110. 110. Source System Data type No Of Agent Records/Day Desktop Data Raw 9 69461 Desktop Data Aggregated 9 45428 Call Data Raw 7000 900000 Call Data Aggregated 7000 900000 Source System Data type No of Agents Records/Day Desktop Data Raw 7000 60M Desktop Data Aggregated 7000 20M Call Data Raw 7000 900000 Call Data Aggregated 7000 900000 Pilot GA Desktop analytics - data volume
  111. 111. Thank you. Questions? © 2017 Impetus Technologies Email: Twitter : @StreamAnalytix