Apache Spark is one of the most popular Big Data frameworks today. It is fast becoming the de facto technology choice for stream processing, real-time analytics, data science and machine learning applications at scale. It has moved well beyond the early-adopter phase, is supported by a vibrant open source community and is enjoying accelerated adoption in enterprises.
Join our guest speaker from Forrester Research, VP & Principal Analyst, Mike Gualtieri and StreamAnalytix, Product Head, Anand Venugopal for a discussion on the trends and directions defining the growing importance of Apache Spark for stream processing, machine learning and other advanced data analytics applications.
Apache spark empowering the real time data driven enterprise - StreamAnalytix webinar
1. WEBINAR
Apache Spark Empowering the Real-Time
Data Driven Enterprise
October 13, 2017
Anand VenugopalMike Gualtieri
Twitter: mgualtieri Twitter: streamanalytix
VP & Principal Analyst, Forrester Product Head & AVP, StreamAnalytix
2. Our Agenda
• Business Value of Streaming Analytics
• Use Cases / Architecture
• Streaming Analytics Platform Criteria
• Spark as a Streaming Technology
• Introducing StreamAnalytix - Visual Spark Studio
• Success Stories and Demo
• Q & A
3. Mission critical
technology solutions
since 1996
Fortune 500:
Big Data clients
1700 people; US,
India, global reach
Unique mix of
Big Data products
and services
About Impetus
4. — Mike Gualtieri, VP & Principal Analyst
The Real-Time Enterprise with Apache Spark
Twitter: @mgualtieri | Linkedin: mgualtieri
69. “Impetus has the
opportunity to make
StreamAnalytix the
de facto tooling
standard for Spark
and future streaming
engines…”
Impetus Technologies covers open source bases without the headaches.
Take your pick. Impetus’ StreamAnalytix supports Apache Storm and Apache
Spark and is architecturally positioned to support other open source streaming
analytics software such as Apache Flink.
StreamAnalytix also embeds EsperTech to provide advanced streaming
analytics capabilities such as complex event processing.
What also shines about the StreamAnalytix solution is that it includes
enterprise-grade visual tooling for both development and deployment of
streaming applications.
StreamAnalytix tooling also unifies streaming and batch by supporting arbitrary
Spark jobs such as machine learning.
A Strong Performer in The Forrester Wave™:
Streaming Analytics, Q3 2017
70. ENABLING THE REAL TIME ENTERPRISE
1
Real-Time Streaming
Data Analytics
2
Makes Spark Easy
(Visual Spark Studio)
71. SENSE
Hours/
Days
ANALYZE ACT
SENSE ANALYZE ACTSec/ ms
Not so real-time
Hours/
Days
Sec/ ms
StreamAnalytix is a platform to build real-time apps
Near real-time /
real-time
1
72. Slow processing jobs
Wherever you are – we can make you faster
HADOOP-MR OR
OTHER NON-BIG
DATA TECH
Faster due to
in-memory
SPARK
BATCH
JOBS
Faster due to
micro batch
SPARK
STREAMING
JOBS
Fastest
EVENT
STREAM
PROCESSING
1
ENABLING THE REAL TIME ENTERPRISE
73. Real-time C360 and Churn
Fraud and Anomaly Detection
IoT and Log Analytics
Next Best Offer or Action
Predictive Maintenance
Cyber Security
Real-time Call Center Analytics
Use Cases
Real-time Streaming
Data Analytics
1
ENABLING THE REAL TIME ENTERPRISE
74. Learning / Training Real-time + Batch
PMML, H20, Python – on Spark
Kafka, Storm, Esper
Scoring Real-time + Batch
Spark Streaming, SparkML, ML-Lib
Stack
Real-time Streaming
Data Analytics
1
ENABLING THE REAL TIME ENTERPRISE
76. Shortage of Spark talent and the urgent need for it
• Spark projects are increasing
• Need to get done quickly, with budget controls
• But, there is a big barrier: Talent - both quality and quantity
• Deep Spark / Scala skills are hard to find
• Big gap between Spark prototype app vs. production grade,
scalable, stable apps that don’t need a lot of baby-sitting
2
IMPACT
• S…LLL...O..OO...WW
• DIFFICULT
• COSTLY
• RISK RIDDEN
• SPARK PROJECTS
77. Is the Real-time Enterprise possible ?
With Spark use-cases taking too long to deliver ?
2
78. Is the real-time enterprise possible?
SOLUTION
•More people? (They don’t exist yet – just gets more messy and costly)
•Ditch Spark and buy proprietary platforms? ($$$$ - That’s going backwards)
•Just bite the bullet, and delay the project? (Oops!)
•Hire outsourcing companies? (Do they really have more skilled people?)
2
79. Is the real-time enterprise possible?
SOLUTION
•Get the right tools
•Make existing people and teams – much more productive
2
80. The right Spark tool or platform – does this…
Maintain
Deploy
Develop
+ Debug
Monitor
+ Tune
Apps
Ingest
Analytics/
ML
ETL
Visual IDE
Scale
Performance
2
81. Data360
Visual Spark IDE – Drag and Drop
Analytics – Feature extract, ML, Time windows
Transform / Enrich – Filter, Blend, Lookup
Streaming, Batch + Oozie Workflow
Load – HDFS, HBase, Hive, Any NoSQL
View – Real-time Dashboards
Ingest – Tables, Files, Kafka, APIs
Visual Spark Studio
2
87. Hadoop Cluster
StreamAnalytix Web Server1 (CentOS / RHEL 6.x or above)
Load
Balancer
With sticky
session
User
StreamAnalytix leverages
Zookeeper for configuration
management4
Standalone spark cluster
or Spark over YARN3
MySQL/
Postgres
RabbitMQ
Deployment diagram
Secured communication
via Kerberos2
StreamAnalytix
Web Container
(Tomcat)
88.
89. Overview
Local
Mode
+ StreamAnalytix Spark portion
+ All dependencies
= One Binary
Full
Cluster
Identical user experience for building and managing Spark jobs
Desktop or
Single VM
93. Transforming the Business - means….
• Creating a real-time enterprise
• Dramatic non-linear increase in performance / cost trade off
• Net new capabilities or revenue streams – that were previously not possible
94. Top airline boosts customer digital experience
• Funnels all app data to enterprise bus and into StreamAnalytix
• Couldn’t handle the volume and velocity of data earlier
• Analytical capacity went from 3 days to 3 months
• Ability to correlate events and see patterns across a larger time window
• Customer experience issues proactively resolved in real-time
• Foundation laid for real-time ML, predictive and prescriptive analytics
95. JSON
Raw
Data
User
Kafka
Data Ingestion
UI Data Diagnostic Tool Query Results
Data Querying
Data Search
YARN
Parsing Filtering Emitting
StreamAnalytix Spark Pipeline
X Service data
Raw JSON Data
• Multiple Apps
• Multiple Services
All Services data
StreamAnalytix Pipeline Overview
High Level Solution Architecture
Highlights
• Input data velocity ~7K /sec
• Contributing to ~5 TB /day
• ES Data retention of 30 days
• Custom built Web UI for queries
• StreamAnalytix implementation providing
easy onboarding of additional services
and application logs
Benefits
• Diagnostic ability on a larger range of data
• SLAs unaffected, similar and better
• Improved searching with custom Web UI
• Scalable architecture
• Supporting even larger data sets
Solution
ElasticSearch
96. •5X performance gain from the same hardware
•New solution based on StreamAnalytix – costs less
•Can onboard 5 times more application traffic for detecting threats
Major bank - insider threat detection: 5X boost
98. Pharmacy business processing giant
•Spark based real-time CDC and flow management
•Sense-change, Ingest, Transform, Load
•100s of source tables – data from a large number of pharmacies
•Plus some important real-time ETL / Analytics use cases
•Attunity Kafka StreamAnalytix / Spark - HDFS, Hive
•2 mission critical data pipelines delivered in 1 day, 2 days
•“I could hire a 3 person team instead of a 10 person team”
99. Problem Statement
•Oracle based transactions merge to Hive reporting tables in seconds
ACHIEVEMENT
•Spark pipelines for this task built and deployed in 2 days
•Partner Integration with Attunity for CDC
•Consume Oracle multi-table CDC events in real-time
•Capture and reconcile changes into Hive tables
•De-normalize data while landing into Hive
100. Workflow: Modelled as StreamAnalytix Oozie workflow to
automate execution of Spark pipelines that perform data
de-normalization and incremental updates to Hive
StreamAnalytix Solution
Data Ingestion
and Staging
Stream data from
Attunity replicate for
multiple tables from
Kafka and store raw
data into HDFS
A complete CDC
solution has 3 parts
Each aspect of the
solution is modelled
as StreamAnalytix
pipeline
Data
De-normalization
Join transactional
data with data at
rest and stores
de-normalized data
on HDFS
Incremental Updates
in Hive
Merge previously
processed
transactional data
with new
incremental updates
101. Pipeline #1 - Data ingestion and staging (Streaming)
Data ingestion via Attunity ‘Channel’:
Reads the data from Attunity target
Kafka. This channel is configured to
read data feeds as well as metadata
from a separate topic
Data enrichment: Enriches incoming
data with metadata information and
event timestamp
HDFS: Stores CDC data on HDFS in landing
area using OOB HDFS emitter. HDFS files are
rotated based on time and size configuration
102. Pipeline #2 - Data de-normalization (Batch)
HDFS data channel:
Ingests incremental
data from previous runs
of the staging location
Pipeline #1
Reads reference (data
at rest) from a fixed
HDFS location Performs outer join to merge
incremental and static data
Store de-normalized
data to HDFS directory
103. Pipeline #3 - Incremental updates in Hive (Batch)
Pipeline #2
Hive SQL query to load a managed
table from the HDFS incremental
data generated from Pipeline #2
Reconciliation step - Hive “merge into” SQL,
performs insert, update and delete operation
based on the operation in incremental data
Clean up step - runs a drop table
command on the managed table to
clean up processed data – so that it
doesn’t get repeatedly processed
104. Workflow: Oozie Coordinator Job
Oozie orchestration flow created using StreamAnalytix webstudio –
it orchestrates pipeline #2 & pipeline #3 into a single Oozie flow that
can scheduled as shown here
105. “After a long time we now have a new offering we can go sell proudly to our customers”
- Product Manager
•Net new capability for real-time inspection and diagnostics of call quality and customer experience
at the contact center
•Dramatically improves end-user service for their B2B customers
Hosted call center adds new premium product / revenue
source
106. Hosted call center
Challenges solved
•Individual events scattered in different media servers
•Needed to filter a lot of noise in the data at the source itself
•Tech support took too long to correlate and solve issues
•Call Center manager had no real-time view on IVR operations
•Needed a variety of cell center metrics in real-time
107. Hosted call center solution
Public
Internet
IP
IP
IP
IP
IP
IP
IP
C
C CIP
C
C C
ACD
= Packet
= Circuit
Internet Caller
Chat, VOIP, E-mail,
Collaboration, Video
Wireless Caller
Live Call, IVR,
Voice Mail
Telephone Caller
Live Call, IVR,
Voice Mail
Core Servers
Routing, Admin,
Stats, Logging
Agent Servers
Agent
Interaction
Connection
Servers
IVR, Voice, Chat,
Video, Message
Dialing Servers
Predictive Engine,
Campaign Manager
GATEWAYS
Circuit
NetworksCircuit
Networks
Legacy Call Centers
ADMINISTRATOR/
SUPERVISOR
Administration, Monitoring
Service Creation,
Recording Reports
PC AGENT - SOFTPHONE
PC AGENT – IP PHONE
HYBRID AGENT
PHONE AGENTS
111. • 8000+ agent desktops monitored for unethical behaviour in real-time
• Secures customer information
• Ensures top quality service
• Net new capability they couldn’t get earlier at any reasonable price point
Tier 1 Telco deploys new “agent monitoring system”
112. Desktop Analytics
Key Business Metrics :
• Average Handling Time
• First Call Resolution
• Sales Close Rate
• Disconnect Save Rate
1yr benefit is $5.41M
in the form of Call Volume Reduction
30 sec AHT reduction for Tech
15 sec AHT reduction for Sales
113. Desktop analytics – desktop data pipeline
Call
Center
Agent
Machine
Big Data Platform
Desktop Raw data
processing
App activity
aggregation
Event activity
aggregation
System data enrich
and persist
App and Event data
enrich and persist
• Consume Raw
ACD events
• Parse and Split the
Bulk Jason mssg
into individual
• Data Process for App, Event,
System events
• Aggregate data: Mini batching,
Data sequencing, Enrich Data
with Agent Hierarchy,
Aggregate Data
• Persist data into HIVE, HBASE,
Elastic
114. Source System Data type No Of Agent Records/Day
Desktop Data Raw 9 69461
Desktop Data Aggregated 9 45428
Call Data Raw 7000 900000
Call Data Aggregated 7000 900000
Source System Data type No of Agents Records/Day
Desktop Data Raw 7000 60M
Desktop Data Aggregated 7000 20M
Call Data Raw 7000 900000
Call Data Aggregated 7000 900000
Pilot
GA
Desktop analytics - data volume