Apache spark empowering the real time data driven enterprise - StreamAnalytix webinar

WEBINAR
Apache Spark Empowering the Real-Time
Data Driven Enterprise
October 13, 2017
Anand VenugopalMike Gualtieri
Twitter: mgualtieri Twitter: streamanalytix
VP & Principal Analyst, Forrester Product Head & AVP, StreamAnalytix

Our Agenda
• Business Value of Streaming Analytics
• Use Cases / Architecture
• Streaming Analytics Platform Criteria
• Spark as a Streaming Technology
• Introducing StreamAnalytix - Visual Spark Studio
• Success Stories and Demo
• Q & A

Mission critical
technology solutions
since 1996
Fortune 500:
Big Data clients
1700 people; US,
India, global reach
Unique mix of
Big Data products
and services
About Impetus

— Mike Gualtieri, VP & Principal Analyst
The Real-Time Enterprise with Apache Spark
Twitter: @mgualtieri | Linkedin: mgualtieri

© 2017 Forrester Research, Inc. Reproduction Prohibited
52%
53%
53%
54%
58%
64%
64%
65%
66%
73%
75%
0% 10% 20% 30% 40% 50% 60% 70% 80%
Better leverage big data and analytics in business…
Create a comprehensive strategy for addressing digital…
Create a comprehensive digital marketing strategy
Better comply with regulations and requirements
Improve differentiation in the market
Increase influence and brand reach in the market
Address rising customer expectations
Improve our ability to innovate
Reduce costs
Improve our products /services
Improve the experience of our customers
• Base: 3,005 global data and analytics decision-makers
• Source: Global Business Technographics Data And Analytics Online Survey 2016
Data and analytics decision-makers are driven by business
priorities

Most firms struggle to analyze data and make
insights actionable in real-time

Real-time means business time

Is this customer thinking about moving to a
rival firm right now?

What offers should you make to your customer
if they are eCommerce’ing right now?

How can you warn other drivers that the road is
slippery to avoid a crash right now?

What are movers and shakers saying about
equities that we cover right now?

How can you prevent this dude from fleecing
you right now?

How you detect customer SLA problems
right now?

How can IoT data be used to predict machine
failure right now?

Ideate Model Detect Adapt
Machine
Learning
Streaming
Analytics
Descriptive
Analytics
Prescriptive
Analytics
(Real-time Analytics)
   
(Batch Analytics)
Only the analytical enterprise can compete and win in the
age of the customer

10-49
Terabytes
5% 50-99
Terabytes
12%
100-500
Terabytes
54%
Greater
than 500
Terabytes
29%
Enterprises have plenty of data from both internal and
external sources
Using your best estimate, what is the size of all data
stored within your company?
Source: Forrester Research, September 2015
Base: 100 US Managers and above currently using Hadoop for processing and analyzing data.
Internal
business
data
49%
External
source
data
51%
What % of the data available is from internal
business applications (ERP and business
applications) versus external sources (social, IoT)?

It forms instantaneously in a cloud…

...and travels far before it makes a ripple

All data originates in real-time!

But, analytics to gain insights is usually done
much, much later

Enterprises must act on a range of perishable insights to get
value from data and analytics
Real-time
Insights
Operational
Insights
Performance
Insights
Insight: Shopping
for furniture
Action:
Recommend
cleaning supplies
Insight: Profit
lower than goal
Action: Optimize
price
Insight: Demand
forecast strong
Action: Increase
inventory
Insight: Furniture
demand high
Action: Expand
product line
TimetoAct
Perishability
Sub-second to
seconds
Seconds to
hours
Days to
weeks
Weeks to
years
Sub-second to
seconds
Seconds to
hours
Hours to
weeks
Weeks to
years
Strategic
Insights

Time To Action
Data
originated
Analytics
performed
Insights
gleaned
Action
taken
Outdated
insights
Impotent or
harmful
actions
Decision
made
Poor
decision
BusinessValuePositiveNegative
Most analytics operations are too slow

BusinessValue
Time to Action
PositiveNegative
The Real-time
Enterprise
You must compress analytics time-to-insight to maximize
the value of data

Real-time
Insights
Strategic
Insights
Operational
Insights
Performance
Insights
TimetoAct
Perishability
Sub-second to
seconds
Seconds to
hours
Days to
weeks
Weeks to
years
Sub-second to
seconds
Seconds to
hours
Hours to
weeks
Weeks to
years
Streaming analytics
Batch analytics
IoT applications must act on a range of perishable insights
to get value from big data

The opportunity to become real-time is high, but
enterprises must redesign applications

Streaming Data
Application Interface
App Logic
Context
Actions
Real-time Context
Programmed Logic
Learned Logic
Machine Learning
Learning
External
Actions
External
Context
From other data
sources of
applications
To other data
sources or
applications
Applications
Modern applications infuse analytics to respond in real-time
and become smarter

Streaming is essential technology to identify
and act on perishable insights

Streaming analytics lets applications sense, think, and act
in real-time
Source: Forrester Research

Streaming analytics is very different from plain vanilla
stream ingestion
Source: Forrester Research

Architecture
• Workload scalability
• Workload latency
• Fault tolerance
• Operational management
Stream/event Handling
• Event sequencing
• Enrichment
Analytical Operators
• Transformation
• Correlation
• Time windows
• Complex event processing
Applications Development
• Development tools
• Data connectors
• Business solution accelerators
• Community innovation
Streaming analytics solutions must be scalable and have
a rich set of stateful analytical operators

110010011011001
010010011011001
010011001101101
010010
Historical
Transactions
Customerdata
Security
Ability to ingest structured and unstructured
data from multiple sources in real-time

Scale to handle any volume & velocity of data

Process and analyse in real-time

Provide fault-tolerance for mission-critical
applications

Provide tools that make it easy to manage and
monitor the platform and its interaction with
technology components

Offer tools for business users to visualize
insights from real-time data

Capture perishable events and insights
at low latency

Offer sophisticated stateful and stateless
analytics

Leverage existing skills to make it easy for
developers to develop, test and deploy
applications

Spark and Hadoop often coexist in the same cluster

Hadoop and Spark are friends, but…

…Spark is where developers go to create
real-time enterprises

58,000x
Spark is designed to process in-memory
datasets, but can spool to disk if necessary

Spark’s directed acyclic graph (DAG) engine
optimizes parallelization to dramatically reduce
intermediary data movement

and/or and/orand/or
Spark doesn’t need Hadoop; it just needs great compute
and great storage

Spark includes capabilities for streaming analytics and
machine learning!

Ideate Model Detect Adapt
Machine
Learning
Streaming
Analytics
Descriptive
Analytics
Prescriptive
Analytics
(Real-time Analytics)
   
(Batch Analytics)
Unify batch and streaming analytics to create your
real-time enterprise

Thank you
Mike Gualtieri
mgualtieri@forrester.com
Twitter: @mgualtieri

Real-Time Stream Processing and Machine Learning Platform
ENABLING THE REAL TIME ENTERPRISE

“Impetus has the
opportunity to make
StreamAnalytix the
de facto tooling
standard for Spark
and future streaming
engines…”
Impetus Technologies covers open source bases without the headaches.
Take your pick. Impetus’ StreamAnalytix supports Apache Storm and Apache
Spark and is architecturally positioned to support other open source streaming
analytics software such as Apache Flink.
StreamAnalytix also embeds EsperTech to provide advanced streaming
analytics capabilities such as complex event processing.
What also shines about the StreamAnalytix solution is that it includes
enterprise-grade visual tooling for both development and deployment of
streaming applications.
StreamAnalytix tooling also unifies streaming and batch by supporting arbitrary
Spark jobs such as machine learning.
A Strong Performer in The Forrester Wave™:
Streaming Analytics, Q3 2017

1
Real-Time Streaming
Data Analytics
2
Makes Spark Easy
(Visual Spark Studio)

SENSE
Hours/
Days
ANALYZE ACT
SENSE ANALYZE ACTSec/ ms
Not so real-time
Hours/
Days
Sec/ ms
StreamAnalytix is a platform to build real-time apps
Near real-time /
real-time
1

Slow processing jobs
Wherever you are – we can make you faster
HADOOP-MR OR
OTHER NON-BIG
DATA TECH
Faster due to
in-memory
SPARK
BATCH
JOBS
Faster due to
micro batch
SPARK
STREAMING
JOBS
Fastest
EVENT
STREAM
PROCESSING
1

Real-time C360 and Churn
Fraud and Anomaly Detection
IoT and Log Analytics
Next Best Offer or Action
Predictive Maintenance
Cyber Security
Real-time Call Center Analytics
Use Cases
Real-time Streaming
Data Analytics
1

Learning / Training  Real-time + Batch
PMML, H20, Python – on Spark
Kafka, Storm, Esper
Scoring  Real-time + Batch
Spark Streaming, SparkML, ML-Lib
Stack
Real-time Streaming
Data Analytics
1

1
Real-Time Streaming
Data Analytics
2
Makes Spark Easy
(Visual Spark Studio)

Shortage of Spark talent and the urgent need for it
• Spark projects are increasing
• Need to get done quickly, with budget controls
• But, there is a big barrier: Talent - both quality and quantity
• Deep Spark / Scala skills are hard to find
• Big gap between Spark prototype app vs. production grade,
scalable, stable apps that don’t need a lot of baby-sitting
2
IMPACT
• S…LLL...O..OO...WW
• DIFFICULT
• COSTLY
• RISK RIDDEN
• SPARK PROJECTS

Is the Real-time Enterprise possible ?
With Spark use-cases taking too long to deliver ?
2

Is the real-time enterprise possible?
SOLUTION
•More people? (They don’t exist yet – just gets more messy and costly)
•Ditch Spark and buy proprietary platforms? ($$$$ - That’s going backwards)
•Just bite the bullet, and delay the project? (Oops!)
•Hire outsourcing companies? (Do they really have more skilled people?)
2

Is the real-time enterprise possible?
SOLUTION
•Get the right tools
•Make existing people and teams – much more productive
2

The right Spark tool or platform – does this…
Maintain
Deploy
Develop
+ Debug
Monitor
+ Tune
Apps
Ingest
Analytics/
ML
ETL
Visual IDE
Scale
Performance
2

Data360
Visual Spark IDE – Drag and Drop
Analytics – Feature extract, ML, Time windows
Transform / Enrich – Filter, Blend, Lookup
Streaming, Batch + Oozie Workflow
Load – HDFS, HBase, Hive, Any NoSQL
View – Real-time Dashboards
Ingest – Tables, Files, Kafka, APIs
Visual Spark Studio
2

User Configurable
Real-time dashboard

Hadoop Cluster
StreamAnalytix Web Server1 (CentOS / RHEL 6.x or above)
Load
Balancer
With sticky
session
User
StreamAnalytix leverages
Zookeeper for configuration
management4
Standalone spark cluster
or Spark over YARN3
MySQL/
Postgres
RabbitMQ
Deployment diagram
Secured communication
via Kerberos2
StreamAnalytix
Web Container
(Tomcat)

Overview
Local
Mode
+ StreamAnalytix Spark portion
+ All dependencies
= One Binary
Full
Cluster
Identical user experience for building and managing Spark jobs
Desktop or
Single VM

Go to
“StreamAnalytix.com”
to view demo
and download
Visual Spark Studio

Why improve?
…when you can transform your business

Transforming the Business - means….
• Creating a real-time enterprise
• Dramatic non-linear increase in performance / cost trade off
• Net new capabilities or revenue streams – that were previously not possible

Top airline boosts customer digital experience
• Funnels all app data to enterprise bus and into StreamAnalytix
• Couldn’t handle the volume and velocity of data earlier
• Analytical capacity went from 3 days to 3 months
• Ability to correlate events and see patterns across a larger time window
• Customer experience issues proactively resolved in real-time
• Foundation laid for real-time ML, predictive and prescriptive analytics

JSON
Raw
Data
User
Kafka
Data Ingestion
UI Data Diagnostic Tool Query Results
Data Querying
Data Search
YARN
Parsing Filtering Emitting
StreamAnalytix Spark Pipeline
X Service data
Raw JSON Data
• Multiple Apps
• Multiple Services
All Services data
StreamAnalytix Pipeline Overview
High Level Solution Architecture
Highlights
• Input data velocity ~7K /sec
• Contributing to ~5 TB /day
• ES Data retention of 30 days
• Custom built Web UI for queries
• StreamAnalytix implementation providing
easy onboarding of additional services
and application logs
Benefits
• Diagnostic ability on a larger range of data
• SLAs unaffected, similar and better
• Improved searching with custom Web UI
• Scalable architecture
• Supporting even larger data sets
Solution
ElasticSearch

•5X performance gain from the same hardware
•New solution based on StreamAnalytix – costs less
•Can onboard 5 times more application traffic for detecting threats
Major bank - insider threat detection: 5X boost

Data Ingestion
Processing and
Enrichment
Data Sink and
Persistence
Data pipeline – high level processing stages

Pharmacy business processing giant
•Spark based real-time CDC and flow management
•Sense-change, Ingest, Transform, Load
•100s of source tables – data from a large number of pharmacies
•Plus some important real-time ETL / Analytics use cases
•Attunity  Kafka  StreamAnalytix / Spark - HDFS, Hive
•2 mission critical data pipelines delivered in 1 day, 2 days
•“I could hire a 3 person team instead of a 10 person team”

Problem Statement
•Oracle based transactions  merge to  Hive reporting tables in seconds
ACHIEVEMENT
•Spark pipelines for this task built and deployed in 2 days
•Partner Integration with Attunity for CDC
•Consume Oracle multi-table CDC events in real-time
•Capture and reconcile changes into Hive tables
•De-normalize data while landing into Hive

Workflow: Modelled as StreamAnalytix Oozie workflow to
automate execution of Spark pipelines that perform data
de-normalization and incremental updates to Hive
StreamAnalytix Solution
Data Ingestion
and Staging
Stream data from
Attunity replicate for
multiple tables from
Kafka and store raw
data into HDFS
A complete CDC
solution has 3 parts
Each aspect of the
solution is modelled
as StreamAnalytix
pipeline
Data
De-normalization
Join transactional
data with data at
rest and stores
de-normalized data
on HDFS
Incremental Updates
in Hive
Merge previously
processed
transactional data
with new
incremental updates

Pipeline #1 - Data ingestion and staging (Streaming)
Data ingestion via Attunity ‘Channel’:
Reads the data from Attunity target
Kafka. This channel is configured to
read data feeds as well as metadata
from a separate topic
Data enrichment: Enriches incoming
data with metadata information and
event timestamp
HDFS: Stores CDC data on HDFS in landing
area using OOB HDFS emitter. HDFS files are
rotated based on time and size configuration

Pipeline #2 - Data de-normalization (Batch)
HDFS data channel:
Ingests incremental
data from previous runs
of the staging location
Pipeline #1
Reads reference (data
at rest) from a fixed
HDFS location Performs outer join to merge
incremental and static data
Store de-normalized
data to HDFS directory

Pipeline #3 - Incremental updates in Hive (Batch)
Pipeline #2
Hive SQL query to load a managed
table from the HDFS incremental
data generated from Pipeline #2
Reconciliation step - Hive “merge into” SQL,
performs insert, update and delete operation
based on the operation in incremental data
Clean up step - runs a drop table
command on the managed table to
clean up processed data – so that it
doesn’t get repeatedly processed

Workflow: Oozie Coordinator Job
Oozie orchestration flow created using StreamAnalytix webstudio –
it orchestrates pipeline #2 & pipeline #3 into a single Oozie flow that
can scheduled as shown here

“After a long time we now have a new offering we can go sell proudly to our customers”
- Product Manager
•Net new capability for real-time inspection and diagnostics of call quality and customer experience
at the contact center
•Dramatically improves end-user service for their B2B customers
Hosted call center adds new premium product / revenue
source

Hosted call center
Challenges solved
•Individual events scattered in different media servers
•Needed to filter a lot of noise in the data at the source itself
•Tech support took too long to correlate and solve issues
•Call Center manager had no real-time view on IVR operations
•Needed a variety of cell center metrics in real-time

Hosted call center solution
Public
Internet
IP
IP
IP
IP
IP
IP
IP
C
C CIP
C
C C
ACD
= Packet
= Circuit
Internet Caller
Chat, VOIP, E-mail,
Collaboration, Video
Wireless Caller
Live Call, IVR,
Voice Mail
Telephone Caller
Live Call, IVR,
Voice Mail
Core Servers
Routing, Admin,
Stats, Logging
Agent Servers
Agent
Interaction
Connection
Servers
IVR, Voice, Chat,
Video, Message
Dialing Servers
Predictive Engine,
Campaign Manager
GATEWAYS
Circuit
NetworksCircuit
Networks
Legacy Call Centers
ADMINISTRATOR/
SUPERVISOR
Administration, Monitoring
Service Creation,
Recording Reports
PC AGENT - SOFTPHONE
PC AGENT – IP PHONE
HYBRID AGENT
PHONE AGENTS

• 8000+ agent desktops monitored for unethical behaviour in real-time
• Secures customer information
• Ensures top quality service
• Net new capability they couldn’t get earlier at any reasonable price point
Tier 1 Telco deploys new “agent monitoring system”

Desktop Analytics
Key Business Metrics :
• Average Handling Time
• First Call Resolution
• Sales Close Rate
• Disconnect Save Rate
1yr benefit is $5.41M
in the form of Call Volume Reduction
30 sec AHT reduction for Tech
15 sec AHT reduction for Sales

Desktop analytics – desktop data pipeline
Call
Center
Agent
Machine
Big Data Platform
Desktop Raw data
processing
App activity
aggregation
Event activity
aggregation
System data enrich
and persist
App and Event data
enrich and persist
• Consume Raw
ACD events
• Parse and Split the
Bulk Jason mssg
into individual
• Data Process for App, Event,
System events
• Aggregate data: Mini batching,
Data sequencing, Enrich Data
with Agent Hierarchy,
Aggregate Data
• Persist data into HIVE, HBASE,
Elastic

Source System Data type No Of Agent Records/Day
Desktop Data Raw 9 69461
Desktop Data Aggregated 9 45428
Call Data Raw 7000 900000
Call Data Aggregated 7000 900000
Source System Data type No of Agents Records/Day
Desktop Data Raw 7000 60M
Desktop Data Aggregated 7000 20M
Call Data Raw 7000 900000
Call Data Aggregated 7000 900000
Pilot
GA
Desktop analytics - data volume

Apache spark empowering the real time data driven enterprise - StreamAnalytix webinar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache spark empowering the real time data driven enterprise - StreamAnalytix webinar

Similar to Apache spark empowering the real time data driven enterprise - StreamAnalytix webinar (20)

More from Impetus Technologies

More from Impetus Technologies (17)

Recently uploaded

Recently uploaded (20)

Apache spark empowering the real time data driven enterprise - StreamAnalytix webinar