Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real-time Streaming

©2018 Impetus Technologies, Inc. All rights reserved.
You are prohibited from making a copy or modification of, or from redistributing,
rebroadcasting, or re-encoding of this content without the prior written consent of
Impetus Technologies.
This presentation may include images from other products and services. These
images are used for illustrative purposes only. Unless explicitly stated there is no
implied endorsement or sponsorship of these products by Impetus Technologies. All
copyrights and trademarks are property of their respective owners.

Apache Spark – The New Enterprise Backbone for ETL,
Batch Processing and Real-time Streaming
May 10, 2018
WEBINAR

Agenda
Enterprise Context
Apache Spark Basics (and user concerns)
Apache Spark Details (APIs, Functionality for Ingest, ETL, Analytics)
Demo: Visual Spark! IoT, Ingest, Data Quality, ETL, ML (On-prem + Cloud)
Live Q & A

Speakers
PUNIT SHAH
Solution Architect, StreamAnalytix
ANAND VENUGOPAL
AVP and Head of StreamAnalytix

It’s a role play!
Anand Venugopal “AV”
Key Influencer, Enterprise Data
Satisfied with the current setup
Prefers traditional vendors
Open to learning about and considering new
technologies
Punit Shah
Apache Spark user and believer
Understands enterprise needs and legacy products
Up to date and hands-on with the latest in Apache
Spark
Likes to build it for real and show it rather than talk
about it

Head of Enterprise Data Platforms at Next-gen Bank

Big Data Solutions Architect
Just finished an Apache Spark project
Data platform for cyber security at a major bank

Vendor and technology selection, evaluation, POCs
Data storage and data processing
Ingest, integration, wrangling, predictive analytics, machine learning
Head of Enterprise Data Platforms

6 vendor products
Matika - Big_data_edition
Allend
Fakta
Rakkle - Streams
SOS - Analytics
Rakkle - Big_data_appliance

More overlapping vendors and
products for similar tasks in other
groups / departments
6 vendor products
Allend
Fakta
Rakkle - Streams
SOS - Analytics

3 years and a few million $
6 vendor products
Allend
Fakta
Rakkle - Streams
SOS - Analytics

We are a 24x7 operation
Nothing can go down
Enterprise vendors are proven
This is no open source game!
6 vendor products
Allend
Fakta
Rakkle - Streams
SOS - Analytics

Customer 360 / Churn
Predictive Maintenance
Fraud and Security
Personalized Recommendation Engine
Real-time Dashboards
Business stalls for long, and then suddenly they want results
Integrated data silos, single source of truth
Ubiquitous, fast, self-service access to the data
“Big data enabled” use-cases

Open Source esp. Apache Spark is becoming the de-facto choice
Widely deployed in Fortune 500 enterprises
We see near 100% usage in our customer base

Apache Spark - Distributed in-memory computation framework
Originally created to massively speed up ML jobs on Hadoop (30X)
Versatile !
Micro-batch
Hi-speed Batch Sits on Hadoop
and/or CloudInteractive Iterative
Graph Streaming

Fault Tolerant
Exactly Once Semantics
Back Pressure and Dynamic Scaling
Performance and Throughput is elastic
Is Apache Spark Enterprise ready?

Major US Airline – 3 nodes: 4TB / day: Ingested, Indexed, Rapid Query – CX use case
Major US Bank – 4 nodes: 200~ Million records / day – Complex event processing
Tier 1 US Telco – 4 nodes: 100~ Million records / day – Contact Center analytics
Larger deployment ranges of 20, 50, 100+ nodes – All stable over years
Is Apache Spark Enterprise ready?

Data Challenges to Implement Any Use Case
Establish Big Data Lake
Ingest – Batch and Streaming sources
Data Quality
Transformation
Blend & Enrich
Analytics – Rules, Statistical, Predictive, Prescriptive
Loading – Various target data stores
Visualization
Secure "Self-Service" Data Access
Governance

End to End Data Processing with Apache Spark
Establish Big Data Lake
Ingest – Batch and Streaming sources
Data Quality - Cleanse
Transformation
Blend & Enrich
Analytics – Rules, Statistical, Predictive, Prescriptive
Loading – Various target data stores
Visualization
Secure "Self-Service" Data Access
Governance
Data 360

Data Processing Task Apache Spark API
Ingest File System and Databases:
HDFS, S3, Hive, RDBMS, ORC, Parquet (with partitioning
support), TextFile, CSV, JSON and more
Streaming Sources:
Kafka, RabbitMQ, JMS, AWS IoT Hub, Azure Event Hub
and more
Other Sources
Redis, Couchbase, Apache Ignite, Elastic, Sqoop

Cleanse
(Data Quality)
Filter with expressions
DeDuplication
Time based filtering using watermark feature
Select query with out of the box comparison operators
over columns like gt, lt, where
DataFrame APIs like – drop, fill, distinct
Column based filtering such as – IsNaN, IsNull, like etc

Blend Stream - Data at rest
Stream - Stream joins (Spark 2.3)
Data at rest
Joins – CrossJoins, InnerJoin, Conditional Joins, Broadcast
Join and more

Transform Core API Functions
SQL Functions
UDFs
Aggregations & Group functions, State based functions
Custom function using ForEach & ForEachPartition

Analytics Feature Extraction – TF-IDF, Word2Vec, CountVectorizer,
FeatureHasher
Feature Transformers - OneHotEncoder, Binarizer, PCA,
IndexToString, Interaction, SQLTransformer,
StopWordsRemover, VectorAssembler and more
Feature Selector – VectorSlicer, RFormula, ChiSqSelector
ML models: ClassificationModel, RegressionModel,
RandomForestRegressionModel,
DataSet APIs – Cube
Third party integrations – H20, Notebook and more

Load Custom Sinks – Foreach Sink
File - ORC, JSON, CSV, Parquet with other compression
options
Hive and RDBMS
NoSQL Databases – Hbase, Cassandra, AWS DynamoDB and
more
Indexing Stores – Elastic, Solr
In Memory Distributed Caching – Redis, Ignite, Couchbase
and more

Enterprise Grade Hand Coded Apache Spark??
Different programming model – will take a lot of re-training
Scalable platform and applications
Monitoring, DevOps challenges (Debugging and diagnostics at scale ?)
Version management of Spark pipelines
Promoting from Dev to Test to Production
Multi-tenancy
Manual Apache Spark coding strategy doesn’t scale

Demo: A Visual IDE for Apache Spark
• ETL and Predictive Analytics
• Connected Car IoT Use Case

RECAP:
Apache Spark – the New Enterprise backbone for ETL, Batch and Real-time Streaming
Too many point-solution vendors is a problem
Apache Spark - Great candidate for consolidating all data prep and compute workloads
Increase RoI of big data lake investment and save further costs
Recommended approach - Visual Enterprise Grade Spark
Provided by StreamAnalytix from Impetus Technologies Inc.
Ingest, Cleanse, Blend, Transform, Analyze, Load, Visualize – All on one UI

Poll and Feedback – Please Respond
Do you agree that Apache Spark is a strong candidate to be the enterprise data processing backbone –
as described in this webinar ?
Would you be interested in a deeper dive of StreamAnalytix – A Visual platform for Apache Spark, as
shown in this webinar ?
Webinar rating and feedback

Thank You
Questions?
Visit www.StreamAnalytix.com for a download OR a cloud based trial
Contact us at inquiry@streamanalytix.com for a proof of concept
Meet us at the Spark Summit and DataWorks Summit in June

Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real-time Streaming

More Related Content

What's hot

Similar to Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real-time Streaming

More from Impetus Technologies

Recently uploaded

Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real-time Streaming