©2018 Impetus Technologies, Inc. All rights reserved.
You are prohibited from making a copy or modification of, or from redistributing,
rebroadcasting, or re-encoding of this content without the prior written consent of
Impetus Technologies.
This presentation may include images from other products and services. These
images are used for illustrative purposes only. Unless explicitly stated there is no
implied endorsement or sponsorship of these products by Impetus Technologies. All
copyrights and trademarks are property of their respective owners.
Apache Spark – The New Enterprise Backbone for ETL,
Batch Processing and Real-time Streaming
May 10, 2018
WEBINAR
Agenda
Enterprise Context
Apache Spark Basics (and user concerns)
Apache Spark Details (APIs, Functionality for Ingest, ETL, Analytics)
Demo: Visual Spark! IoT, Ingest, Data Quality, ETL, ML (On-prem + Cloud)
Live Q & A
Speakers
PUNIT SHAH
Solution Architect, StreamAnalytix
ANAND VENUGOPAL
AVP and Head of StreamAnalytix
It’s a role play!
Anand Venugopal “AV”
Key Influencer, Enterprise Data
Satisfied with the current setup
Prefers traditional vendors
Open to learning about and considering new
technologies
Punit Shah
Apache Spark user and believer
Understands enterprise needs and legacy products
Up to date and hands-on with the latest in Apache
Spark
Likes to build it for real and show it rather than talk
about it
Head of Enterprise Data Platforms at Next-gen Bank
Big Data Solutions Architect
Just finished an Apache Spark project
Data platform for cyber security at a major bank
Vendor and technology selection, evaluation, POCs
Data storage and data processing
Ingest, integration, wrangling, predictive analytics, machine learning
Head of Enterprise Data Platforms
Head of Enterprise Data Platforms
6 vendor products
Matika - Big_data_edition
Allend
Fakta
Rakkle - Streams
SOS - Analytics
Rakkle - Big_data_appliance
Head of Enterprise Data Platforms
More overlapping vendors and
products for similar tasks in other
groups / departments
6 vendor products
Matika - Big_data_edition
Allend
Fakta
Rakkle - Streams
SOS - Analytics
Rakkle - Big_data_appliance
Head of Enterprise Data Platforms
3 years and a few million $
6 vendor products
Matika - Big_data_edition
Allend
Fakta
Rakkle - Streams
SOS - Analytics
Rakkle - Big_data_appliance
Head of Enterprise Data Platforms
We are a 24x7 operation
Nothing can go down
Enterprise vendors are proven
This is no open source game!
6 vendor products
Matika - Big_data_edition
Allend
Fakta
Rakkle - Streams
SOS - Analytics
Rakkle - Big_data_appliance
Customer 360 / Churn
Predictive Maintenance
Fraud and Security
Personalized Recommendation Engine
Real-time Dashboards
Business stalls for long, and then suddenly they want results
Integrated data silos, single source of truth
Ubiquitous, fast, self-service access to the data
“Big data enabled” use-cases
Head of Enterprise Data Platforms
Open Source esp. Apache Spark is becoming the de-facto choice
Widely deployed in Fortune 500 enterprises
We see near 100% usage in our customer base
Big Data Solutions Architect
Apache Spark - Distributed in-memory computation framework
Originally created to massively speed up ML jobs on Hadoop (30X)
Versatile !
Big Data Solutions Architect
Micro-batch
Hi-speed Batch Sits on Hadoop
and/or CloudInteractive Iterative
Graph Streaming
Fault Tolerant
Exactly Once Semantics
Back Pressure and Dynamic Scaling
Performance and Throughput is elastic
Is Apache Spark Enterprise ready?
Big Data Solutions Architect
Major US Airline – 3 nodes: 4TB / day: Ingested, Indexed, Rapid Query – CX use case
Major US Bank – 4 nodes: 200~ Million records / day – Complex event processing
Tier 1 US Telco – 4 nodes: 100~ Million records / day – Contact Center analytics
Larger deployment ranges of 20, 50, 100+ nodes – All stable over years
Is Apache Spark Enterprise ready?
Big Data Solutions Architect
Data Challenges to Implement Any Use Case
Establish Big Data Lake
Ingest – Batch and Streaming sources
Data Quality
Transformation
Blend & Enrich
Analytics – Rules, Statistical, Predictive, Prescriptive
Loading – Various target data stores
Visualization
Secure "Self-Service" Data Access
Governance
Head of Enterprise Data Platforms
End to End Data Processing with Apache Spark
Establish Big Data Lake
Ingest – Batch and Streaming sources
Data Quality - Cleanse
Transformation
Blend & Enrich
Analytics – Rules, Statistical, Predictive, Prescriptive
Loading – Various target data stores
Visualization
Secure "Self-Service" Data Access
Governance
Data 360
Big Data Solutions Architect
Data Processing Task Apache Spark API
Ingest File System and Databases:
HDFS, S3, Hive, RDBMS, ORC, Parquet (with partitioning
support), TextFile, CSV, JSON and more
Streaming Sources:
Kafka, RabbitMQ, JMS, AWS IoT Hub, Azure Event Hub
and more
Other Sources
Redis, Couchbase, Apache Ignite, Elastic, Sqoop
Data Processing Task Apache Spark API
Cleanse
(Data Quality)
Filter with expressions
DeDuplication
Time based filtering using watermark feature
Select query with out of the box comparison operators
over columns like gt, lt, where
DataFrame APIs like – drop, fill, distinct
Column based filtering such as – IsNaN, IsNull, like etc
Data Processing Task Apache Spark API
Blend Stream - Data at rest
Stream - Stream joins (Spark 2.3)
Data at rest
Joins – CrossJoins, InnerJoin, Conditional Joins, Broadcast
Join and more
Data Processing Task Apache Spark API
Transform Core API Functions
SQL Functions
UDFs
Aggregations & Group functions, State based functions
Custom function using ForEach & ForEachPartition
Data Processing Task Apache Spark API
Analytics Feature Extraction – TF-IDF, Word2Vec, CountVectorizer,
FeatureHasher
Feature Transformers - OneHotEncoder, Binarizer, PCA,
IndexToString, Interaction, SQLTransformer,
StopWordsRemover, VectorAssembler and more
Feature Selector – VectorSlicer, RFormula, ChiSqSelector
ML models: ClassificationModel, RegressionModel,
RandomForestRegressionModel,
DataSet APIs – Cube
Third party integrations – H20, Notebook and more
Data Processing Task Apache Spark API
Load Custom Sinks – Foreach Sink
File - ORC, JSON, CSV, Parquet with other compression
options
Hive and RDBMS
NoSQL Databases – Hbase, Cassandra, AWS DynamoDB and
more
Indexing Stores – Elastic, Solr
In Memory Distributed Caching – Redis, Ignite, Couchbase
and more
Enterprise Grade Hand Coded Apache Spark??
Different programming model – will take a lot of re-training
Scalable platform and applications
Monitoring, DevOps challenges (Debugging and diagnostics at scale ?)
Version management of Spark pipelines
Promoting from Dev to Test to Production
Multi-tenancy
Manual Apache Spark coding strategy doesn’t scale
Head of Enterprise Data Platforms
Demo: A Visual IDE for Apache Spark
• ETL and Predictive Analytics
• Connected Car IoT Use Case
RECAP:
Apache Spark – the New Enterprise backbone for ETL, Batch and Real-time Streaming
Too many point-solution vendors is a problem
Apache Spark - Great candidate for consolidating all data prep and compute workloads
Increase RoI of big data lake investment and save further costs
Recommended approach - Visual Enterprise Grade Spark
Provided by StreamAnalytix from Impetus Technologies Inc.
Ingest, Cleanse, Blend, Transform, Analyze, Load, Visualize – All on one UI
Poll and Feedback – Please Respond
Do you agree that Apache Spark is a strong candidate to be the enterprise data processing backbone –
as described in this webinar ?
Would you be interested in a deeper dive of StreamAnalytix – A Visual platform for Apache Spark, as
shown in this webinar ?
Webinar rating and feedback
Thank You
Questions?
Visit www.StreamAnalytix.com for a download OR a cloud based trial
Contact us at inquiry@streamanalytix.com for a proof of concept
Meet us at the Spark Summit and DataWorks Summit in June

Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real-time Streaming

  • 1.
    ©2018 Impetus Technologies,Inc. All rights reserved. You are prohibited from making a copy or modification of, or from redistributing, rebroadcasting, or re-encoding of this content without the prior written consent of Impetus Technologies. This presentation may include images from other products and services. These images are used for illustrative purposes only. Unless explicitly stated there is no implied endorsement or sponsorship of these products by Impetus Technologies. All copyrights and trademarks are property of their respective owners.
  • 2.
    Apache Spark –The New Enterprise Backbone for ETL, Batch Processing and Real-time Streaming May 10, 2018 WEBINAR
  • 3.
    Agenda Enterprise Context Apache SparkBasics (and user concerns) Apache Spark Details (APIs, Functionality for Ingest, ETL, Analytics) Demo: Visual Spark! IoT, Ingest, Data Quality, ETL, ML (On-prem + Cloud) Live Q & A
  • 4.
    Speakers PUNIT SHAH Solution Architect,StreamAnalytix ANAND VENUGOPAL AVP and Head of StreamAnalytix
  • 5.
    It’s a roleplay! Anand Venugopal “AV” Key Influencer, Enterprise Data Satisfied with the current setup Prefers traditional vendors Open to learning about and considering new technologies Punit Shah Apache Spark user and believer Understands enterprise needs and legacy products Up to date and hands-on with the latest in Apache Spark Likes to build it for real and show it rather than talk about it
  • 6.
    Head of EnterpriseData Platforms at Next-gen Bank
  • 7.
    Big Data SolutionsArchitect Just finished an Apache Spark project Data platform for cyber security at a major bank
  • 8.
    Vendor and technologyselection, evaluation, POCs Data storage and data processing Ingest, integration, wrangling, predictive analytics, machine learning Head of Enterprise Data Platforms
  • 9.
    Head of EnterpriseData Platforms 6 vendor products Matika - Big_data_edition Allend Fakta Rakkle - Streams SOS - Analytics Rakkle - Big_data_appliance
  • 10.
    Head of EnterpriseData Platforms More overlapping vendors and products for similar tasks in other groups / departments 6 vendor products Matika - Big_data_edition Allend Fakta Rakkle - Streams SOS - Analytics Rakkle - Big_data_appliance
  • 11.
    Head of EnterpriseData Platforms 3 years and a few million $ 6 vendor products Matika - Big_data_edition Allend Fakta Rakkle - Streams SOS - Analytics Rakkle - Big_data_appliance
  • 12.
    Head of EnterpriseData Platforms We are a 24x7 operation Nothing can go down Enterprise vendors are proven This is no open source game! 6 vendor products Matika - Big_data_edition Allend Fakta Rakkle - Streams SOS - Analytics Rakkle - Big_data_appliance
  • 13.
    Customer 360 /Churn Predictive Maintenance Fraud and Security Personalized Recommendation Engine Real-time Dashboards Business stalls for long, and then suddenly they want results Integrated data silos, single source of truth Ubiquitous, fast, self-service access to the data “Big data enabled” use-cases Head of Enterprise Data Platforms
  • 14.
    Open Source esp.Apache Spark is becoming the de-facto choice Widely deployed in Fortune 500 enterprises We see near 100% usage in our customer base Big Data Solutions Architect
  • 15.
    Apache Spark -Distributed in-memory computation framework Originally created to massively speed up ML jobs on Hadoop (30X) Versatile ! Big Data Solutions Architect Micro-batch Hi-speed Batch Sits on Hadoop and/or CloudInteractive Iterative Graph Streaming
  • 16.
    Fault Tolerant Exactly OnceSemantics Back Pressure and Dynamic Scaling Performance and Throughput is elastic Is Apache Spark Enterprise ready? Big Data Solutions Architect
  • 17.
    Major US Airline– 3 nodes: 4TB / day: Ingested, Indexed, Rapid Query – CX use case Major US Bank – 4 nodes: 200~ Million records / day – Complex event processing Tier 1 US Telco – 4 nodes: 100~ Million records / day – Contact Center analytics Larger deployment ranges of 20, 50, 100+ nodes – All stable over years Is Apache Spark Enterprise ready? Big Data Solutions Architect
  • 18.
    Data Challenges toImplement Any Use Case Establish Big Data Lake Ingest – Batch and Streaming sources Data Quality Transformation Blend & Enrich Analytics – Rules, Statistical, Predictive, Prescriptive Loading – Various target data stores Visualization Secure "Self-Service" Data Access Governance Head of Enterprise Data Platforms
  • 19.
    End to EndData Processing with Apache Spark Establish Big Data Lake Ingest – Batch and Streaming sources Data Quality - Cleanse Transformation Blend & Enrich Analytics – Rules, Statistical, Predictive, Prescriptive Loading – Various target data stores Visualization Secure "Self-Service" Data Access Governance Data 360 Big Data Solutions Architect
  • 20.
    Data Processing TaskApache Spark API Ingest File System and Databases: HDFS, S3, Hive, RDBMS, ORC, Parquet (with partitioning support), TextFile, CSV, JSON and more Streaming Sources: Kafka, RabbitMQ, JMS, AWS IoT Hub, Azure Event Hub and more Other Sources Redis, Couchbase, Apache Ignite, Elastic, Sqoop
  • 21.
    Data Processing TaskApache Spark API Cleanse (Data Quality) Filter with expressions DeDuplication Time based filtering using watermark feature Select query with out of the box comparison operators over columns like gt, lt, where DataFrame APIs like – drop, fill, distinct Column based filtering such as – IsNaN, IsNull, like etc
  • 22.
    Data Processing TaskApache Spark API Blend Stream - Data at rest Stream - Stream joins (Spark 2.3) Data at rest Joins – CrossJoins, InnerJoin, Conditional Joins, Broadcast Join and more
  • 23.
    Data Processing TaskApache Spark API Transform Core API Functions SQL Functions UDFs Aggregations & Group functions, State based functions Custom function using ForEach & ForEachPartition
  • 24.
    Data Processing TaskApache Spark API Analytics Feature Extraction – TF-IDF, Word2Vec, CountVectorizer, FeatureHasher Feature Transformers - OneHotEncoder, Binarizer, PCA, IndexToString, Interaction, SQLTransformer, StopWordsRemover, VectorAssembler and more Feature Selector – VectorSlicer, RFormula, ChiSqSelector ML models: ClassificationModel, RegressionModel, RandomForestRegressionModel, DataSet APIs – Cube Third party integrations – H20, Notebook and more
  • 25.
    Data Processing TaskApache Spark API Load Custom Sinks – Foreach Sink File - ORC, JSON, CSV, Parquet with other compression options Hive and RDBMS NoSQL Databases – Hbase, Cassandra, AWS DynamoDB and more Indexing Stores – Elastic, Solr In Memory Distributed Caching – Redis, Ignite, Couchbase and more
  • 26.
    Enterprise Grade HandCoded Apache Spark?? Different programming model – will take a lot of re-training Scalable platform and applications Monitoring, DevOps challenges (Debugging and diagnostics at scale ?) Version management of Spark pipelines Promoting from Dev to Test to Production Multi-tenancy Manual Apache Spark coding strategy doesn’t scale Head of Enterprise Data Platforms
  • 27.
    Demo: A VisualIDE for Apache Spark • ETL and Predictive Analytics • Connected Car IoT Use Case
  • 28.
    RECAP: Apache Spark –the New Enterprise backbone for ETL, Batch and Real-time Streaming Too many point-solution vendors is a problem Apache Spark - Great candidate for consolidating all data prep and compute workloads Increase RoI of big data lake investment and save further costs Recommended approach - Visual Enterprise Grade Spark Provided by StreamAnalytix from Impetus Technologies Inc. Ingest, Cleanse, Blend, Transform, Analyze, Load, Visualize – All on one UI
  • 29.
    Poll and Feedback– Please Respond Do you agree that Apache Spark is a strong candidate to be the enterprise data processing backbone – as described in this webinar ? Would you be interested in a deeper dive of StreamAnalytix – A Visual platform for Apache Spark, as shown in this webinar ? Webinar rating and feedback
  • 30.
    Thank You Questions? Visit www.StreamAnalytix.comfor a download OR a cloud based trial Contact us at inquiry@streamanalytix.com for a proof of concept Meet us at the Spark Summit and DataWorks Summit in June