Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Big Data 2.0
HOW SPARK TECHNOLOGIES ARE RESHAPING THE
WORLD OF BIG DATA ANALYTICS
Presented By: Lillian Pierson, P.E.

Today’s webinar
Apache Spark: Journey from “Hadoop Eco System component” to “Big
Data platform”
The story of how Spark began
Is Spark a data engineering or data science platform?
Who is using Spark and for what?
Got Spark skills? Here’s why you should

Apache Spark
JOURNEY FROM “HADOOP ECO SYSTEM
COMPONENT” TO “BIG DATA PLATFORM”

“In-memory computing appliances
are … faster than the traditional
Hadoop system because in-
memory appliances don’t use
MapReduce… By storing data in
memory, in-memory appliances are
able to bypass the time-consuming
disk accesses that are required as
part of the map and reduce
operations that comprise the
MapReduce process. In-memory
data storage processing, and
analysis is fast enough to generate
data analytics in real-time, derived
from streaming data sources.“ –
Excerpt from my book:
Big Data/Hadoop for Dummies
Why in-memory
applications?

From Hadoop ecosystem
component…
HDFS
MapReduce
2.0
YARN

From Hadoop ecosystem
component…
HDFS
Spark
MapReduce
2.0
YARN

To big data platform
HDFS
MapReduce
2.0
Spark YARN

To big data platform
Spark-as-a-Service

Spark’s 4 submodules
Spark SQL MLlib
GraphX Streaming

Spark SQL module
DataFrames
Spark SQL
◦ SQL
Hive
◦ HiveQL
◦ Spark Processing Engine

Mllib module
Data analysis
Statistics
Machine learning

GraphX module
Graph data storage and processing
Graphx
◦ In-memory graph data processing
HDFS
◦ Graph data storage

Streaming module
Continuously
Streaming
Data
Discreet Data
Streams
(Dstream)
Micro-batch processing

Dstreams and micro-batch
architecture
Source: http://www.slideshare.net/skpabba/hadoop-and-spark
RDD @ time 1 RDD @ time 2 RDD @ time 3

Basic Spark Architecture
Spark SQL MLlib GraphX Streaming
Physical Hardware
Data Storage Layer (HDFS)
Resource Manager (YARN)
Spark Core Libraries
Single Abstraction Layer
Processing Processing Processing Processing

Changes with Spark 2.0
RDD API
•DataFrame
API
Spark
1.0
•RDD API
•DataFrame
API
Spark
1.3
*RDD API
*DataFrame
API
*Dataset API
Spark
1.6
Dataset API
•DataFrame
API
•RDD API
Spark
2.0

RDD API
Dataset API
DataFrame API
RDD API
Spark 1.0 Spark 2.0

Structured
Stream
Processing
DataFrame API
Dataset API

Taking things from the
beginning…
2009
Mesos
UC Berkeley
Interactive, iterative parallel processing (in-
memory)
◦ Machine learning requirements
Integrates with Hadoop ecosystem
Dr. Ion Stoica
Computer Science Professor
UC Berkeley

Databricks… the cutting edge
of Spark
Delivers Apache Spark-as-a-Service
Most popular solution for deploying Spark on
the cloud
Dr. Ion Stoica
Executive Chairman, Apache Databricks

Databricks… the cutting edge
of Spark
Spark on an as-needed basis
Automates
◦ Cluster building and configuration
◦ Security
◦ Process monitoring
◦ Resource monitoring
Notebooks
◦ For data analysis and machine learning using Python, R, and Scala
Data visualization capabilities
◦ Data visualization and dashboard design options

Is Spark a data
engineering or data
science platform?
DATA ENGINEERING COMPONENTS AND
TECHNOLOGIES
DATA SCIENCE COMPONENTS AND TECHNOLOGIES

Spark’s data engineering
elements
Automate cluster sizing and configuration requirements
Data Storage: HDFS
Resource Management:
◦ Spark Standalone
◦ Apache Mesos
◦ Hadoop YARN

Spark’s data engineering
elements
Spark Streaming Submodule – Reuse same code you use for batch
processing, but get real-time results!
◦ Integrates with big data source, like:
◦ HDFS
◦ Flume
◦ Kafka
◦ Twitter and
◦ ZeroMQ

Doing data science with Spark
Useful for machine learning and analysis of big data
Build big data analytics products
Programmable in Python, R, Scala, and SQL
Submodules:
◦ SQL and DataFrames
◦ MLlib for machine learning
◦ GraphX for in-memory big (graph) data computations

Doing data science with Spark
Spark integrates with the following data sources and formats:
◦ Hive, Avro, Parquet, CSV, JSON, and JDBC, HBase
◦ BI Tools: Tableau, QLIK, ZoomData, etc. (through JDBC)

Who is using
Spark and for
what?
A U T O M A T I C L A B S
L E N D U P
S E L L P O I N T S
F I N D I F Y

Automatic Labs on Databricks
Making cars smarter with real-time analytics
Connect to, and make smart use, of your car’s data

Automatic Labs on Databricks
Automatic apps do things like:
◦ Decoding engine problems
◦ Locating parked cars
◦ Crash detection and response
◦ Low fuel warnings, etc.
Automatic is using Spark to make cars smarter with real-time analytics
During product development, Automatic needs to query, explore, and
visualize large amounts of data, QUICKLY. By moving this work over to
Spark, Automatic was able to:
◦ Validate products in days, not weeks
◦ Complete complex queries in minutes
◦ Free up 1 full-time data scientist
◦ Save $10K/month on infrastructure costs

LendUp on
Databricks
Improving the lending
process and experience
“Moving up the LendUp
Ladder means earning
access to more money, at
better rates, for longer
periods of time” - LendUp

LendUp on Databricks
LendUp uses Spark for:
◦ Feature engineering at scale
◦ Fast model building and testing
By using Spark to do this work, LendUp is able to:
◦ Build more accurate models, faster
◦ Offer more lines of credit
◦ Develop new products more quickly
◦ Increase in-house productivity of data science team

sellpoints on Databricks
Increasing ROI on ad spend

sellpoints on Databricks
Increasing ROI on ad spend
Sellpoint offers services in:
◦ Identifying qualified shoppers
◦ Driving traffic
◦ Increasing sales conversion
By moving to Databricks, sellpoints was able to:
◦ Productize a new predictive analytics offering, improving the ad spend ROI
by threefold compared to competitive offerings.
◦ Reduce the time and effort required to deliver actionable insights to the
business team while lowering costs.
◦ Improve productivity of the engineering and data science team by
eliminating the time spent on DevOps and maintaining open source
software.

Findify on Databricks
Improving shopping experience for ecommerce customers
Uses machine learning to continually improve search accuracy

Findify on Databricks
Improving shopping experience for ecommerce customers
By moving to Databricks, Findify was able to:
◦ Focus on development instead of infrastructure – Allowing them to complete
their feature development projects faster and reduce customer frustration
in delayed analytics
◦ Focus on building innovative features - because the managed Spark platform
eliminated time spent on DevOps and infrastructure issues.
Uses machine learning to continually improve search accuracy

Got Spark skills?
Here’s why you
should
IMPACT ON SALARY
TRAINING ISSUES AND OPPORTUNITIES

How much do Spark skills pay?
2015 Data Science Salary Survey, by O’Reilly
$11,000
$4,000
$4,600
$8,000
$0
$2,000
$4,000
$6,000
$8,000
$10,000
$12,000
Spark Skills Scala Programming Basic Exploratory
Analysis (>4 hr/wk)
D3.js Skills
Annual Salary Increase
Annual Salary Increase

Getting training and
experience in Spark
$149.50
Sale
Until
March 30
Only
Discount
Code:
‘SPRING50’

Getting training and
experience in Spark
Get hands-on training in the following areas:
◦ Using RDD
◦ Writing applications using Scala
◦ Spark SQL
◦ Spark Streaming
◦ Machine Learning in Spark (Mllib)
◦ Spark GraphX
◦ Spark Project Implementation

Why Data Science From Simplilearn
Key
Features
40 hours of real life
industry project
experience
25 hours of High
Quality e-learning
Visualize and
optimize data
effectively using
the built-in tools in
R , SAS and Excel
48 hours of Live
Instructor Led
Online sessions
Get proficient in
using R,SAS and Excel
to model data and
predict solutions to
business problems
Master the concepts
of statistical analysis
like linear & logistic
regression, cluster
analysis &
forecasting

OUR JOURNEY SO FAR Project
Management
Digital Marketing
Big Data &
Analytics
Business
Productivity
Tools
Quality
Management
Virtualization and
Cloud Computing
IT Security
Financial
Management
CompTIA
Certification
IT Hardware and
N/W ERP
IT Services and
Architecture
Agile and Scrum
Certification
OS and Database
Web and App
Programming
Simplilearn : World’s Largest Certification Training Destination
One of the largest collections of accredited certification training in the
world.
YEAR
2010
YEAR
2015
YEAR
2010
YEAR
2016

Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Similar to Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics (20)

Recently uploaded

Recently uploaded (20)

Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics