SlideShare a Scribd company logo
1 of 76
Robert Hryniewicz
Data Evangelist
@RobHryniewicz
Apache Spark
Crash Course - DataWorks Summit – Berlin 2018
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Sources
 Internet of Things (IoT)
– Wind Turbines, Oil Rigs
– Beacons, Wearables
– Smart Cars
 User Generated Content (Social, Web & Mobile)
– Twitter, Facebook, Snapchat
– Clickstream
– Paypal, Venmo
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Data Growth in Zeta Bytes (ZB)
50+ ZB in 2021
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The “Big Data” Problem
 A single machine cannot process or even store all the data!
Problem
Solution
 Distribute data over large clusters
Difficulty
 How to split work across machines?
 Moving data over network is expensive
 Must consider data & network locality
 How to deal with failures?
 How to deal with slow nodes?
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Is Apache Spark?
 Apache open source project
originally developed at AMPLab
(University of California Berkeley)
 Unified, general data processing
engine that operates across varied
data workloads and platforms
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Apache Spark?
 Elegant Developer APIs
– Single environment for data munging, data wrangling, and Machine Learning (ML)
 In-memory computation model – Fast!
– Effective for iterative computations and ML
 Machine Learning
– Implementation of distributed ML algorithms
– Pipeline API (Spark MLlib)
– External libraries via open & commercial projects (H2Os Sparkling Water)
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
Structured Data
Spark Streaming
Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
More Flexible Better Storage and Performance///
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Overview
 Spark module for structured data processing (e.g. ORC, Parquet, Avro, MySQL)
 Two ways to manipulate data:
– DataFrame/Dataset API
– SQL query
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SparkSession
 Main entry point for Spark functionality
 Allows programming with DataFrame and Dataset APIs
 Represented as spark and auto-initialized in a notebook type env. (Zeppelin or Jupyter)
What is it?
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
 Distributed collection of data organized into named
columns
 Conceptually equivalent to a table in relational DB or
a data frame in R/Python
 API available in Scala, Java, Python, and R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data is described as a DataFrame
with rows, columns, and a schema
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sources
CSV
Avro
HIVE
Spark SQL
Col1 Col2 … … ColN
DataFrame
Column
Row
JSON
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Create a DataFrame
val path = "examples/flights.json"
val flights = spark.read.json(path)
Example
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Register a Temporary View (SQL API)
Example
flights.createOrReplaceTempView("flightsView")
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Two API Examples: DataFrame and SQL APIs
flights.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
SELECT Origin, Dest, DepDelay
FROM flightsView
WHERE DepDelay > 15 LIMIT 5
SQL API
DataFrame API
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
Structured Data
Spark Streaming
Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Stream Processing?
Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics
Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action
Stream Processing + Batch Processing = All Data Analytics
real-time (now) historical (past)
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
23
Modern Data Applications approach to Insights
Start with hypothesis
Test against selected data
Data leads the way
Explore all data, identify correlations
Analyze after landing… Analyze in motion…
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
 Extension of Spark Core API
 Stream processing of live data streams
– Scalable
– High-throughput
– Fault-tolerant
Overview
ZeroMQ
MQTT
No longer
supported
in
Spark 2.x
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
Discretized Streams (DStreams)
 High-level abstraction representing continuous stream of data
 Internally represented as a sequence of RDDs
 Operation applied on a DStream translates to operations on the underlying RDDs
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
Example: flatMap operation
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
 Apply transformations over a sliding window of data, e.g. rolling average
Window Operations
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Challenges in Streaming Data
 Consistency
 Fault tolerance
 Out-of-order data
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Structured Streaming
 High-Level APIs - DataFrames, Datasets and SQL. Same in streaming and in batch
 Event-time Processing - Native support for working w/ out -of-order and late data
 End-to-end Exactly Once - Transactional both in processing and output
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Structured Streaming: Basics
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Structured Streaming: Model
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Handling late arriving data
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark MLlib
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark ML Pipeline
 fit() is for training
 transform() is for prediction
Input
DataFrame
(TRAIN)
Input
DataFrame
(TEST)
Output
Dataframe
(PREDICTIONS)
Pipeline
Pipeline Model
fit()
transform()
Train
Predict
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark ML Pipeline
Feature
transform
1
Feature
transform
2
Combine
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline Model
Train
Predict
Export Model
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sample Spark ML Pipeline
indexer = …
parser = …
hashingTF = …
vecAssembler = …
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Exporting ML Models - PMML
 Predictive Model Markup Language (PMML)
–> XML-based predictive model interchange format
 Supported models
–K-Means
–Linear Regression
–Ridge Regression
–Lasso
–SVM
–Binary
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark GraphX
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Page Rank
 Topic Modeling (LDA)
 Community Detection
Source: ampcamp.berkeley.edu
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GraphX Algorithms
 PageRank
 Connected components
 Label propagation
 SVD++
 Strongly connected components
 Triangle count
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sample GraphX Code in Scala
graph = Graph(vertices, edges)
messages = spark.textFile("hdfs://...")
graph2 = graph.joinVertices(messages) {
(id, vertex, msg) => ...
}
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s Apache Zeppelin?
Web-based notebook
that enables interactive
data analytics.
You can make beautiful
data-driven, interactive
and collaborative
documents with SQL,
Python, Scala and more
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin with HDP 2.6+
• Data exploration and discovery
• Visualization
• Interactive snippet-at-a-time
experience
• “Modern Data Science Studio”
Features
• Ad-hoc experimentation
• Deeply integrated with
Spark + Hadoop
• Supports multiple
language backends
• Incubating at Apache
Use Case
Web-based Notebook for interactive analytics
49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Zeppelin work?
Notebook
Author
Collaborators/
Report viewers
Zeppelin
Cluster
Spark | Hive | HBase
Any of 30+ back ends
51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Lifecycle
Collect
ETL /
Process
Analysis
Report
Data
Product
Business user
Customer
Data ScientistData Engineer
52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zeppelin Multitenancy
53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Livy
 Livy is the open source REST interface for interacting with Apache Spark from anywhere
 Installed as Spark Ambari Service
Livy Client
HTTP HTTP (RPC)
Spark Interactive Session
SparkContext
Spark Batch Session
SparkContext
Livy Server
54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security Across Zeppelin-Livy-Spark
Shiro
Ispark Group Interpreter
SPNego: Kerberos Kerberos
Livy APIs
Spark on YARN
Zeppelin
Driver
LDAP
Livy Server
55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reasons to Integrate with Livy
 Bring Sessions to Apache Zeppelin
– Isolation
– Session sharing
 Enable efficient cluster resource utilization
– Default Spark interpreter keeps YARN/Spark job running forever
– Livy interpreter recycled after 60 minutes of inactivity
(controlled by livy.server.session.timeout)
 To Identity Propagation
– Send user identity from Zeppelin > Livy > Spark on YARN
56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Livy Server
SparkSession Sharing
Session-2
Session-1
SparkSession-1
SparkContext
SparkSession-2
SparkContext
Client 1
Client 2
Client 3
Session-1
Session-1
Session-2
57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin + Livy End-to-End Security
Ispark Group Interpreter
SPNego: Kerberos Kerberos/RPC
Livy APIs
Spark on YARN
Zeppelin
LDAP
Livy Server
Job runs as
Tommy Callahan
Tommy Callahan
58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP Basics
59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Zeppelin  Interactive notebook
 Spark
 YARN  Resource Management
 HDFS  Distributed Storage Layer (4M files)
– Future: Ozone object storeYARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Data Platform
64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sample Architecture
68 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
69 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Managed Dataflow
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE
72 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
High-Level Overview
IoT Edge
(single node)
IoT Edge
(single node)
IoT Devices
IoT Devices
NiFi Hub Data Broker
Column
DB
Data
Store
Live Dashboard
Data Center
(on prem/cloud)
HDFS/Ozone HBase/Cassandra
73 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 2.x & HDP 2.x
What’s New?
74 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s New
 Future HDP / Spark 2.3
– Spark Structured Streaming latency in single-digit milliseconds in continuous mode in stream
processing (instead of 100ms we’d normally see with micro batching)
– stream-to-stream joins
– PySpark boost by improving performance with pandas UDFs
– runs on Kubernetes clusters by providing native support for Apache Spark applications
 HDP 2.6.4 / Spark 2.2
– Structured Streaming GA
– Yahoo! Benchmark: 65M rec/s
– ORC feature & performance improvements  Parquet Parity
 HDP 2.6.3 / Spark 2.1
– Spark SQL Ranger integration for row and column security
– DataSet API GA
– GraphX GA
75 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
76 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 2.3
77 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis
78 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSX + HDP
79 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSX
HDP
Data Science Experience (DSX) Local
Enterprise Data Science platform for teams
Livy REST interface
Hortonworks Data Platform (HDP)
Enterprise compute (Spark/Hive) & storage
(HDFS/Ozone)
80 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lab
81 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Community Connection
• Full Q&A Platform (like StackOverflow)
• Knowledge Base Articles
• Code Samples and Repositories
community.hortonworks.com
82 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Community Engagement
community.hortonworks.com
20k+
Registered Users
45k+
Answers
100k+
Technical Assets
83 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Future of Data Meetups
Robert Hryniewicz
@RobHryniewicz
Thanks!

More Related Content

What's hot

Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetDataWorks Summit
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureDataWorks Summit
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 
The Future of Apache Ambari
The Future of Apache AmbariThe Future of Apache Ambari
The Future of Apache AmbariDataWorks Summit
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseDataWorks Summit/Hadoop Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easyDataWorks Summit
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightDataWorks Summit
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 

What's hot (20)

Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
The Future of Apache Ambari
The Future of Apache AmbariThe Future of Apache Ambari
The Future of Apache Ambari
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easy
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 

Similar to Apache Spark Crash Course

Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin DataWorks Summit/Hadoop Summit
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJDaniel Madrigal
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingAll Things Open
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationAbdelkrim Hadjidj
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Mac Moore
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Mac Moore
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics OptimizationHortonworks
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics OptimizationIsheeta Sanghi
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasyDataWorks Summit
 
Unlocking insights in streaming data
Unlocking insights in streaming dataUnlocking insights in streaming data
Unlocking insights in streaming dataCarolyn Duby
 

Similar to Apache Spark Crash Course (20)

Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
 
Streaming analytics manager
Streaming analytics managerStreaming analytics manager
Streaming analytics manager
 
SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made Easy
 
Unlocking insights in streaming data
Unlocking insights in streaming dataUnlocking insights in streaming data
Unlocking insights in streaming data
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Apache Spark Crash Course

  • 1. Robert Hryniewicz Data Evangelist @RobHryniewicz Apache Spark Crash Course - DataWorks Summit – Berlin 2018
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Sources  Internet of Things (IoT) – Wind Turbines, Oil Rigs – Beacons, Wearables – Smart Cars  User Generated Content (Social, Web & Mobile) – Twitter, Facebook, Snapchat – Clickstream – Paypal, Venmo
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Data Growth in Zeta Bytes (ZB) 50+ ZB in 2021
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The “Big Data” Problem  A single machine cannot process or even store all the data! Problem Solution  Distribute data over large clusters Difficulty  How to split work across machines?  Moving data over network is expensive  Must consider data & network locality  How to deal with failures?  How to deal with slow nodes?
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Spark
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What Is Apache Spark?  Apache open source project originally developed at AMPLab (University of California Berkeley)  Unified, general data processing engine that operates across varied data workloads and platforms
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Apache Spark?  Elegant Developer APIs – Single environment for data munging, data wrangling, and Machine Learning (ML)  In-memory computation model – Fast! – Effective for iterative computations and ML  Machine Learning – Implementation of distributed ML algorithms – Pipeline API (Spark MLlib) – External libraries via open & commercial projects (H2Os Sparkling Water)
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Real-time Spark MLlib Machine Learning GraphX Graph Analysis
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Near Real-time Spark MLlib Machine Learning GraphX Graph Analysis
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved More Flexible Better Storage and Performance///
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Overview  Spark module for structured data processing (e.g. ORC, Parquet, Avro, MySQL)  Two ways to manipulate data: – DataFrame/Dataset API – SQL query
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SparkSession  Main entry point for Spark functionality  Allows programming with DataFrame and Dataset APIs  Represented as spark and auto-initialized in a notebook type env. (Zeppelin or Jupyter) What is it?
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrames  Distributed collection of data organized into named columns  Conceptually equivalent to a table in relational DB or a data frame in R/Python  API available in Scala, Java, Python, and R Col1 Col2 … … ColN DataFrame Column Row Data is described as a DataFrame with rows, columns, and a schema
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sources CSV Avro HIVE Spark SQL Col1 Col2 … … ColN DataFrame Column Row JSON
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Create a DataFrame val path = "examples/flights.json" val flights = spark.read.json(path) Example
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Register a Temporary View (SQL API) Example flights.createOrReplaceTempView("flightsView")
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Two API Examples: DataFrame and SQL APIs flights.select("Origin", "Dest", "DepDelay”) .filter($"DepDelay" > 15).show(5) Results +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+ SELECT Origin, Dest, DepDelay FROM flightsView WHERE DepDelay > 15 LIMIT 5 SQL API DataFrame API
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Real-time Spark MLlib Machine Learning GraphX Graph Analysis
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is Stream Processing? Batch Processing • Ability to process and analyze data at-rest (stored data) • Request-based, bulk evaluation and short-lived processing • Enabler for Retrospective, Reactive and On-demand Analytics Stream Processing • Ability to ingest, process and analyze data in-motion in real- or near-real-time • Event or micro-batch driven, continuous evaluation and long-lived processing • Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best Action Stream Processing + Batch Processing = All Data Analytics real-time (now) historical (past)
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next Generation Analytics Iterative & Exploratory Data is the structure Traditional Analytics Structured & Repeatable Structure built to store data 23 Modern Data Applications approach to Insights Start with hypothesis Test against selected data Data leads the way Explore all data, identify correlations Analyze after landing… Analyze in motion…
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming  Extension of Spark Core API  Stream processing of live data streams – Scalable – High-throughput – Fault-tolerant Overview ZeroMQ MQTT No longer supported in Spark 2.x
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming Discretized Streams (DStreams)  High-level abstraction representing continuous stream of data  Internally represented as a sequence of RDDs  Operation applied on a DStream translates to operations on the underlying RDDs
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming Example: flatMap operation
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming  Apply transformations over a sliding window of data, e.g. rolling average Window Operations
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Challenges in Streaming Data  Consistency  Fault tolerance  Out-of-order data
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Structured Streaming  High-Level APIs - DataFrames, Datasets and SQL. Same in streaming and in batch  Event-time Processing - Native support for working w/ out -of-order and late data  End-to-end Exactly Once - Transactional both in processing and output
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Structured Streaming: Basics
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Structured Streaming: Model
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Handling late arriving data
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark MLlib
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Near Real-time Spark MLlib Machine Learning GraphX Graph Analysis
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark ML Pipeline  fit() is for training  transform() is for prediction Input DataFrame (TRAIN) Input DataFrame (TEST) Output Dataframe (PREDICTIONS) Pipeline Pipeline Model fit() transform() Train Predict
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark ML Pipeline Feature transform 1 Feature transform 2 Combine features Linear Regression Input DataFrame Input DataFrame Output DataFrame Pipeline Pipeline Model Train Predict Export Model
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sample Spark ML Pipeline indexer = … parser = … hashingTF = … vecAssembler = … rf = RandomForestClassifier(numTrees=100) pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf]) model = pipe.fit(trainData) # Train model results = model.transform(testData) # Test model
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Exporting ML Models - PMML  Predictive Model Markup Language (PMML) –> XML-based predictive model interchange format  Supported models –K-Means –Linear Regression –Ridge Regression –Lasso –SVM –Binary
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark GraphX
  • 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Near Real-time Spark MLlib Machine Learning GraphX Graph Analysis
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Page Rank  Topic Modeling (LDA)  Community Detection Source: ampcamp.berkeley.edu
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GraphX Algorithms  PageRank  Connected components  Label propagation  SVD++  Strongly connected components  Triangle count
  • 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sample GraphX Code in Scala graph = Graph(vertices, edges) messages = spark.textFile("hdfs://...") graph2 = graph.joinVertices(messages) { (id, vertex, msg) => ... }
  • 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin
  • 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s Apache Zeppelin? Web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Python, Scala and more
  • 48. 48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin with HDP 2.6+ • Data exploration and discovery • Visualization • Interactive snippet-at-a-time experience • “Modern Data Science Studio” Features • Ad-hoc experimentation • Deeply integrated with Spark + Hadoop • Supports multiple language backends • Incubating at Apache Use Case Web-based Notebook for interactive analytics
  • 49. 49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 50. 50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How does Zeppelin work? Notebook Author Collaborators/ Report viewers Zeppelin Cluster Spark | Hive | HBase Any of 30+ back ends
  • 51. 51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Data Lifecycle Collect ETL / Process Analysis Report Data Product Business user Customer Data ScientistData Engineer
  • 52. 52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Zeppelin Multitenancy
  • 53. 53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Livy  Livy is the open source REST interface for interacting with Apache Spark from anywhere  Installed as Spark Ambari Service Livy Client HTTP HTTP (RPC) Spark Interactive Session SparkContext Spark Batch Session SparkContext Livy Server
  • 54. 54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security Across Zeppelin-Livy-Spark Shiro Ispark Group Interpreter SPNego: Kerberos Kerberos Livy APIs Spark on YARN Zeppelin Driver LDAP Livy Server
  • 55. 55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Reasons to Integrate with Livy  Bring Sessions to Apache Zeppelin – Isolation – Session sharing  Enable efficient cluster resource utilization – Default Spark interpreter keeps YARN/Spark job running forever – Livy interpreter recycled after 60 minutes of inactivity (controlled by livy.server.session.timeout)  To Identity Propagation – Send user identity from Zeppelin > Livy > Spark on YARN
  • 56. 56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Livy Server SparkSession Sharing Session-2 Session-1 SparkSession-1 SparkContext SparkSession-2 SparkContext Client 1 Client 2 Client 3 Session-1 Session-1 Session-2
  • 57. 57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin + Livy End-to-End Security Ispark Group Interpreter SPNego: Kerberos Kerberos/RPC Livy APIs Spark on YARN Zeppelin LDAP Livy Server Job runs as Tommy Callahan Tommy Callahan
  • 58. 58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDP Basics
  • 59. 59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Zeppelin  Interactive notebook  Spark  YARN  Resource Management  HDFS  Distributed Storage Layer (4M files) – Future: Ozone object storeYARN Scala Java Python R APIs Spark Core Engine Spark SQL Spark Streaming MLlib GraphX 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS
  • 60. 63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Data Platform
  • 61. 64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sample Architecture
  • 62. 68 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 63. 69 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Managed Dataflow SOURCES REGIONAL INFRASTRUCTURE CORE INFRASTRUCTURE
  • 64. 72 © Hortonworks Inc. 2011 – 2016. All Rights Reserved High-Level Overview IoT Edge (single node) IoT Edge (single node) IoT Devices IoT Devices NiFi Hub Data Broker Column DB Data Store Live Dashboard Data Center (on prem/cloud) HDFS/Ozone HBase/Cassandra
  • 65. 73 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark 2.x & HDP 2.x What’s New?
  • 66. 74 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s New  Future HDP / Spark 2.3 – Spark Structured Streaming latency in single-digit milliseconds in continuous mode in stream processing (instead of 100ms we’d normally see with micro batching) – stream-to-stream joins – PySpark boost by improving performance with pandas UDFs – runs on Kubernetes clusters by providing native support for Apache Spark applications  HDP 2.6.4 / Spark 2.2 – Structured Streaming GA – Yahoo! Benchmark: 65M rec/s – ORC feature & performance improvements  Parquet Parity  HDP 2.6.3 / Spark 2.1 – Spark SQL Ranger integration for row and column security – DataSet API GA – GraphX GA
  • 67. 75 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 68. 76 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark 2.3
  • 69. 77 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Near Real-time Spark MLlib Machine Learning GraphX Graph Analysis
  • 70. 78 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DSX + HDP
  • 71. 79 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DSX HDP Data Science Experience (DSX) Local Enterprise Data Science platform for teams Livy REST interface Hortonworks Data Platform (HDP) Enterprise compute (Spark/Hive) & storage (HDFS/Ozone)
  • 72. 80 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lab
  • 73. 81 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Community Connection • Full Q&A Platform (like StackOverflow) • Knowledge Base Articles • Code Samples and Repositories community.hortonworks.com
  • 74. 82 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Community Engagement community.hortonworks.com 20k+ Registered Users 45k+ Answers 100k+ Technical Assets
  • 75. 83 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future of Data Meetups

Editor's Notes

  1. 1ZB = 1B * 1 TB