SlideShare a Scribd company logo
Introduction to Apache Spark
www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business intelligence/ data architecture strategy.
● Installation
○ Installation of Hadoop or relevant technology.
● Data Consolidation
○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards,
feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools
○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)
www.mammothdata.com | @mammothdataco
● Lead Consultant on all things DevOps and Spark
● @carsondial
Me!
www.mammothdata.com | @mammothdataco
● Apache Spark™ is a fast and general engine for large-scale data
processing
● Not all that helpful, is it?
What Is Apache Spark?!
www.mammothdata.com | @mammothdataco
● Framework for massive parallel computing (cluster)
● Harnessing power of cheap memory
● Direct Acyclic Graph (DAG) computing engine
● It goes very fast!
● Apache Project (spark.apache.org)
What Is Apache Spark?! No, But Really…
www.mammothdata.com | @mammothdataco
● Performance
● Developer productivity
Why Spark?
www.mammothdata.com | @mammothdataco
● Graysort benchmark (100TB)
● Hadoop - 72 minutes / 2100 nodes / datacentre
● Spark - 23 minutes / 206 nodes / AWS
● HDFS versus Memory
Performance!
www.mammothdata.com | @mammothdataco
● First class support for Scala, Java, Python, and R!
● Data Science friendly
Developers!
www.mammothdata.com | @mammothdataco
Word Count: Hadoop
www.mammothdata.com | @mammothdataco
from pyspark import SparkContext
logFile = "hdfs:///input"
sc = SparkContext("spark://spark-m:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).
reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs:///output")
Word Count: Spark
www.mammothdata.com | @mammothdataco
● Spark Streaming
● GraphX (graph algorithms)
● MLLib (machine learning)
● Dataframes (data access)
Spark: Batteries Included
www.mammothdata.com | @mammothdataco
● Analytics (batch / streaming)
● Machine Learning
● ETL (Extract - Transform - Load)
● …and many more!
Applications
www.mammothdata.com | @mammothdataco
● RDD = Resilient Distributed Dataset
● Immutable, Fault-tolerant
● Operated on in parallel
● Can be created manually or from external sources
RDDs – The Building Block
www.mammothdata.com | @mammothdataco
● Transformations
● Actions
● Transformations are lazy
● Actions evaluate transformations in pipeline as well as
performing action
RDDs – The Building Block
www.mammothdata.com | @mammothdataco
● map()
● filter()
● pipe()
● sample()
● …and more!
RDDs – Example Transformations
www.mammothdata.com | @mammothdataco
● reduce()
● count()
● take()
● saveAsTextFile()
● …and yes, more
RDDs – Example Actions
www.mammothdata.com | @mammothdataco
from pyspark import SparkContext
logFile = "hdfs:///input"
sc = SparkContext("spark://spark-m:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).
reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs:///output")
Word Count: Spark
www.mammothdata.com | @mammothdataco
● cache() / persist()
● When an action is performed for the first time - keep the result in
memory
● Different levels of persistence available
RDDs – cache()
www.mammothdata.com | @mammothdataco
● Micro-batches (DStreams of RDDs)
● Access to other parts of Spark (MLLib, GraphX, Dataframes)
● Fault-tolerant
● Connectors to Kafka, Flume, Kinesis, ZeroMQ
● (we’ll come back to this)
Streaming
www.mammothdata.com | @mammothdataco
● Spark SQL
● Support for JSON, Cassandra, SQL databases, etc.
● Easier syntax than RDDs
● Dataframes ‘borrowed’ from Python/R
● Catalyst query planner
Dataframes
www.mammothdata.com | @mammothdataco
val sc = new SparkContext()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("people.json")
df.show()
df.filter(df("age") >= 35).show()
df.groupBy("age").count().show()
Dataframes: Example
www.mammothdata.com | @mammothdataco
● Optimizing query planning for Spark
● Takes Dataframe operations and ‘compiles’ them down to RDD
operations
● Often faster than writing RDD code manually
● Use Dataframes whenever possible (v1.4+)
Dataframes: Catalyst
www.mammothdata.com | @mammothdataco
Dataframes: Catalyst
www.mammothdata.com | @mammothdataco
● Standalone
● YARN (Hadoop ecosystem)
● Mesos (Hipster ecosystem)
Deploying Spark
www.mammothdata.com | @mammothdataco
● Spark-Shell
● Zeppelin
Demos
www.mammothdata.com | @mammothdataco
● Spark Streaming is not ‘pure’ streaming
● Low latency requirements - use Storm
● Still immature in some ways
● Come to my All Things Open talk to learn more!
Spark for Everything?
www.mammothdata.com | @mammothdataco
● http://www.meetup.com/Triangle-Apache-Spark-Meetup/
● Next meeting likely to be in late October
Triangle Apache Spark Meetup Group
www.mammothdata.com | @mammothdataco
● spark.apache.org
● databricks.com
● zeppelin.incubator.apache.org
● mammothdata.com/white-papers/spark-a-modern-tool-for-big-
data-applications
Links
www.mammothdata.com | @mammothdataco
● Questions for you! (for a $15 Digital Ocean voucher)
1. What is a RDD?
2. What’s the difference between a transformation and an action?
3. When wouldn’t you use Spark Streaming?
Questions?

More Related Content

What's hot

Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Modern Data Stack France
 
Machine Learning with Apache Spark - HackNY Masters
Machine Learning with Apache Spark - HackNY Masters Machine Learning with Apache Spark - HackNY Masters
Machine Learning with Apache Spark - HackNY Masters
Evan Casey
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
Sigmoid
 
10 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 201910 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 2019
JanBask Training
 
Is Spark Replacing Hadoop
Is Spark Replacing HadoopIs Spark Replacing Hadoop
Is Spark Replacing Hadoop
MapR Technologies
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Mark Kromer
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
Modern Data Stack France
 
Spark Application for Time Series Analysis
Spark Application for Time Series AnalysisSpark Application for Time Series Analysis
Spark Application for Time Series Analysis
MapR Technologies
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
Daniel Marcous
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Hajira Jabeen
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
André Faria Gomes
 
Big Data A La Carte Menu
Big Data A La Carte MenuBig Data A La Carte Menu
Big Data A La Carte Menu
Venkatesh Balakumar
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
Thoughtworks
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
Christoph Matthies
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Databricks
 

What's hot (20)

Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Machine Learning with Apache Spark - HackNY Masters
Machine Learning with Apache Spark - HackNY Masters Machine Learning with Apache Spark - HackNY Masters
Machine Learning with Apache Spark - HackNY Masters
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
10 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 201910 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 2019
 
Is Spark Replacing Hadoop
Is Spark Replacing HadoopIs Spark Replacing Hadoop
Is Spark Replacing Hadoop
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
Spark Application for Time Series Analysis
Spark Application for Time Series AnalysisSpark Application for Time Series Analysis
Spark Application for Time Series Analysis
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Big Data A La Carte Menu
Big Data A La Carte MenuBig Data A La Carte Menu
Big Data A La Carte Menu
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 

Viewers also liked

Hamad
HamadHamad
Lääkärien riittävyys 2014
Lääkärien riittävyys 2014Lääkärien riittävyys 2014
Lääkärien riittävyys 2014
laakariliitto
 
Mike Stawski: Who are These People? A Profile of Patients and their Families ...
Mike Stawski: Who are These People? A Profile of Patients and their Families ...Mike Stawski: Who are These People? A Profile of Patients and their Families ...
Mike Stawski: Who are These People? A Profile of Patients and their Families ...
Beitissie1
 
Pedro Encarnação: “LUDI”- Technology and Play for Children with Disabilities
Pedro Encarnação: “LUDI”- Technology and Play for Children with DisabilitiesPedro Encarnação: “LUDI”- Technology and Play for Children with Disabilities
Pedro Encarnação: “LUDI”- Technology and Play for Children with Disabilities
Beitissie1
 
Mohsin hakim
Mohsin hakimMohsin hakim
Mohsin hakim
Mohsin Hakim
 
Scanned from a Xerox multifunction device
Scanned from a Xerox multifunction deviceScanned from a Xerox multifunction device
Scanned from a Xerox multifunction deviceKarim Taha
 

Viewers also liked (6)

Hamad
HamadHamad
Hamad
 
Lääkärien riittävyys 2014
Lääkärien riittävyys 2014Lääkärien riittävyys 2014
Lääkärien riittävyys 2014
 
Mike Stawski: Who are These People? A Profile of Patients and their Families ...
Mike Stawski: Who are These People? A Profile of Patients and their Families ...Mike Stawski: Who are These People? A Profile of Patients and their Families ...
Mike Stawski: Who are These People? A Profile of Patients and their Families ...
 
Pedro Encarnação: “LUDI”- Technology and Play for Children with Disabilities
Pedro Encarnação: “LUDI”- Technology and Play for Children with DisabilitiesPedro Encarnação: “LUDI”- Technology and Play for Children with Disabilities
Pedro Encarnação: “LUDI”- Technology and Play for Children with Disabilities
 
Mohsin hakim
Mohsin hakimMohsin hakim
Mohsin hakim
 
Scanned from a Xerox multifunction device
Scanned from a Xerox multifunction deviceScanned from a Xerox multifunction device
Scanned from a Xerox multifunction device
 

Similar to Introduction To Spark - Durham LUG 20150916

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Mammoth Data
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
ScyllaDB
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 

Similar to Introduction To Spark - Durham LUG 20150916 (20)

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 

Introduction To Spark - Durham LUG 20150916

  • 2. www.mammothdata.com | @mammothdataco The Leader in Big Data Consulting ● BI/Data Strategy ○ Development of a business intelligence/ data architecture strategy. ● Installation ○ Installation of Hadoop or relevant technology. ● Data Consolidation ○ Load data from diverse sources into a single scalable repository. ● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions. ● Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to necessary employees who will analyze the data. Mammoth Data, based in downtown Durham (right above Toast)
  • 3. www.mammothdata.com | @mammothdataco ● Lead Consultant on all things DevOps and Spark ● @carsondial Me!
  • 4. www.mammothdata.com | @mammothdataco ● Apache Spark™ is a fast and general engine for large-scale data processing ● Not all that helpful, is it? What Is Apache Spark?!
  • 5. www.mammothdata.com | @mammothdataco ● Framework for massive parallel computing (cluster) ● Harnessing power of cheap memory ● Direct Acyclic Graph (DAG) computing engine ● It goes very fast! ● Apache Project (spark.apache.org) What Is Apache Spark?! No, But Really…
  • 6. www.mammothdata.com | @mammothdataco ● Performance ● Developer productivity Why Spark?
  • 7. www.mammothdata.com | @mammothdataco ● Graysort benchmark (100TB) ● Hadoop - 72 minutes / 2100 nodes / datacentre ● Spark - 23 minutes / 206 nodes / AWS ● HDFS versus Memory Performance!
  • 8. www.mammothdata.com | @mammothdataco ● First class support for Scala, Java, Python, and R! ● Data Science friendly Developers!
  • 10. www.mammothdata.com | @mammothdataco from pyspark import SparkContext logFile = "hdfs:///input" sc = SparkContext("spark://spark-m:7077", "WordCount") textFile = sc.textFile(logFile) wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)). reduceByKey(lambda a, b: a+b) wordCounts.saveAsTextFile("hdfs:///output") Word Count: Spark
  • 11. www.mammothdata.com | @mammothdataco ● Spark Streaming ● GraphX (graph algorithms) ● MLLib (machine learning) ● Dataframes (data access) Spark: Batteries Included
  • 12. www.mammothdata.com | @mammothdataco ● Analytics (batch / streaming) ● Machine Learning ● ETL (Extract - Transform - Load) ● …and many more! Applications
  • 13. www.mammothdata.com | @mammothdataco ● RDD = Resilient Distributed Dataset ● Immutable, Fault-tolerant ● Operated on in parallel ● Can be created manually or from external sources RDDs – The Building Block
  • 14. www.mammothdata.com | @mammothdataco ● Transformations ● Actions ● Transformations are lazy ● Actions evaluate transformations in pipeline as well as performing action RDDs – The Building Block
  • 15. www.mammothdata.com | @mammothdataco ● map() ● filter() ● pipe() ● sample() ● …and more! RDDs – Example Transformations
  • 16. www.mammothdata.com | @mammothdataco ● reduce() ● count() ● take() ● saveAsTextFile() ● …and yes, more RDDs – Example Actions
  • 17. www.mammothdata.com | @mammothdataco from pyspark import SparkContext logFile = "hdfs:///input" sc = SparkContext("spark://spark-m:7077", "WordCount") textFile = sc.textFile(logFile) wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)). reduceByKey(lambda a, b: a+b) wordCounts.saveAsTextFile("hdfs:///output") Word Count: Spark
  • 18. www.mammothdata.com | @mammothdataco ● cache() / persist() ● When an action is performed for the first time - keep the result in memory ● Different levels of persistence available RDDs – cache()
  • 19. www.mammothdata.com | @mammothdataco ● Micro-batches (DStreams of RDDs) ● Access to other parts of Spark (MLLib, GraphX, Dataframes) ● Fault-tolerant ● Connectors to Kafka, Flume, Kinesis, ZeroMQ ● (we’ll come back to this) Streaming
  • 20. www.mammothdata.com | @mammothdataco ● Spark SQL ● Support for JSON, Cassandra, SQL databases, etc. ● Easier syntax than RDDs ● Dataframes ‘borrowed’ from Python/R ● Catalyst query planner Dataframes
  • 21. www.mammothdata.com | @mammothdataco val sc = new SparkContext() val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.read.json("people.json") df.show() df.filter(df("age") >= 35).show() df.groupBy("age").count().show() Dataframes: Example
  • 22. www.mammothdata.com | @mammothdataco ● Optimizing query planning for Spark ● Takes Dataframe operations and ‘compiles’ them down to RDD operations ● Often faster than writing RDD code manually ● Use Dataframes whenever possible (v1.4+) Dataframes: Catalyst
  • 24. www.mammothdata.com | @mammothdataco ● Standalone ● YARN (Hadoop ecosystem) ● Mesos (Hipster ecosystem) Deploying Spark
  • 25. www.mammothdata.com | @mammothdataco ● Spark-Shell ● Zeppelin Demos
  • 26. www.mammothdata.com | @mammothdataco ● Spark Streaming is not ‘pure’ streaming ● Low latency requirements - use Storm ● Still immature in some ways ● Come to my All Things Open talk to learn more! Spark for Everything?
  • 27. www.mammothdata.com | @mammothdataco ● http://www.meetup.com/Triangle-Apache-Spark-Meetup/ ● Next meeting likely to be in late October Triangle Apache Spark Meetup Group
  • 28. www.mammothdata.com | @mammothdataco ● spark.apache.org ● databricks.com ● zeppelin.incubator.apache.org ● mammothdata.com/white-papers/spark-a-modern-tool-for-big- data-applications Links
  • 29. www.mammothdata.com | @mammothdataco ● Questions for you! (for a $15 Digital Ocean voucher) 1. What is a RDD? 2. What’s the difference between a transformation and an action? 3. When wouldn’t you use Spark Streaming? Questions?