SlideShare a Scribd company logo
Introduction to Apache Spark
www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business intelligence/ data architecture strategy.
● Installation
○ Installation of Hadoop or relevant technology.
● Data Consolidation
○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards,
feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools
○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)
www.mammothdata.com | @mammothdataco
●Lead Consultant on all things DevOps and Spark
●@carsondial
Me!
www.mammothdata.com | @mammothdataco
●Apache Spark™ is a fast and general engine for large-scale data
processing
What Is Apache Spark?!
www.mammothdata.com | @mammothdataco
● Framework for massive parallel computing (cluster)
● Harnessing power of cheap memory
● Direct Acyclic Graph (DAG) computing engine
● It goes very fast!
● Apache Project (spark.apache.org)
What Is Apache Spark?! No, But Really…
www.mammothdata.com | @mammothdataco
● Began at UC Berkeley in 2009
● Apache project since 2013
● Top-level Apache project since 2014
● Creators formed databricks.com
History
www.mammothdata.com | @mammothdataco
● Performance
● Developer productivity
Why Spark?
www.mammothdata.com | @mammothdataco
● Graysort benchmark (100TB)
● Hadoop - 72 minutes / 2100 nodes / datacentre
● Spark - 23 minutes / 206 nodes / AWS
● HDFS versus Memory
Performance!
www.mammothdata.com | @mammothdataco
●First class support for Scala, Java, Python, and R!
●Data Science friendly
Developers!
www.mammothdata.com | @mammothdataco
Word Count: Hadoop
www.mammothdata.com | @mammothdataco
from pyspark import SparkContext
logFile = "hdfs:///input"
sc = SparkContext("spark://spark-m:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word,
1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs:///output")
Word Count: Spark
www.mammothdata.com | @mammothdataco
●Spark Streaming
●GraphX (graph algorithms)
●MLLib (machine learning)
●Dataframes (data access)
Spark: Batteries Included
www.mammothdata.com | @mammothdataco
● Analytics (batch / streaming)
● Machine Learning
● ETL (Extract - Transform - Load)
● …and many more!
Applications
www.mammothdata.com | @mammothdataco
●RDD = Resilient Distributed Dataset
●Immutable, Fault-tolerant
●Operated on in parallel
●Can be created manually or from external sources
RDDs – The Building Block
www.mammothdata.com | @mammothdataco
●Transformations
●Actions
●Transformations are lazy
●Actions evaluate transformations in pipeline as well as
performing action
RDDs – The Building Block
www.mammothdata.com | @mammothdataco
● map()
● filter()
● pipe()
● sample()
● …and more!
RDDs – Example Transformations
www.mammothdata.com | @mammothdataco
●reduce()
●count()
●take()
●saveAsTextFile()
●…and yes, more
RDDs – Example Actions
www.mammothdata.com | @mammothdataco
from pyspark import SparkContext
logFile = "hdfs:///input"
sc = SparkContext("spark://spark-m:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word,
1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs:///output")
Word Count: Spark
www.mammothdata.com | @mammothdataco
●cache() / persist()
●When an action is performed for the first time - keep the result in
memory
●Different levels of persistence available
RDDs – cache()
www.mammothdata.com | @mammothdataco
●Micro-batches (DStreams of RDDs)
●Access to other parts of Spark (MLLib, GraphX, Dataframes)
●Fault-tolerant
●Connectors to Kafka, Flume, Kinesis, ZeroMQ
●(we’ll come back to this)
Streaming
www.mammothdata.com | @mammothdataco
●Spark SQL
●Support for JSON, Cassandra, SQL databases, etc.
●Easier syntax than RDDs
●Dataframes ‘borrowed’ from Python/R
●Catalyst query planner
Dataframes
www.mammothdata.com | @mammothdataco
val sc = new SparkContext()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("people.json")
df.show()
df.filter(df("age") >= 35).show()
df.groupBy("age").count().show()
Dataframes: Example
www.mammothdata.com | @mammothdataco
●Optimizing query planning for Spark
●Takes Dataframe operations and ‘compiles’ them down to RDD
operations
●Often faster than writing RDD code manually
●Use Dataframes whenever possible (v1.4+)
Dataframes: Catalyst
www.mammothdata.com | @mammothdataco
Dataframes: Catalyst
www.mammothdata.com | @mammothdataco
●Machine Learning
●Includes algorithm implementations for Bayes, k-means
clustering, ALS, word2vec, random forests, etc.
●Matrix operations (dense / sparse), dimensionality reduction, etc.
●And basic stats too!
MLLib
www.mammothdata.com | @mammothdataco
●Common interface between different ML solutions
●(still in progress, but production-ready as of 1.5)
●Pipelines : ML as Dataframes : RDDs
MLLib - Pipelines
www.mammothdata.com | @mammothdataco
●Graph processing algorithms
●Operations on vertices and edges
●Includes PageRank algorithm
●Can be combined with Streaming/Dataframes/MLLib
GraphX
www.mammothdata.com | @mammothdataco
●Standalone
●YARN (Hadoop ecosystem)
●Mesos (Hipster ecosystem)
Deploying Spark
www.mammothdata.com | @mammothdataco
●Traditional (write code, submit to cluster)
●REPL / Shell (write code interactively, backed by cluster)
●Interactive Notebooks (iPython/Zeppelin)
Using Spark
www.mammothdata.com | @mammothdataco
●Log / diary approach to data science
●Type code into a web page
●Visualizations built-in
Interactive Notebooks
www.mammothdata.com | @mammothdataco
●iPython / Juypter - most popular
●Zeppelin - built for Spark
Interactive Notebooks
www.mammothdata.com | @mammothdataco
Interactive Notebooks
www.mammothdata.com | @mammothdataco
●spark.apache.org
●databricks.com
●zeppelin.incubator.apache.org
●mammothdata.com/white-papers/spark-a-modern-tool-for-big-
data-applications
Links
www.mammothdata.com | @mammothdataco
Questions?

More Related Content

What's hot

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Robert "Chip" Senkbeil
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
Vincent GALOPIN
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Helena Edelson
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
DataWorks Summit
 

What's hot (20)

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
 

Viewers also liked

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP server
Kazuho Oku
 
Container Orchestration Wars
Container Orchestration WarsContainer Orchestration Wars
Container Orchestration Wars
Karl Isenberg
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
Brendan Gregg
 

Viewers also liked (10)

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP server
 
Container Orchestration Wars
Container Orchestration WarsContainer Orchestration Wars
Container Orchestration Wars
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 

Similar to Intro to Apache Spark

Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916
Ian Pointer
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
ScyllaDB
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
prateek kumar
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 

Similar to Intro to Apache Spark (20)

Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 

More from Mammoth Data

Intro to Apache Spark - Lab
Intro to Apache Spark - LabIntro to Apache Spark - Lab
Intro to Apache Spark - Lab
Mammoth Data
 
A Modern Data Architecture for Risk Management... For Financial Services
A Modern Data Architecture for Risk Management... For Financial ServicesA Modern Data Architecture for Risk Management... For Financial Services
A Modern Data Architecture for Risk Management... For Financial Services
Mammoth Data
 
2015 Red Hat Summit - Open Source in Financial Services
2015 Red Hat Summit - Open Source in Financial Services2015 Red Hat Summit - Open Source in Financial Services
2015 Red Hat Summit - Open Source in Financial Services
Mammoth Data
 
How To Run A Successful BI Project with Hadoop
How To Run A Successful BI Project with HadoopHow To Run A Successful BI Project with Hadoop
How To Run A Successful BI Project with Hadoop
Mammoth Data
 
Cloud Worst Practices
Cloud Worst PracticesCloud Worst Practices
Cloud Worst Practices
Mammoth Data
 
A Gentle Introduction To Storm And Kafka
A Gentle Introduction To Storm And KafkaA Gentle Introduction To Storm And Kafka
A Gentle Introduction To Storm And Kafka
Mammoth Data
 
Become Data Driven With Hadoop as-a-Service
Become Data Driven With Hadoop as-a-ServiceBecome Data Driven With Hadoop as-a-Service
Become Data Driven With Hadoop as-a-Service
Mammoth Data
 

More from Mammoth Data (7)

Intro to Apache Spark - Lab
Intro to Apache Spark - LabIntro to Apache Spark - Lab
Intro to Apache Spark - Lab
 
A Modern Data Architecture for Risk Management... For Financial Services
A Modern Data Architecture for Risk Management... For Financial ServicesA Modern Data Architecture for Risk Management... For Financial Services
A Modern Data Architecture for Risk Management... For Financial Services
 
2015 Red Hat Summit - Open Source in Financial Services
2015 Red Hat Summit - Open Source in Financial Services2015 Red Hat Summit - Open Source in Financial Services
2015 Red Hat Summit - Open Source in Financial Services
 
How To Run A Successful BI Project with Hadoop
How To Run A Successful BI Project with HadoopHow To Run A Successful BI Project with Hadoop
How To Run A Successful BI Project with Hadoop
 
Cloud Worst Practices
Cloud Worst PracticesCloud Worst Practices
Cloud Worst Practices
 
A Gentle Introduction To Storm And Kafka
A Gentle Introduction To Storm And KafkaA Gentle Introduction To Storm And Kafka
A Gentle Introduction To Storm And Kafka
 
Become Data Driven With Hadoop as-a-Service
Become Data Driven With Hadoop as-a-ServiceBecome Data Driven With Hadoop as-a-Service
Become Data Driven With Hadoop as-a-Service
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 

Intro to Apache Spark