SlideShare a Scribd company logo
1 of 13
Intro to Spark
Kyle Burke - IgnitionOne
Data Science Engineer
March 24, 2016
https://www.linkedin.com/in/kyleburke
Today’s Topics
• Why Spark?
• Spark Basics
• Spark Under the Hood.
• Quick Tour of Spark Core, SQL, and Streaming.
• Tips from the Trenches.
• Setting up Spark Locally.
• Ways to Learn More.
Who am I?
• Background mostly in Data Warehousing with
some app and web development work.
• Currently a data engineer/data scientist with
IgnitionOne.
• Began using Spark last year.
• Currently using Spark to read data from Kafka
stream to load to Redshift/Cassandra.
Why Spark?
• You find yourself writing code to parallelize data and then
have to resync.
• Your database is overloaded and you want to off load some of
the workload.
• You’re being asked to preform both batch and streaming
operations with your data.
• You’ve got a bunch of data sitting in files that you’d like to
analyze.
• You’d like to make yourself more marketable.
Spark Basics Overview
• Spark Conf – Contains config information about your app.
• Spark Context – Contains config information about
cluster. Driver which defines jobs and constructs the DAG to
outline work on a cluster.
• Resilient Distributed Dataset (RDD) – Can be
thought of as a distributed collection.
• SQL Context – Entry point into Spark SQL functionality. Only
need a Spark Context to create.
• DataFrame – Can be thought of as a distributed collection
of rows contain named columns or similar to database table.
Spark Core
• First you’ll need to create a SparkConf and SparkContext.
val conf = new SparkConf().setAppName(“HelloWorld”)
val sc = new SparkContext(conf)
• Using the SparkContext, you can read in data from Hadoop
compatible and local file systems.
val clicks_raw = sc.textFile(path_to_clicks)
val ga_clicks = clicks_raw.filter(s => s.contains(“Georgia”)) //transformer
val ga_clicks_cnt = ga_clicks.count // This is an action
• Map function allows for operations to be performed on each
row in RDD.
• Lazy evaluation means that no data processing occurs until an
Action happens.
Spark SQL
• Allows dataframes to be registered as temporary tables.
rawbids = sqlContext.read.parquet(parquet_directory)
rawbids.registerTempTable(“bids”)
• Tables can be queried using SQL or HiveQL language
sqlContext.sql(“SELECT url, insert_date, insert_hr from bids”)
• Supports User-Defined Functions
import urllib
sqlContext.registerFunction("urlDecode", lambda s: urllib.unquote(s), StringType())
bids_urls = sqlContext.sql(“SELECT urlDecode(url) from bids”)
• First class support for complex data types (ex typically found in JSON
structures)
Spark Core Advanced Topics
Spark Streaming
• Streaming Context – context for the cluster to create and manage streams.
• Dstream – Sequence of RDDs. Formally called discretized stream.
//File Stream Example
val ssc = new StreamingContext(conf, Minutes(1))
val ImpressionStream = ssc.textFileStream(path_to_directory)
ImpressionStream.foreachRDD((rdd, time) => {
//normal rdd processing goes here
}
Tips
• Use mapPartitions if you’ve got expensive objects to instantiate.
def partitionLines(lines:Iterator[String] )={
val parser = new CSVParser('t')
lines.map(parser.parseLine(_).size)
}
rdd.mapPartitions(partitionLines)
• Caching if you’re going to reuse objects.
rdd.cache() == rdd.persist(MEMORY_ONLY)
• Partition files to improve read performance
all_bids.write
.mode("append")
.partitionBy("insert_date","insert_hr")
.json(stage_path)
Tips (Cont’d)
• Save DataFrame to JSON/Parquet
• CSV is more cumbersome to deal with but spark-csv package
• Avro data conversions seem buggy.
• Parquet is the format where the most effort is being done for
performance optimizations.
• Spark History Server is helpful for troubleshooting.
– Started by running “$SPARK_HOME/sbin/start-history-server.sh”
– By default you can access it from port 18080.
• Hive external tables
• Check out spark-packages.org
Spark Local Setup
Step Shell Command
Download and place tgz in Spark
folder.
>>mkdir Spark
>> cd spark-1.6.1.tgz Spark/
Untar spark tgz file >>tar -xvf spark-1.6.1.tgz
cd extract folder >>cd spark-1.6.1
Give Maven extra memory >>export MAVEN_OPTS="-Xmx2g -
XX:MaxPermSize=512M -
XX:ReservedCodeCacheSize=512m
Build Spark >>mvn -Pyarn -Phadoop-2.6 -
Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -
DskipTests clean package
Ways To Learn More
• Edx Course: Intro to Spark
• Spark Summit – Previous conferences are
available to view for free.
• Big Data University – IBM’s training.

More Related Content

What's hot

Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
Pradeep MG
 

What's hot (20)

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hands-On Apache Spark
Hands-On Apache SparkHands-On Apache Spark
Hands-On Apache Spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
Spark core
Spark coreSpark core
Spark core
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Apache sqoop
Apache sqoopApache sqoop
Apache sqoop
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 

Similar to Intro to Spark

An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 

Similar to Intro to Spark (20)

Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Intro to Spark

  • 1. Intro to Spark Kyle Burke - IgnitionOne Data Science Engineer March 24, 2016 https://www.linkedin.com/in/kyleburke
  • 2. Today’s Topics • Why Spark? • Spark Basics • Spark Under the Hood. • Quick Tour of Spark Core, SQL, and Streaming. • Tips from the Trenches. • Setting up Spark Locally. • Ways to Learn More.
  • 3. Who am I? • Background mostly in Data Warehousing with some app and web development work. • Currently a data engineer/data scientist with IgnitionOne. • Began using Spark last year. • Currently using Spark to read data from Kafka stream to load to Redshift/Cassandra.
  • 4. Why Spark? • You find yourself writing code to parallelize data and then have to resync. • Your database is overloaded and you want to off load some of the workload. • You’re being asked to preform both batch and streaming operations with your data. • You’ve got a bunch of data sitting in files that you’d like to analyze. • You’d like to make yourself more marketable.
  • 5. Spark Basics Overview • Spark Conf – Contains config information about your app. • Spark Context – Contains config information about cluster. Driver which defines jobs and constructs the DAG to outline work on a cluster. • Resilient Distributed Dataset (RDD) – Can be thought of as a distributed collection. • SQL Context – Entry point into Spark SQL functionality. Only need a Spark Context to create. • DataFrame – Can be thought of as a distributed collection of rows contain named columns or similar to database table.
  • 6. Spark Core • First you’ll need to create a SparkConf and SparkContext. val conf = new SparkConf().setAppName(“HelloWorld”) val sc = new SparkContext(conf) • Using the SparkContext, you can read in data from Hadoop compatible and local file systems. val clicks_raw = sc.textFile(path_to_clicks) val ga_clicks = clicks_raw.filter(s => s.contains(“Georgia”)) //transformer val ga_clicks_cnt = ga_clicks.count // This is an action • Map function allows for operations to be performed on each row in RDD. • Lazy evaluation means that no data processing occurs until an Action happens.
  • 7. Spark SQL • Allows dataframes to be registered as temporary tables. rawbids = sqlContext.read.parquet(parquet_directory) rawbids.registerTempTable(“bids”) • Tables can be queried using SQL or HiveQL language sqlContext.sql(“SELECT url, insert_date, insert_hr from bids”) • Supports User-Defined Functions import urllib sqlContext.registerFunction("urlDecode", lambda s: urllib.unquote(s), StringType()) bids_urls = sqlContext.sql(“SELECT urlDecode(url) from bids”) • First class support for complex data types (ex typically found in JSON structures)
  • 9. Spark Streaming • Streaming Context – context for the cluster to create and manage streams. • Dstream – Sequence of RDDs. Formally called discretized stream. //File Stream Example val ssc = new StreamingContext(conf, Minutes(1)) val ImpressionStream = ssc.textFileStream(path_to_directory) ImpressionStream.foreachRDD((rdd, time) => { //normal rdd processing goes here }
  • 10. Tips • Use mapPartitions if you’ve got expensive objects to instantiate. def partitionLines(lines:Iterator[String] )={ val parser = new CSVParser('t') lines.map(parser.parseLine(_).size) } rdd.mapPartitions(partitionLines) • Caching if you’re going to reuse objects. rdd.cache() == rdd.persist(MEMORY_ONLY) • Partition files to improve read performance all_bids.write .mode("append") .partitionBy("insert_date","insert_hr") .json(stage_path)
  • 11. Tips (Cont’d) • Save DataFrame to JSON/Parquet • CSV is more cumbersome to deal with but spark-csv package • Avro data conversions seem buggy. • Parquet is the format where the most effort is being done for performance optimizations. • Spark History Server is helpful for troubleshooting. – Started by running “$SPARK_HOME/sbin/start-history-server.sh” – By default you can access it from port 18080. • Hive external tables • Check out spark-packages.org
  • 12. Spark Local Setup Step Shell Command Download and place tgz in Spark folder. >>mkdir Spark >> cd spark-1.6.1.tgz Spark/ Untar spark tgz file >>tar -xvf spark-1.6.1.tgz cd extract folder >>cd spark-1.6.1 Give Maven extra memory >>export MAVEN_OPTS="-Xmx2g - XX:MaxPermSize=512M - XX:ReservedCodeCacheSize=512m Build Spark >>mvn -Pyarn -Phadoop-2.6 - Dhadoop.version=2.6.0 -Phive -Phive-thriftserver - DskipTests clean package
  • 13. Ways To Learn More • Edx Course: Intro to Spark • Spark Summit – Previous conferences are available to view for free. • Big Data University – IBM’s training.

Editor's Notes

  1. DAG – Directed Acylic Graph When the user runs an action (like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into (map and reduce) stages. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages. The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager. (Spark Standalone/Yarn/Mesos). The task scheduler doesn’t know about dependencies among stages. The Worker executes the tasks. A new JVM is started per job. The worker knows only about the code that is passed to it.