Intro to Spark

Intro to Spark
Kyle Burke - IgnitionOne
Data Science Engineer
March 24, 2016
https://www.linkedin.com/in/kyleburke

Today’s Topics
• Why Spark?
• Spark Basics
• Spark Under the Hood.
• Quick Tour of Spark Core, SQL, and Streaming.
• Tips from the Trenches.
• Setting up Spark Locally.
• Ways to Learn More.

Who am I?
• Background mostly in Data Warehousing with
some app and web development work.
• Currently a data engineer/data scientist with
IgnitionOne.
• Began using Spark last year.
• Currently using Spark to read data from Kafka
stream to load to Redshift/Cassandra.

Why Spark?
• You find yourself writing code to parallelize data and then
have to resync.
• Your database is overloaded and you want to off load some of
the workload.
• You’re being asked to preform both batch and streaming
operations with your data.
• You’ve got a bunch of data sitting in files that you’d like to
analyze.
• You’d like to make yourself more marketable.

Spark Basics Overview
• Spark Conf – Contains config information about your app.
• Spark Context – Contains config information about
cluster. Driver which defines jobs and constructs the DAG to
outline work on a cluster.
• Resilient Distributed Dataset (RDD) – Can be
thought of as a distributed collection.
• SQL Context – Entry point into Spark SQL functionality. Only
need a Spark Context to create.
• DataFrame – Can be thought of as a distributed collection
of rows contain named columns or similar to database table.

Spark Core
• First you’ll need to create a SparkConf and SparkContext.
val conf = new SparkConf().setAppName(“HelloWorld”)
val sc = new SparkContext(conf)
• Using the SparkContext, you can read in data from Hadoop
compatible and local file systems.
val clicks_raw = sc.textFile(path_to_clicks)
val ga_clicks = clicks_raw.filter(s => s.contains(“Georgia”)) //transformer
val ga_clicks_cnt = ga_clicks.count // This is an action
• Map function allows for operations to be performed on each
row in RDD.
• Lazy evaluation means that no data processing occurs until an
Action happens.

Spark SQL
• Allows dataframes to be registered as temporary tables.
rawbids = sqlContext.read.parquet(parquet_directory)
rawbids.registerTempTable(“bids”)
• Tables can be queried using SQL or HiveQL language
sqlContext.sql(“SELECT url, insert_date, insert_hr from bids”)
• Supports User-Defined Functions
import urllib
sqlContext.registerFunction("urlDecode", lambda s: urllib.unquote(s), StringType())
bids_urls = sqlContext.sql(“SELECT urlDecode(url) from bids”)
• First class support for complex data types (ex typically found in JSON
structures)

Spark Streaming
• Streaming Context – context for the cluster to create and manage streams.
• Dstream – Sequence of RDDs. Formally called discretized stream.
//File Stream Example
val ssc = new StreamingContext(conf, Minutes(1))
val ImpressionStream = ssc.textFileStream(path_to_directory)
ImpressionStream.foreachRDD((rdd, time) => {
//normal rdd processing goes here
}

Tips
• Use mapPartitions if you’ve got expensive objects to instantiate.
def partitionLines(lines:Iterator[String] )={
val parser = new CSVParser('t')
lines.map(parser.parseLine(_).size)
}
rdd.mapPartitions(partitionLines)
• Caching if you’re going to reuse objects.
rdd.cache() == rdd.persist(MEMORY_ONLY)
• Partition files to improve read performance
all_bids.write
.mode("append")
.partitionBy("insert_date","insert_hr")
.json(stage_path)

Tips (Cont’d)
• Save DataFrame to JSON/Parquet
• CSV is more cumbersome to deal with but spark-csv package
• Avro data conversions seem buggy.
• Parquet is the format where the most effort is being done for
performance optimizations.
• Spark History Server is helpful for troubleshooting.
– Started by running “$SPARK_HOME/sbin/start-history-server.sh”
– By default you can access it from port 18080.
• Hive external tables
• Check out spark-packages.org

Spark Local Setup
Step Shell Command
Download and place tgz in Spark
folder.
>>mkdir Spark
>> cd spark-1.6.1.tgz Spark/
Untar spark tgz file >>tar -xvf spark-1.6.1.tgz
cd extract folder >>cd spark-1.6.1
Give Maven extra memory >>export MAVEN_OPTS="-Xmx2g -
XX:MaxPermSize=512M -
XX:ReservedCodeCacheSize=512m
Build Spark >>mvn -Pyarn -Phadoop-2.6 -
Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -
DskipTests clean package

Ways To Learn More
• Edx Course: Intro to Spark
• Spark Summit – Previous conferences are
available to view for free.
• Big Data University – IBM’s training.

Intro to Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intro to Spark

Similar to Intro to Spark (20)

Recently uploaded

Recently uploaded (20)

Intro to Spark

Editor's Notes