Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala

Alpine Academy - Spark
Lightning fast cluster computing with Python
and just a wee bit of Scala

Who am I?
Holden
I prefer she/her for pronouns
Co-author of the Learning Spark book
Software Engineer at IBM’s Spark Technology Center
@holdenkarau
http://www.slideshare.net/hkarau
https://www.linkedin.com/in/holdenkarau

What we are going to explore together!
What is Spark?
Spark’s primary distributed collection
Word count
Coffee break!
How PySpark works
Using libraries with Spark
Spark SQL / DataFrames (time permitting)

What is Spark?
General purpose distributed system
With a really nice API
Apache project (one of the most active)
Must faster than Hadoop Map/Reduce

The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Grah X
MLLib
Community
Packages

Some pages to keep open for the exercises
http://bit.ly/sparkDocs
http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc
http://bit.ly/PySparkIntroExamples
http://bit.ly/learningSparkExamples
OR
http://spark.apache.org/docs/latest/api/python/index.html
http://spark.apache.org/docs/latest/
https://github.com/holdenk/intro-to-pyspark-demos

Starting the shell
./bin/pyspark OR ./bin/spark-shell
[Lots of output]
SparkContext available as sc, SQLContext available as
sqlContext.
>>>

Reducing log level
cp ./conf/log4j.properties.template ./conf/log4j.properties
Then set
log4j.rootCategory=ERROR, console

Sparkcontext: entry to the world
Can be used to create RDDs from many input sources
Native collections, local & remote FS
Any Hadoop Data Source
Also create counters & accumulators
Automatically created in the shells (called sc)
Specify master & app name when creating
Master can be local[*], spark:// , yarn, etc.
app name should be human readable and make sense
etc.

RDDs: Spark’s Primary abstraction
RDD (Resilient Distributed Dataset)
Recomputed on node failure
Distributed across the cluster
Lazily evaluated (transformations & actions)

Word count
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(output)

Word count
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD

Word count - in Scala
val lines = sc.textFile(src)
val words = lines.flatMap(_.split(" "))
word_count = words.map(_, 1)).reduceByKey( _ + _)

Some common transformations & actions
Transformations (lazy)
map
filter
flatMap
reduceByKey
join
cogroup
Actions (eager)
count
reduce
collect
take
saveAsTextFile
saveAsHadoop
countByValue
Photo by Steve
Photo by Dan G

Exercise time
Photo by recastle

Lets find the lines with the word “Spark”
Get started in Python:
import os
src = "file:///"+os.environ['SPARK_HOME']+"/README.md"
Get started in Scala:
val src = "file:///" + sys.env("SPARK_HOME") +
"/README.md"
val lines = sc.textFile(src)

A solution:
spark_lines = lines.filter(
lambda x: x.lower().find("spark") != -1)
print spark_lines.count()

Combined with previous example
Do you notice anything funky?
We read the data in twice :(
cache/persist/checkpoint to the rescue!

lets use toDebugString
un-cached:
>>> print word_count.toDebugString()
(2) PythonRDD[17] at RDD at PythonRDD.scala:43 []
| MapPartitionsRDD[14] at mapPartitions at PythonRDD.scala:346 []
| ShuffledRDD[13] at partitionBy at NativeMethodAccessorImpl.java:-2 []
+-(2) PairwiseRDD[12] at reduceByKey at <stdin>:3 []
| PythonRDD[11] at reduceByKey at <stdin>:3 []
| MapPartitionsRDD[10] at textFile at NativeMethodAccessorImpl.java:-2 []
| file:////home/holden/repos/spark/README.md HadoopRDD[9] at textFile at NativeMethodAccessorImpl.java:-2 []

lets use toDebugString
cached:
>>> print word_count.toDebugString()
(2) PythonRDD[8] at RDD at PythonRDD.scala:43 []
| MapPartitionsRDD[5] at mapPartitions at PythonRDD.scala:346 []
| ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:-2 []
+-(2) PairwiseRDD[3] at reduceByKey at <stdin>:3 []
| PythonRDD[2] at reduceByKey at <stdin>:3 []
| MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 []
| CachedPartitions: 2; MemorySize: 2.7 KB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
| file:////home/holden/repos/spark/README.md HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:-2 []

A detour into the internals
Photo by Bill Ward

Why lazy evaluation?
Allows pipelining procedures
Less passes over our data, extra happiness
Can skip materializing intermediate results which are
really really big*
Figuring out where our code fails becomes a little trickier

So what happens when we run this code?
Driver
Worker
Worker
Worker
HDFS /
Cassandra/
etc

Driver
Worker
Worker
Worker
HDFS /
Cassandra/
etc
function

Driver
Worker
Worker
Worker
HDFS /
Cassandra/
etc
read
read
read

Driver
Worker
Worker
Worker
HDFS /
Cassandra/
etc
cached
cached
cached
counts

Spark in Scala, how does PySpark work?
Py4J + pickling + magic
This can be kind of slow sometimes
RDDs are generally RDDs of pickled objects
Spark SQL (and DataFrames) avoid some of this

So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe

Using other libraries
built ins
just import!*
Except for Hive, compile with -PHive & then import
spark-packages
--packages
generic python
pre-install on workers (pssh, puppet, etc.)
add it with --zip-files
sc.addPyFile

So lets take “DataFrames” out for a spin
useful for structured data
support schema inference on JSON
Many operations done without* pickling
Integrated into ML!
Accessed through SQLContext
Not the same feature set as Panda’s or R DataFrames

Loading data
df = sqlContext.read.load(
"files/testweet.json", # From learning-spark-
examples
format="json")
# Built in json, parquet, etc.
# More formats (csv, etc.) at http://spark-packages.org/

DataFrames aren’t quite as lazy...
Keep track of schema information
Loading JSON data involves looking at the data
Before if we tried to load non-existent data wouldn’t fail
right away, now fails right away

Manipulating DataFrames
SQL
df.registerTempTable("panda")
sqlContext.sql("select * from panda where id =
529799371026485248")
API
df.filter(df.id == 529799371026485248)

DataFrames to RDD’s & vice versa
map lets us work per-row
df.map(lambda row: row.text)
Converting back
infer_schema
specify the schema

Or we can make a UDF
def function(x):
# Some magic
sqlContext.registerFunction(“name”, function,
IntegerType())
Or in Scala:
def func(a: String): Int = //Magic
sqlContext.udf.register("name", func)

More exercise funtimes :)
Lets load a sample tweet
Write a UDF to compute the length of the tweet
Select the length of the tweet

Additional Resources
Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
http://spark.apache.org/docs/latest/
Books
Videos
Our next meetup!
Spark Office Hours
follow me on twitter for future ones - https://twitter.com/holdenkarau
fill out this survey to choose the next date - http://bit.ly/spOffice1

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
Advanced
Analytics with
Spark
Coming soon:
Spark in Action

Spark Videos
Apache Spark Youtube Channel
My youtube Spark videos - http://bit.ly/1MsvUKo
Spark Summit 2014 training
Paco’s Introduction to Apache Spark

Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala

Similar to Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala (20)

Recently uploaded

Recently uploaded (20)

Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala

Editor's Notes