Up and running with pyspark

Achilles Heel for Hadoop
 Hadoop is not fast enough *apparently* for things like ML .
 Need to Read again from disk after each MR job.
{ MR1 => HDFS =>MR2 => HDFS =>MR3 }
 MR , Let’s admit is a bit too complicated.
 The problem with giant codebase.
{Hadoop : 1.7 Million LOC}
{Spark : .35 Million LOC}

Why Spark?
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

A brief History of Spark Timeline
 UC Berkeley : The home of innovation.
 2009 : Started as a simple class project.
 The UCB folks wanted to create a Cluster Management system  Mesos
 They needed something to test on top of Mesos .  Voila Spark
 2010 : Open sourced under BSD licence
 Feb 2014 : Became Apache Top Level project
 Nov 2014 : New world record in Large scale sorting
https://soundcloud.com/oreilly-radar/apache-sparks-journey-from-academia-to-industry

Spark Concepts
 In memory Processing
{Processors : 64 bit ~~ Up to 1 TB RAM}
{Fact: RAM will always be faster than disk}
{Idea : Compress data, do processing }
{Remember : Data is distributed across various machines too}
 Resilient Distributed Datasets
http://www.gridgain.com/in-memory-computing-in-plain-english/
Resilient This is Sparta and we don’t give up on data without a fight.
Distributed A part of data is everywhere.
Dataset Meh!

A bit more on RDD
Basic unit of data in Spark
 RDDs are immutable
// int a=0;
// final int b =0;
 There are two main categories of operations on RDD
a) Transformation => Lazy evaluation.
=> Creates a new RDD from the existing RDD.
b) Actions => Return values
=> Write to disk
Eg : My mom asks me to buy Grocery items

Setting Up
Download “Prebuilt for Hadoop 2.4 and later”
 Build from source with Maven or sbt.
./bin/pyspark
http://spark.apache.org/downloads.html

Talk is Cheap! Show me the code
Pyspark shell is REPL.
 Creating an RDD
a) From data in memory.
b) From File.
c) From another RDD
rdd = sc.parallelize(“ChennaiPy”) // from string
nums = [1,2,3]
rdd_nums = sc.parallelize(nums) // from list
rdd_shakespeare= sc.textFile(“shakespeare.txt”) // from file

Transformations
Less Dramatic than this . But beautiful nevertheless.
 Classic Example 1 : Map
a) Beauty in this case comes from Lambda Expressions .
nums = [1,2,3,4,5,6]
rdd_nums = sc.parallelize(nums) // Creating our RDD
new_rdd = rdd_nums.map(lambda x : x**2) // You’ve got squares
print new_rdd.colect() // Finally some action

Did you say 80 Operations?
http://nbviewer.ipython.org/github/jkthompson/pyspark-
pictures/blob/master/pyspark-pictures.ipynb

Up and running with pyspark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Up and running with pyspark

Similar to Up and running with pyspark (20)

More from Krishna Sangeeth KS

More from Krishna Sangeeth KS (17)

Recently uploaded

Recently uploaded (20)

Up and running with pyspark

Editor's Notes