Tulsa techfest Spark Core Aug 5th 2016

Remove Duplicates
Basic Spark Functionality

Spark Core
• Spark Core is the base engine for large-scale
parallel and distributed data processing. It is
responsible for:
• memory management and fault recovery
• scheduling, distributing and monitoring jobs
on a cluster
• interacting with storage systems

Spark Core
• Spark introduces the concept of an RDD (Resilient
Distributed Dataset)
• an immutable fault-tolerant, distributed collection of objects
that can be operated on in parallel.
• contains any type of object and is created by loading an
external dataset or distributing a collection from the driver
program.
• RDDs support two types of operations:
• Transformations are operations (such as map, ﬁlter, join, union,
and so on) that are performed on an RDD and which yield a
new RDD containing the result.
• Actions are operations (such as reduce, count, ﬁrst, and so
on) that return a value after running a computation on an RDD.

Spark DataFrames
• DataFrames API is inspired by data frames in R and Python
(Pandas), but designed from the ground-up to support
modern big data and data science applications:
• Ability to scale from kilobytes of data on a single laptop to
petabytes on a large cluster
• Support for a wide array of data formats and storage systems
• State-of-the-art optimization and code generation through the
Spark SQL Catalyst optimizer
• Seamless integration with all big data tooling and
infrastructure via Spark
• APIs for Python, Java, Scala, and R (in development via
SparkR)

val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv")
college: org.apache.spark.rdd.RDD[String]
val cNoDups = college.distinct
cNoDups: org.apache.spark.rdd.RDD[String]
college: RDD
“as,df,asf”
“qw,e,qw”
“mb,kg,o”
“as,df,asf”
“q3,e,qw”
“mb,kg,o”
“as,df,asf”
“qw,e,qw”
“mb,k2,o”
cNoDups: RDD
“as,df,asf”
“qw,e,qw”
“mb,kg,o”
“q3,e,qw
“mb,k2,o”

val cRows = college.map(x => x.split(",",-1))
cRows: org.apache.spark.rdd.RDD[Array[String]]
val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x )
cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])]
college: RDD
“as,df,asf”
“qw,e,qw”
“mb,kg,o”
“as,df,asf”
“q3,e,qw”
“mb,kg,o”
“as,df,asf”
“qw,e,qw”
“mb,k2,o”
cRows: RDD
Array(as,df,asf)
Array(qw,e,qw)
Array(mb,kg,o)
Array(as,df,asf)
Array(q3,e,qw)
Array(mb,kg,o)
Array(as,df,asf)
Array(qw,e,qw)
Array(mb,k2,o)
cKeyRows: RDD
key->Array(as,df,asf)
key->Array(qw,e,qw)
key->Array(mb,kg,o)
key->Array(q3,e,qw)
key->Array(mb,kg,o)
key->Array(qw,e,qw)
key->Array(mb,k2,o)

val cGrouped = cKeyRows
.groupBy(x => x._1)
.map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer]))
cGrouped: org.apache.spark.rdd.RDD[(String,Array[(String, Array[String])])]
val cDups = cGrouped.ﬁlter(x => x._2.length > 1)
cKeyRows: RDD
key->Array(qw,e,qw)
key->Array(mb,kg,o)
key->Array(q3,e,qw)
key->Array(mb,kg,o)
key->Array(qw,e,qw)
key->Array(mb,k2,o)
cGrouped: RDD
Array(as,df,asf)
Array(as,df,asf)
key->Array(mb,kg,o)
Array(mb,kg,o)
key->Array(mb,k2,o)
key->Array(qw,e,qw)
Array(qw,e,qw)
key->Array(q3,e,qw)

cDups: org.apache.spark.rdd.RDD[(String, Array[(String, Array[String])])]
val cNoDups = cGrouped.map(x => x._2(0)._2)
cNoDups: org.apache.spark.rdd.RDD[Array[String]]
cGrouped: RDD
Array(as,df,asf)
Array(as,df,asf)
key->Array(mb,kg,o)
Array(mb,kg,o)
key->Array(mb,k2,o)
key->Array(qw,e,qw)
Array(qw,e,qw)
key->Array(q3,e,qw)
“as,df,asf”
“qw,e,qw”
“mb,kg,o”
“q3,e,qw
“mb,k2,o”
cNoDups: RDD cDups: RDD
Array(as,df,asf)
Array(as,df,asf)
key->Array(mb,kg,o)
Array(mb,kg,o)
key->Array(qw,e,qw)
Array(qw,e,qw)

Previously it was RDD but currently the Spark DataFrames API is
considered to be the primary interaction point of Spark. but RDD is
available if needed

What is partitioning in Apache Spark?
Partitioning is actually the main concept of access your entire Hardware resources while
executing any Job.
More Partition = More Parallelism
So conceptually you must check the number of slots in your hardware, how many tasks can
each of executors can handle.Each partition will leave in different Executor.

DataFrames
• So Dataframe is more like column structure and each
record is actually a line.
• Can Run statistics naturally as its somewhat works like
SQL or Python/R Dataframe.
• In RDD, to process any data for last 7 days, spark
needed to go through entire dataset to get the details, but
in Dataframe you already get Time column to handle the
situation, so Spark won’t even see the data which is
greater than 7 days.
• Easier to program.
• Better performance and storage in the heap of executor.

How Dataframe ensures to
read less data?
• You can skip partition while reading the data
using Dataframe.
• Using Parquet
• Skipping data using statistucs (ie min, max)
• Using partitioning (ie year = 2015/month = 06…)
• Pushing predicates into storage systems.

What is Parquet
• You can skip partition while reading the data using
Dataframe.
• Using Parquet
• Skipping data using statistucs (ie min, max)
• Using partitioning (ie year = 2015/month = 06…)
• Pushing predicates into storage systems.

• Parquet should be the source for any operation or ETL. So if the
data is different format, preferred approach is to convert the source
to Parquet and then process.
• If any dataset in JSON or comma separated file, first ETL it to
convert it to Parquet.
• It limits I/O , so scans/reads only the columns that are needed.
• Parquet is columnar layout based, so it compresses better, so
save spaces.
• So parquet takes first column and store that as a file, and so on. So
if we have 3 different files and sql query is on 2 files, then parquet
won’t even consider to read the 3rd file.

val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/CollegeScoreCard.csv")
college.count
res2: Long = 7805
val collegeNoDups = college.distinct
collegeNoDups.count
res3: Long = 7805
val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv")
college: org.apache.spark.rdd.RDD[String] = /Users/marksmith/TulsaTechFest2016/colleges.csv MapPartitionsRDD[17] at textFile at <console>:27
val cNoDups = college.distinct
cNoDups: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[20] at distinct at <console>:29
cNoDups.count
res7: Long = 7805
college.count
res8: Long = 9000
val cRows = college.map(x => x.split(",",-1))
cRows: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[21] at map at <console>:29
val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x )
cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[22] at map at <console>:31
cKeyRows.take(2)
res11: Array[(String, Array[String])] = Array((UNITID_OPEID_opeid6_INSTNM,Array(
val cGrouped = cKeyRows.groupBy(x => x._1).map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer]))
cGrouped: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[27] at map at <console>:33
cDups: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[28] at ﬁlter at <console>:35
cDups.count
res12: Long = 1195
val cNoDups = cGrouped.map(x => x._2(0))
cNoDups: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[29] at map at <console>:35
cNoDups.count
res13: Long = 7805
val cNoDups = cGrouped.map(x => x._2(0)._2)
cNoDups: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:35
cNoDups.take(5)
16/08/04 16:44:24 ERROR Executor: Managed memory leak detected; size = 41227428 bytes, TID = 28
res16: Array[Array[String]] = Array(Array(145725, 00169100, 001691, Illinois Institute of Technology, Chicago, IL, www.iit.edu, npc.collegeboard.org/student/app/iit, 0, 3, 2, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0,
NULL, 520, 640, 640, 740, 520, 640, 580, 690, 580, 25, 30, 24, 31, 26, 33, NULL, NULL, 28, 28, 30, NULL, 1252, 1252, 0, 0, 0.2026, 0, 0, 0, 0.1225, 0, 0, 0.4526, 0.0245
Demo RDD Code

import org.apache.spark.sql.SQLContext
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/Users/marksmith/TulsaTechFest2016/colleges.csv")
df: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, CITY: string, STABBR: string, INSTURL: string, NPCURL: string, HCM2: int, PREDDEG: int, CONTROL: int, LOCALE: string, HBCU: string, PBI: string, ANNHI:
string, TRIBAL: string, AANAPII: string, HSI: string, NANTI: string, MENONLY: string, WOMENONLY: string, RELAFFIL: string, SATVR25: string, SATVR75: string, SATMT25: string, SATMT75: string, SATWR25: string, SATWR75: string, SATVRMID: string,
SATMTMID: string, SATWRMID: string, ACTCM25: string, ACTCM75: string, ACTEN25: string, ACTEN75: string, ACTMT25: string, ACTMT75: string, ACTWR25: string, ACTWR75: string, ACTCMMID: string, ACTENMID: string, ACTMTMID: string,
ACTWRMID: string, SAT_AVG: string, SAT_AVG_ALL: string, PCIP01: string, PCIP03: stri...
val dfd = df.distinct
dfd.count
res0: Long = 7804
df.count
res1: Long = 8998
val dfdd = df.dropDuplicates(Array("UNITID", "OPEID", "opeid6", "INSTNM"))
dfdd.count
res2: Long = 7804
val dfCnt = df.groupBy("UNITID", "OPEID", "opeid6", "INSTNM").agg(count("UNITID").alias("cnt"))
res8: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint]
dfCnt.show
+--------+-------+------+--------------------+---+
| UNITID| OPEID|opeid6| INSTNM|cnt|
+--------+-------+------+--------------------+---+
|10236801| 104703| 1047|Troy University-P...| 2|
|11339705|3467309| 34673|Marinello Schools...| 2|
| 135276| 558500| 5585|Lively Technical ...| 2|
| 145682| 675300| 6753|Illinois Central ...| 2|
| 151111| 181300| 1813|Indiana Universit...| 1|
df.registerTempTable("colleges")
val dfCnt2 = sqlContext.sql("select UNITID, OPEID, opeid6, INSTNM, count(UNITID) as cnt from colleges group by UNITID, OPEID, opeid6, INSTNM")
dfCnt2: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint]
dfCnt2.show
+--------+-------+------+--------------------+---+
| UNITID| OPEID|opeid6| INSTNM|cnt|
+--------+-------+------+--------------------+---+
|10236801| 104703| 1047|Troy University-P...| 2|
|11339705|3467309| 34673|Marinello Schools...| 2|
| 135276| 558500| 5585|Lively Technical ...| 2|
| 145682| 675300| 6753|Illinois Central ...| 2|
| 151111| 181300| 1813|Indiana Universit...| 1|
| 156921| 696100| 6961|Jefferson Communi...| 1|
Demo DataFrame Code

Tulsa techfest Spark Core Aug 5th 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Tulsa techfest Spark Core Aug 5th 2016

Similar to Tulsa techfest Spark Core Aug 5th 2016 (20)

More from Mark Smith

More from Mark Smith (10)

Recently uploaded

Recently uploaded (20)

Tulsa techfest Spark Core Aug 5th 2016