This document provides an overview of basic usage of the Apache Spark framework for data analysis. It describes what Spark is, how to install it, and how to use it from Scala, Python, and R. It also explains the key concepts of RDDs (Resilient Distributed Datasets), transformations, and actions. Transformations like filter, map, join, and reduce return new RDDs, while actions like collect, count, and first return results to the driver program. The document provides examples of common transformations and actions in Spark.
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Databricks
Big companies typically integrate their data from various heterogeneous systems when building a data lake as single point for accessing data. To achieve this goal technical teams often deal with data defined by complex schemas and various data formats. Spark SQL Datasets are currently compatible with data formats such as XML, Avro and Parquet by providing primitive and complex data types such as structs and arrays.
Although Dataset API offers rich set of functions, general manipulation of array and deeply nested data structures is lacking. We will demonstrate this fact by providing examples of data which is currently very hard to process in Spark efficiently. We designed and developed an extension of Dataset API to allow developers to work with array and complex type elements in a more straightforward and consistent way. The extension should help users dealing with complex and structured big data to use Apache Spark as a truly generic processing framework.
(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012).
Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.
- rhdfs provides file level manipulation for HDFS, the Hadoop file system
- rhbase provides access to HBASE, the hadoop database
- rmr allows to write mapreduce programs in R
rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Databricks
Meltdown and Spectre are two security vulnerabilities disclosed in early 2018 that expose systems to cross-VM and cross-process attacks. They were the first of their kind and opened up a new class of exploits that allow one program to scan another program’s memory. The kernel and VM patches released to address these vulnerabilities have shown to degrade the performance of Apache Spark workloads in the cloud by 2-5%.
This talk will dive deep into the exploits and their patches in order to help explain the origin of this decline in performance.
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Databricks
Big companies typically integrate their data from various heterogeneous systems when building a data lake as single point for accessing data. To achieve this goal technical teams often deal with data defined by complex schemas and various data formats. Spark SQL Datasets are currently compatible with data formats such as XML, Avro and Parquet by providing primitive and complex data types such as structs and arrays.
Although Dataset API offers rich set of functions, general manipulation of array and deeply nested data structures is lacking. We will demonstrate this fact by providing examples of data which is currently very hard to process in Spark efficiently. We designed and developed an extension of Dataset API to allow developers to work with array and complex type elements in a more straightforward and consistent way. The extension should help users dealing with complex and structured big data to use Apache Spark as a truly generic processing framework.
(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012).
Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.
- rhdfs provides file level manipulation for HDFS, the Hadoop file system
- rhbase provides access to HBASE, the hadoop database
- rmr allows to write mapreduce programs in R
rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Databricks
Meltdown and Spectre are two security vulnerabilities disclosed in early 2018 that expose systems to cross-VM and cross-process attacks. They were the first of their kind and opened up a new class of exploits that allow one program to scan another program’s memory. The kernel and VM patches released to address these vulnerabilities have shown to degrade the performance of Apache Spark workloads in the cloud by 2-5%.
This talk will dive deep into the exploits and their patches in order to help explain the origin of this decline in performance.
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...Databricks
Learn from someone who has made just about every basic Apache Spark mistake possible so you don’t have to! We’ll go over some of the most common things that users do that end up doing that cause unnecessary pain and actually explain how to avoid them.
Confused about serialization? Not sure what is meant by use a singleton to share connections? Together we will walk through concrete examples of how to handle these situation. Learn how to: do all your work remotely, not break your catalyst optimizations, use all your resources, and much more! Together lets learn how to make our Spark Applications better!
Many people ask about how to develop a functional mindset. It’s difficult if you’ve learned another paradigm and don’t know where to start. Functional thinking is a set of habits that you can train that will serve you well while programming in any language.
The slide shows a full gist of reading different types of data in R thanks to coursera it was much comprehensive and i made some additional changes too.
Basic Query Tuning Primer - Pg West 2009mattsmiley
Intro to query tuning in Postgres, for beginners or intermediate software developers. Lists your basic toolkit, common problems, a series of examples. Assumes the audience knows basic SQL but has little or no experience with reading or adjusting execution plans. Accompanies 45-90 minute talk; meant to encourage Q/A.
The slides I prepared for https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/268164461/ about Apache Kafka integration in Apache Spark Structured Streaming.
This one is about advanced indexing in PostgreSQL. It guides you through basic concepts as well as through advanced techniques to speed up the database.
All important PostgreSQL Index types explained: btree, gin, gist, sp-gist and hashes.
Regular expression indexes and LIKE queries are also covered.
Using Cerberus and PySpark to validate semi-structured datasetsBartosz Konieczny
This short presentation shows one of ways to to integrate Cerberus and PySpark. It was initially given at Paris.py meetup (https://www.meetup.com/Paris-py-Python-Django-friends/events/264404036/)
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...Databricks
Learn from someone who has made just about every basic Apache Spark mistake possible so you don’t have to! We’ll go over some of the most common things that users do that end up doing that cause unnecessary pain and actually explain how to avoid them.
Confused about serialization? Not sure what is meant by use a singleton to share connections? Together we will walk through concrete examples of how to handle these situation. Learn how to: do all your work remotely, not break your catalyst optimizations, use all your resources, and much more! Together lets learn how to make our Spark Applications better!
Many people ask about how to develop a functional mindset. It’s difficult if you’ve learned another paradigm and don’t know where to start. Functional thinking is a set of habits that you can train that will serve you well while programming in any language.
The slide shows a full gist of reading different types of data in R thanks to coursera it was much comprehensive and i made some additional changes too.
Basic Query Tuning Primer - Pg West 2009mattsmiley
Intro to query tuning in Postgres, for beginners or intermediate software developers. Lists your basic toolkit, common problems, a series of examples. Assumes the audience knows basic SQL but has little or no experience with reading or adjusting execution plans. Accompanies 45-90 minute talk; meant to encourage Q/A.
The slides I prepared for https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/268164461/ about Apache Kafka integration in Apache Spark Structured Streaming.
This one is about advanced indexing in PostgreSQL. It guides you through basic concepts as well as through advanced techniques to speed up the database.
All important PostgreSQL Index types explained: btree, gin, gist, sp-gist and hashes.
Regular expression indexes and LIKE queries are also covered.
Using Cerberus and PySpark to validate semi-structured datasetsBartosz Konieczny
This short presentation shows one of ways to to integrate Cerberus and PySpark. It was initially given at Paris.py meetup (https://www.meetup.com/Paris-py-Python-Django-friends/events/264404036/)
JavaScript is becoming more functional nowadays. map and filter - those words mean a lot for modern programming world. Some people think that those are just Array.prototype methods. But Bacon.js also implements map and filter algorithms in their codebase, as well as RxJS. Immutable.js and lazy collections from lodash library do have their own implementations of these algorithms. Transducers is an attempt to describe and implement the essence of these (and lots of other) algorithms, ignoring the context which they are applied to - arrays, lazy collections, event streams or queues/channels.
Dr. Hsieh is teaching how to use the state-of-the-art libraries, Spark by Apache, to conduct data analysis on hadoop platform in ISSNIP 2015, Singapore. He started with teaching the basic operations like “map, reduce, flatten, and more,” followed by explaining the extension of Spark, including MLib, GraphX, and SparkSQL.
Python 101 language features and functional programmingLukasz Dynowski
Presentation reviles the syntax solution for common encountered programming challenges, gives insight in to python datatypes, and explains core design principles behind the program
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
It's the presentation slides I prepared for my college workshop. This demonstrates how you can talk with PostgreSql db using python scripting.For queries, mail at dipeshsuwal@gmail.com
Stata cheat sheet: programming. Co-authored with Tim Essam (linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/04
Introduction to parallel and distributed computation with sparkAngelo Leto
Lecture about Apache Spark at the Master in High Performance Computing organized by SISSA and ICTP
Covered topics: Apache Spark, functional programming, Scala, implementation of simple information retrieval programs using TFIDF and the Vector Model
1. Basic Usage Of Spark
Author(s): Nagavarunkumar Kolla
Table of Contents
1. Introduction to Data Analysis with Spark
2. 1.1 What Is Apache Spark?
1.2 Important Notes about Spark
2. Spark Installation Steps
2.1 How to start scala, python & R
3. Technical Details
3.1 Transformations
a. cartesian (otherDataset)
b. partitions(func)
c. cogroup (otherDataset, [num Tasks])
d. distinct([numTasks])
e. filter(func) & filterWith(func)
f. sortBy() & sortByKey()
g. flatMap(func)
h. join (otherDataset, [numTasks])
i. sum(), max() & min()
j. pipe(command, [envVars])
3.2 Actions
a. collect() & collectAsMap()
b. count(), countByKey() & countByValue()
c. first()
d. foreach(func)
e. keys()
f. reduce(func)
g. saveAsTextFile(path)
h. stats()
i. take(n) & takeOrdered(n, [ordering])
j. top()
3.3 Miscellaneous
a. Zip & CombineByKey
b. toDebugString()
c. ++ Operator
d. values & variance
4. DataFrames
a. Defining a Function
b. Defining a Function with Aggreate
5. References and Links
1.Introduction to Data Analysis with Spark:
1.1 What Is Apache Spark?
Apache Spark is a cluster computing platform designed to be fast and general purpose
Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL
and rich built-in libraries
3. 1.2 Important Notes about Spark:
1. Spark provides two ways to create RDDs: loading an external dataset and parallelizing a
collection in your driver program
2. The Simplest way to create RDDs is to take an existing collection in your program and pass it
to SparkContext's parallelize() method
3. RDD (Resilient Distributed Dataset), supports two types of operations:Transformations
& Actions
Transformations are operations on RDDs that return a new RDD, such as map() and
filter()
Actions are operations that return a result to the driver program on write it to storage
such as count() and first()
2. Spark Installation Steps
1. Install Java, Python, R
sudo apt-get install openjdk-7-jdk
sudo apt-get install python
sudo apt-get install r-base
2. Install Hadoop-2.6.0, Hive-1.2.1
3. Install scala-2.10.4
Download the file from below link
http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
copy the tar file into work dir & extract it
update the ~/.bashrc file with SCALA_HOME path
4. Install spark-1.4.1
Download the file from below link
https://spark.apache.org/downloads.html
copy the tar file into work dir & extract it
update the ~/.bashrc file with SPARK_HOME path
5. reopen the terminal to reflect the changes
2.1 How to start scala, python & R
1. "scala" is the command to start the "scala"
"python" is the command to start the "python"
"R" is the command to start the "R"
4. Verify the all are installed or not
2. spark-shell, pyspark, sparkR, spark-sql these are the main commands to
work with spark
execute "spark-shell" command, it will move to "scala" prompt
execute "pyspark" command, it will move to "python" prompt
execute "sparkR" command, it will move to "R" prompt
3. Technical Details
3.1 Transformations
All transformations in Spark are lazy, in that they do not compute their results right away.
The transformations are only computed when an action requires a result to be returned to the
driver program.
The following lists some of the common transformations supported by Spark
cartesian (otherDataset): When called on datasets of types T and U, returns a dataset of (T,
U) pairs (all pairs of elements).
Example:
val x = sc.parallelize(List(1,2,3,4,5))
val y = sc.parallelize(List(6,7,8,9,10))
x.cartesian(y).collect
res: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6),
(3,7), (3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))
partitions(func): A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements that can be operated on in parallel.
Example:
val y = sc.parallelize(1 to 10, 10)
y.partitions.length
res: Int = 10
val z = y.coalesce(2, false)
z.partitions.length
res: Int = 2
5. cogroup (otherDataset, [numTasks]): When called on datasets of type (K, V) and (K, W),
returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called
groupWith.
Example:
val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.map((_, "b"))
val c = a.map((_, "c"))
val d = a.map((_, "d"))
b.cogroup(c).collect
res: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b,
b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))),
(2,(CompactBuffer(b),CompactBuffer(c))))
b.cogroup(c, d).collect
res: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b,
b),CompactBuffer(c, c),CompactBuffer(d, d))),
(3,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d))),
(2,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d))))
distinct([numTasks]): Return a new dataset that contains the distinct elements of the source
dataset
Example 1:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.distinct.collect
res: Array[String] = Array(Dog, Gnu, Cat, Rat)
Example 2:
val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
a.distinct(2).partitions.length
res: Int = 2
a.distinct(3).partitions.length
res: Int = 3
6. filter(func) & filterWith(func): Return a new dataset formed by selecting those elements of the
source on which func returns true.
Example 1:
val a = sc.parallelize(1 to 10, 3)
val b = a.filter(_ % 2 == 0)
b.collect
res: Array[Int] = Array(2, 4, 6, 8, 10)
Example 2:
val a = sc.parallelize(1 to 9, 3)
val b = a.filterWith(i => i)((x,i) => x % 2 == 0 || i % 2 == 0)
b.collect
res: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)
-------------------------------------------------------------------^
val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)
a.filterWith(x=> x)((a, b) => b == 0).collect
res: Array[Int] = Array(1, 2)
---------------------------------------------------------------------^
a.filterWith(x=> x)((a, b) => a % (b+1) == 0).collect
res: Array[Int] = Array(1, 2, 4, 6, 8, 10)
---------------------------------------------------------------------^
a.filterWith(x=> x.toString)((a, b) => b == "2").collect
res: Array[Int] = Array(5, 6)
sortBy() & sortByKey([ascending], [numTasks]): When called on a dataset of (K, V) pairs
where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the boolean ascending argument.
Example 1:
val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1))
y.sortBy(c => c, true).collect
res: Array[Int] = Array(1, 1, 2, 3, 5, 7)
--------------------------------------------------------------------------------^
y.sortBy(c => c, false).collect
res: Array[Int] = Array(7, 5, 3, 2, 1, 1)
7. --------------------------------------------------------------------------------^
val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5)))
z.sortBy(c => c._1, true).collect
res: Array[(String, Int)] = Array((A,26), (H,10), (L,5),(Z,1))
--------------------------------------------------------------------------------^
z.sortBy(c => c._2, true).collect
res: Array[(String, Int)] = Array((Z,1), (L,5), (H,10),(A,26))
Example 2:
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.toInt, 2)
val c = a.zip(b)
c.sortByKey(true).collect
res: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))
--------------------------------------------------------------------------------------------------^
c.sortByKey(false).collect
res: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))
-------------------------------------------------------------------------------------------------^
val a = sc.parallelize(1 to 100, 5)
val b = a.cartesian(a)
val c = sc.parallelize(b.takeSample(true, 5, 13), 2)
val d = c.sortByKey(false)
res: Array[(Int, Int)] = Array((96,9), (84,76), (59,59), (53,65), (52,4))
flatMap(func): Similar to map, but each input item can be mapped to 0 or more output items (so
func should return a Seq rather than a single item).
Example:
val a = sc.parallelize(1 to 10, 5)
a.flatMap(1 to _).collect
res: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4,
5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10)
-------------------------------------------------------------------------------------------------^
sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collect
res: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)
-------------------------------------------------------------------------------------------------^
8. The program below generates a random number of copies (up to 10) of
the items in the list.
val x = sc.parallelize(1 to 10, 3)
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect
res: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7,
7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)
join (otherDataset, [numTasks]): When called on datasets of type (K, V) and (K, W), returns a dataset
of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through
leftOuterJoin, rightOuterJoin, and fullOuterJoin.
Example:
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.join(d).collect
res: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,
(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,
(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,
(rat,gnu)), (3,(rat,bee)))
pipe(command, [envVars]): Pipe each partition of the RDD through a shell command, e.g. a Perl or
bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned
as an RDD of strings.
Example:
val a = sc.parallelize(1 to 9, 3)
a.pipe("head -n 1").collect
res: Array[String] = Array(1, 4, 7)
sum(), max() & min(): These functions are used to get the results in transformation and provide
the accurate values
Example 1:
9. val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.sum
res: Double = 101.39999999999999
Example 2:
val y = sc.parallelize(10 to 30)
y.max
res: Int = 30
Example 3:
val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9,
"lion"), (18, "cat")))
a.min
res: (Int, String) = (3,tiger)
3.2 Actions
By default, each transformed RDD may be recomputed each time you run an action on
it. However, you may also persist an RDD in memory using the persist (or cache) method, in
which case Spark will keep the elements around on the cluster for much faster access the next
time you query it.
The following lists some of the common actions supported by Spark.
collect() & collectAsMap(): Return all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation that returns a sufficiently small
subset of the data.
Example 1:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.collect
res: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)
Example 2:
val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.zip(a)
b.collect
10. res: Array[(Int, Int)] = Array((1,1), (2,2), (1,1), (3,3))
b.collectAsMap
res: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)
count(): Return the number of elements in the dataset.
Example:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.count
res: Long = 4
countByKey() & countByValue(): Only available on RDDs of type (K, V). Returns a hashmap
of (K, Int) pairs with the count of each key.
Example 1:
val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2)
c.countByKey
res: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)
Example 2:
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b.countByValue
res: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 ->
2)
first(): Return the first element of the dataset (similar to take(1)).
Example 1:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.first
res: String = Gnu
11. foreach(func): Run a function func on each element of the dataset. This is usually done for side
effects such as updating an accumulator variable or interacting with external storage systems.
Example:
val c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale", "dolphin",
"spider"), 3)
c.foreach(x => println(x + "s are yummy"))
res:
lions are yummy
gnus are yummy
crocodiles are yummy
ants are yummy
whales are yummy
dolphins are yummy
spiders are yummy
keys(): For each key k in this or other, return a resulting RDD that contains a tuple with the list
of values for that key in this as well as other.
Example:
val a = sc.parallelize(List("dog", "tiger", "lion", "cat",
"panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.keys.collect
res: Array[Int] = Array(3, 5, 4, 3, 7, 5)
reduce(func):Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one). The function should be commutative and associative so that it can
be computed correctly in parallel
Example:
val a = sc.parallelize(1 to 10, 3)
a.reduce(_ + _)
res: Int = 55
saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark
will call toString on each element to convert it to a line of text in the file.
12. Example:
val a = sc.parallelize(1 to 10000, 3)
a.saveAsTextFile("mydata_a")
import org.apache.hadoop.io.compress.GzipCodec
a.saveAsTextFile("mydata_b", classOf[GzipCodec])
val x = sc.textFile("mydata_b")
x.count
res: Long = 10000
val x = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3)
x.saveAsTextFile("hdfs://localhost:8020/user/hadoop/test");
val sp = sc.textFile("hdfs://localhost:8020/user/hadoop/sp_data")
sp.flatMap(_.split(" ")).saveAsTextFile("hdfs://localhost:8020/user/hadoop/sp_x")
val v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2), ("ant",5)), 2)
v.saveAsSequenceFile("hd_seq_file")
val x = sc.parallelize(1 to 100, 3)
x.saveAsObjectFile("objFile")
val y = sc.objectFile[Array[Int]]("objFile")
y.collect
res: Array[Int] = Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33)
stats(): They provide column summary statistics for RDD[Vector] through the function Stats
available in Statistics.
Example:
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02,
19.29, 11.09, 21.0), 2)
x.stats
res: org.apache.spark.util.StatCounter = (count: 9, mean: 11.266667, stdev: 8.126859, max:
21.000000, min: 1.000000)
13. take(n) : Return an array with the first n elements of the dataset. Note that this is currently not
executed in parallel. Instead, the driver program computes all the elements
Example 1:
val b = sc.parallelize(List("dog", "cat", "ape", "salmon","gnu"), 2)
b.take(2)
res: Array[String] = Array(dog, cat)
Example 2:
val b = sc.parallelize(1 to 10000, 5000)
b.take(100)
res: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
98, 99, 100)
takeOrdered(n, [ordering]): Return the first n elements of the RDD using either their natural
order or a custom comparator
Example 1:
val b = sc.parallelize(List("dog", "cat", "ape", "salmon","gnu"), 2)
b.takeOrdered(2)
res: Array[String] = Array(ape, cat)
top(num: Int): Returns the top K elements from this RDD as defined by the specified implicit
Ordering[T].
Example:
val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)
c.top(2)
res: Array[Int] = Array(9, 8)
3.3 Miscellaneous
14. The following lists some of the common Miscellaneous actions supported by Spark
Zip : Zips this RDD with another one, returning key-value pairs with the first element in each
RDD, second element in each RDD, etc.
Example:
val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
c.collect
res: Array[(Int, String)] = Array((1,dog), (1,cat), (2,gnu), (2,salmon), (2,rabbit), (1,turkey),
(2,wolf), (2,bear), (2,bee))
CombineByKey: Simplified version of combineByKey that hash-partitions the resulting RDD
using the default parallelism level.
Example:
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) =>
x ::: y)
d.collect
res: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee,
bear, wolf)))
toDebugString():toDebugString method of an RDD did a better job of explaining where shuffle
boundaries occur in the lineage graph. One way to do this would be to only indent the tree at a
shuffle boundary instead of indenting it for every parent.
Example:
val a = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.toDebugString
res156: String =
(3) MapPartitionsRDD[146] at subtract at <console>:26 []
| SubtractedRDD[145] at subtract at <console>:26 []
+-(3) MapPartitionsRDD[143] at subtract at <console>:26 []
| | ParallelCollectionRDD[141] at parallelize at <console>:22 []
+-(3) MapPartitionsRDD[144] at subtract at <console>:26 []
| ParallelCollectionRDD[142] at parallelize at <console>:22 []
15. ++ Operator:This is the special operator treated in Spark
Example 1:
val a = sc.parallelize(1 to 3, 1)
val b = sc.parallelize(5 to 7, 1)
(a ++ b).collect
res: Array[Int] = Array(1, 2, 3, 5, 6, 7)
values & variance:
Example 1:
val a = sc.parallelize(List("dog", "tiger", "lion", "cat",
"panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.values.collect
res: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)
Example 2:
val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.variance
res: Double = 10.605333333333332
--------------------------------------------------------------------------^
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.variance
res: Double = 66.04584444444443
x.sampleVariance
res: Double = 74.30157499999999
4. DataFrames
A DataFrame is a distributed collection of data organized into named columns.The
DataFrame API is available in Scala, Java, Python, and R
Examples For Scala:
-:Scala:-
16. command:- spark-shell
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("file:/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/people.json")
// Show the content of the DataFrame
df.show()
// Print the schema in a tree format
df.printSchema()
// Select only the "name" column
df.select("name").show()
// Select everybody, but increment the age by 1
df.select(df("name"), df("age") + 1).show()
// Select people older than 21
df.filter(df("age") > 21).show()
// Count people by age
df.groupBy("age").count().show()
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// you can use custom classes that implement the Product interface.
Step 1:
case class Person(name: String, age: Int)
Step 2:
val people = sc.textFile("file:/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0),
p(1).trim.toInt)).toDF()
people.registerTempTable("people")
Step 3:
val results = sqlContext.sql("SELECT * FROM people")
Step4:
results.map(t => "Name: " + t(0)).collect().foreach(println)
17. Example For Python:
-:Python:-
command:- pyspark
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("file:/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
# Print the schema in a tree format
df.printSchema()
# Select only the "name" column
df.select("name").show()
# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
# Select people older than 21
df.filter(df['age'] > 21).show()
# Count people by age
df.groupBy("age").count().show()
# you can use custom classes that implement the Product interface
Step 1:
schemaPeople = sqlContext.createDataFrame(people)
Step 2:
schemaPeople.registerTempTable("people")
Step 3:
peoples = sqlContext.sql("SELECT * FROM people")
for people in peoples.collect():print people
Step 4:
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
Examples for R:
-:R:-
command:- sparkR
18. sqlContext <- SQLContext(sc)
df <- jsonFile(sqlContext, "file path")
# Displays the content of the DataFrame to stdout
showDF(df)
# Print the schema in a tree format
printSchema(df)
# Select only the "name" column
showDF(select(df, "name"))
# Select everybody, but increment the age by 1
showDF(select(df, df$name, df$age + 1))
# Select people older than 21
showDF(where(df, df$age > 21))
# Count people by age
showDF(count(groupBy(df, "age")))
Examples For Hive:
-:Hive:-
One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or
HiveQL
Spark SQL can also be used to read data from an existing Hive installation
command:- spark-shell
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE DATABASE IF NOT EXISTS varun")
sqlContext.sql("CREATE TABLE IF NOT EXISTS varun.src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH '/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/kv1.txt' INTO TABL varun.src")
sqlContext.sql("FROM varun.src SELECT key, value").collect().foreach(println)
sqlContext.sql("SELECT key, value FROM varun.src").collect().foreach(println)
sqlContext.sql("SELECT key, value FROM varun.src where key > 100").collect().foreach(println)
19. sqlContext.sql("SELECT key, count(key) FROM varun.src group by
key").collect().foreach(println)
4.1 Defining a Function: This function operates on distributed DataFrames and works row by
row (unless you're creating an user defined aggregation function)
Example:
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12),
("mouse", 2)), 2)
def myfunc(index: Int, iter: Iterator[(String, Int)]) :
Iterator[String] = {
iter.toList.map(x => "[partID:" +
index + ", val: " + x + "]").iterator
}
pairRDD.mapPartitionsWithIndex(myfunc).collect
res: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)],
[partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])
pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))
4.2 Defining a Function with Aggreate:
Example:
val nums = sc.parallelize(List(1,2,3,4,5,6), 2)
val chars = sc.parallelize(List("a","b","c","d","e","f"),2)
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
nums.foreach(println)
chars.foreach(println)
nums.mapPartitionsWithIndex(myfunc).collect
20. chars.mapPartitionsWithIndex(myfunc).collect
nums.aggregate(1)(math.max(_, _), _ + _)
reduce of partition 0 will be max(1, 1, 2, 3) = 3
reduce of partition 1 will be max(1, 4, 5, 6) = 6
inal reduce across partitions will be 1 + 3 + 6 = 10
nums.aggregate(5)(math.max(_, _), _ + _)
reduce of partition 0 will be max(5, 1, 2, 3) = 5
reduce of partition 1 will be max(5, 4, 5, 6) = 6
inal reduce across partitions will be 5 + 5 + 6 = 16
chars.aggregate("")(_ + _, _+_)
res: String = defabc
chars.aggregate("x")(_ + _, _+_)
res: String = xxabcxdef
5. References and Links:
https://spark.apache.org/examples.html
https://spark.apache.org/docs/latest/