Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introducing Spark Core
Friday, January 22, 16
Agenda
• Assumptions
• Why Spark?
• What you need to know to begin?
Friday, January 22, 16
Assumptions
• You want to learn Apache Spark, but need to know where
to begin
• You need to know the fundamentals of Spark...
In a nutshell, why spark?
• Engine for efficient large-scale processing. It’s faster than
Hadoop MapReduce
• Spark can compl...
Introduction
• Ok, so where should I start?
Friday, January 22, 16
Spark Essentials
• Resilient Distributed Datasets (RDD)
• Transformers
• Actions
• Spark Driver Programs and SparkContext
...
Resilient Distributed Datasets (RDDs)
• RDDs are Spark’s primary abstraction for data
interaction (lazy, in memory)
• RDDs...
Transformations
• RDD functions which return pointers to new RDDs
(remember: lazy)
• map, flatMap, filter, etc.
Friday, Janu...
Actions
• RDD functions which return values to the driver
• reduce, collect, count, etc.
Friday, January 22, 16
Spark RDDs, Transformations, Actions Diagram
Load from External Source
Example: textFile
Transformations Actions
RDDs
Outp...
Spark Driver Programs and Context
• Spark driver is a program that declares transformations and
actions on RDDs of data
• ...
Driver Program and SparkContext
Image borrowed from http://spark.apache.org/docs/latest/cluster-overview.html
Friday, Janu...
References
• For course information and discount coupons, visit http://
www.supergloo.com/
• Learning Spark Book Summary h...
Next Steps
Friday, January 22, 16
Upcoming SlideShare
Loading in …5
×

Spark Core

382 views

Published on

When learning Apache Spark, where should a person begin? What are the key fundamentals when learning Apache Spark? Resilient Distributed Datasets, Spark Drivers and Context, Transformations, Actions.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Spark Core

  1. 1. Introducing Spark Core Friday, January 22, 16
  2. 2. Agenda • Assumptions • Why Spark? • What you need to know to begin? Friday, January 22, 16
  3. 3. Assumptions • You want to learn Apache Spark, but need to know where to begin • You need to know the fundamentals of Spark in order to progress in your learning of Spark • You need to evaluate if Spark could be an appropriate fit for your use cases or career growth One or more of the following Friday, January 22, 16
  4. 4. In a nutshell, why spark? • Engine for efficient large-scale processing. It’s faster than Hadoop MapReduce • Spark can complement your existing Hadoop investments such as HDFS and Hive • Rich ecosystem including support for SQL, Machine Learning, Steaming and multiple language APIs such as Scala, Python and Java Friday, January 22, 16
  5. 5. Introduction • Ok, so where should I start? Friday, January 22, 16
  6. 6. Spark Essentials • Resilient Distributed Datasets (RDD) • Transformers • Actions • Spark Driver Programs and SparkContext To begin, you need to know: Friday, January 22, 16
  7. 7. Resilient Distributed Datasets (RDDs) • RDDs are Spark’s primary abstraction for data interaction (lazy, in memory) • RDDs are an immutable, distributed collection of elements separated into partitions • There are multiple types of RDDs • RDDs can be created from an external data sets such as Hadoop InputFormats, text files on a variety of file systems or existing RDDs via a Spark Transformations Friday, January 22, 16
  8. 8. Transformations • RDD functions which return pointers to new RDDs (remember: lazy) • map, flatMap, filter, etc. Friday, January 22, 16
  9. 9. Actions • RDD functions which return values to the driver • reduce, collect, count, etc. Friday, January 22, 16
  10. 10. Spark RDDs, Transformations, Actions Diagram Load from External Source Example: textFile Transformations Actions RDDs Output Value(s) Example: count, collect 5, ['a','b', 'c'] Friday, January 22, 16
  11. 11. Spark Driver Programs and Context • Spark driver is a program that declares transformations and actions on RDDs of data • A driver submits the serialized RDD graph to the master where the master creates tasks. These tasks are delegated to the workers for execution. • Workers are where the tasks are actually executed. Friday, January 22, 16
  12. 12. Driver Program and SparkContext Image borrowed from http://spark.apache.org/docs/latest/cluster-overview.html Friday, January 22, 16
  13. 13. References • For course information and discount coupons, visit http:// www.supergloo.com/ • Learning Spark Book Summary http://www.amazon.com/ Learning-Spark-Summary-Lightning-Fast-Deconstructed- ebook/dp/B019HS7USA/ Friday, January 22, 16
  14. 14. Next Steps Friday, January 22, 16

×