SlideShare a Scribd company logo
An Introduction
A big data processing tool built with Scala and runs on JVM
BigData
It was first mentioned by NASA researchers Michael Cox and
David Ellsworth in 1997.
Definition: data sets that are so large or complex that
traditional data processing applications are inadequate to
deal with them. Challenges include analysis, capture, data
curation, search, sharing, storage, transfer, visualization,
querying, updating, information privacy, and real-time
capabilities. (Wikipedia)
BigDataProcessing
The rise of Big Data required faster tools for processing
data.
Can traditional databases handle Big Data? What are the
limitations?
What are the solutions?
WhatisSpark?
A big data processing framework built around speed,
generality, ease of use, and accessibility.
PropertiesofSPARK
Speed: Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk.
Ease of Use: Write applications quickly in Java, Scala,
Python, and R.
Generality: Combines SQL, streaming, and complex analytics.
Accessibility: Spark runs on Hadoop, Mesos, standalone, or
in the cloud. It can access diverse data sources including
HDFS, Cassandra, HBase, and S3.
SparkFeatures
❏ Works directly on memory for speed-up
❏ Supports MapReduce
❏ Lazy evaluation of big data queries for optimization. (It
does exactly what you tell it to do)
❏ Operators and API that allow for easy interaction through
Scala, R, and Python.
Sparknuggets
Spark Streaming - process real-time data using basic
abstraction called Resilient Distributed Datasets (RDDs).
Spark SQL - Allows for ad-hoc querying using JDBC API
Spark MLlib - provides optimized machine learning libraries
for regression, classification, and clustering tasks.
Spark GraphX - an extension to RDD for graph-parallel
computation.
RDD
A fundamental data structure in Spark. An immutable
collection of objects that support in-memory processing.
Two ways of creating RDDs:
❏ Parallelize existing data collection in your driver
program
❏ Importing directly from HDFS, HBASE or any other
Hadoop Input Format.
Map/Reduce
A programming model for processing and generating large
amounts of data with a parallel, distributed algorithm on a
cluster.
ASimpleWordCountprogram
text_file = spark.textFile("hdfs://...")
text_file.flatMap(lambda line:
line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b)
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount extends
Configured implements Tool{
public int run(String[] args)
throws Exception
{
//creating a JobConf
object and assigning a job name for
identification purposes
JobConf conf = new
JobConf(getConf(), WordCount.class);
conf.setJobName("WordCount");
//Setting configuration
object with the Data Type of output
Key and Value
conf.setOutputKeyClass(Text.class);
References
http://spark.apache.org/
http://www.winshuttle.com/big-data-timeline/
https://www.infoq.com/articles/apache-spark-introduction
http://kickstarthadoop.blogspot.tw/2011/04/word-count-hadoop
-map-reduce-example.html
An Introduction to Apache Spark

More Related Content

What's hot

Hadoop white papers
Hadoop white papersHadoop white papers
Hadoop white papers
Muthu Natarajan
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
datastack
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Prashanth Yennampelli
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
amarkayam
 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
VijayMohan Vasu
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn HadoopSilicon Halton
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
Arjen de Vries
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
SandeepTaksande
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
Dr. Mirko Kämpf
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
SahilRaina21
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
vhrocca
 
Hadoop
HadoopHadoop
Bi with apache hadoop(en)
Bi with apache hadoop(en)Bi with apache hadoop(en)
Bi with apache hadoop(en)
Alexander Alten
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 

What's hot (20)

Hadoop white papers
Hadoop white papersHadoop white papers
Hadoop white papers
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Hadoop
HadoopHadoop
Hadoop
 
Bi with apache hadoop(en)
Bi with apache hadoop(en)Bi with apache hadoop(en)
Bi with apache hadoop(en)
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 

Similar to An Introduction to Apache Spark

APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraCassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Anant Corporation
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Big data overview
Big data overviewBig data overview
Big data overview
beCloudReady
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
GauravBiswas9
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apachespark 160612140708
Apachespark 160612140708Apachespark 160612140708
Apachespark 160612140708
Srikrishna k
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
Jen Stirrup
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
Prwatech Institution
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Low latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkLow latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkPradeep Kumar G.S
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
Laxmi8
 
Spark 101
Spark 101Spark 101

Similar to An Introduction to Apache Spark (20)

APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraCassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in Cassandra
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apachespark 160612140708
Apachespark 160612140708Apachespark 160612140708
Apachespark 160612140708
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Low latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkLow latency access of bigdata using spark and shark
Low latency access of bigdata using spark and shark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Spark 101
Spark 101Spark 101
Spark 101
 

More from Elvis Saravia

The Future of Brain-Powered Learning
The Future of Brain-Powered Learning The Future of Brain-Powered Learning
The Future of Brain-Powered Learning
Elvis Saravia
 
Introduction to Fundamentals of RNNs
Introduction to Fundamentals of RNNsIntroduction to Fundamentals of RNNs
Introduction to Fundamentals of RNNs
Elvis Saravia
 
Text mining lab (summer 2017) - Word Vector Representation
Text mining lab (summer 2017) - Word Vector RepresentationText mining lab (summer 2017) - Word Vector Representation
Text mining lab (summer 2017) - Word Vector Representation
Elvis Saravia
 
Thesis oral defense 2015 elvis saravia
Thesis oral defense 2015  elvis saraviaThesis oral defense 2015  elvis saravia
Thesis oral defense 2015 elvis saravia
Elvis Saravia
 
The Neurochemistry of Music
The Neurochemistry of MusicThe Neurochemistry of Music
The Neurochemistry of Music
Elvis Saravia
 
NewSQL - The Future of Databases?
NewSQL - The Future of Databases?NewSQL - The Future of Databases?
NewSQL - The Future of Databases?
Elvis Saravia
 
Crowdsource Delivery System - Improving traditional delivery systems
Crowdsource Delivery System - Improving traditional delivery systemsCrowdsource Delivery System - Improving traditional delivery systems
Crowdsource Delivery System - Improving traditional delivery systems
Elvis Saravia
 
Relational Databases - Benefits and Challenges
Relational Databases - Benefits and ChallengesRelational Databases - Benefits and Challenges
Relational Databases - Benefits and Challenges
Elvis Saravia
 
Subconscious Crowdsourcing: A Feasible Data Collection Mechanism for Mental D...
Subconscious Crowdsourcing: A Feasible Data Collection Mechanism for Mental D...Subconscious Crowdsourcing: A Feasible Data Collection Mechanism for Mental D...
Subconscious Crowdsourcing: A Feasible Data Collection Mechanism for Mental D...
Elvis Saravia
 

More from Elvis Saravia (9)

The Future of Brain-Powered Learning
The Future of Brain-Powered Learning The Future of Brain-Powered Learning
The Future of Brain-Powered Learning
 
Introduction to Fundamentals of RNNs
Introduction to Fundamentals of RNNsIntroduction to Fundamentals of RNNs
Introduction to Fundamentals of RNNs
 
Text mining lab (summer 2017) - Word Vector Representation
Text mining lab (summer 2017) - Word Vector RepresentationText mining lab (summer 2017) - Word Vector Representation
Text mining lab (summer 2017) - Word Vector Representation
 
Thesis oral defense 2015 elvis saravia
Thesis oral defense 2015  elvis saraviaThesis oral defense 2015  elvis saravia
Thesis oral defense 2015 elvis saravia
 
The Neurochemistry of Music
The Neurochemistry of MusicThe Neurochemistry of Music
The Neurochemistry of Music
 
NewSQL - The Future of Databases?
NewSQL - The Future of Databases?NewSQL - The Future of Databases?
NewSQL - The Future of Databases?
 
Crowdsource Delivery System - Improving traditional delivery systems
Crowdsource Delivery System - Improving traditional delivery systemsCrowdsource Delivery System - Improving traditional delivery systems
Crowdsource Delivery System - Improving traditional delivery systems
 
Relational Databases - Benefits and Challenges
Relational Databases - Benefits and ChallengesRelational Databases - Benefits and Challenges
Relational Databases - Benefits and Challenges
 
Subconscious Crowdsourcing: A Feasible Data Collection Mechanism for Mental D...
Subconscious Crowdsourcing: A Feasible Data Collection Mechanism for Mental D...Subconscious Crowdsourcing: A Feasible Data Collection Mechanism for Mental D...
Subconscious Crowdsourcing: A Feasible Data Collection Mechanism for Mental D...
 

Recently uploaded

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 

Recently uploaded (20)

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 

An Introduction to Apache Spark

  • 1. An Introduction A big data processing tool built with Scala and runs on JVM
  • 2. BigData It was first mentioned by NASA researchers Michael Cox and David Ellsworth in 1997. Definition: data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating, information privacy, and real-time capabilities. (Wikipedia)
  • 3. BigDataProcessing The rise of Big Data required faster tools for processing data. Can traditional databases handle Big Data? What are the limitations? What are the solutions?
  • 4. WhatisSpark? A big data processing framework built around speed, generality, ease of use, and accessibility.
  • 5. PropertiesofSPARK Speed: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Ease of Use: Write applications quickly in Java, Scala, Python, and R. Generality: Combines SQL, streaming, and complex analytics. Accessibility: Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
  • 6. SparkFeatures ❏ Works directly on memory for speed-up ❏ Supports MapReduce ❏ Lazy evaluation of big data queries for optimization. (It does exactly what you tell it to do) ❏ Operators and API that allow for easy interaction through Scala, R, and Python.
  • 7. Sparknuggets Spark Streaming - process real-time data using basic abstraction called Resilient Distributed Datasets (RDDs). Spark SQL - Allows for ad-hoc querying using JDBC API Spark MLlib - provides optimized machine learning libraries for regression, classification, and clustering tasks. Spark GraphX - an extension to RDD for graph-parallel computation.
  • 8. RDD A fundamental data structure in Spark. An immutable collection of objects that support in-memory processing. Two ways of creating RDDs: ❏ Parallelize existing data collection in your driver program ❏ Importing directly from HDFS, HBASE or any other Hadoop Input Format.
  • 9. Map/Reduce A programming model for processing and generating large amounts of data with a parallel, distributed algorithm on a cluster.
  • 10. ASimpleWordCountprogram text_file = spark.textFile("hdfs://...") text_file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount extends Configured implements Tool{ public int run(String[] args) throws Exception { //creating a JobConf object and assigning a job name for identification purposes JobConf conf = new JobConf(getConf(), WordCount.class); conf.setJobName("WordCount"); //Setting configuration object with the Data Type of output Key and Value conf.setOutputKeyClass(Text.class);