SlideShare a Scribd company logo
1 of 12
HDFS
YARN Map Reduce SPARK
SPARK SPARK
SPARK Streaming MLlib GraphXSPARK SQL
Architecture
Standalone apps
lScala
lJava
lPython
Deployment
Spark-submit
* .jar (for Java/Scala) or a set of .py or .zip files (for Python),
Development
Source: http://tech.blog.box.com/2014/07/evaluating-apache-spark-and-twitter-scalding/
http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
Wordcount in Spark
#Python
file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
#Scala
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Interactive Live Demo I on Spark REPL
cd /home/grf/Downloads/spark-1.0.2-bin-hadoop1/bin
./spark-shell
val f = sc.textFile("README.md")
val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wc.saveAsTextFile("wc_result.txt")
wc.toDebugString
Interactive Live Demo II on Spark REPL
cd /home/grf/Downloads/spark-1.0.2-bin-hadoop1/bin
./spark-shell
val rm = sc.textFile("README.md")
val rm_wc = rm.flatMap(l => l.split(" ")).filter(_ == "Spark").map(workd => (word, 1)).reduceByKey(_ + _)
rm_wc.collect()
val cl = sc.textFile("CHANGES.txt")
val cl_wc = cl.flatMap(l => l.split(" ")).filter(_ == "Spark").map(word => (word, 1)).reduceByKey(_ + _)
cl_wc.collect()
rm_wc.join(cl_wc).collect()
Running on EMR
#Starting the cluster
/opt/elastic-mapreduce-cli/elastic-mapreduce --create --alive --name "Paul's Spark/Shark Cluster" --
bootstrap-action s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh --bootstrap-name
"Install Spark/Shark" --instance-type m1.xlarge --instance-count 10
Spark 1.0.0 is available on YARN and on SPARK platforms (haven't properly tested yet).
ssh hadoop@<FQDN> -i /opt/rnd_eu.pem
cd /home/hadoop/spark
./bin/spark-shell
./bin/spark-submit
./bin/pyspark
Monitoring:
<FQDN>:8080
alter table rtb_transactions add if not exists partition (dt='${DATE}');
INSERT OVERWRITE TABLE
rtb_transactions_export PARTITION (dt ='${DATE}', cd)
SELECT
ChannelId,
RequestId,
Time,
CookieId,
<...>
FROM
rtb_transactions t
JOIN
placements p ON (t.PlacementId = p.PlacementId)
WHERE
dt = '${DATE}'
and (p.AgencyId=107
or p.AgencyId=136
or p.AgencyId=590);
35 lines
val tr = sc.textFile("path_to_rtb_transactions").
map(_.split("t")).
map(r => (r(11), r))
val pl = sc.textFile("path_to_placements").
map(_.split("t")).
filter(c => Set(107, 136,590).contains(c(9).trim.toInt)).
map(r => (r(0), r))
pl.join(tr).map(tuple => "%s".format(tuple._2._2.mkString("t"))).
coalesce(1).
saveAsTextFile("path_to_rtb_transactions_sampled")
12 lines
Need to understand the internals!
Goal: Find number of distinct names per “first letter”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
HadoopRDD
map()
groupBy()
mapValues()
collect()
Stage 1
Stage 2
ala ana pet
A, ana A, ala P, pet
P, (pet)A, (ana, ala)
A, 2 P, 1
(a, 2), (p, 1)
shuffle
Need to understand the internals!
Goal: Find number of distinct names per “first letter”
HadoopRDD
map()
reduceByKey()
collect()
Stage 1
ala ana pet
A, 1 A, 1 P, 1
A, 2 P, 1
(a, 2), (p, 1)
sc.textFile(“hdfs:/names”)
.distinct(numPartitions = 3)
.map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)
.collect()
No shuffle!
Key points:
Handles batch, interactive and real-time processing within a single framework
Native integration with Python, Scala and Java
Programming at a higher level of abstraction
More general: MR is just one set of supported constructs (??)

More Related Content

What's hot

Getting Started with Project Lambda
Getting Started with Project LambdaGetting Started with Project Lambda
Getting Started with Project LambdaYuichi Sakuraba
 
Generic Functional Programming with Type Classes
Generic Functional Programming with Type ClassesGeneric Functional Programming with Type Classes
Generic Functional Programming with Type ClassesTapio Rautonen
 
Building a Functional Stream in Scala
Building a Functional Stream in ScalaBuilding a Functional Stream in Scala
Building a Functional Stream in ScalaDerek Wyatt
 
Three Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataDynamical Software, Inc.
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmInfoFarm
 
Делаем пользовательское Api на базе Shapeless
Делаем пользовательское Api на базе ShapelessДелаем пользовательское Api на базе Shapeless
Делаем пользовательское Api на базе ShapelessВадим Челышов
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinityShashwat Shriparv
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON Padma shree. T
 
JDD2014: Real life lambdas - Peter Lawrey
JDD2014: Real life lambdas - Peter LawreyJDD2014: Real life lambdas - Peter Lawrey
JDD2014: Real life lambdas - Peter LawreyPROIDEA
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Big Data Spain
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
Junos commands
Junos commandsJunos commands
Junos commandsmyerfarez
 

What's hot (18)

Getting Started with Project Lambda
Getting Started with Project LambdaGetting Started with Project Lambda
Getting Started with Project Lambda
 
Generic Functional Programming with Type Classes
Generic Functional Programming with Type ClassesGeneric Functional Programming with Type Classes
Generic Functional Programming with Type Classes
 
Building a Functional Stream in Scala
Building a Functional Stream in ScalaBuilding a Functional Stream in Scala
Building a Functional Stream in Scala
 
Three Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big Data
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
 
Pig
PigPig
Pig
 
Делаем пользовательское Api на базе Shapeless
Делаем пользовательское Api на базе ShapelessДелаем пользовательское Api на базе Shapeless
Делаем пользовательское Api на базе Shapeless
 
ECMAScript 5: Новое в JavaScript
ECMAScript 5: Новое в JavaScriptECMAScript 5: Новое в JavaScript
ECMAScript 5: Новое в JavaScript
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
 
JDD2014: Real life lambdas - Peter Lawrey
JDD2014: Real life lambdas - Peter LawreyJDD2014: Real life lambdas - Peter Lawrey
JDD2014: Real life lambdas - Peter Lawrey
 
Flink meetup
Flink meetupFlink meetup
Flink meetup
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Hive
HiveHive
Hive
 
Pig
PigPig
Pig
 
Junos commands
Junos commandsJunos commands
Junos commands
 
Clojure
ClojureClojure
Clojure
 

Viewers also liked

Scala Style by Adform Research (Saulius Valatka)
Scala Style by Adform Research (Saulius Valatka)Scala Style by Adform Research (Saulius Valatka)
Scala Style by Adform Research (Saulius Valatka)Vasil Remeniuk
 
Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2Vasil Remeniuk
 
"Scala in Goozy", Alexey Zlobin
"Scala in Goozy", Alexey Zlobin "Scala in Goozy", Alexey Zlobin
"Scala in Goozy", Alexey Zlobin Vasil Remeniuk
 
"Scala in Goozy", Alexey Zlobin
"Scala in Goozy", Alexey Zlobin "Scala in Goozy", Alexey Zlobin
"Scala in Goozy", Alexey Zlobin Vasil Remeniuk
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 
"Error Recovery" by @alaz at scalaby#8
"Error Recovery" by @alaz at scalaby#8"Error Recovery" by @alaz at scalaby#8
"Error Recovery" by @alaz at scalaby#8Vasil Remeniuk
 
Testing in Scala. Adform Research
Testing in Scala. Adform ResearchTesting in Scala. Adform Research
Testing in Scala. Adform ResearchVasil Remeniuk
 
scala.reflect, Eugene Burmako
scala.reflect, Eugene Burmakoscala.reflect, Eugene Burmako
scala.reflect, Eugene BurmakoVasil Remeniuk
 

Viewers also liked (9)

Scala Style by Adform Research (Saulius Valatka)
Scala Style by Adform Research (Saulius Valatka)Scala Style by Adform Research (Saulius Valatka)
Scala Style by Adform Research (Saulius Valatka)
 
Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2
 
"Scala in Goozy", Alexey Zlobin
"Scala in Goozy", Alexey Zlobin "Scala in Goozy", Alexey Zlobin
"Scala in Goozy", Alexey Zlobin
 
"Scala in Goozy", Alexey Zlobin
"Scala in Goozy", Alexey Zlobin "Scala in Goozy", Alexey Zlobin
"Scala in Goozy", Alexey Zlobin
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
"Error Recovery" by @alaz at scalaby#8
"Error Recovery" by @alaz at scalaby#8"Error Recovery" by @alaz at scalaby#8
"Error Recovery" by @alaz at scalaby#8
 
Testing in Scala. Adform Research
Testing in Scala. Adform ResearchTesting in Scala. Adform Research
Testing in Scala. Adform Research
 
Vaadin+Scala
Vaadin+ScalaVaadin+Scala
Vaadin+Scala
 
scala.reflect, Eugene Burmako
scala.reflect, Eugene Burmakoscala.reflect, Eugene Burmako
scala.reflect, Eugene Burmako
 

Similar to SPARK Architecture, Programming and Deployment

Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsMatt Stubbs
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...PROIDEA
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik ErlandsonDatabricks
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?
Lukasz Byczynski
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
 

Similar to SPARK Architecture, Programming and Deployment (20)

Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?

 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
 

More from Vasil Remeniuk

Product Minsk - РТБ и Программатик
Product Minsk - РТБ и ПрограмматикProduct Minsk - РТБ и Программатик
Product Minsk - РТБ и ПрограмматикVasil Remeniuk
 
Работа с Akka Сluster, @afiskon, scalaby#14
Работа с Akka Сluster, @afiskon, scalaby#14Работа с Akka Сluster, @afiskon, scalaby#14
Работа с Akka Сluster, @afiskon, scalaby#14Vasil Remeniuk
 
Cake pattern. Presentation by Alex Famin at scalaby#14
Cake pattern. Presentation by Alex Famin at scalaby#14Cake pattern. Presentation by Alex Famin at scalaby#14
Cake pattern. Presentation by Alex Famin at scalaby#14Vasil Remeniuk
 
Scala laboratory: Globus. iteration #3
Scala laboratory: Globus. iteration #3Scala laboratory: Globus. iteration #3
Scala laboratory: Globus. iteration #3Vasil Remeniuk
 
Testing in Scala by Adform research
Testing in Scala by Adform researchTesting in Scala by Adform research
Testing in Scala by Adform researchVasil Remeniuk
 
Spark Intro by Adform Research
Spark Intro by Adform ResearchSpark Intro by Adform Research
Spark Intro by Adform ResearchVasil Remeniuk
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaVasil Remeniuk
 
Types by Adform Research
Types by Adform ResearchTypes by Adform Research
Types by Adform ResearchVasil Remeniuk
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 
Spark intro by Adform Research
Spark intro by Adform ResearchSpark intro by Adform Research
Spark intro by Adform ResearchVasil Remeniuk
 
SBT by Aform Research, Saulius Valatka
SBT by Aform Research, Saulius ValatkaSBT by Aform Research, Saulius Valatka
SBT by Aform Research, Saulius ValatkaVasil Remeniuk
 
Scala laboratory. Globus. iteration #1
Scala laboratory. Globus. iteration #1Scala laboratory. Globus. iteration #1
Scala laboratory. Globus. iteration #1Vasil Remeniuk
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + ElkVasil Remeniuk
 
Опыт использования Spark, Основано на реальных событиях
Опыт использования Spark, Основано на реальных событияхОпыт использования Spark, Основано на реальных событиях
Опыт использования Spark, Основано на реальных событияхVasil Remeniuk
 
Funtional Reactive Programming with Examples in Scala + GWT
Funtional Reactive Programming with Examples in Scala + GWTFuntional Reactive Programming with Examples in Scala + GWT
Funtional Reactive Programming with Examples in Scala + GWTVasil Remeniuk
 
[Не]практичные типы
[Не]практичные типы[Не]практичные типы
[Не]практичные типыVasil Remeniuk
 
Зачем нужна Scala?
Зачем нужна Scala?Зачем нужна Scala?
Зачем нужна Scala?Vasil Remeniuk
 
Scala Magic, Alexander Podhaliusin
Scala Magic, Alexander PodhaliusinScala Magic, Alexander Podhaliusin
Scala Magic, Alexander PodhaliusinVasil Remeniuk
 
The Kotlin Programming Language, Svetlana Isakova
The Kotlin Programming Language, Svetlana IsakovaThe Kotlin Programming Language, Svetlana Isakova
The Kotlin Programming Language, Svetlana IsakovaVasil Remeniuk
 

More from Vasil Remeniuk (20)

Product Minsk - РТБ и Программатик
Product Minsk - РТБ и ПрограмматикProduct Minsk - РТБ и Программатик
Product Minsk - РТБ и Программатик
 
Работа с Akka Сluster, @afiskon, scalaby#14
Работа с Akka Сluster, @afiskon, scalaby#14Работа с Akka Сluster, @afiskon, scalaby#14
Работа с Akka Сluster, @afiskon, scalaby#14
 
Cake pattern. Presentation by Alex Famin at scalaby#14
Cake pattern. Presentation by Alex Famin at scalaby#14Cake pattern. Presentation by Alex Famin at scalaby#14
Cake pattern. Presentation by Alex Famin at scalaby#14
 
Scala laboratory: Globus. iteration #3
Scala laboratory: Globus. iteration #3Scala laboratory: Globus. iteration #3
Scala laboratory: Globus. iteration #3
 
Testing in Scala by Adform research
Testing in Scala by Adform researchTesting in Scala by Adform research
Testing in Scala by Adform research
 
Spark Intro by Adform Research
Spark Intro by Adform ResearchSpark Intro by Adform Research
Spark Intro by Adform Research
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
 
Types by Adform Research
Types by Adform ResearchTypes by Adform Research
Types by Adform Research
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Spark intro by Adform Research
Spark intro by Adform ResearchSpark intro by Adform Research
Spark intro by Adform Research
 
SBT by Aform Research, Saulius Valatka
SBT by Aform Research, Saulius ValatkaSBT by Aform Research, Saulius Valatka
SBT by Aform Research, Saulius Valatka
 
Scala laboratory. Globus. iteration #1
Scala laboratory. Globus. iteration #1Scala laboratory. Globus. iteration #1
Scala laboratory. Globus. iteration #1
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 
Опыт использования Spark, Основано на реальных событиях
Опыт использования Spark, Основано на реальных событияхОпыт использования Spark, Основано на реальных событиях
Опыт использования Spark, Основано на реальных событиях
 
ETL со Spark
ETL со SparkETL со Spark
ETL со Spark
 
Funtional Reactive Programming with Examples in Scala + GWT
Funtional Reactive Programming with Examples in Scala + GWTFuntional Reactive Programming with Examples in Scala + GWT
Funtional Reactive Programming with Examples in Scala + GWT
 
[Не]практичные типы
[Не]практичные типы[Не]практичные типы
[Не]практичные типы
 
Зачем нужна Scala?
Зачем нужна Scala?Зачем нужна Scala?
Зачем нужна Scala?
 
Scala Magic, Alexander Podhaliusin
Scala Magic, Alexander PodhaliusinScala Magic, Alexander Podhaliusin
Scala Magic, Alexander Podhaliusin
 
The Kotlin Programming Language, Svetlana Isakova
The Kotlin Programming Language, Svetlana IsakovaThe Kotlin Programming Language, Svetlana Isakova
The Kotlin Programming Language, Svetlana Isakova
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

SPARK Architecture, Programming and Deployment

  • 1. HDFS YARN Map Reduce SPARK SPARK SPARK SPARK Streaming MLlib GraphXSPARK SQL Architecture
  • 2. Standalone apps lScala lJava lPython Deployment Spark-submit * .jar (for Java/Scala) or a set of .py or .zip files (for Python), Development
  • 5. Wordcount in Spark #Python file = spark.textFile("hdfs://...") counts = file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://...") #Scala val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 6. Interactive Live Demo I on Spark REPL cd /home/grf/Downloads/spark-1.0.2-bin-hadoop1/bin ./spark-shell val f = sc.textFile("README.md") val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) wc.saveAsTextFile("wc_result.txt") wc.toDebugString
  • 7. Interactive Live Demo II on Spark REPL cd /home/grf/Downloads/spark-1.0.2-bin-hadoop1/bin ./spark-shell val rm = sc.textFile("README.md") val rm_wc = rm.flatMap(l => l.split(" ")).filter(_ == "Spark").map(workd => (word, 1)).reduceByKey(_ + _) rm_wc.collect() val cl = sc.textFile("CHANGES.txt") val cl_wc = cl.flatMap(l => l.split(" ")).filter(_ == "Spark").map(word => (word, 1)).reduceByKey(_ + _) cl_wc.collect() rm_wc.join(cl_wc).collect()
  • 8. Running on EMR #Starting the cluster /opt/elastic-mapreduce-cli/elastic-mapreduce --create --alive --name "Paul's Spark/Shark Cluster" -- bootstrap-action s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh --bootstrap-name "Install Spark/Shark" --instance-type m1.xlarge --instance-count 10 Spark 1.0.0 is available on YARN and on SPARK platforms (haven't properly tested yet). ssh hadoop@<FQDN> -i /opt/rnd_eu.pem cd /home/hadoop/spark ./bin/spark-shell ./bin/spark-submit ./bin/pyspark Monitoring: <FQDN>:8080
  • 9. alter table rtb_transactions add if not exists partition (dt='${DATE}'); INSERT OVERWRITE TABLE rtb_transactions_export PARTITION (dt ='${DATE}', cd) SELECT ChannelId, RequestId, Time, CookieId, <...> FROM rtb_transactions t JOIN placements p ON (t.PlacementId = p.PlacementId) WHERE dt = '${DATE}' and (p.AgencyId=107 or p.AgencyId=136 or p.AgencyId=590); 35 lines val tr = sc.textFile("path_to_rtb_transactions"). map(_.split("t")). map(r => (r(11), r)) val pl = sc.textFile("path_to_placements"). map(_.split("t")). filter(c => Set(107, 136,590).contains(c(9).trim.toInt)). map(r => (r(0), r)) pl.join(tr).map(tuple => "%s".format(tuple._2._2.mkString("t"))). coalesce(1). saveAsTextFile("path_to_rtb_transactions_sampled") 12 lines
  • 10. Need to understand the internals! Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() HadoopRDD map() groupBy() mapValues() collect() Stage 1 Stage 2 ala ana pet A, ana A, ala P, pet P, (pet)A, (ana, ala) A, 2 P, 1 (a, 2), (p, 1) shuffle
  • 11. Need to understand the internals! Goal: Find number of distinct names per “first letter” HadoopRDD map() reduceByKey() collect() Stage 1 ala ana pet A, 1 A, 1 P, 1 A, 2 P, 1 (a, 2), (p, 1) sc.textFile(“hdfs:/names”) .distinct(numPartitions = 3) .map(name => (name.charAt(0), 1)) .reduceByKey(_ + _) .collect() No shuffle!
  • 12. Key points: Handles batch, interactive and real-time processing within a single framework Native integration with Python, Scala and Java Programming at a higher level of abstraction More general: MR is just one set of supported constructs (??)