SlideShare a Scribd company logo
1 of 24
Download to read offline
Spark in Action
from parallel processing perspective
ณัฐวุฒิ หนูไพโรจน์ (อาจารย์เป้ ง)
natawut.n@chula.ac.th
http://natawutn.wordpress.com
@natawutn
“ผมอยากให้มีภาพว่าอายุไม่ใช่อุปสรรคสาหรับ
การเขียนโค้ด ต่างประเทศยังเขียนกันผมหงอก”
ผู้จัดงาน Code Mania#11 ท่านหนึ่งได้กล่าวไว้
= Parallel Programming Reborn
Very Small Big Data Example
Storage Requirements for 90 days = 39,000,000,000 events (6.5TB)
#เรามาถึงจุดนี้ได้อย่างไร
MapReduce
Source: https://www.mssqltips.com/sqlservertip/3222/big-data-basics--part-5--introduction-to-
mapreduce/
Hadoop Shortcomings
× Too simplified
× Not interactive
× Pessimistic
Along come Apache Spark!
• Opensource Big Data Framework
from UC Berkeley
• In-Memory Analytics
• > 800 contributors
• Largest Cluster = 8,000+ nodes (Tencent)
Source: M. Zaharia, “New Directions for Spark in 2015”, Spark Summit East 2015, 18 March 2015.
Let’s get to Action!
Action#1 – Wifi Radius Log
• Size = 336.6 MB
• Sample Data
2016-01-01T00:00:08.000+07:00 acs1121-cen59-1
CSCOacs_Passed_Authentications 0137798608 4 0 2016-
01-01 00:00:08.198 +07:00 0113947311 5200 NOTICE
Passed-Authentication: Authentication succeeded,
ACSVersion=acs-5.6.0.22-B.225, ConfigVe...
2016-01-01T00:00:13.000+07:00 acs1121-cen59-1
CSCOacs_RADIUS_Accounting 0137798618 2 0 2016-01-01
00:00:13.644 +07:00 0113947536 3000 NOTICE Radius-
Accounting: RADIUS Accounting start request, ACSV...
Action#1 – Wifi Radius Log
val loglines = sc.textFile("wifi-radius.log")
val successlogin = loglines.filter(
line => line.contains("Authentication succeeded"))
val logintime = successlogin.map(
line => line.split(" ")(0))
val logintimepair = logintime.map(
ltime => (ltime.substring(11,13), 1))
val loginByHour = logintimepair.reduceByKey(
(x,y) => x + y).sortByKey()
loginByHour.collect().foreach(println)
For Spark: RDD is everything!
Action#2 – Thai Monitoring Corpus
• Size = 40 x 200 MB
• Sample Data
{"ngram": 8, "totallen": 39, "wordseg": "}อยาก}ไป~นอน}ที่~นี่}<s>}วิว}
สวย}<s>}รอ}ชม}ค่า}<s>", "text": "อยากไปนอนที่นี่ วิวสวยยยย <img
class="img-in-emotion" title="อมยิ้ม02" alt="อมยิ้ม02“
src="http://ptcdn.info/emoticons/smiley/smiley02.png"/><br />nรอชม
ค่าาาา", "timestamp": 1392286781000, "id": 21935913, "faillen": 0, "refid":
"31650001"}
{"ngram": 6, "totallen": 26, "wordseg": "ตาม}มา~เที่ยว}ด้วย~กัน}ค่ะ}คุณ}
วา}<s>}", …
Action#2 – Thai Monitoring Corpus
val data = sqlContext.read.json("gs://tmc-data/*")
val d = data.withColumn("date",
from_unixtime(data("timestamp")/1000, "yyyy-MM-dd"))
val failre = "<[Ff][^>]+>[^<]+</[Ff][^>]+>".r
val nonthaire = "[u0021-u007F]+".r
def prepareText(x:Any):String = {
var text = x.asInstanceOf[String]
text = text.replace("n"," ")
text = text.replace("<s>"," ")
text = text.replace("~","")
text = text.replace("}"," ")
text = failre.replaceAllIn(text, " ")
text = nonthaire.replaceAllIn(text, " ")
return text
}
val dTmp = d.select(d("date"), d("wordseg"))
val dataMap = dTmp.map(item => (item(0), prepareText(item(1))))
val baseGramRDD = dataMap.flatMapValues(v => v.split("( )+"))
val baseGramCountRDD = baseGramRDD.countByValue()
Source: P.Zecevic and M.Bonaci, Spark in Action, Manning Publications, Summer 2016 (est.)
Why I like Spark?
• RDD: Simple but Elegant Design
• Very consistent abstraction
• Extensible with minimal overheads
• One feature added to base RDD,
all other modules get benefits
Big Data + Spark = #คู่จิ้นฟินเวอร์

More Related Content

Similar to Spark in Action #CodeMania11

Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
琛琳 饶
 
MongoSF 2012: MongoDB Deployment Preparedness
MongoSF 2012: MongoDB Deployment PreparednessMongoSF 2012: MongoDB Deployment Preparedness
MongoSF 2012: MongoDB Deployment Preparedness
MongoDB
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
Adam Kawa
 

Similar to Spark in Action #CodeMania11 (20)

Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
 
Retaining globally distributed high availability
Retaining globally distributed high availabilityRetaining globally distributed high availability
Retaining globally distributed high availability
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 
MongoSF 2012: MongoDB Deployment Preparedness
MongoSF 2012: MongoDB Deployment PreparednessMongoSF 2012: MongoDB Deployment Preparedness
MongoSF 2012: MongoDB Deployment Preparedness
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for Graphs
 
Evolution of Spark APIs
Evolution of Spark APIsEvolution of Spark APIs
Evolution of Spark APIs
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 

More from Chulalongkorn University

More from Chulalongkorn University (7)

Thailand 4.0 strategies by Data Science and Blockchain
Thailand 4.0 strategies by Data Science and BlockchainThailand 4.0 strategies by Data Science and Blockchain
Thailand 4.0 strategies by Data Science and Blockchain
 
Digital Transformation: Big Data and Data Science Learning Path
Digital Transformation: Big Data and Data Science Learning PathDigital Transformation: Big Data and Data Science Learning Path
Digital Transformation: Big Data and Data Science Learning Path
 
Myths about data science and big data analytics
Myths about data science and big data analyticsMyths about data science and big data analytics
Myths about data science and big data analytics
 
Big Data & Analytics - What is it and How does it matter to Insurance?
Big Data & Analytics - What is it and How does it matter to Insurance?Big Data & Analytics - What is it and How does it matter to Insurance?
Big Data & Analytics - What is it and How does it matter to Insurance?
 
The Age of Big Data: A New Class of Economic Asset
The Age of Big Data: A New Class of Economic AssetThe Age of Big Data: A New Class of Economic Asset
The Age of Big Data: A New Class of Economic Asset
 
Digital Trends - Redefining the Insurance Industry (2016)
Digital Trends - Redefining the Insurance Industry (2016)Digital Trends - Redefining the Insurance Industry (2016)
Digital Trends - Redefining the Insurance Industry (2016)
 
Big data user group big data application - mar 2016
Big data user group   big data application - mar 2016Big data user group   big data application - mar 2016
Big data user group big data application - mar 2016
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 

Spark in Action #CodeMania11

  • 1. Spark in Action from parallel processing perspective ณัฐวุฒิ หนูไพโรจน์ (อาจารย์เป้ ง) natawut.n@chula.ac.th http://natawutn.wordpress.com @natawutn
  • 4. Very Small Big Data Example
  • 5. Storage Requirements for 90 days = 39,000,000,000 events (6.5TB)
  • 9. Hadoop Shortcomings × Too simplified × Not interactive × Pessimistic
  • 10. Along come Apache Spark! • Opensource Big Data Framework from UC Berkeley • In-Memory Analytics • > 800 contributors • Largest Cluster = 8,000+ nodes (Tencent)
  • 11. Source: M. Zaharia, “New Directions for Spark in 2015”, Spark Summit East 2015, 18 March 2015.
  • 12.
  • 13. Let’s get to Action!
  • 14. Action#1 – Wifi Radius Log • Size = 336.6 MB • Sample Data 2016-01-01T00:00:08.000+07:00 acs1121-cen59-1 CSCOacs_Passed_Authentications 0137798608 4 0 2016- 01-01 00:00:08.198 +07:00 0113947311 5200 NOTICE Passed-Authentication: Authentication succeeded, ACSVersion=acs-5.6.0.22-B.225, ConfigVe... 2016-01-01T00:00:13.000+07:00 acs1121-cen59-1 CSCOacs_RADIUS_Accounting 0137798618 2 0 2016-01-01 00:00:13.644 +07:00 0113947536 3000 NOTICE Radius- Accounting: RADIUS Accounting start request, ACSV...
  • 15. Action#1 – Wifi Radius Log val loglines = sc.textFile("wifi-radius.log") val successlogin = loglines.filter( line => line.contains("Authentication succeeded")) val logintime = successlogin.map( line => line.split(" ")(0)) val logintimepair = logintime.map( ltime => (ltime.substring(11,13), 1)) val loginByHour = logintimepair.reduceByKey( (x,y) => x + y).sortByKey() loginByHour.collect().foreach(println)
  • 16. For Spark: RDD is everything!
  • 17.
  • 18.
  • 19.
  • 20. Action#2 – Thai Monitoring Corpus • Size = 40 x 200 MB • Sample Data {"ngram": 8, "totallen": 39, "wordseg": "}อยาก}ไป~นอน}ที่~นี่}<s>}วิว} สวย}<s>}รอ}ชม}ค่า}<s>", "text": "อยากไปนอนที่นี่ วิวสวยยยย <img class="img-in-emotion" title="อมยิ้ม02" alt="อมยิ้ม02“ src="http://ptcdn.info/emoticons/smiley/smiley02.png"/><br />nรอชม ค่าาาา", "timestamp": 1392286781000, "id": 21935913, "faillen": 0, "refid": "31650001"} {"ngram": 6, "totallen": 26, "wordseg": "ตาม}มา~เที่ยว}ด้วย~กัน}ค่ะ}คุณ} วา}<s>}", …
  • 21. Action#2 – Thai Monitoring Corpus val data = sqlContext.read.json("gs://tmc-data/*") val d = data.withColumn("date", from_unixtime(data("timestamp")/1000, "yyyy-MM-dd")) val failre = "<[Ff][^>]+>[^<]+</[Ff][^>]+>".r val nonthaire = "[u0021-u007F]+".r def prepareText(x:Any):String = { var text = x.asInstanceOf[String] text = text.replace("n"," ") text = text.replace("<s>"," ") text = text.replace("~","") text = text.replace("}"," ") text = failre.replaceAllIn(text, " ") text = nonthaire.replaceAllIn(text, " ") return text } val dTmp = d.select(d("date"), d("wordseg")) val dataMap = dTmp.map(item => (item(0), prepareText(item(1)))) val baseGramRDD = dataMap.flatMapValues(v => v.split("( )+")) val baseGramCountRDD = baseGramRDD.countByValue()
  • 22. Source: P.Zecevic and M.Bonaci, Spark in Action, Manning Publications, Summer 2016 (est.)
  • 23. Why I like Spark? • RDD: Simple but Elegant Design • Very consistent abstraction • Extensible with minimal overheads • One feature added to base RDD, all other modules get benefits
  • 24. Big Data + Spark = #คู่จิ้นฟินเวอร์