SlideShare a Scribd company logo
1 of 14
Download to read offline
Apache Spark
Nidhin
Problem
- Data is growing faster than processing
speeds
- Only solution is to parallelize on large
clusters
What is Spark
- open source cluster computing framework
- ‘successor’ of Hadoop
- developed in AMPLab in UC Berkeley
History
- 2002: MapReduce @Google
- 2004: MapReduce paper
- 2006: Hadoop @ Yahoo
- 2009: Amazon EMR
- 2010 Spark paper
- 2014: Apache Spark (top level)
Hadoop vs Spark
- only way to reuse data, is to write to disk
- Logistic Regression: 3 sec vs 76 sec
- K-Means: 33 sec vs 106 sec
- interactively explore data
Spark Components
- Spark Core
- Spark SQL
- Spark Streaming
- Mlib
- GraphX
Resilient Distributed Dataset (RDD)
- fundamental unit of data in spark
- immutable resilient distributed collection
- can be in disk or memory
- knows how it was created
Analyzing Github Event Data
- what are the most common words used in
the github commits (public repos)
Data
- file for every hour in 2015
- 2015-01-01-00.json.gz
- 2015-01-01-01.json.gz
- …
- 2015-07-23-11.json.gz
Data
- file content: each line is a json
- {...."type":"PushEvent"....}
- {....”type”:”CommitCommentEvent”:...”body”:”my
super awesome commit”...}
Approach (each file)
- read the file
- filter lines that are commit events
- extract the actual commit message
- split the commit message to words
- get frequency of word
- combine results
- sort in descending order
Code
https://github.
com/npatta01/spark_metis_investigation
Conclusion
- Spark is awesome ?
Resources
- Moocs
- Introduction to Big Data using Apache Spark
- Scalable Machine Learning
- Papers:
- Spark - Lightning-Fast Cluster Computing
- Spark SQL: Relational Data Processing in Spark
- Resilient Distributed Datasets: A Fault-Toleran
Abstraction for In-Memory Cluster Computing

More Related Content

What's hot

Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallSpark Summit
 
PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase) PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase) Ryuji Tamagawa
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Datasamuel shamiri
 
20171012 found IT #9 PySparkの勘所
20171012 found  IT #9 PySparkの勘所20171012 found  IT #9 PySparkの勘所
20171012 found IT #9 PySparkの勘所Ryuji Tamagawa
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoopSandeep Patil
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark Summit
 
Hunk - Unlocking the Power of Big Data
Hunk - Unlocking the Power of Big DataHunk - Unlocking the Power of Big Data
Hunk - Unlocking the Power of Big DataSplunk
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Spark Summit
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying SparkDatabricks
 
20170210 sapporotechbar7
20170210 sapporotechbar720170210 sapporotechbar7
20170210 sapporotechbar7Ryuji Tamagawa
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystemSlideCentral
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 

What's hot (20)

Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase) PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase)
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Data
 
20171012 found IT #9 PySparkの勘所
20171012 found  IT #9 PySparkの勘所20171012 found  IT #9 PySparkの勘所
20171012 found IT #9 PySparkの勘所
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoop
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
 
Hunk - Unlocking the Power of Big Data
Hunk - Unlocking the Power of Big DataHunk - Unlocking the Power of Big Data
Hunk - Unlocking the Power of Big Data
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
20170210 sapporotechbar7
20170210 sapporotechbar720170210 sapporotechbar7
20170210 sapporotechbar7
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 

Viewers also liked

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Media evaluation activity 3
Media evaluation activity 3Media evaluation activity 3
Media evaluation activity 3lambykins
 
Warna sdp sem4
Warna sdp sem4Warna sdp sem4
Warna sdp sem4ruwalfindi
 
Top 8 business executive resume samples
Top 8 business executive resume samplesTop 8 business executive resume samples
Top 8 business executive resume samplescorejom
 
Farmacología cardiovascular y del aparato respiratorio cuestionario
Farmacología cardiovascular y del aparato respiratorio cuestionarioFarmacología cardiovascular y del aparato respiratorio cuestionario
Farmacología cardiovascular y del aparato respiratorio cuestionarioÁlvaro Miguel Carranza Montalvo
 
Mobilizing Action to End Violence Against Children: Lessons from around the w...
Mobilizing Action to End Violence Against Children: Lessons from around the w...Mobilizing Action to End Violence Against Children: Lessons from around the w...
Mobilizing Action to End Violence Against Children: Lessons from around the w...BASPCAN
 
ở đâu dịch vụ giúp việc văn phòng uy tín hcm
ở đâu dịch vụ giúp việc văn phòng uy tín hcmở đâu dịch vụ giúp việc văn phòng uy tín hcm
ở đâu dịch vụ giúp việc văn phòng uy tín hcmsharda531
 
Top 8 software product manager resume samples
Top 8 software product manager resume samplesTop 8 software product manager resume samples
Top 8 software product manager resume samplescorejom
 
Kysely espoolaisesta veneilyharrastuksesta tulokset
Kysely espoolaisesta veneilyharrastuksesta  tuloksetKysely espoolaisesta veneilyharrastuksesta  tulokset
Kysely espoolaisesta veneilyharrastuksesta tuloksetSailyta-veneileva-Espoo
 
7 Steps to A Prezi
7 Steps to A Prezi7 Steps to A Prezi
7 Steps to A PreziKcomAcademy
 
Care proceedings under the Public Law Outline: The role for pre-proceedings
Care proceedings under the Public Law Outline: The role for pre-proceedingsCare proceedings under the Public Law Outline: The role for pre-proceedings
Care proceedings under the Public Law Outline: The role for pre-proceedingsBASPCAN
 
Get Your Board to Say "Yes" to a BSIMM Assessment
Get Your Board to Say "Yes" to a BSIMM AssessmentGet Your Board to Say "Yes" to a BSIMM Assessment
Get Your Board to Say "Yes" to a BSIMM AssessmentCigital
 
Tugas selamat riady algoritma
Tugas selamat riady algoritmaTugas selamat riady algoritma
Tugas selamat riady algoritmaSelamatriady
 
Evaluation of Caring Dads Safer Children
Evaluation of Caring Dads Safer ChildrenEvaluation of Caring Dads Safer Children
Evaluation of Caring Dads Safer ChildrenBASPCAN
 
The other Kinship Carers
The other Kinship CarersThe other Kinship Carers
The other Kinship CarersBASPCAN
 

Viewers also liked (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Gaurav Resume
Gaurav ResumeGaurav Resume
Gaurav Resume
 
Car Accident Attorney in Sacramento
Car Accident Attorney in SacramentoCar Accident Attorney in Sacramento
Car Accident Attorney in Sacramento
 
Media evaluation activity 3
Media evaluation activity 3Media evaluation activity 3
Media evaluation activity 3
 
Warna sdp sem4
Warna sdp sem4Warna sdp sem4
Warna sdp sem4
 
Top 8 business executive resume samples
Top 8 business executive resume samplesTop 8 business executive resume samples
Top 8 business executive resume samples
 
Farmacología cardiovascular y del aparato respiratorio cuestionario
Farmacología cardiovascular y del aparato respiratorio cuestionarioFarmacología cardiovascular y del aparato respiratorio cuestionario
Farmacología cardiovascular y del aparato respiratorio cuestionario
 
Mobilizing Action to End Violence Against Children: Lessons from around the w...
Mobilizing Action to End Violence Against Children: Lessons from around the w...Mobilizing Action to End Violence Against Children: Lessons from around the w...
Mobilizing Action to End Violence Against Children: Lessons from around the w...
 
ở đâu dịch vụ giúp việc văn phòng uy tín hcm
ở đâu dịch vụ giúp việc văn phòng uy tín hcmở đâu dịch vụ giúp việc văn phòng uy tín hcm
ở đâu dịch vụ giúp việc văn phòng uy tín hcm
 
Top 8 software product manager resume samples
Top 8 software product manager resume samplesTop 8 software product manager resume samples
Top 8 software product manager resume samples
 
Kysely espoolaisesta veneilyharrastuksesta tulokset
Kysely espoolaisesta veneilyharrastuksesta  tuloksetKysely espoolaisesta veneilyharrastuksesta  tulokset
Kysely espoolaisesta veneilyharrastuksesta tulokset
 
7 Steps to A Prezi
7 Steps to A Prezi7 Steps to A Prezi
7 Steps to A Prezi
 
Project weka
Project wekaProject weka
Project weka
 
Care proceedings under the Public Law Outline: The role for pre-proceedings
Care proceedings under the Public Law Outline: The role for pre-proceedingsCare proceedings under the Public Law Outline: The role for pre-proceedings
Care proceedings under the Public Law Outline: The role for pre-proceedings
 
Get Your Board to Say "Yes" to a BSIMM Assessment
Get Your Board to Say "Yes" to a BSIMM AssessmentGet Your Board to Say "Yes" to a BSIMM Assessment
Get Your Board to Say "Yes" to a BSIMM Assessment
 
How I Made My Game No Fun
How I Made My Game No FunHow I Made My Game No Fun
How I Made My Game No Fun
 
Tugas selamat riady algoritma
Tugas selamat riady algoritmaTugas selamat riady algoritma
Tugas selamat riady algoritma
 
Evaluation of Caring Dads Safer Children
Evaluation of Caring Dads Safer ChildrenEvaluation of Caring Dads Safer Children
Evaluation of Caring Dads Safer Children
 
The other Kinship Carers
The other Kinship CarersThe other Kinship Carers
The other Kinship Carers
 

Similar to Beginner Apache Spark Presentation

Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Anirudh Gangwar
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache SparkEdureka!
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With SparkEdureka!
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Low latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkLow latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkPradeep Kumar G.S
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]Shweta Patnaik
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 

Similar to Beginner Apache Spark Presentation (20)

Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Spark 101
Spark 101Spark 101
Spark 101
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Low latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkLow latency access of bigdata using spark and shark
Low latency access of bigdata using spark and shark
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Big data with java
Big data with javaBig data with java
Big data with java
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 

Recently uploaded

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 

Recently uploaded (20)

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 

Beginner Apache Spark Presentation

  • 2. Problem - Data is growing faster than processing speeds - Only solution is to parallelize on large clusters
  • 3. What is Spark - open source cluster computing framework - ‘successor’ of Hadoop - developed in AMPLab in UC Berkeley
  • 4. History - 2002: MapReduce @Google - 2004: MapReduce paper - 2006: Hadoop @ Yahoo - 2009: Amazon EMR - 2010 Spark paper - 2014: Apache Spark (top level)
  • 5. Hadoop vs Spark - only way to reuse data, is to write to disk - Logistic Regression: 3 sec vs 76 sec - K-Means: 33 sec vs 106 sec - interactively explore data
  • 6. Spark Components - Spark Core - Spark SQL - Spark Streaming - Mlib - GraphX
  • 7. Resilient Distributed Dataset (RDD) - fundamental unit of data in spark - immutable resilient distributed collection - can be in disk or memory - knows how it was created
  • 8. Analyzing Github Event Data - what are the most common words used in the github commits (public repos)
  • 9. Data - file for every hour in 2015 - 2015-01-01-00.json.gz - 2015-01-01-01.json.gz - … - 2015-07-23-11.json.gz
  • 10. Data - file content: each line is a json - {...."type":"PushEvent"....} - {....”type”:”CommitCommentEvent”:...”body”:”my super awesome commit”...}
  • 11. Approach (each file) - read the file - filter lines that are commit events - extract the actual commit message - split the commit message to words - get frequency of word - combine results - sort in descending order
  • 14. Resources - Moocs - Introduction to Big Data using Apache Spark - Scalable Machine Learning - Papers: - Spark - Lightning-Fast Cluster Computing - Spark SQL: Relational Data Processing in Spark - Resilient Distributed Datasets: A Fault-Toleran Abstraction for In-Memory Cluster Computing