SlideShare a Scribd company logo
Up and Running with
Achilles Heel for Hadoop
 Hadoop is not fast enough *apparently* for things like ML .
 Need to Read again from disk after each MR job.
{ MR1 => HDFS =>MR2 => HDFS =>MR3 }
 MR , Let’s admit is a bit too complicated.
 The problem with giant codebase.
{Hadoop : 1.7 Million LOC}
{Spark : .35 Million LOC}
Why Spark?
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
A brief History of Spark Timeline
 UC Berkeley : The home of innovation.
 2009 : Started as a simple class project.
 The UCB folks wanted to create a Cluster Management system  Mesos
 They needed something to test on top of Mesos .  Voila Spark
 2010 : Open sourced under BSD licence
 Feb 2014 : Became Apache Top Level project
 Nov 2014 : New world record in Large scale sorting
https://soundcloud.com/oreilly-radar/apache-sparks-journey-from-academia-to-industry
Spark made wise choices
The Spark Stack
Spark Concepts
 In memory Processing
{Processors : 64 bit ~~ Up to 1 TB RAM}
{Fact: RAM will always be faster than disk}
{Idea : Compress data, do processing }
{Remember : Data is distributed across various machines too}
 Resilient Distributed Datasets
http://www.gridgain.com/in-memory-computing-in-plain-english/
Resilient This is Sparta and we don’t give up on data without a fight.
Distributed A part of data is everywhere.
Dataset Meh!
A bit more on RDD
Basic unit of data in Spark
 RDDs are immutable
// int a=0;
// final int b =0;
 There are two main categories of operations on RDD
a) Transformation => Lazy evaluation.
=> Creates a new RDD from the existing RDD.
b) Actions => Return values
=> Write to disk
Eg : My mom asks me to buy Grocery items
Setting Up
Download “Prebuilt for Hadoop 2.4 and later”
 Build from source with Maven or sbt.
./bin/pyspark
http://spark.apache.org/downloads.html
Talk is Cheap! Show me the code
Pyspark shell is REPL.
 Creating an RDD
a) From data in memory.
b) From File.
c) From another RDD
rdd = sc.parallelize(“ChennaiPy”) // from string
nums = [1,2,3]
rdd_nums = sc.parallelize(nums) // from list
rdd_shakespeare= sc.textFile(“shakespeare.txt”) // from file
Transformations
Less Dramatic than this . But beautiful nevertheless.
 Classic Example 1 : Map
a) Beauty in this case comes from Lambda Expressions .
nums = [1,2,3,4,5,6]
rdd_nums = sc.parallelize(nums) // Creating our RDD
new_rdd = rdd_nums.map(lambda x : x**2) // You’ve got squares
print new_rdd.colect() // Finally some action
Did you say 80 Operations?
http://nbviewer.ipython.org/github/jkthompson/pyspark-
pictures/blob/master/pyspark-pictures.ipynb
Use case : Log Analysis
Demo Time
Thank You 

More Related Content

What's hot

Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Richard Seymour
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Holden Karau
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark Summit
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
Databricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 

What's hot (20)

Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 

Viewers also liked

Scala 101
Scala 101Scala 101
Scala 101
Andrey Myatlyuk
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
Edureka!
 
Introduction to AMQP Messaging with RabbitMQ
Introduction to AMQP Messaging with RabbitMQIntroduction to AMQP Messaging with RabbitMQ
Introduction to AMQP Messaging with RabbitMQ
Dmitriy Samovskiy
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to Scala
Tim Underwood
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
Gerard Maas
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
pramode_ce
 

Viewers also liked (10)

Scala 101
Scala 101Scala 101
Scala 101
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
 
Introduction to AMQP Messaging with RabbitMQ
Introduction to AMQP Messaging with RabbitMQIntroduction to AMQP Messaging with RabbitMQ
Introduction to AMQP Messaging with RabbitMQ
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to Scala
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
 

Similar to Up and running with pyspark

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
Infinity Tech Solutions
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Scaling Big Data with Hadoop and Mesos
Scaling Big Data with Hadoop and MesosScaling Big Data with Hadoop and Mesos
Scaling Big Data with Hadoop and Mesos
Discover Pinterest
 
Beginner Apache Spark Presentation
Beginner Apache Spark PresentationBeginner Apache Spark Presentation
Beginner Apache Spark Presentation
Nidhin Pattaniyil
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
sparrowAnalytics.com
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
Demet Aksoy
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 

Similar to Up and running with pyspark (20)

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Scaling Big Data with Hadoop and Mesos
Scaling Big Data with Hadoop and MesosScaling Big Data with Hadoop and Mesos
Scaling Big Data with Hadoop and Mesos
 
Beginner Apache Spark Presentation
Beginner Apache Spark PresentationBeginner Apache Spark Presentation
Beginner Apache Spark Presentation
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Data Science
Data ScienceData Science
Data Science
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 

More from Krishna Sangeeth KS

Bringing back Vangogh
Bringing back VangoghBringing back Vangogh
Bringing back Vangogh
Krishna Sangeeth KS
 
All things py
All things pyAll things py
All things py
Krishna Sangeeth KS
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
Krishna Sangeeth KS
 
Solving graph problems using networkX
Solving graph problems using networkXSolving graph problems using networkX
Solving graph problems using networkX
Krishna Sangeeth KS
 
Lvc
LvcLvc
Round 1 gnosis
Round 1 gnosisRound 1 gnosis
Round 1 gnosis
Krishna Sangeeth KS
 
Lvc
LvcLvc
Anti Counterfeit System using QR codes and Various other applications
Anti Counterfeit System using QR codes and Various other applicationsAnti Counterfeit System using QR codes and Various other applications
Anti Counterfeit System using QR codes and Various other applications
Krishna Sangeeth KS
 
Automatic web monitoring & retrieval system
Automatic web monitoring & retrieval systemAutomatic web monitoring & retrieval system
Automatic web monitoring & retrieval system
Krishna Sangeeth KS
 
Written round
Written roundWritten round
Written round
Krishna Sangeeth KS
 
Supertheme
SuperthemeSupertheme
Gnosis quiz 2k10 dry1
Gnosis quiz 2k10 dry1Gnosis quiz 2k10 dry1
Gnosis quiz 2k10 dry1
Krishna Sangeeth KS
 
Dry 2
Dry 2Dry 2
Choice round
Choice roundChoice round
Choice round
Krishna Sangeeth KS
 
Gnosis quiz 2k10 dry1
Gnosis quiz 2k10 dry1Gnosis quiz 2k10 dry1
Gnosis quiz 2k10 dry1
Krishna Sangeeth KS
 
Gnosis Quiz 2k10 Prelims
Gnosis Quiz  2k10 PrelimsGnosis Quiz  2k10 Prelims
Gnosis Quiz 2k10 Prelims
Krishna Sangeeth KS
 

More from Krishna Sangeeth KS (17)

Bringing back Vangogh
Bringing back VangoghBringing back Vangogh
Bringing back Vangogh
 
All things py
All things pyAll things py
All things py
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
Solving graph problems using networkX
Solving graph problems using networkXSolving graph problems using networkX
Solving graph problems using networkX
 
Lvc
LvcLvc
Lvc
 
Round 1 gnosis
Round 1 gnosisRound 1 gnosis
Round 1 gnosis
 
Lvc
LvcLvc
Lvc
 
Anti Counterfeit System using QR codes and Various other applications
Anti Counterfeit System using QR codes and Various other applicationsAnti Counterfeit System using QR codes and Various other applications
Anti Counterfeit System using QR codes and Various other applications
 
Automatic web monitoring & retrieval system
Automatic web monitoring & retrieval systemAutomatic web monitoring & retrieval system
Automatic web monitoring & retrieval system
 
Written round
Written roundWritten round
Written round
 
Visual connect
Visual connectVisual connect
Visual connect
 
Supertheme
SuperthemeSupertheme
Supertheme
 
Gnosis quiz 2k10 dry1
Gnosis quiz 2k10 dry1Gnosis quiz 2k10 dry1
Gnosis quiz 2k10 dry1
 
Dry 2
Dry 2Dry 2
Dry 2
 
Choice round
Choice roundChoice round
Choice round
 
Gnosis quiz 2k10 dry1
Gnosis quiz 2k10 dry1Gnosis quiz 2k10 dry1
Gnosis quiz 2k10 dry1
 
Gnosis Quiz 2k10 Prelims
Gnosis Quiz  2k10 PrelimsGnosis Quiz  2k10 Prelims
Gnosis Quiz 2k10 Prelims
 

Recently uploaded

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 

Recently uploaded (20)

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 

Up and running with pyspark

  • 2. Achilles Heel for Hadoop  Hadoop is not fast enough *apparently* for things like ML .  Need to Read again from disk after each MR job. { MR1 => HDFS =>MR2 => HDFS =>MR3 }  MR , Let’s admit is a bit too complicated.  The problem with giant codebase. {Hadoop : 1.7 Million LOC} {Spark : .35 Million LOC}
  • 4. A brief History of Spark Timeline  UC Berkeley : The home of innovation.  2009 : Started as a simple class project.  The UCB folks wanted to create a Cluster Management system  Mesos  They needed something to test on top of Mesos .  Voila Spark  2010 : Open sourced under BSD licence  Feb 2014 : Became Apache Top Level project  Nov 2014 : New world record in Large scale sorting https://soundcloud.com/oreilly-radar/apache-sparks-journey-from-academia-to-industry
  • 5. Spark made wise choices
  • 7. Spark Concepts  In memory Processing {Processors : 64 bit ~~ Up to 1 TB RAM} {Fact: RAM will always be faster than disk} {Idea : Compress data, do processing } {Remember : Data is distributed across various machines too}  Resilient Distributed Datasets http://www.gridgain.com/in-memory-computing-in-plain-english/ Resilient This is Sparta and we don’t give up on data without a fight. Distributed A part of data is everywhere. Dataset Meh!
  • 8. A bit more on RDD Basic unit of data in Spark  RDDs are immutable // int a=0; // final int b =0;  There are two main categories of operations on RDD a) Transformation => Lazy evaluation. => Creates a new RDD from the existing RDD. b) Actions => Return values => Write to disk Eg : My mom asks me to buy Grocery items
  • 9. Setting Up Download “Prebuilt for Hadoop 2.4 and later”  Build from source with Maven or sbt. ./bin/pyspark http://spark.apache.org/downloads.html
  • 10. Talk is Cheap! Show me the code Pyspark shell is REPL.  Creating an RDD a) From data in memory. b) From File. c) From another RDD rdd = sc.parallelize(“ChennaiPy”) // from string nums = [1,2,3] rdd_nums = sc.parallelize(nums) // from list rdd_shakespeare= sc.textFile(“shakespeare.txt”) // from file
  • 11. Transformations Less Dramatic than this . But beautiful nevertheless.  Classic Example 1 : Map a) Beauty in this case comes from Lambda Expressions . nums = [1,2,3,4,5,6] rdd_nums = sc.parallelize(nums) // Creating our RDD new_rdd = rdd_nums.map(lambda x : x**2) // You’ve got squares print new_rdd.colect() // Finally some action
  • 12. Did you say 80 Operations? http://nbviewer.ipython.org/github/jkthompson/pyspark- pictures/blob/master/pyspark-pictures.ipynb
  • 13. Use case : Log Analysis

Editor's Notes

  1. This is just a test
  2. This is just a test