SlideShare a Scribd company logo
Is Spark the right choice for Data
Analysis ?
Ahmed Kamal, Big Data Engineer
http://ahmedkamal.me
Resources ?
●“Advanced Analytics using Spark”, a practical book !
●“The thing I like most about this book is its focus on
examples, which are all drawn from real applications on real-
world data sets.” - Matei Zaharia, CTO at Databricks.
●It is all about developing data applications using Spark
Data Applications, like what ?
●Build a model to detect credit card fraud using thousands of
features and billions of transactions.
●Intelligently recommend millions of products to millions of
users.
●Estimate financial risk through simulations of portfolios
including millions of instruments.
●Easily manipulate data from thousands of human genomes
to detect genetic associations with disease.
Doing something useful with data
●Often, “doing something useful” = Placing a schema over it
and using SQL to answer questions like
●“of the gazillion users who made it to the third page in
our registration process, how many are over 25?”
●The field of how to structure a data warehouse and
organize information to make answering these kinds of
questions easy is a rich one.
A new superpower !
●When people say that we live in an age of “big data,” they
mean that we have tools for collecting, storing, and
processing information at a scale previously unheard of.
●There is a gap between having access to these tools and all
this data, and doing something useful with it.
Doing extra useful things
●Requirements :
a- Flexible programming model
b- Rich functionality in machine learning and statistics
●Existing Tools :
R, Python (PyData stack) and Octave
Pros : Little effort, easy to use
Cons : Viable only to small data sets, too complex to
redesign to be suitable for working over clusters of
computers.
Why is it difficult ?
●Some algorithms (like machine learning algos) would have
wide data dependencies.
• Data are partitioned across nodes.
• Network transfer is much sloooower than memory
accesses.
●What about the probability of failures ?
●Summary : We need a programming paradigm that is
sensitive to the c/c of the underlying system and that
encourages good choices and make it easy to write parallel
code.
High performance Computing
●Use Case : processing a large file full of DNA sequencing
reads in parallel
●1- Manually split the file into smaller files
●2- Submitting a job for each file split to the scheduler
●3- Continuous jobs monitoring to resubmit any failed jobs
●All to all operations like sorting the full data would require
streaming through one node or to go and use MPI.
●Relatively low level of abstraction and difficulty of use in
addition to the high cost.
The 3 truths about data science
●Successful data preprocessing is a must for successful
analysis.
–Large data sets requires special treatment
–Feature engineering should be given more time than the
time spent on the algorithms stuff. (A model for fraud
detection can use IP location info, login times, click logs)
–How would you convert features into vectors suitable for ML
algorithms.
The 3 truths about data science
●Iteration is the key.
–Famous optimization techniques like Gradient Descent
requires repeated scans over the input until convergence
–You can't get it right from the first time.
(Features/Algo/Test)
Analytics between lab and factory
A framework that makes modeling easy
but is also a good fit for production systems is a huge
win.
Apache Spark In Points
●Spark continues from what Hadoop Shines at (Linear
Scalability , Fault Tolerance)
●Spark supports DAG (Direct Acyclic Graph of operators)
●Complements its capabilities with rich set of
transformations.
●In-memory processing. (Suitable for iterations)
Apache Spark In Points
●The most important bottleneck that Spark addresses is
analyst productivity. (R, HDFS, MR, .. etc)
●Spark is better at being an operational system than most
exploratory systems and better for data exploration than
the technologies commonly used in operational systems.
●Standing on top of JVM – Good integration with Hadoop
ecosystem
Spark From the other side !
●Still young compared to MapReduce
●Its main components needs a lot of work to be mature
enough (stream processing, SQL, machine learning, and
graph processing)
–MLlib’s pipelines and transformer API model is in progress
–Its statistics and modeling functionality comes nowhere near that of
single machine languages like R
–Its SQL functionality is rich, but still lags far behind that of Hive.
Spark Programming Model
●It starts with a dataset or a few residing in a distributed
persistent storage (like HDFS)
●Writing a Spark program typically consists of a few related
steps:
–Defining a set of transformations on input data sets.
–Invoking actions that output the transformed data sets to persistent
storage or return results to the driver’s local memory.
–Running local computations that operate on the results computed in a
distributed fashion. These can help you decide what transformations
and actions to undertake next.
Why should you consider Scala ?
●Spark has already different wrappers (Java, python)
●It reduces performance overhead. (Running your different
language of top of JVM)
●It gives you access to the latest and greatest.
●It will help you understand the Spark philosophy.
–If you know how to use Spark in Scala, even if you primarily
use it from other languages, you’ll have a better
understanding of the system and will be in a better position
to “think in Spark.”
If you are immune to boredom,
there is literally nothing you cannot
accomplish.
—David Foster Wallace
Data Science's First Step
●Data cleansing is the first step in any data science project.
●Many clever analyses have been undone because the data
analyzed had fundamental quality problems or bias problem.
●It is a dull work that you have to do before you can get to
the really cool machine learning algorithm that you’ve been
dying to apply to a new problem.
Our First Real Problem !
●Name : Record Linkage
●Description :
–we have a large collection of records from one or more
source systems
–it is likely that some of the records refer to the same
underlying entity, such as a customer, a patient.
–Each of the entities has a number of attributes, such as a
name or address
The Challenge
●Challenge :
–The values of these attributes aren’t perfect
–Values might have different formatting, or typos, or missing
information.
–It is easy for a human to understand and identify at a
glance, but is difficult for a computer to learn.
Steps we are going to take
●Bringing Data from the Cluster to the Client
●Shipping Code from the Client to the Cluster
●Structuring Data with Tuples and Case Classes
●Getting some numbers regarding our data.
The End
Thanks A lot :)

More Related Content

What's hot

Evolution of big data
Evolution of big dataEvolution of big data
Evolution of big data
ShilpaKrishna6
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
Stanley Wang
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
MLconf
 
Cheat sheets for data scientists
Cheat sheets for data scientistsCheat sheets for data scientists
Cheat sheets for data scientistsAjay Ohri
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
Colleen Farrelly
 
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf dataIEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEEFINALYEARSTUDENTPROJECTS
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Simplilearn
 
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Sri Ambati
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Yuanyuan Tian
 
Data Science Lifecycle
Data Science LifecycleData Science Lifecycle
Data Science Lifecycle
SwapnilDahake2
 
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Formulatedby
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Big Data Spain
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data Science
Roger Huang
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics
Farheen Nilofer
 
Maoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_shortMaoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_short
Mao Ye
 
Data science
Data scienceData science
Data science
GitanshuSharma1
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Simplilearn
 
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017
Nisha Talagala
 

What's hot (20)

Evolution of big data
Evolution of big dataEvolution of big data
Evolution of big data
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Cheat sheets for data scientists
Cheat sheets for data scientistsCheat sheets for data scientists
Cheat sheets for data scientists
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf dataIEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
 
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Data Science Lifecycle
Data Science LifecycleData Science Lifecycle
Data Science Lifecycle
 
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data Science
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics
 
Maoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_shortMaoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_short
 
Data science
Data scienceData science
Data science
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
 
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017
 

Viewers also liked

Dia 25 abril
Dia 25 abrilDia 25 abril
Dia 25 abril
iscarlos nathiveira
 
2013_07 - Saturn Monthly Review
2013_07 - Saturn Monthly Review2013_07 - Saturn Monthly Review
2013_07 - Saturn Monthly ReviewAwangku Kasman
 
Entorno De Las Aplicaciones Web
Entorno De Las Aplicaciones Web Entorno De Las Aplicaciones Web
Entorno De Las Aplicaciones Web Mireya Marquez
 
Apakah saudara sudah tahu berapabanyakuangygberedardiduniaburuh
Apakah saudara sudah tahu berapabanyakuangygberedardiduniaburuhApakah saudara sudah tahu berapabanyakuangygberedardiduniaburuh
Apakah saudara sudah tahu berapabanyakuangygberedardiduniaburuh
henry jaya teddy
 
แบบฟอร์มศาลาบริการ-ศาลาสีม่วง
แบบฟอร์มศาลาบริการ-ศาลาสีม่วงแบบฟอร์มศาลาบริการ-ศาลาสีม่วง
แบบฟอร์มศาลาบริการ-ศาลาสีม่วง
YoYoo Showbiz
 
Techknow
TechknowTechknow
Techknow
TechknowInc
 
Тренинг по новым медиа
Тренинг по новым медиаТренинг по новым медиа
Тренинг по новым медиа
Sergey
 
3 l'usurpateur d'irsmun
3   l'usurpateur d'irsmun3   l'usurpateur d'irsmun
3 l'usurpateur d'irsmun
Wolfen Dugondor
 

Viewers also liked (11)

Dia 25 abril
Dia 25 abrilDia 25 abril
Dia 25 abril
 
CONMEMORACÍON DE TODOS LOS DIFUNTOS
CONMEMORACÍON DE TODOS LOS DIFUNTOSCONMEMORACÍON DE TODOS LOS DIFUNTOS
CONMEMORACÍON DE TODOS LOS DIFUNTOS
 
2013_07 - Saturn Monthly Review
2013_07 - Saturn Monthly Review2013_07 - Saturn Monthly Review
2013_07 - Saturn Monthly Review
 
Entorno De Las Aplicaciones Web
Entorno De Las Aplicaciones Web Entorno De Las Aplicaciones Web
Entorno De Las Aplicaciones Web
 
Apakah saudara sudah tahu berapabanyakuangygberedardiduniaburuh
Apakah saudara sudah tahu berapabanyakuangygberedardiduniaburuhApakah saudara sudah tahu berapabanyakuangygberedardiduniaburuh
Apakah saudara sudah tahu berapabanyakuangygberedardiduniaburuh
 
แบบฟอร์มศาลาบริการ-ศาลาสีม่วง
แบบฟอร์มศาลาบริการ-ศาลาสีม่วงแบบฟอร์มศาลาบริการ-ศาลาสีม่วง
แบบฟอร์มศาลาบริการ-ศาลาสีม่วง
 
Techknow
TechknowTechknow
Techknow
 
ateeq
ateeqateeq
ateeq
 
Тренинг по новым медиа
Тренинг по новым медиаТренинг по новым медиа
Тренинг по новым медиа
 
AFZAAL AHMAD
AFZAAL AHMADAFZAAL AHMAD
AFZAAL AHMAD
 
3 l'usurpateur d'irsmun
3   l'usurpateur d'irsmun3   l'usurpateur d'irsmun
3 l'usurpateur d'irsmun
 

Similar to Is Spark the right choice for data analysis ?

Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark
ZaranTech LLC
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera, Inc.
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
Trieu Nguyen
 
Tutorial4
Tutorial4Tutorial4
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
elephantscale
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
Happiest Minds Technologies
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
IJDKP
 

Similar to Is Spark the right choice for data analysis ? (20)

Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Spark
SparkSpark
Spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
Tutorial4
Tutorial4Tutorial4
Tutorial4
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
INFO491FinalPaper
INFO491FinalPaperINFO491FinalPaper
INFO491FinalPaper
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 

Recently uploaded

【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 

Recently uploaded (20)

【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 

Is Spark the right choice for data analysis ?

  • 1. Is Spark the right choice for Data Analysis ? Ahmed Kamal, Big Data Engineer http://ahmedkamal.me
  • 2. Resources ? ●“Advanced Analytics using Spark”, a practical book ! ●“The thing I like most about this book is its focus on examples, which are all drawn from real applications on real- world data sets.” - Matei Zaharia, CTO at Databricks. ●It is all about developing data applications using Spark
  • 3. Data Applications, like what ? ●Build a model to detect credit card fraud using thousands of features and billions of transactions. ●Intelligently recommend millions of products to millions of users. ●Estimate financial risk through simulations of portfolios including millions of instruments. ●Easily manipulate data from thousands of human genomes to detect genetic associations with disease.
  • 4. Doing something useful with data ●Often, “doing something useful” = Placing a schema over it and using SQL to answer questions like ●“of the gazillion users who made it to the third page in our registration process, how many are over 25?” ●The field of how to structure a data warehouse and organize information to make answering these kinds of questions easy is a rich one.
  • 5. A new superpower ! ●When people say that we live in an age of “big data,” they mean that we have tools for collecting, storing, and processing information at a scale previously unheard of. ●There is a gap between having access to these tools and all this data, and doing something useful with it.
  • 6. Doing extra useful things ●Requirements : a- Flexible programming model b- Rich functionality in machine learning and statistics ●Existing Tools : R, Python (PyData stack) and Octave Pros : Little effort, easy to use Cons : Viable only to small data sets, too complex to redesign to be suitable for working over clusters of computers.
  • 7. Why is it difficult ? ●Some algorithms (like machine learning algos) would have wide data dependencies. • Data are partitioned across nodes. • Network transfer is much sloooower than memory accesses. ●What about the probability of failures ? ●Summary : We need a programming paradigm that is sensitive to the c/c of the underlying system and that encourages good choices and make it easy to write parallel code.
  • 8. High performance Computing ●Use Case : processing a large file full of DNA sequencing reads in parallel ●1- Manually split the file into smaller files ●2- Submitting a job for each file split to the scheduler ●3- Continuous jobs monitoring to resubmit any failed jobs ●All to all operations like sorting the full data would require streaming through one node or to go and use MPI. ●Relatively low level of abstraction and difficulty of use in addition to the high cost.
  • 9. The 3 truths about data science ●Successful data preprocessing is a must for successful analysis. –Large data sets requires special treatment –Feature engineering should be given more time than the time spent on the algorithms stuff. (A model for fraud detection can use IP location info, login times, click logs) –How would you convert features into vectors suitable for ML algorithms.
  • 10. The 3 truths about data science ●Iteration is the key. –Famous optimization techniques like Gradient Descent requires repeated scans over the input until convergence –You can't get it right from the first time. (Features/Algo/Test)
  • 11. Analytics between lab and factory A framework that makes modeling easy but is also a good fit for production systems is a huge win.
  • 12. Apache Spark In Points ●Spark continues from what Hadoop Shines at (Linear Scalability , Fault Tolerance) ●Spark supports DAG (Direct Acyclic Graph of operators) ●Complements its capabilities with rich set of transformations. ●In-memory processing. (Suitable for iterations)
  • 13. Apache Spark In Points ●The most important bottleneck that Spark addresses is analyst productivity. (R, HDFS, MR, .. etc) ●Spark is better at being an operational system than most exploratory systems and better for data exploration than the technologies commonly used in operational systems. ●Standing on top of JVM – Good integration with Hadoop ecosystem
  • 14. Spark From the other side ! ●Still young compared to MapReduce ●Its main components needs a lot of work to be mature enough (stream processing, SQL, machine learning, and graph processing) –MLlib’s pipelines and transformer API model is in progress –Its statistics and modeling functionality comes nowhere near that of single machine languages like R –Its SQL functionality is rich, but still lags far behind that of Hive.
  • 15. Spark Programming Model ●It starts with a dataset or a few residing in a distributed persistent storage (like HDFS) ●Writing a Spark program typically consists of a few related steps: –Defining a set of transformations on input data sets. –Invoking actions that output the transformed data sets to persistent storage or return results to the driver’s local memory. –Running local computations that operate on the results computed in a distributed fashion. These can help you decide what transformations and actions to undertake next.
  • 16. Why should you consider Scala ? ●Spark has already different wrappers (Java, python) ●It reduces performance overhead. (Running your different language of top of JVM) ●It gives you access to the latest and greatest. ●It will help you understand the Spark philosophy. –If you know how to use Spark in Scala, even if you primarily use it from other languages, you’ll have a better understanding of the system and will be in a better position to “think in Spark.”
  • 17. If you are immune to boredom, there is literally nothing you cannot accomplish. —David Foster Wallace
  • 18. Data Science's First Step ●Data cleansing is the first step in any data science project. ●Many clever analyses have been undone because the data analyzed had fundamental quality problems or bias problem. ●It is a dull work that you have to do before you can get to the really cool machine learning algorithm that you’ve been dying to apply to a new problem.
  • 19. Our First Real Problem ! ●Name : Record Linkage ●Description : –we have a large collection of records from one or more source systems –it is likely that some of the records refer to the same underlying entity, such as a customer, a patient. –Each of the entities has a number of attributes, such as a name or address
  • 20. The Challenge ●Challenge : –The values of these attributes aren’t perfect –Values might have different formatting, or typos, or missing information. –It is easy for a human to understand and identify at a glance, but is difficult for a computer to learn.
  • 21. Steps we are going to take ●Bringing Data from the Cluster to the Client ●Shipping Code from the Client to the Cluster ●Structuring Data with Tuples and Case Classes ●Getting some numbers regarding our data.