SlideShare a Scribd company logo
Exploratory Data Analysis
in Spark with Jupyter
https://github.com/phatak-dev/Statistical-Data-Exploration-Using-Spark-2.0
● Madhukara Phatak
● Team Lead at Tellius
● Work in Hadoop, Spark, ML
and Scala
● www.madhukaraphatak.com
Agenda
● Introduction to EDA
● EDA on Big Data
● EDA with Notebooks
● Five Point Summary
● Pyspark and EDA
● Histograms
● Outlier detection
● Correlation
Introduction to EDA
What’s EDA
● Exploratory data analysis (EDA) is an approach to
analyzing data sets to summarize their main
characteristics, often with visual methods
● Uses statistical methods to analyse different aspects of
the data
● Puts lot of importance on visualisation
● Some of the EDA techniques are
○ Historgrams
○ Correlations etc
Why EDA?
● EDA helps data scientist to understand the distribution
of the data before they are fed to downstream
algorithms
● EDA also helps to understand the correlation between
different variables collected as part of the data collection
● Visualising the data also helps us to see the different
patterns in the data which can inform our later part of
the analysis
● Interactivity of EDA helps exploration of various different
assumptions
EDA in Hadoop ERA
EDA in Hadoop ERA
● Typical EDA is an interactive process and highly
experimental
● The first generation Hadoop systems where mostly built
for batch processes and don't offer much tools for
interactivity
● So typically data scientist used to take sample of the
data and run EDA using traditional tools like R / Python
etc
Limitation of Sample EDA
● Running EDA on Sample requires the sampling
techniques to sample data which represents the
distribution of full data
● It’s hard to achieve for the multi dimensional data which
is most of real world data
● Sample sometimes create issue for skewed distributions
Ex : Payment type in nyc taxi data
● So though sample works for most of the cases, it’s not
most accurate
EDA in Spark ERA
Interactive Analysis in Spark
● Spark is built for interactive data analysis from day one
● Below are some of the features for good for Interactive
analysis
○ Interactive spark-shell
○ Local mode for low latency
○ Caching for Speed up
○ Dataframe abstraction to support structured data
analysis
○ Support for Python
EDA on Notebooks
● Spark shell is good for one liners
● It’s not that great interface for writing long interactive
queries
● It’s also doesn’t support good visualisation options
which are important for EDA
● So notebooks systems are an alternative to spark shell
which keeps interactivity of the shell with other
advanced features
● So a notebook interface is good for EDA
Jupyter Notebook
Introduction to Notebook System
● Notebook is a interactive web interface primarily used
for exploratory programming
● They are spiritual successors for interactive shells found
in languages like python, scala etc
● Notebook systems typically supports multiple language
backends using kernels or interpreters
● Interpreter is language runtime which is responsible for
actual interpretation of code
● Ex : IPython, Zeppelin, Jupyter
Introduction to Jupyter
● Jupyter is one of the notebook systems which evolved
from the IPython shell and notebook system
● Primarily built for python based analysis
● Now supports multiple languages like python,R,scala
etc
● Also has good support for big data frameworks like
spark, flink
● http://jupyter.org/
Five Point Summary
Five Point Summary
● Five number summary is one of the basic data
exploration technique where we will find how values of
dataset columns are distributed.
● It calculates below values for a column
○ Min - Minimum value of the column
○ First Quartile - The 25% th data
○ Median - Middle Value
○ Third Quartile - 75% of the value
○ Max - Maximum value
Five Point Summary in Spark
● In spark , we can use describe method on dataframe to
get this summary for a given column
● In our example, we'll be using life expectancy data and
generating five point summary
● Ex : SummaryExample.scala
● From the results we can observe that
○ They miss quantiles and median
○ Spark gives stddev which is not there in original
definition
Approximate Quantities
● Quantiles are costly to calculate on large data as they
require sorting and result in skewed calculation
● So by default spark skips them in the describe function
● Spark 2.1 has introduced new method approxQuantile
on stat functions of dataframe
● This allows us to calculating these different quantiles
with reasonable time with threshold for accuracy
● Ex : SummaryExample.scala
Visualizing Five Point Summary
● In earlier examples, we have calculated the five point
summary
● By just looking at the numbers, it’s difficult to
understand how the data is distributed
● It’s always good to have visualize the numbers to
understand distribution
● Box plot is a good way visualize these numbers
● But how to visualize in Scala?
Scala and Visualisation Libraries
● Scala is often is choice of language to develop spark
application
● Scala gives rich language primitives to build robust
scalable systems
● But when it comes EDA, ecosystem support
visualization and other tools in not great in Scala
● Even though there are effort like plot.ly or Vegas they
are not as mature as pyplot or similar ones
● So Scala may not be great language of choice for EDA
EDA and PySpark
Pyspark
● Pyspark is a python interface for Spark API’s
● With Dataframe and Dataset API, performance is on par
with scala equivalent
● One of the advantage of pyspark over scala is it
seamless ability to convert between spark & pandas
dataframe
● Converting padas helps to use myriad of python
ecosystem tools for visualization
● But what about memory limitation about pandas?
EDA with Pyspark
● If we directly use pandas dataframe for EDA we will be
limited by data size
● So the trick is to calculate all the values using spark
API’s and then convert only result to pandas
● Then use visualize libraries like pyplot , seaborn etc to
visualize results on jupyter
● This combo of pyspark and python libraries enables us
to do interactive and high quality EDA on spark
Pyspark Boxplot
● In our example, we will first calculate five point summary
using pyspark code
● Then convert the result to pandas dataframe to extract
values
● Render box plot matplotlib.pyplot library
● One of the challenge is we need to draw using
precompute results rather than actual data itself
● It needs understanding lower level API
● Ex : EDA on Life Expectancy Data
Outlier Detection
Outlier Detection using IQR
● One of the use case to calculate five point summary is
to find outliers in data
● Idea is the any value which are significantly outside
IQR, interquartile range are typically signified as outliers
● IQR = Q3 - Q1
● One of the formula is to find the outlier which are
outside Q1- 1.5*IQR to Q3+1.5*IQR
Ex : OutliersWithIQR.scala
Histogram
Histogram
● A histogram is an accurate representation of the
distribution of numerical data
● It is a kind of bar graph
● To construct a histogram, the first step is to "bin" the
range of values—that is, divide the entire range of
values into a series of intervals—and then count how
many values fall into each interval
Histogram API
● Dataframe doesn’t have direct histogram method, but
RDD does have on DoubleRDD
● histogram API takes number buckets and it return two
things
○ Start Values for Each Buckets
○ No of elements in the bucket
● We can use pyplot barchart API to draw histogram
using these result
● Ex : EDA on Life Expectancy Data

More Related Content

What's hot

Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Edureka!
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
Catherine Kimani
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
Edureka!
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
Lucian Neghina
 
Tableau slideshare
Tableau slideshareTableau slideshare
Tableau slideshare
Sakshi Jain
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
King Julian
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
Vrishit Saraswat
 
What is Informatica Powercenter
What is Informatica PowercenterWhat is Informatica Powercenter
What is Informatica Powercenter
BigClasses Com
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
Novita Sari
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
 
Data analytics vs. Data analysis
Data analytics vs. Data analysisData analytics vs. Data analysis
Data analytics vs. Data analysis
Dr. C.V. Suresh Babu
 
Tableau PPT Intro, Features, Advantages, Disadvantages
Tableau PPT Intro, Features, Advantages, DisadvantagesTableau PPT Intro, Features, Advantages, Disadvantages
Tableau PPT Intro, Features, Advantages, Disadvantages
Burn & Born
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
Davis David
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
Spotle.ai
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Edureka!
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
AkshitaKanther
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
Dr. C.V. Suresh Babu
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Vivek Aanand Ganesan
 

What's hot (20)

Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Tableau slideshare
Tableau slideshareTableau slideshare
Tableau slideshare
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
What is Informatica Powercenter
What is Informatica PowercenterWhat is Informatica Powercenter
What is Informatica Powercenter
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data analytics vs. Data analysis
Data analytics vs. Data analysisData analytics vs. Data analysis
Data analytics vs. Data analysis
 
Tableau PPT Intro, Features, Advantages, Disadvantages
Tableau PPT Intro, Features, Advantages, DisadvantagesTableau PPT Intro, Features, Advantages, Disadvantages
Tableau PPT Intro, Features, Advantages, Disadvantages
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 

Similar to Exploratory Data Analysis in Spark

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
Holden Karau
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
Pramod Toraskar
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Alexey Zinoviev
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
Lucian Neghina
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
Petr Zapletal
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Spark
SparkSpark
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
datamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
datamantra
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Module 3 - Basics of Data Manipulation in Time Series
Module 3 - Basics of Data Manipulation in Time SeriesModule 3 - Basics of Data Manipulation in Time Series
Module 3 - Basics of Data Manipulation in Time Series
ssusere5ddd6
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 

Similar to Exploratory Data Analysis in Spark (20)

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Spark
SparkSpark
Spark
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Module 3 - Basics of Data Manipulation in Time Series
Module 3 - Basics of Data Manipulation in Time SeriesModule 3 - Basics of Data Manipulation in Time Series
Module 3 - Basics of Data Manipulation in Time Series
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 

Recently uploaded

Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Alexander Teggin
 
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
6459astrid
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
Sathiakumar Chandr
 
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
Grant McAlister
 
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
sheetal singh$A17
 
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
Ak47
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Samuel Jackson
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
AnujaGaikwad28
 
UNITEC Institute of Technology diploma
UNITEC Institute of Technology diplomaUNITEC Institute of Technology diploma
UNITEC Institute of Technology diploma
oyhka
 
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2) hhh (1) (2) (5) (1) (1).pdf
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2)  hhh (1) (2) (5) (1) (1).pdfFINAL PROJECT WORK PORTFOLIO MANAGEMENT (2)  hhh (1) (2) (5) (1) (1).pdf
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2) hhh (1) (2) (5) (1) (1).pdf
bala krishna
 
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdfCMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
IndranilDasgupta19
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
lenjisoHussein
 
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
arti singh$A17
 
PTT of AI Bots, Avatar, business continuity software.
PTT of AI Bots, Avatar, business continuity software.PTT of AI Bots, Avatar, business continuity software.
PTT of AI Bots, Avatar, business continuity software.
arash8484
 
Communication-Skills-An-Essential-Toolkit.pptx
Communication-Skills-An-Essential-Toolkit.pptxCommunication-Skills-An-Essential-Toolkit.pptx
Communication-Skills-An-Essential-Toolkit.pptx
sanketdhavale23di
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
rightmanforbloodline
 
VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...
VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...
VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...
sukaniyasunnu
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
satpalsheravatmumbai
 
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
revolutionary575
 
History and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big DataHistory and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big Data
Jongwook Woo
 

Recently uploaded (20)

Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
 
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
 
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
 
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
 
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
 
UNITEC Institute of Technology diploma
UNITEC Institute of Technology diplomaUNITEC Institute of Technology diploma
UNITEC Institute of Technology diploma
 
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2) hhh (1) (2) (5) (1) (1).pdf
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2)  hhh (1) (2) (5) (1) (1).pdfFINAL PROJECT WORK PORTFOLIO MANAGEMENT (2)  hhh (1) (2) (5) (1) (1).pdf
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2) hhh (1) (2) (5) (1) (1).pdf
 
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdfCMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
 
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
 
PTT of AI Bots, Avatar, business continuity software.
PTT of AI Bots, Avatar, business continuity software.PTT of AI Bots, Avatar, business continuity software.
PTT of AI Bots, Avatar, business continuity software.
 
Communication-Skills-An-Essential-Toolkit.pptx
Communication-Skills-An-Essential-Toolkit.pptxCommunication-Skills-An-Essential-Toolkit.pptx
Communication-Skills-An-Essential-Toolkit.pptx
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
 
VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...
VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...
VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
 
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
 
History and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big DataHistory and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big Data
 

Exploratory Data Analysis in Spark

  • 1. Exploratory Data Analysis in Spark with Jupyter https://github.com/phatak-dev/Statistical-Data-Exploration-Using-Spark-2.0
  • 2. ● Madhukara Phatak ● Team Lead at Tellius ● Work in Hadoop, Spark, ML and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Introduction to EDA ● EDA on Big Data ● EDA with Notebooks ● Five Point Summary ● Pyspark and EDA ● Histograms ● Outlier detection ● Correlation
  • 5. What’s EDA ● Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods ● Uses statistical methods to analyse different aspects of the data ● Puts lot of importance on visualisation ● Some of the EDA techniques are ○ Historgrams ○ Correlations etc
  • 6. Why EDA? ● EDA helps data scientist to understand the distribution of the data before they are fed to downstream algorithms ● EDA also helps to understand the correlation between different variables collected as part of the data collection ● Visualising the data also helps us to see the different patterns in the data which can inform our later part of the analysis ● Interactivity of EDA helps exploration of various different assumptions
  • 8. EDA in Hadoop ERA ● Typical EDA is an interactive process and highly experimental ● The first generation Hadoop systems where mostly built for batch processes and don't offer much tools for interactivity ● So typically data scientist used to take sample of the data and run EDA using traditional tools like R / Python etc
  • 9. Limitation of Sample EDA ● Running EDA on Sample requires the sampling techniques to sample data which represents the distribution of full data ● It’s hard to achieve for the multi dimensional data which is most of real world data ● Sample sometimes create issue for skewed distributions Ex : Payment type in nyc taxi data ● So though sample works for most of the cases, it’s not most accurate
  • 11. Interactive Analysis in Spark ● Spark is built for interactive data analysis from day one ● Below are some of the features for good for Interactive analysis ○ Interactive spark-shell ○ Local mode for low latency ○ Caching for Speed up ○ Dataframe abstraction to support structured data analysis ○ Support for Python
  • 12. EDA on Notebooks ● Spark shell is good for one liners ● It’s not that great interface for writing long interactive queries ● It’s also doesn’t support good visualisation options which are important for EDA ● So notebooks systems are an alternative to spark shell which keeps interactivity of the shell with other advanced features ● So a notebook interface is good for EDA
  • 14. Introduction to Notebook System ● Notebook is a interactive web interface primarily used for exploratory programming ● They are spiritual successors for interactive shells found in languages like python, scala etc ● Notebook systems typically supports multiple language backends using kernels or interpreters ● Interpreter is language runtime which is responsible for actual interpretation of code ● Ex : IPython, Zeppelin, Jupyter
  • 15. Introduction to Jupyter ● Jupyter is one of the notebook systems which evolved from the IPython shell and notebook system ● Primarily built for python based analysis ● Now supports multiple languages like python,R,scala etc ● Also has good support for big data frameworks like spark, flink ● http://jupyter.org/
  • 17. Five Point Summary ● Five number summary is one of the basic data exploration technique where we will find how values of dataset columns are distributed. ● It calculates below values for a column ○ Min - Minimum value of the column ○ First Quartile - The 25% th data ○ Median - Middle Value ○ Third Quartile - 75% of the value ○ Max - Maximum value
  • 18. Five Point Summary in Spark ● In spark , we can use describe method on dataframe to get this summary for a given column ● In our example, we'll be using life expectancy data and generating five point summary ● Ex : SummaryExample.scala ● From the results we can observe that ○ They miss quantiles and median ○ Spark gives stddev which is not there in original definition
  • 19. Approximate Quantities ● Quantiles are costly to calculate on large data as they require sorting and result in skewed calculation ● So by default spark skips them in the describe function ● Spark 2.1 has introduced new method approxQuantile on stat functions of dataframe ● This allows us to calculating these different quantiles with reasonable time with threshold for accuracy ● Ex : SummaryExample.scala
  • 20. Visualizing Five Point Summary ● In earlier examples, we have calculated the five point summary ● By just looking at the numbers, it’s difficult to understand how the data is distributed ● It’s always good to have visualize the numbers to understand distribution ● Box plot is a good way visualize these numbers ● But how to visualize in Scala?
  • 21. Scala and Visualisation Libraries ● Scala is often is choice of language to develop spark application ● Scala gives rich language primitives to build robust scalable systems ● But when it comes EDA, ecosystem support visualization and other tools in not great in Scala ● Even though there are effort like plot.ly or Vegas they are not as mature as pyplot or similar ones ● So Scala may not be great language of choice for EDA
  • 23. Pyspark ● Pyspark is a python interface for Spark API’s ● With Dataframe and Dataset API, performance is on par with scala equivalent ● One of the advantage of pyspark over scala is it seamless ability to convert between spark & pandas dataframe ● Converting padas helps to use myriad of python ecosystem tools for visualization ● But what about memory limitation about pandas?
  • 24. EDA with Pyspark ● If we directly use pandas dataframe for EDA we will be limited by data size ● So the trick is to calculate all the values using spark API’s and then convert only result to pandas ● Then use visualize libraries like pyplot , seaborn etc to visualize results on jupyter ● This combo of pyspark and python libraries enables us to do interactive and high quality EDA on spark
  • 25. Pyspark Boxplot ● In our example, we will first calculate five point summary using pyspark code ● Then convert the result to pandas dataframe to extract values ● Render box plot matplotlib.pyplot library ● One of the challenge is we need to draw using precompute results rather than actual data itself ● It needs understanding lower level API ● Ex : EDA on Life Expectancy Data
  • 27. Outlier Detection using IQR ● One of the use case to calculate five point summary is to find outliers in data ● Idea is the any value which are significantly outside IQR, interquartile range are typically signified as outliers ● IQR = Q3 - Q1 ● One of the formula is to find the outlier which are outside Q1- 1.5*IQR to Q3+1.5*IQR Ex : OutliersWithIQR.scala
  • 29. Histogram ● A histogram is an accurate representation of the distribution of numerical data ● It is a kind of bar graph ● To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval
  • 30. Histogram API ● Dataframe doesn’t have direct histogram method, but RDD does have on DoubleRDD ● histogram API takes number buckets and it return two things ○ Start Values for Each Buckets ○ No of elements in the bucket ● We can use pyplot barchart API to draw histogram using these result ● Ex : EDA on Life Expectancy Data