Exploratory Data Analysis
in Spark with Jupyter
https://github.com/phatak-dev/Statistical-Data-Exploration-Using-Spark-2.0
● Madhukara Phatak
● Team Lead at Tellius
● Work in Hadoop, Spark, ML
and Scala
● www.madhukaraphatak.com
Agenda
● Introduction to EDA
● EDA on Big Data
● EDA with Notebooks
● Five Point Summary
● Pyspark and EDA
● Histograms
● Outlier detection
● Correlation
Introduction to EDA
What’s EDA
● Exploratory data analysis (EDA) is an approach to
analyzing data sets to summarize their main
characteristics, often with visual methods
● Uses statistical methods to analyse different aspects of
the data
● Puts lot of importance on visualisation
● Some of the EDA techniques are
○ Historgrams
○ Correlations etc
Why EDA?
● EDA helps data scientist to understand the distribution
of the data before they are fed to downstream
algorithms
● EDA also helps to understand the correlation between
different variables collected as part of the data collection
● Visualising the data also helps us to see the different
patterns in the data which can inform our later part of
the analysis
● Interactivity of EDA helps exploration of various different
assumptions
EDA in Hadoop ERA
EDA in Hadoop ERA
● Typical EDA is an interactive process and highly
experimental
● The first generation Hadoop systems where mostly built
for batch processes and don't offer much tools for
interactivity
● So typically data scientist used to take sample of the
data and run EDA using traditional tools like R / Python
etc
Limitation of Sample EDA
● Running EDA on Sample requires the sampling
techniques to sample data which represents the
distribution of full data
● It’s hard to achieve for the multi dimensional data which
is most of real world data
● Sample sometimes create issue for skewed distributions
Ex : Payment type in nyc taxi data
● So though sample works for most of the cases, it’s not
most accurate
EDA in Spark ERA
Interactive Analysis in Spark
● Spark is built for interactive data analysis from day one
● Below are some of the features for good for Interactive
analysis
○ Interactive spark-shell
○ Local mode for low latency
○ Caching for Speed up
○ Dataframe abstraction to support structured data
analysis
○ Support for Python
EDA on Notebooks
● Spark shell is good for one liners
● It’s not that great interface for writing long interactive
queries
● It’s also doesn’t support good visualisation options
which are important for EDA
● So notebooks systems are an alternative to spark shell
which keeps interactivity of the shell with other
advanced features
● So a notebook interface is good for EDA
Jupyter Notebook
Introduction to Notebook System
● Notebook is a interactive web interface primarily used
for exploratory programming
● They are spiritual successors for interactive shells found
in languages like python, scala etc
● Notebook systems typically supports multiple language
backends using kernels or interpreters
● Interpreter is language runtime which is responsible for
actual interpretation of code
● Ex : IPython, Zeppelin, Jupyter
Introduction to Jupyter
● Jupyter is one of the notebook systems which evolved
from the IPython shell and notebook system
● Primarily built for python based analysis
● Now supports multiple languages like python,R,scala
etc
● Also has good support for big data frameworks like
spark, flink
● http://jupyter.org/
Five Point Summary
Five Point Summary
● Five number summary is one of the basic data
exploration technique where we will find how values of
dataset columns are distributed.
● It calculates below values for a column
○ Min - Minimum value of the column
○ First Quartile - The 25% th data
○ Median - Middle Value
○ Third Quartile - 75% of the value
○ Max - Maximum value
Five Point Summary in Spark
● In spark , we can use describe method on dataframe to
get this summary for a given column
● In our example, we'll be using life expectancy data and
generating five point summary
● Ex : SummaryExample.scala
● From the results we can observe that
○ They miss quantiles and median
○ Spark gives stddev which is not there in original
definition
Approximate Quantities
● Quantiles are costly to calculate on large data as they
require sorting and result in skewed calculation
● So by default spark skips them in the describe function
● Spark 2.1 has introduced new method approxQuantile
on stat functions of dataframe
● This allows us to calculating these different quantiles
with reasonable time with threshold for accuracy
● Ex : SummaryExample.scala
Visualizing Five Point Summary
● In earlier examples, we have calculated the five point
summary
● By just looking at the numbers, it’s difficult to
understand how the data is distributed
● It’s always good to have visualize the numbers to
understand distribution
● Box plot is a good way visualize these numbers
● But how to visualize in Scala?
Scala and Visualisation Libraries
● Scala is often is choice of language to develop spark
application
● Scala gives rich language primitives to build robust
scalable systems
● But when it comes EDA, ecosystem support
visualization and other tools in not great in Scala
● Even though there are effort like plot.ly or Vegas they
are not as mature as pyplot or similar ones
● So Scala may not be great language of choice for EDA
EDA and PySpark
Pyspark
● Pyspark is a python interface for Spark API’s
● With Dataframe and Dataset API, performance is on par
with scala equivalent
● One of the advantage of pyspark over scala is it
seamless ability to convert between spark & pandas
dataframe
● Converting padas helps to use myriad of python
ecosystem tools for visualization
● But what about memory limitation about pandas?
EDA with Pyspark
● If we directly use pandas dataframe for EDA we will be
limited by data size
● So the trick is to calculate all the values using spark
API’s and then convert only result to pandas
● Then use visualize libraries like pyplot , seaborn etc to
visualize results on jupyter
● This combo of pyspark and python libraries enables us
to do interactive and high quality EDA on spark
Pyspark Boxplot
● In our example, we will first calculate five point summary
using pyspark code
● Then convert the result to pandas dataframe to extract
values
● Render box plot matplotlib.pyplot library
● One of the challenge is we need to draw using
precompute results rather than actual data itself
● It needs understanding lower level API
● Ex : EDA on Life Expectancy Data
Outlier Detection
Outlier Detection using IQR
● One of the use case to calculate five point summary is
to find outliers in data
● Idea is the any value which are significantly outside
IQR, interquartile range are typically signified as outliers
● IQR = Q3 - Q1
● One of the formula is to find the outlier which are
outside Q1- 1.5*IQR to Q3+1.5*IQR
Ex : OutliersWithIQR.scala
Histogram
Histogram
● A histogram is an accurate representation of the
distribution of numerical data
● It is a kind of bar graph
● To construct a histogram, the first step is to "bin" the
range of values—that is, divide the entire range of
values into a series of intervals—and then count how
many values fall into each interval
Histogram API
● Dataframe doesn’t have direct histogram method, but
RDD does have on DoubleRDD
● histogram API takes number buckets and it return two
things
○ Start Values for Each Buckets
○ No of elements in the bucket
● We can use pyplot barchart API to draw histogram
using these result
● Ex : EDA on Life Expectancy Data

Exploratory Data Analysis in Spark

  • 1.
    Exploratory Data Analysis inSpark with Jupyter https://github.com/phatak-dev/Statistical-Data-Exploration-Using-Spark-2.0
  • 2.
    ● Madhukara Phatak ●Team Lead at Tellius ● Work in Hadoop, Spark, ML and Scala ● www.madhukaraphatak.com
  • 3.
    Agenda ● Introduction toEDA ● EDA on Big Data ● EDA with Notebooks ● Five Point Summary ● Pyspark and EDA ● Histograms ● Outlier detection ● Correlation
  • 4.
  • 5.
    What’s EDA ● Exploratorydata analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods ● Uses statistical methods to analyse different aspects of the data ● Puts lot of importance on visualisation ● Some of the EDA techniques are ○ Historgrams ○ Correlations etc
  • 6.
    Why EDA? ● EDAhelps data scientist to understand the distribution of the data before they are fed to downstream algorithms ● EDA also helps to understand the correlation between different variables collected as part of the data collection ● Visualising the data also helps us to see the different patterns in the data which can inform our later part of the analysis ● Interactivity of EDA helps exploration of various different assumptions
  • 7.
  • 8.
    EDA in HadoopERA ● Typical EDA is an interactive process and highly experimental ● The first generation Hadoop systems where mostly built for batch processes and don't offer much tools for interactivity ● So typically data scientist used to take sample of the data and run EDA using traditional tools like R / Python etc
  • 9.
    Limitation of SampleEDA ● Running EDA on Sample requires the sampling techniques to sample data which represents the distribution of full data ● It’s hard to achieve for the multi dimensional data which is most of real world data ● Sample sometimes create issue for skewed distributions Ex : Payment type in nyc taxi data ● So though sample works for most of the cases, it’s not most accurate
  • 10.
  • 11.
    Interactive Analysis inSpark ● Spark is built for interactive data analysis from day one ● Below are some of the features for good for Interactive analysis ○ Interactive spark-shell ○ Local mode for low latency ○ Caching for Speed up ○ Dataframe abstraction to support structured data analysis ○ Support for Python
  • 12.
    EDA on Notebooks ●Spark shell is good for one liners ● It’s not that great interface for writing long interactive queries ● It’s also doesn’t support good visualisation options which are important for EDA ● So notebooks systems are an alternative to spark shell which keeps interactivity of the shell with other advanced features ● So a notebook interface is good for EDA
  • 13.
  • 14.
    Introduction to NotebookSystem ● Notebook is a interactive web interface primarily used for exploratory programming ● They are spiritual successors for interactive shells found in languages like python, scala etc ● Notebook systems typically supports multiple language backends using kernels or interpreters ● Interpreter is language runtime which is responsible for actual interpretation of code ● Ex : IPython, Zeppelin, Jupyter
  • 15.
    Introduction to Jupyter ●Jupyter is one of the notebook systems which evolved from the IPython shell and notebook system ● Primarily built for python based analysis ● Now supports multiple languages like python,R,scala etc ● Also has good support for big data frameworks like spark, flink ● http://jupyter.org/
  • 16.
  • 17.
    Five Point Summary ●Five number summary is one of the basic data exploration technique where we will find how values of dataset columns are distributed. ● It calculates below values for a column ○ Min - Minimum value of the column ○ First Quartile - The 25% th data ○ Median - Middle Value ○ Third Quartile - 75% of the value ○ Max - Maximum value
  • 18.
    Five Point Summaryin Spark ● In spark , we can use describe method on dataframe to get this summary for a given column ● In our example, we'll be using life expectancy data and generating five point summary ● Ex : SummaryExample.scala ● From the results we can observe that ○ They miss quantiles and median ○ Spark gives stddev which is not there in original definition
  • 19.
    Approximate Quantities ● Quantilesare costly to calculate on large data as they require sorting and result in skewed calculation ● So by default spark skips them in the describe function ● Spark 2.1 has introduced new method approxQuantile on stat functions of dataframe ● This allows us to calculating these different quantiles with reasonable time with threshold for accuracy ● Ex : SummaryExample.scala
  • 20.
    Visualizing Five PointSummary ● In earlier examples, we have calculated the five point summary ● By just looking at the numbers, it’s difficult to understand how the data is distributed ● It’s always good to have visualize the numbers to understand distribution ● Box plot is a good way visualize these numbers ● But how to visualize in Scala?
  • 21.
    Scala and VisualisationLibraries ● Scala is often is choice of language to develop spark application ● Scala gives rich language primitives to build robust scalable systems ● But when it comes EDA, ecosystem support visualization and other tools in not great in Scala ● Even though there are effort like plot.ly or Vegas they are not as mature as pyplot or similar ones ● So Scala may not be great language of choice for EDA
  • 22.
  • 23.
    Pyspark ● Pyspark isa python interface for Spark API’s ● With Dataframe and Dataset API, performance is on par with scala equivalent ● One of the advantage of pyspark over scala is it seamless ability to convert between spark & pandas dataframe ● Converting padas helps to use myriad of python ecosystem tools for visualization ● But what about memory limitation about pandas?
  • 24.
    EDA with Pyspark ●If we directly use pandas dataframe for EDA we will be limited by data size ● So the trick is to calculate all the values using spark API’s and then convert only result to pandas ● Then use visualize libraries like pyplot , seaborn etc to visualize results on jupyter ● This combo of pyspark and python libraries enables us to do interactive and high quality EDA on spark
  • 25.
    Pyspark Boxplot ● Inour example, we will first calculate five point summary using pyspark code ● Then convert the result to pandas dataframe to extract values ● Render box plot matplotlib.pyplot library ● One of the challenge is we need to draw using precompute results rather than actual data itself ● It needs understanding lower level API ● Ex : EDA on Life Expectancy Data
  • 26.
  • 27.
    Outlier Detection usingIQR ● One of the use case to calculate five point summary is to find outliers in data ● Idea is the any value which are significantly outside IQR, interquartile range are typically signified as outliers ● IQR = Q3 - Q1 ● One of the formula is to find the outlier which are outside Q1- 1.5*IQR to Q3+1.5*IQR Ex : OutliersWithIQR.scala
  • 28.
  • 29.
    Histogram ● A histogramis an accurate representation of the distribution of numerical data ● It is a kind of bar graph ● To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval
  • 30.
    Histogram API ● Dataframedoesn’t have direct histogram method, but RDD does have on DoubleRDD ● histogram API takes number buckets and it return two things ○ Start Values for Each Buckets ○ No of elements in the bucket ● We can use pyplot barchart API to draw histogram using these result ● Ex : EDA on Life Expectancy Data