Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Exploratory Data Analysis in Spark


Published on

Introduction to EDA in Spark using Jupyter

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Exploratory Data Analysis in Spark

  1. 1. Exploratory Data Analysis in Spark with Jupyter
  2. 2. ● Madhukara Phatak ● Team Lead at Tellius ● Work in Hadoop, Spark, ML and Scala ●
  3. 3. Agenda ● Introduction to EDA ● EDA on Big Data ● EDA with Notebooks ● Five Point Summary ● Pyspark and EDA ● Histograms ● Outlier detection ● Correlation
  4. 4. Introduction to EDA
  5. 5. What’s EDA ● Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods ● Uses statistical methods to analyse different aspects of the data ● Puts lot of importance on visualisation ● Some of the EDA techniques are ○ Historgrams ○ Correlations etc
  6. 6. Why EDA? ● EDA helps data scientist to understand the distribution of the data before they are fed to downstream algorithms ● EDA also helps to understand the correlation between different variables collected as part of the data collection ● Visualising the data also helps us to see the different patterns in the data which can inform our later part of the analysis ● Interactivity of EDA helps exploration of various different assumptions
  7. 7. EDA in Hadoop ERA
  8. 8. EDA in Hadoop ERA ● Typical EDA is an interactive process and highly experimental ● The first generation Hadoop systems where mostly built for batch processes and don't offer much tools for interactivity ● So typically data scientist used to take sample of the data and run EDA using traditional tools like R / Python etc
  9. 9. Limitation of Sample EDA ● Running EDA on Sample requires the sampling techniques to sample data which represents the distribution of full data ● It’s hard to achieve for the multi dimensional data which is most of real world data ● Sample sometimes create issue for skewed distributions Ex : Payment type in nyc taxi data ● So though sample works for most of the cases, it’s not most accurate
  10. 10. EDA in Spark ERA
  11. 11. Interactive Analysis in Spark ● Spark is built for interactive data analysis from day one ● Below are some of the features for good for Interactive analysis ○ Interactive spark-shell ○ Local mode for low latency ○ Caching for Speed up ○ Dataframe abstraction to support structured data analysis ○ Support for Python
  12. 12. EDA on Notebooks ● Spark shell is good for one liners ● It’s not that great interface for writing long interactive queries ● It’s also doesn’t support good visualisation options which are important for EDA ● So notebooks systems are an alternative to spark shell which keeps interactivity of the shell with other advanced features ● So a notebook interface is good for EDA
  13. 13. Jupyter Notebook
  14. 14. Introduction to Notebook System ● Notebook is a interactive web interface primarily used for exploratory programming ● They are spiritual successors for interactive shells found in languages like python, scala etc ● Notebook systems typically supports multiple language backends using kernels or interpreters ● Interpreter is language runtime which is responsible for actual interpretation of code ● Ex : IPython, Zeppelin, Jupyter
  15. 15. Introduction to Jupyter ● Jupyter is one of the notebook systems which evolved from the IPython shell and notebook system ● Primarily built for python based analysis ● Now supports multiple languages like python,R,scala etc ● Also has good support for big data frameworks like spark, flink ●
  16. 16. Five Point Summary
  17. 17. Five Point Summary ● Five number summary is one of the basic data exploration technique where we will find how values of dataset columns are distributed. ● It calculates below values for a column ○ Min - Minimum value of the column ○ First Quartile - The 25% th data ○ Median - Middle Value ○ Third Quartile - 75% of the value ○ Max - Maximum value
  18. 18. Five Point Summary in Spark ● In spark , we can use describe method on dataframe to get this summary for a given column ● In our example, we'll be using life expectancy data and generating five point summary ● Ex : SummaryExample.scala ● From the results we can observe that ○ They miss quantiles and median ○ Spark gives stddev which is not there in original definition
  19. 19. Approximate Quantities ● Quantiles are costly to calculate on large data as they require sorting and result in skewed calculation ● So by default spark skips them in the describe function ● Spark 2.1 has introduced new method approxQuantile on stat functions of dataframe ● This allows us to calculating these different quantiles with reasonable time with threshold for accuracy ● Ex : SummaryExample.scala
  20. 20. Visualizing Five Point Summary ● In earlier examples, we have calculated the five point summary ● By just looking at the numbers, it’s difficult to understand how the data is distributed ● It’s always good to have visualize the numbers to understand distribution ● Box plot is a good way visualize these numbers ● But how to visualize in Scala?
  21. 21. Scala and Visualisation Libraries ● Scala is often is choice of language to develop spark application ● Scala gives rich language primitives to build robust scalable systems ● But when it comes EDA, ecosystem support visualization and other tools in not great in Scala ● Even though there are effort like or Vegas they are not as mature as pyplot or similar ones ● So Scala may not be great language of choice for EDA
  22. 22. EDA and PySpark
  23. 23. Pyspark ● Pyspark is a python interface for Spark API’s ● With Dataframe and Dataset API, performance is on par with scala equivalent ● One of the advantage of pyspark over scala is it seamless ability to convert between spark & pandas dataframe ● Converting padas helps to use myriad of python ecosystem tools for visualization ● But what about memory limitation about pandas?
  24. 24. EDA with Pyspark ● If we directly use pandas dataframe for EDA we will be limited by data size ● So the trick is to calculate all the values using spark API’s and then convert only result to pandas ● Then use visualize libraries like pyplot , seaborn etc to visualize results on jupyter ● This combo of pyspark and python libraries enables us to do interactive and high quality EDA on spark
  25. 25. Pyspark Boxplot ● In our example, we will first calculate five point summary using pyspark code ● Then convert the result to pandas dataframe to extract values ● Render box plot matplotlib.pyplot library ● One of the challenge is we need to draw using precompute results rather than actual data itself ● It needs understanding lower level API ● Ex : EDA on Life Expectancy Data
  26. 26. Outlier Detection
  27. 27. Outlier Detection using IQR ● One of the use case to calculate five point summary is to find outliers in data ● Idea is the any value which are significantly outside IQR, interquartile range are typically signified as outliers ● IQR = Q3 - Q1 ● One of the formula is to find the outlier which are outside Q1- 1.5*IQR to Q3+1.5*IQR Ex : OutliersWithIQR.scala
  28. 28. Histogram
  29. 29. Histogram ● A histogram is an accurate representation of the distribution of numerical data ● It is a kind of bar graph ● To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval
  30. 30. Histogram API ● Dataframe doesn’t have direct histogram method, but RDD does have on DoubleRDD ● histogram API takes number buckets and it return two things ○ Start Values for Each Buckets ○ No of elements in the bucket ● We can use pyplot barchart API to draw histogram using these result ● Ex : EDA on Life Expectancy Data