SlideShare a Scribd company logo
1 of 21
Download to read offline
Amit Kapoor
@amitkaps
Visualising
Big Data
Visualise Million
Data Points
x <- rnorm(1000000, mean=0, sd=2)
y <- rnorm(1000000, mean=0, sd=2)
xy <- data.frame(x,y)
Same order as the
Number of Pixels
on my MacBook Air
1400 x 900
Data
Data Sample
Sampling can be
effective (with
overweighting
unusual values)
Require multiple
plots or careful
tuning parameters
Data Sample
Model
Models are great as
they scale nicely.
But, visualisation is
required as
“I don’t know, what I
don’t know.”
Data Sample
ModelBinning
Binning can solve a
lot of these
challenges
“Bin - Summarize -
Smooth: A framework
for visualising big data” -
Hadley Wickam (2013)
“Visualising big data
is the process of creating
generalized histograms”
Approach
BIN : fixed size bins = (x-origin)/width
SUMMARIZE : summary stats = count, mean, stdev
SMOOTH : smoothing e.g. kernel mean, regression
VISUALISE : visualise using standard plots
Bigvis Package in R
Aim: To plot 100 million points in under 5 seconds.
Approach:
- Plotting using standard R libraries
- Processing done in (fast) compiled C++ code, using
Rcpp package
- Outlier removal in big data
- Smoothing to highlight trends & suppress noise
Diamonds dataset
ggplot(diamonds) + aes(carat, price)
+ geom_point(alpha = 0.2, colour =
“orange”)
50k observations e.g. price, carat of diamonds
Condense (bin + summarise)
library(bigvis)
library(ggplot2)
Nbin <- 20
BinData <- with(diamonds, condense(
bin(carat, find_width(carat,Nbin)),
bin(price, find_width(price,Nbin)))
Plotting the Condense
p <- ggplot(BinData) + aes(carat,
price, fill=.count) + geom_tile()
Create bins = 20 and summarized using count
Both Points & Condensed
q <- p + geom_point(data = diamonds,
aes(fill = NULL), alpha = 0.2, colour
= "orange")
Create bins = 20, summarized using count & added base data
Movies dataset
ggplot(movies) + aes(length, rating)
+ geom_point(alpha = 0.2, colour =
“orange”)
130k observations e.g. length, rating of movies on IMDB
Let us see the outliers
title length rating
1 Matrjoschka 5700 8.5
2 The Cure for Insomnia 5220 5.9
3 The Longest Most Meaningless Movie in the World 2880 7.3
4 The Hazards of Helen 1428 6.6
5 **** 1100 6.9
Condense (bin + summarise)
library(bigvis)
library(ggplot2)
Nbin <- 1e4
BinData <- with(movies, condense(
bin(length, find_width(length,Nbin)),
bin(rating, find_width(rating,Nbin)))
Condesed Plot
p <- ggplot(BinData) + aes(length,
rating, fill=.count) + geom_tile()
Create bins = 10000 and summarized using count
Remove Outliers
p %>% peel(BinData)
Create bins = 10000, summarize count & peel 1% outlier
Smoothing
smoothBinData <- smooth(peel
(binData), h=c(20, 1))
autoplot(smoothBinData)
Create bins = 20, summarize count, peel 1% outlier & smooth
Big Data Visualisation
● Approach: Bin - Summarize - Smooth - Visualise
● “Interactively” plot nearly 100 millions data point in-
memory for EDA in R
● Can be extend to in-database e.g. for binning
● Can be parallelised e.g. summarize on count, mean
Amit Kapoor
@amitkaps
amitkaps.com
narrativeviz.com
Data
Visual
Story
*

More Related Content

What's hot

Introducing the Microsoft Virtual Earth Silverlight Map Control CTP
Introducing the Microsoft Virtual Earth Silverlight Map Control CTPIntroducing the Microsoft Virtual Earth Silverlight Map Control CTP
Introducing the Microsoft Virtual Earth Silverlight Map Control CTPgoodfriday
 
Surface3d in R and rgl package.
Surface3d in R and rgl package.Surface3d in R and rgl package.
Surface3d in R and rgl package.Dr. Volkan OBAN
 
Fun with D3.js: Data Visualization Eye Candy with Streaming JSON
Fun with D3.js: Data Visualization Eye Candy with Streaming JSONFun with D3.js: Data Visualization Eye Candy with Streaming JSON
Fun with D3.js: Data Visualization Eye Candy with Streaming JSONTomomi Imura
 
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)Hansol Kang
 
peRm R group. Review of packages for r for market data downloading and analysis
peRm R group. Review of packages for r for market data downloading and analysispeRm R group. Review of packages for r for market data downloading and analysis
peRm R group. Review of packages for r for market data downloading and analysisVyacheslav Arbuzov
 
12. Map | WeakMap | ES6 | JavaScript | Typescript
12. Map | WeakMap | ES6 | JavaScript | Typescript12. Map | WeakMap | ES6 | JavaScript | Typescript
12. Map | WeakMap | ES6 | JavaScript | Typescriptpcnmtutorials
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Dr. Volkan OBAN
 
C Graphics Functions
C Graphics FunctionsC Graphics Functions
C Graphics FunctionsSHAKOOR AB
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingMark Kilgard
 
Juggle: Hybrid Large-Scale Music Recommendation
Juggle: Hybrid Large-Scale Music RecommendationJuggle: Hybrid Large-Scale Music Recommendation
Juggle: Hybrid Large-Scale Music RecommendationJosé Devezas
 
Pointer Events in Canvas
Pointer Events in CanvasPointer Events in Canvas
Pointer Events in Canvasdeanhudson
 
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Paul Richards
 

What's hot (20)

Real life XNA
Real life XNAReal life XNA
Real life XNA
 
Introducing the Microsoft Virtual Earth Silverlight Map Control CTP
Introducing the Microsoft Virtual Earth Silverlight Map Control CTPIntroducing the Microsoft Virtual Earth Silverlight Map Control CTP
Introducing the Microsoft Virtual Earth Silverlight Map Control CTP
 
Surface3d in R and rgl package.
Surface3d in R and rgl package.Surface3d in R and rgl package.
Surface3d in R and rgl package.
 
Fun with D3.js: Data Visualization Eye Candy with Streaming JSON
Fun with D3.js: Data Visualization Eye Candy with Streaming JSONFun with D3.js: Data Visualization Eye Candy with Streaming JSON
Fun with D3.js: Data Visualization Eye Candy with Streaming JSON
 
Introduction to graphics programming in c
Introduction to graphics programming in cIntroduction to graphics programming in c
Introduction to graphics programming in c
 
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
 
peRm R group. Review of packages for r for market data downloading and analysis
peRm R group. Review of packages for r for market data downloading and analysispeRm R group. Review of packages for r for market data downloading and analysis
peRm R group. Review of packages for r for market data downloading and analysis
 
12. Map | WeakMap | ES6 | JavaScript | Typescript
12. Map | WeakMap | ES6 | JavaScript | Typescript12. Map | WeakMap | ES6 | JavaScript | Typescript
12. Map | WeakMap | ES6 | JavaScript | Typescript
 
CLUSTERGRAM
CLUSTERGRAMCLUSTERGRAM
CLUSTERGRAM
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.
 
Ggplot2 cheatsheet-2.1
Ggplot2 cheatsheet-2.1Ggplot2 cheatsheet-2.1
Ggplot2 cheatsheet-2.1
 
C Graphics Functions
C Graphics FunctionsC Graphics Functions
C Graphics Functions
 
Numpy python cheat_sheet
Numpy python cheat_sheetNumpy python cheat_sheet
Numpy python cheat_sheet
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and Culling
 
Juggle: Hybrid Large-Scale Music Recommendation
Juggle: Hybrid Large-Scale Music RecommendationJuggle: Hybrid Large-Scale Music Recommendation
Juggle: Hybrid Large-Scale Music Recommendation
 
Pointer Events in Canvas
Pointer Events in CanvasPointer Events in Canvas
Pointer Events in Canvas
 
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Kwp2 091217
Kwp2 091217Kwp2 091217
Kwp2 091217
 
Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
 

Viewers also liked

Interent of Things (IoT) & Data Science Contextual Reference Models
Interent of Things (IoT) & Data Science Contextual Reference ModelsInterent of Things (IoT) & Data Science Contextual Reference Models
Interent of Things (IoT) & Data Science Contextual Reference ModelsTom Zorde
 
Telling Stories with Data - Using Story Spine
Telling Stories with Data - Using Story SpineTelling Stories with Data - Using Story Spine
Telling Stories with Data - Using Story SpineAmit Kapoor
 
Five Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used TableauFive Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used TableauRyan Sleeper
 
The Power of Ensembles in Machine Learning
The Power of Ensembles in Machine LearningThe Power of Ensembles in Machine Learning
The Power of Ensembles in Machine LearningAmit Kapoor
 
Data driven storytelling tips from an iron viz champion ryan sleeper
Data driven storytelling tips from an iron viz champion   ryan sleeperData driven storytelling tips from an iron viz champion   ryan sleeper
Data driven storytelling tips from an iron viz champion ryan sleeperRyan Sleeper
 
Crafting Visual Stories with Data
Crafting Visual Stories with DataCrafting Visual Stories with Data
Crafting Visual Stories with DataAmit Kapoor
 
Python Visualisation for Data Science
Python Visualisation for Data SciencePython Visualisation for Data Science
Python Visualisation for Data ScienceAmit Kapoor
 
Learning the Craft of Data Visualisation
Learning the Craft of Data VisualisationLearning the Craft of Data Visualisation
Learning the Craft of Data VisualisationAmit Kapoor
 
Data Visualisation Literacy - Learning to See
Data Visualisation Literacy - Learning to SeeData Visualisation Literacy - Learning to See
Data Visualisation Literacy - Learning to SeeAndy Kirk
 
Deep Learning for NLP
Deep Learning for NLPDeep Learning for NLP
Deep Learning for NLPAmit Kapoor
 
Data Storytelling: The only way to unlock true insight from your data
Data Storytelling: The only way to unlock true insight from your dataData Storytelling: The only way to unlock true insight from your data
Data Storytelling: The only way to unlock true insight from your dataBright North
 
Storytelling with Data - Approach | Skills
Storytelling with Data - Approach | SkillsStorytelling with Data - Approach | Skills
Storytelling with Data - Approach | SkillsAmit Kapoor
 
Nonprofit Marketing Plan Template - Summary
Nonprofit Marketing Plan Template - SummaryNonprofit Marketing Plan Template - Summary
Nonprofit Marketing Plan Template - SummaryKivi Leroux Miller
 
Marketing Plan Template - Small Business
Marketing Plan Template - Small BusinessMarketing Plan Template - Small Business
Marketing Plan Template - Small BusinessChris R. Keller
 
Data stories - how to combine the power storytelling with effective data visu...
Data stories - how to combine the power storytelling with effective data visu...Data stories - how to combine the power storytelling with effective data visu...
Data stories - how to combine the power storytelling with effective data visu...Coincidencity
 
Сценарий для рисованной истории
Сценарий для рисованной историиСценарий для рисованной истории
Сценарий для рисованной историиЛидия Бабинцева
 
The 8 Hats of Data Visualisation
The 8 Hats of Data VisualisationThe 8 Hats of Data Visualisation
The 8 Hats of Data VisualisationAndy Kirk
 
5 Secrets to Better Presentation Charts and Graphs
5 Secrets to Better Presentation Charts and Graphs5 Secrets to Better Presentation Charts and Graphs
5 Secrets to Better Presentation Charts and GraphsMetamorph Training Pvt Ltd
 

Viewers also liked (20)

Interent of Things (IoT) & Data Science Contextual Reference Models
Interent of Things (IoT) & Data Science Contextual Reference ModelsInterent of Things (IoT) & Data Science Contextual Reference Models
Interent of Things (IoT) & Data Science Contextual Reference Models
 
2016 04-07 презентация
2016 04-07 презентация2016 04-07 презентация
2016 04-07 презентация
 
Telling Stories with Data - Using Story Spine
Telling Stories with Data - Using Story SpineTelling Stories with Data - Using Story Spine
Telling Stories with Data - Using Story Spine
 
Five Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used TableauFive Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used Tableau
 
The Power of Ensembles in Machine Learning
The Power of Ensembles in Machine LearningThe Power of Ensembles in Machine Learning
The Power of Ensembles in Machine Learning
 
Data driven storytelling tips from an iron viz champion ryan sleeper
Data driven storytelling tips from an iron viz champion   ryan sleeperData driven storytelling tips from an iron viz champion   ryan sleeper
Data driven storytelling tips from an iron viz champion ryan sleeper
 
Crafting Visual Stories with Data
Crafting Visual Stories with DataCrafting Visual Stories with Data
Crafting Visual Stories with Data
 
Python Visualisation for Data Science
Python Visualisation for Data SciencePython Visualisation for Data Science
Python Visualisation for Data Science
 
Learning the Craft of Data Visualisation
Learning the Craft of Data VisualisationLearning the Craft of Data Visualisation
Learning the Craft of Data Visualisation
 
Data Visualisation Literacy - Learning to See
Data Visualisation Literacy - Learning to SeeData Visualisation Literacy - Learning to See
Data Visualisation Literacy - Learning to See
 
Deep Learning for NLP
Deep Learning for NLPDeep Learning for NLP
Deep Learning for NLP
 
Embedding with Tableau Server
Embedding with Tableau ServerEmbedding with Tableau Server
Embedding with Tableau Server
 
Data Storytelling: The only way to unlock true insight from your data
Data Storytelling: The only way to unlock true insight from your dataData Storytelling: The only way to unlock true insight from your data
Data Storytelling: The only way to unlock true insight from your data
 
Storytelling with Data - Approach | Skills
Storytelling with Data - Approach | SkillsStorytelling with Data - Approach | Skills
Storytelling with Data - Approach | Skills
 
Nonprofit Marketing Plan Template - Summary
Nonprofit Marketing Plan Template - SummaryNonprofit Marketing Plan Template - Summary
Nonprofit Marketing Plan Template - Summary
 
Marketing Plan Template - Small Business
Marketing Plan Template - Small BusinessMarketing Plan Template - Small Business
Marketing Plan Template - Small Business
 
Data stories - how to combine the power storytelling with effective data visu...
Data stories - how to combine the power storytelling with effective data visu...Data stories - how to combine the power storytelling with effective data visu...
Data stories - how to combine the power storytelling with effective data visu...
 
Сценарий для рисованной истории
Сценарий для рисованной историиСценарий для рисованной истории
Сценарий для рисованной истории
 
The 8 Hats of Data Visualisation
The 8 Hats of Data VisualisationThe 8 Hats of Data Visualisation
The 8 Hats of Data Visualisation
 
5 Secrets to Better Presentation Charts and Graphs
5 Secrets to Better Presentation Charts and Graphs5 Secrets to Better Presentation Charts and Graphs
5 Secrets to Better Presentation Charts and Graphs
 

Similar to Visualising Big Data

Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...FarhanAhmade
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleYvonne K. Matos
 
The Ring programming language version 1.2 book - Part 35 of 84
The Ring programming language version 1.2 book - Part 35 of 84The Ring programming language version 1.2 book - Part 35 of 84
The Ring programming language version 1.2 book - Part 35 of 84Mahmoud Samir Fayed
 
Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012aleks-f
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
Making BIG DATA smaller
Making BIG DATA smallerMaking BIG DATA smaller
Making BIG DATA smallerTony Tran
 
Introduction of DiscoGAN
Introduction of DiscoGANIntroduction of DiscoGAN
Introduction of DiscoGANSeongcheol Baek
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft AzureDmitry Petukhov
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
imager package in R and examples..
imager package in R and examples..imager package in R and examples..
imager package in R and examples..Dr. Volkan OBAN
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualizationbigdataviz_bay
 
집단지성 프로그래밍 08-가격모델링
집단지성 프로그래밍 08-가격모델링집단지성 프로그래밍 08-가격모델링
집단지성 프로그래밍 08-가격모델링Kwang Woo NAM
 
Applying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKApplying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKJeremy Chen
 
"Wix Engineering Media AI Photo Studio", Mykola Mykhailych
"Wix Engineering Media AI Photo Studio", Mykola Mykhailych"Wix Engineering Media AI Photo Studio", Mykola Mykhailych
"Wix Engineering Media AI Photo Studio", Mykola MykhailychFwdays
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIlya Grigorik
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 

Similar to Visualising Big Data (20)

Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
 
The Ring programming language version 1.2 book - Part 35 of 84
The Ring programming language version 1.2 book - Part 35 of 84The Ring programming language version 1.2 book - Part 35 of 84
The Ring programming language version 1.2 book - Part 35 of 84
 
Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
2021 05-04-u2-net
2021 05-04-u2-net2021 05-04-u2-net
2021 05-04-u2-net
 
Making BIG DATA smaller
Making BIG DATA smallerMaking BIG DATA smaller
Making BIG DATA smaller
 
Introduction of DiscoGAN
Introduction of DiscoGANIntroduction of DiscoGAN
Introduction of DiscoGAN
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
 
ML .pptx
ML .pptxML .pptx
ML .pptx
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
imager package in R and examples..
imager package in R and examples..imager package in R and examples..
imager package in R and examples..
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
 
집단지성 프로그래밍 08-가격모델링
집단지성 프로그래밍 08-가격모델링집단지성 프로그래밍 08-가격모델링
집단지성 프로그래밍 08-가격모델링
 
Applying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKApplying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPK
 
"Wix Engineering Media AI Photo Studio", Mykola Mykhailych
"Wix Engineering Media AI Photo Studio", Mykola Mykhailych"Wix Engineering Media AI Photo Studio", Mykola Mykhailych
"Wix Engineering Media AI Photo Studio", Mykola Mykhailych
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 

More from Amit Kapoor

Model Visualisation
Model VisualisationModel Visualisation
Model VisualisationAmit Kapoor
 
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with DataFifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with DataAmit Kapoor
 
Storytelling with Data - See | Show | Tell | Engage
Storytelling with Data - See | Show | Tell | EngageStorytelling with Data - See | Show | Tell | Engage
Storytelling with Data - See | Show | Tell | EngageAmit Kapoor
 
Business Process Improvement - A Strategic and Supply Chain Perspective
Business Process Improvement - A Strategic and Supply Chain Perspective Business Process Improvement - A Strategic and Supply Chain Perspective
Business Process Improvement - A Strategic and Supply Chain Perspective Amit Kapoor
 
What makes a data-story work?
What makes a data-story work?What makes a data-story work?
What makes a data-story work?Amit Kapoor
 
What is Strategy - Thinking like a Strategist
What is Strategy - Thinking like a StrategistWhat is Strategy - Thinking like a Strategist
What is Strategy - Thinking like a StrategistAmit Kapoor
 
Story Structure and Modern Storytelling
Story Structure and Modern StorytellingStory Structure and Modern Storytelling
Story Structure and Modern StorytellingAmit Kapoor
 
Targeting the Moment of Truth - Using Big Data in Retail
Targeting the Moment of Truth - Using Big Data in RetailTargeting the Moment of Truth - Using Big Data in Retail
Targeting the Moment of Truth - Using Big Data in RetailAmit Kapoor
 
Storytelling - Gutenberg
Storytelling - GutenbergStorytelling - Gutenberg
Storytelling - GutenbergAmit Kapoor
 
Analytics in Consulting
Analytics in ConsultingAnalytics in Consulting
Analytics in ConsultingAmit Kapoor
 
Retail Pricing Perspective
Retail Pricing PerspectiveRetail Pricing Perspective
Retail Pricing PerspectiveAmit Kapoor
 

More from Amit Kapoor (11)

Model Visualisation
Model VisualisationModel Visualisation
Model Visualisation
 
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with DataFifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
 
Storytelling with Data - See | Show | Tell | Engage
Storytelling with Data - See | Show | Tell | EngageStorytelling with Data - See | Show | Tell | Engage
Storytelling with Data - See | Show | Tell | Engage
 
Business Process Improvement - A Strategic and Supply Chain Perspective
Business Process Improvement - A Strategic and Supply Chain Perspective Business Process Improvement - A Strategic and Supply Chain Perspective
Business Process Improvement - A Strategic and Supply Chain Perspective
 
What makes a data-story work?
What makes a data-story work?What makes a data-story work?
What makes a data-story work?
 
What is Strategy - Thinking like a Strategist
What is Strategy - Thinking like a StrategistWhat is Strategy - Thinking like a Strategist
What is Strategy - Thinking like a Strategist
 
Story Structure and Modern Storytelling
Story Structure and Modern StorytellingStory Structure and Modern Storytelling
Story Structure and Modern Storytelling
 
Targeting the Moment of Truth - Using Big Data in Retail
Targeting the Moment of Truth - Using Big Data in RetailTargeting the Moment of Truth - Using Big Data in Retail
Targeting the Moment of Truth - Using Big Data in Retail
 
Storytelling - Gutenberg
Storytelling - GutenbergStorytelling - Gutenberg
Storytelling - Gutenberg
 
Analytics in Consulting
Analytics in ConsultingAnalytics in Consulting
Analytics in Consulting
 
Retail Pricing Perspective
Retail Pricing PerspectiveRetail Pricing Perspective
Retail Pricing Perspective
 

Recently uploaded

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Recently uploaded (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

Visualising Big Data

  • 2. Visualise Million Data Points x <- rnorm(1000000, mean=0, sd=2) y <- rnorm(1000000, mean=0, sd=2) xy <- data.frame(x,y) Same order as the Number of Pixels on my MacBook Air 1400 x 900 Data
  • 3. Data Sample Sampling can be effective (with overweighting unusual values) Require multiple plots or careful tuning parameters
  • 4. Data Sample Model Models are great as they scale nicely. But, visualisation is required as “I don’t know, what I don’t know.”
  • 5. Data Sample ModelBinning Binning can solve a lot of these challenges “Bin - Summarize - Smooth: A framework for visualising big data” - Hadley Wickam (2013)
  • 6.
  • 7. “Visualising big data is the process of creating generalized histograms”
  • 8. Approach BIN : fixed size bins = (x-origin)/width SUMMARIZE : summary stats = count, mean, stdev SMOOTH : smoothing e.g. kernel mean, regression VISUALISE : visualise using standard plots
  • 9. Bigvis Package in R Aim: To plot 100 million points in under 5 seconds. Approach: - Plotting using standard R libraries - Processing done in (fast) compiled C++ code, using Rcpp package - Outlier removal in big data - Smoothing to highlight trends & suppress noise
  • 10. Diamonds dataset ggplot(diamonds) + aes(carat, price) + geom_point(alpha = 0.2, colour = “orange”) 50k observations e.g. price, carat of diamonds
  • 11. Condense (bin + summarise) library(bigvis) library(ggplot2) Nbin <- 20 BinData <- with(diamonds, condense( bin(carat, find_width(carat,Nbin)), bin(price, find_width(price,Nbin)))
  • 12. Plotting the Condense p <- ggplot(BinData) + aes(carat, price, fill=.count) + geom_tile() Create bins = 20 and summarized using count
  • 13. Both Points & Condensed q <- p + geom_point(data = diamonds, aes(fill = NULL), alpha = 0.2, colour = "orange") Create bins = 20, summarized using count & added base data
  • 14. Movies dataset ggplot(movies) + aes(length, rating) + geom_point(alpha = 0.2, colour = “orange”) 130k observations e.g. length, rating of movies on IMDB
  • 15. Let us see the outliers title length rating 1 Matrjoschka 5700 8.5 2 The Cure for Insomnia 5220 5.9 3 The Longest Most Meaningless Movie in the World 2880 7.3 4 The Hazards of Helen 1428 6.6 5 **** 1100 6.9
  • 16. Condense (bin + summarise) library(bigvis) library(ggplot2) Nbin <- 1e4 BinData <- with(movies, condense( bin(length, find_width(length,Nbin)), bin(rating, find_width(rating,Nbin)))
  • 17. Condesed Plot p <- ggplot(BinData) + aes(length, rating, fill=.count) + geom_tile() Create bins = 10000 and summarized using count
  • 18. Remove Outliers p %>% peel(BinData) Create bins = 10000, summarize count & peel 1% outlier
  • 19. Smoothing smoothBinData <- smooth(peel (binData), h=c(20, 1)) autoplot(smoothBinData) Create bins = 20, summarize count, peel 1% outlier & smooth
  • 20. Big Data Visualisation ● Approach: Bin - Summarize - Smooth - Visualise ● “Interactively” plot nearly 100 millions data point in- memory for EDA in R ● Can be extend to in-database e.g. for binning ● Can be parallelised e.g. summarize on count, mean