SlideShare a Scribd company logo
ANALYST’S
NIGHTMARE OR
LAUNDERING MASSIVE
SPREADSHEETS
An example of how analysis that overlooks data quality issues may go
completely wrong
By Feyzi Bagirov and Tanya Yarmola
Agenda
■ About us
■ Dirty Data types
■ Fit Bit dataset insights (pre-impute)
■ Fit Bit dataset insights (post-impute)
■ Q&A
About us
■ Vice President in
Model Governance
and Review at
JP Morgan
■ Faculty of Analytics at
Harrisburg University
of Science and
Technology
■ Data Science Advisor
at Metadata.io
According to Gartner, Excel is still the
most popular BI tool in the world
■ More and more powerful tools are available on the market
■ Spreadsheet however lives on:
– Excel is the most widely used analytics
tool in the world
Dirty Data
■ Significant quantities of data are stored and passed around in the spreadsheet
formats
■ Analysis is also frequently performed without leaving Excel.
■ This aggravates data quality issues:
– duplicates and nulls are overlooked
– copy-pastes and manual imputations create additional errors
– VLOOKUPS do not take duplicates into account
■ When the data happens to be not as clean as you hoped it to be, serious errors
occur and reproduce through the spreadsheet work cycle.
According to IDG, cleaning and organizing
data takes up to 60% of the data scientists’
time
Common types of dirty data
■ Missing data
– Missing Completely At Random (MCAR)
– Missing At Random (MAR)
– Missing Not At Random (MNAR)
■ Duplicates
■ Outliers
■ Multiple comma-separated (or not) values that are stored in one column (common
symptom)
■ Column headers are values, not variable names
Handling Dirty Data
■ You can handle dirty data on two levels:
– Database level/manual clean inside the database – not efficient, does not scale
well
– Application level – recommended way, whenever possible
Ø Identify the commonly occurring problems with your data and the tasks to fix them
Ø Once you identified most common tasks related to your data cleanup, create scripts, that you
are going to be run on every new dataset.
Ø Whenever you have new type of errors in the new dataset, add the code to fix them to your
scripts.
Concept of tidy data
■ “Tidy Data” by Hadley Wickham, “Journal of Statistical Software”, Aug 20141
■ Principles of tidy data:
– Observations as rows
– Variables of columns
– One type of observational unit per table (if table that suppose to contain
characteristics of people, contains information about their pets, there are more
observational units).
1 https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf
Objectives
■ To provide a simple example that illustrates how data quality issues may visibly
affect results of an analysis
■ To estimate customer’s height based on average stride length and see whether
results belong to expected ranges
Tools
■ https://zenodo.org/record/53894/files/mturkfitbit_export_4.12.16-5.12.16.zip
• A publicly available FitBit dataset1 that contains records on 33
customers with
• minute-by-minute records on steps and intensities
• daily distances travelled (FitBit estimate)
• Data quality issues were introduced for illustration purposes – this
also allows comparison with the original.
Data
Data
Quick an dirty height calculation
Quick and Dirty Calculation Results
Let’s take a closer look at the data to see if
we can correct for outlier mistakes
Initial observations
■ minuteSteps and minuteIntensities have different numbers of records - there may be
duplicates.
■ Most values for Steps and Intesities are zeroes.
■ There are Nulls in minuteSteps
■ Numbers of unique user Ids are different.
■ Id in minuteSteps is an object datatype.
■ Max number of Steps per minute is 500 - this is over 8 steps per second - seems too
high, potential outlier issue
Daily Distances observations
More observations
• Number of unique Ids matches minuteIntensities
• SedentaryActiveDistance is mostly zero – exclusion should be OK
Analysis with Data Checks
• Ids are mix of integers and strange strings
• Should convert all to integers to match other datasets
Analysis with Data Checks(cont’d)
Analysis with Data Checks(cont’d)
Nulls and outliers
• There are Nulls in minuteSteps
• Max number of Steps per minute is 500 - this is over 8 steps per
second - seems too high, potential outlier issue
Missing Values - Imputations
Imputation is used when the data analysis techniques is not content
robust. It can be done in several ways, but multiple imputations is
recommended and is a relatively standard method:
- Single imputation
- Multiple imputation
Single Imputations
■ Mean substitution - replacing missing value with the mean of that value for all other
cases. Does not change the sample mean for that variable, however, attenuates any
correlations involving the imputed variables, because there is no guaranteed
relationships between the imputed and measured variables)
■ Interpolation – a method of constructing new data points within the range of a
discrete set of known data points.
■ Partial deletion (Listwise deletion/casewise)- the most common means of dealing
with missing data is listwise deletion (complete case), which is when all cases with
missing values are deleted. If the data are MCAR, this will not add any bias, but it
will decrease the power of the analysis (smaller sample size).
■ Pairwise deletion – deleting a case when it is missing a variable required for a
particular analysis, but including that case in analysis for which all required
variables are present. The main advantage of this method is that it is
straightforward and easy to implement.
Single Imputations (cont’d)
■ Hot-deck – a missing value is imputed from a randomly selected similar record.
■ Cold deck – selects donors from another dataset. Due to the advances in
computation power, more sophisticated methods have superseded the original
random and sorted hot deck imputation techniques
■ Regression imputation - Available information for complete and incomplete cases is
used to predict whether a value on a specific variable is missing or not. Fitted values
from the regression model are then used to impute the missing values. It has the
opposite problem of mean imputation – imputed data do not have an error term
included in their estimation, thus the estimates fit perfectly along the regression line
without any residual variance, causing relationships to be over identified and
suggest greater precision in the imputed values, supplying no uncertainty about that
value.
Single Imputations (cont’d)
Multiple Imputations
■ Multiple Imputation developed to deal with the problem of increased noise due to
imputation by Rubin (1987). There are multiple methods of multiple imputation
■ The primary method is Multiple Imputation by Chained Equations (MICE) should be
implemented only when the missing data follow the missing at random mechanism
Multiple Imputations (cont’d)
■ Advantages of Multiple Imputation:
– An advantage over single imputation is that MI is flexible and can be used in cases,
where the data is MCAR, MAR, and even when the data is MNAR.
– By imputing multiple times, multiple imputation certainly accounts for the
uncertainty and range of values that the true value could have taken.
– Not difficult to implement
■ Disadvantages of Multiple Imputation:
– Can be computationally expensive and not quite worth it.
Steps distributions per intensity
Single imputations - Impute nulls and outliers
using different methods:
1. mean value
2. interpolate between existing values
3. draw from the distribution of existing
values (per customer)
Single imputation - Impute using mean
Single imputation - impute using
interpolation
Impute using transform with random
choice (hot-deck)
Calculate height function
Calculate height for different imputation
versions and compare results
Q&A
Thanks!
Feyzi Bagirov, feyzi.bagirov@metadata.io, @FeyziBagirov
Tanya Yarmola, tanya.yarmola@jpmorgan.com, @TanyaYarmola

More Related Content

What's hot

Application of SPSS by umakant bhaskar gohatre
Application of SPSS by umakant bhaskar gohatre Application of SPSS by umakant bhaskar gohatre
Application of SPSS by umakant bhaskar gohatre
Smt. Indira Gandhi College of Engineering, Navi Mumbai, Mumbai
 
Supervised Machine Learning Techniques common algorithms and its application
Supervised Machine Learning Techniques common algorithms and its applicationSupervised Machine Learning Techniques common algorithms and its application
Supervised Machine Learning Techniques common algorithms and its application
Tara ram Goyal
 
Introduction to regression
Introduction to regressionIntroduction to regression
Introduction to regression
Dr. C.V. Suresh Babu
 
Software Programs for Data Analysis
Software Programs for Data AnalysisSoftware Programs for Data Analysis
Software Programs for Data Analysis
unmgrc
 
Statistical softwares
Statistical softwaresStatistical softwares
Statistical softwares
Afra Fathima
 
MachineLearning_AishwaryaCR
MachineLearning_AishwaryaCRMachineLearning_AishwaryaCR
MachineLearning_AishwaryaCR
Aishwarya C Ramachandran
 
Data Science Methodology for Analytics and Solution Implementation
Data Science Methodology for Analytics and Solution ImplementationData Science Methodology for Analytics and Solution Implementation
Data Science Methodology for Analytics and Solution Implementation
Rupak Roy
 
Spss an introduction
Spss  an introductionSpss  an introduction
Spss an introduction
Suresh Thengumpallil
 
Unit 1 introduction
Unit 1 introductionUnit 1 introduction
Unit 1 introduction
raksharao
 
Supervised Machine Learning Techniques
Supervised Machine Learning TechniquesSupervised Machine Learning Techniques
Supervised Machine Learning Techniques
Tara ram Goyal
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
egoodwintx
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
rkalidasan
 
Application of Computer in Inferential Analysis
Application of Computer in Inferential AnalysisApplication of Computer in Inferential Analysis
Application of Computer in Inferential Analysis
Dr. Amjad Ali Arain
 
Managing data and defining variables
Managing data and defining variablesManaging data and defining variables
Managing data and defining variables
Christal Sanlao
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
nibraspk
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwares
Dr.ammara khakwani
 
Btm8107 8 week2 activity understanding and exploring assumptions a+ work
Btm8107 8 week2 activity understanding and exploring assumptions a+ workBtm8107 8 week2 activity understanding and exploring assumptions a+ work
Btm8107 8 week2 activity understanding and exploring assumptions a+ work
coursesexams1
 
Guide to data analytics
Guide to data analyticsGuide to data analytics
Guide to data analytics
Debashish Jana
 
Statrting spss
Statrting spssStatrting spss
Statrting spss
Mohamed Afifi
 
SQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMiningSQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMining
Polish SQL Server User Group
 

What's hot (20)

Application of SPSS by umakant bhaskar gohatre
Application of SPSS by umakant bhaskar gohatre Application of SPSS by umakant bhaskar gohatre
Application of SPSS by umakant bhaskar gohatre
 
Supervised Machine Learning Techniques common algorithms and its application
Supervised Machine Learning Techniques common algorithms and its applicationSupervised Machine Learning Techniques common algorithms and its application
Supervised Machine Learning Techniques common algorithms and its application
 
Introduction to regression
Introduction to regressionIntroduction to regression
Introduction to regression
 
Software Programs for Data Analysis
Software Programs for Data AnalysisSoftware Programs for Data Analysis
Software Programs for Data Analysis
 
Statistical softwares
Statistical softwaresStatistical softwares
Statistical softwares
 
MachineLearning_AishwaryaCR
MachineLearning_AishwaryaCRMachineLearning_AishwaryaCR
MachineLearning_AishwaryaCR
 
Data Science Methodology for Analytics and Solution Implementation
Data Science Methodology for Analytics and Solution ImplementationData Science Methodology for Analytics and Solution Implementation
Data Science Methodology for Analytics and Solution Implementation
 
Spss an introduction
Spss  an introductionSpss  an introduction
Spss an introduction
 
Unit 1 introduction
Unit 1 introductionUnit 1 introduction
Unit 1 introduction
 
Supervised Machine Learning Techniques
Supervised Machine Learning TechniquesSupervised Machine Learning Techniques
Supervised Machine Learning Techniques
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 
Application of Computer in Inferential Analysis
Application of Computer in Inferential AnalysisApplication of Computer in Inferential Analysis
Application of Computer in Inferential Analysis
 
Managing data and defining variables
Managing data and defining variablesManaging data and defining variables
Managing data and defining variables
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwares
 
Btm8107 8 week2 activity understanding and exploring assumptions a+ work
Btm8107 8 week2 activity understanding and exploring assumptions a+ workBtm8107 8 week2 activity understanding and exploring assumptions a+ work
Btm8107 8 week2 activity understanding and exploring assumptions a+ work
 
Guide to data analytics
Guide to data analyticsGuide to data analytics
Guide to data analytics
 
Statrting spss
Statrting spssStatrting spss
Statrting spss
 
SQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMiningSQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMining
 

Similar to Analyst’s Nightmare or Laundering Massive Spreadsheets

KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
ijcnes
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET Journal
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
SaketBansal9
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
Deadpool120050
 
IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence Chain
IRJET Journal
 
03 presentation-bothiesson
03 presentation-bothiesson03 presentation-bothiesson
03 presentation-bothiesson
InfinIT - Innovationsnetværket for it
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
TanujaSomvanshi1
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
Library and Information Science Research Coalition
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
Shesha R
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networking
Stenio Fernandes
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
Harsha Patel
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
Dr.Shweta
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
Dhilsath Fathima
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
Salford Systems
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
Nandakumar P
 

Similar to Analyst’s Nightmare or Laundering Massive Spreadsheets (20)

KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its Analysis
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
 
IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence Chain
 
03 presentation-bothiesson
03 presentation-bothiesson03 presentation-bothiesson
03 presentation-bothiesson
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networking
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
PyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
PyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
PyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
PyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
PyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 

Recently uploaded (20)

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 

Analyst’s Nightmare or Laundering Massive Spreadsheets

  • 1. ANALYST’S NIGHTMARE OR LAUNDERING MASSIVE SPREADSHEETS An example of how analysis that overlooks data quality issues may go completely wrong By Feyzi Bagirov and Tanya Yarmola
  • 2. Agenda ■ About us ■ Dirty Data types ■ Fit Bit dataset insights (pre-impute) ■ Fit Bit dataset insights (post-impute) ■ Q&A
  • 3. About us ■ Vice President in Model Governance and Review at JP Morgan ■ Faculty of Analytics at Harrisburg University of Science and Technology ■ Data Science Advisor at Metadata.io
  • 4. According to Gartner, Excel is still the most popular BI tool in the world ■ More and more powerful tools are available on the market ■ Spreadsheet however lives on: – Excel is the most widely used analytics tool in the world
  • 5.
  • 6. Dirty Data ■ Significant quantities of data are stored and passed around in the spreadsheet formats ■ Analysis is also frequently performed without leaving Excel. ■ This aggravates data quality issues: – duplicates and nulls are overlooked – copy-pastes and manual imputations create additional errors – VLOOKUPS do not take duplicates into account ■ When the data happens to be not as clean as you hoped it to be, serious errors occur and reproduce through the spreadsheet work cycle.
  • 7. According to IDG, cleaning and organizing data takes up to 60% of the data scientists’ time
  • 8. Common types of dirty data ■ Missing data – Missing Completely At Random (MCAR) – Missing At Random (MAR) – Missing Not At Random (MNAR) ■ Duplicates ■ Outliers ■ Multiple comma-separated (or not) values that are stored in one column (common symptom) ■ Column headers are values, not variable names
  • 9. Handling Dirty Data ■ You can handle dirty data on two levels: – Database level/manual clean inside the database – not efficient, does not scale well – Application level – recommended way, whenever possible Ø Identify the commonly occurring problems with your data and the tasks to fix them Ø Once you identified most common tasks related to your data cleanup, create scripts, that you are going to be run on every new dataset. Ø Whenever you have new type of errors in the new dataset, add the code to fix them to your scripts.
  • 10. Concept of tidy data ■ “Tidy Data” by Hadley Wickham, “Journal of Statistical Software”, Aug 20141 ■ Principles of tidy data: – Observations as rows – Variables of columns – One type of observational unit per table (if table that suppose to contain characteristics of people, contains information about their pets, there are more observational units). 1 https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf
  • 11. Objectives ■ To provide a simple example that illustrates how data quality issues may visibly affect results of an analysis ■ To estimate customer’s height based on average stride length and see whether results belong to expected ranges
  • 12. Tools
  • 13. ■ https://zenodo.org/record/53894/files/mturkfitbit_export_4.12.16-5.12.16.zip • A publicly available FitBit dataset1 that contains records on 33 customers with • minute-by-minute records on steps and intensities • daily distances travelled (FitBit estimate) • Data quality issues were introduced for illustration purposes – this also allows comparison with the original. Data
  • 14. Data
  • 15. Quick an dirty height calculation
  • 16. Quick and Dirty Calculation Results
  • 17. Let’s take a closer look at the data to see if we can correct for outlier mistakes
  • 18. Initial observations ■ minuteSteps and minuteIntensities have different numbers of records - there may be duplicates. ■ Most values for Steps and Intesities are zeroes. ■ There are Nulls in minuteSteps ■ Numbers of unique user Ids are different. ■ Id in minuteSteps is an object datatype. ■ Max number of Steps per minute is 500 - this is over 8 steps per second - seems too high, potential outlier issue
  • 19. Daily Distances observations More observations • Number of unique Ids matches minuteIntensities • SedentaryActiveDistance is mostly zero – exclusion should be OK
  • 20. Analysis with Data Checks • Ids are mix of integers and strange strings • Should convert all to integers to match other datasets
  • 21. Analysis with Data Checks(cont’d)
  • 22. Analysis with Data Checks(cont’d)
  • 23. Nulls and outliers • There are Nulls in minuteSteps • Max number of Steps per minute is 500 - this is over 8 steps per second - seems too high, potential outlier issue
  • 24. Missing Values - Imputations Imputation is used when the data analysis techniques is not content robust. It can be done in several ways, but multiple imputations is recommended and is a relatively standard method: - Single imputation - Multiple imputation
  • 25. Single Imputations ■ Mean substitution - replacing missing value with the mean of that value for all other cases. Does not change the sample mean for that variable, however, attenuates any correlations involving the imputed variables, because there is no guaranteed relationships between the imputed and measured variables) ■ Interpolation – a method of constructing new data points within the range of a discrete set of known data points.
  • 26. ■ Partial deletion (Listwise deletion/casewise)- the most common means of dealing with missing data is listwise deletion (complete case), which is when all cases with missing values are deleted. If the data are MCAR, this will not add any bias, but it will decrease the power of the analysis (smaller sample size). ■ Pairwise deletion – deleting a case when it is missing a variable required for a particular analysis, but including that case in analysis for which all required variables are present. The main advantage of this method is that it is straightforward and easy to implement. Single Imputations (cont’d)
  • 27. ■ Hot-deck – a missing value is imputed from a randomly selected similar record. ■ Cold deck – selects donors from another dataset. Due to the advances in computation power, more sophisticated methods have superseded the original random and sorted hot deck imputation techniques ■ Regression imputation - Available information for complete and incomplete cases is used to predict whether a value on a specific variable is missing or not. Fitted values from the regression model are then used to impute the missing values. It has the opposite problem of mean imputation – imputed data do not have an error term included in their estimation, thus the estimates fit perfectly along the regression line without any residual variance, causing relationships to be over identified and suggest greater precision in the imputed values, supplying no uncertainty about that value. Single Imputations (cont’d)
  • 28. Multiple Imputations ■ Multiple Imputation developed to deal with the problem of increased noise due to imputation by Rubin (1987). There are multiple methods of multiple imputation ■ The primary method is Multiple Imputation by Chained Equations (MICE) should be implemented only when the missing data follow the missing at random mechanism
  • 29. Multiple Imputations (cont’d) ■ Advantages of Multiple Imputation: – An advantage over single imputation is that MI is flexible and can be used in cases, where the data is MCAR, MAR, and even when the data is MNAR. – By imputing multiple times, multiple imputation certainly accounts for the uncertainty and range of values that the true value could have taken. – Not difficult to implement ■ Disadvantages of Multiple Imputation: – Can be computationally expensive and not quite worth it.
  • 30. Steps distributions per intensity Single imputations - Impute nulls and outliers using different methods: 1. mean value 2. interpolate between existing values 3. draw from the distribution of existing values (per customer)
  • 31. Single imputation - Impute using mean
  • 32. Single imputation - impute using interpolation
  • 33. Impute using transform with random choice (hot-deck)
  • 35. Calculate height for different imputation versions and compare results
  • 36.
  • 37. Q&A
  • 38. Thanks! Feyzi Bagirov, feyzi.bagirov@metadata.io, @FeyziBagirov Tanya Yarmola, tanya.yarmola@jpmorgan.com, @TanyaYarmola