SlideShare a Scribd company logo
1 of 55
Download to read offline
Data Analytics
A (Short) Tour
Venkatesh-Prasad Ranganath
http://about.me/rvprasad
Click to edit Master title style
Is it Analytics or Analysis?
Analytics uses analysis to recommend actions
or make decisions.
Why Data Analysis?
Confirm a hypothesis
Confirmatory
Explore the data
Exploratory (EDA)
Word of Caution – Case of Killer Potatoes?
This is figure 1.5 in the book “Exploring Data” by Ronald K Pearson.
Word of Caution – Case of Killer Potatoes?
This is figure 1.6 in the book “Exploring Data” by Ronald K Pearson.
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansing
4. Data Transformation
5. Data Analysis (Processing)
6. Result Validation
7. Result Presentation (Visual Validation)
8. Recommend Action / Make Decision
Data Collection – Approaches
Observation Monitoring
Interviews Surveys
Data Collection – Comparing Approaches
Observation Interviews Surveys Monitoring
Technique Shadowing Conversation Questionnaire Logging
Interactive No Yes No No
Simple No No Yes Yes
Automatable No No Yes Yes
Scalable No No Yes Yes
Data Size Small Small Medium Huge
Data Format Flexible Flexible Rigid Rigid
Data Type Qualitative Qualitative Qualitative Quantitative
Real Time Analysis No No No Yes
Expensive Yes Yes No No
Data Collection – Comparing Approaches
Observation Interviews Surveys Monitoring
What to capture? Flexible Flexible Fixed Fixed
How to capture? Flexible Flexible Fixed Fixed
Human Subjects Yes Yes Yes No
Transcription Yes Yes Yes/No No
SnR High High High Low
Involves NLP Unlikely Unlikely Likely Likely
Kind of Analysis Confirmatory Confirmatory Confirmatory Exploratory
Kind of Techniques Statistical Testing Statistical Testing Statistical Testing Machine Learning
Data Storage – Choices
• Flat Files
• Databases
• Streaming Data (but there is no storage)
Data Storage – Flat Files
• Simple
• Common / Universal
• Inexpensive
• Independent of specific technology
• Compression friendly
• Very few choices
• Plain text, CSV, XML, and JSON
• Well established
• Low level data access APIs
• No support for automatic scale out /
parallel access
• Unoptimized data access
• Indices
• Columnar storage
Data Storage – Databases
• High level data access API
• Support for automatic scale out /
parallel access
• Optimized data access
• Indices
• Columnar storage
• Well established
• Complex
• Niche / Requires experts
• Optimization
• Distribution
• Expensive
• Dependent on specific technology
• DB controlled compression
• Lots of choices
• SQL, MySQL, PostgreSQL, Maria, Raven,
Couch, Redis, Neo4j, ….
Data Storage – Streaming
• Well, there is not storage 
• Novel
• Many streaming data sources
• Breaks traditional data analysis
algorithms
• No access to the entire data set
• Too many unknowns
• Expertise
• Cost
• Best practices
• Accuracy
• Benefits
• Deficiencies
• Ease of use
Data Storage – Algorithms and Necessity
• Offline
• Online
• Streaming
• Real-time
• Flat Files
• Databases
• Streaming Data
• Do we need fast?
• How fast is quick enough?
• How often do we need fast?
• Is it worth the cost?
• Is it worth the loss of accuracy?
Data Representation – Structured
• Easy to process
• One time schema setup cost
• Common schema types
• CSV, XML, JSON, …
• You can cook up your schema
• Eases data exploration & analysis
• Off-the-shelf techniques to handle
data
• Requires very little expertise
• Ideal with automatic data collection
• Ideal for storing quantitative data
• Rigid
• Changing schema can be hard
• Upfront cost to define the schema
Data Representation – Unstructured
• Flexible
• Off-the-shelf techniques to preprocess
data but requires expertise
• Ideal for manual data collection
• Requires lots of preprocessing
• Complicates data exploration and
analysis
• Requires domain expertise
• Extracting data semantics is hard
• Requires schema recovery *
Data Access – Security
• Who has access to what parts of the data?
• What is the access control policy?
• How do we enforce these policies?
• What techniques do we employ to enforce these policies?
• How do we ensure the policies have been enforced?
Data Access – Privacy
• Who has access to what parts of the data?
• Who has access to what aspects of the data?
• How do you ensure the privacy of the source?
• What are the access control and anonymization policies?
• How do we enforce these policies?
• What techniques do we employ to enforce these policies?
• How do we ensure the policies have been enforced?
• How strong is the anonymization policy?
• Is it possible to recover the anonymized information? If so, how hard?
Data Scale
• Nominal
• Male, Female
• Equality operation
• Ordinal
• Very satisfied, satisfied, dissatisfied, and very dissatisfied
• Inequalities operations
• Interval
• Temperature, dates
• Addition and subtraction operations
• Ratio
• Mass, length, duration
• Multiplication and division operations
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansing
4. Data Transformation
5. Data Analysis (Processing)
6. Result Validation
7. Result Presentation (Visual Validation)
8. Recommend Action / Make Decision
Data Cleansing
Let’s get our hands dirty!!
The data set is from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/).
Data Cleansing – Common Issues
• Missing values
• Extra values
• Incorrect format
• Encoding
• File corruption
• Incorrect units
• Too much data
• Outliers
• Inliers
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansing
4. Data Transformation
5. Data Analysis (Processing)
6. Result Validation
7. Result Presentation (Visual Validation)
8. Recommend Action / Make Decision
Data Transformation (Feature Engineering)
• Analyze specific aspects of the data
• Coarsening data
• Discretization
• Changing Scale
• Normalization
Data Transformation (Feature Engineering)
• Analyze specific aspects of the data
• Coarsening data
• Discretization
• Changing Scale
• Normalization
BMI BMI Categories
< 18.5 Underweight
18.5 – 24.9 Normal Weight
25 – 29.9 Overweight
> 30 Obesity
Data Transformation (Feature Engineering)
• Analyze specific aspects of the data
• Coarsening data
• Discretization
• Changing Scale
• Normalization
Actual Weight Normalized
78 0.285
88 0.322
62 0.227
45 0.164
Data Transformation (Feature Engineering)
• Analyze relations between features of the data
• Synthesize new features
• Relating existing features
• Combining existing features
Data Transformation
Let’s get out hands dirty!!
Data Transformation (Feature Engineering)
Keep in mind the following:
• Scales
• What the permitted operations?
• Data Collection
• What is the trade-offs in data collection?
• Parsimony
• Can we get away with simple scales?
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansing
4. Data Transformation
5. Data Analysis (Processing)
6. Result Validation
7. Result Presentation (Visual Validation)
8. Recommend Action / Make Decision
Data Analysis
• Features
• Attributes of each datum
• Labels
• Expert’s input about datum
• Data sets
• Training
• Validation
• Test
• Work flow
• Model building (training)
• Model tuning and selection (validation)
• Error reporting (test)
Data Analysis – Models
The figure is from the book “Modern Multivariate Statistical Techniques” by Alan Julian Izenman.
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansing
4. Data Transformation
5. Data Analysis (Processing)
6. Result Validation
7. Result Presentation (Visual Validation)
8. Recommend Action / Make Decision
Result Validation – Approaches
• Expert Inputs
• Cross Validation
• K-fold cross validation
• 5x2 cross validation
• Bootstrapping
Result Validation – Basic Terms
Classification
Actuals
X Y
X True X (tx) False Y (fy) p = tx + fy
Y False X (fx) True Y (ty) n = fx + ty
p’ = tx + fx N = p + n
Consider a 2-class classification problem.
Result Validation – Basic Terms
Now, consider X as positive evidence and Y as negative evidence.
Classification
Actuals
X Y
X True Positive (tp) False Negative (fn) p = tp + fn
Y False Positive (fp) True Negative (tn) n = fp + tn
p’ = tp + fp N = p + n
Result Validation – Measures
error = (fp + fn) / N
accuracy = (tp + tn) / N
tp-rate = tp / p
fp-rate = fp / n
sensitivity = tp / p = tp-rate
specificity = tn / n = 1 – fp-rate
precision = tp / p’
recall = tp / p = tp-rate
Classification
Actuals
X Y
X True Positive (tp) False Negative (fn) p = tp + fn
Y False Positive (fp) True Negative (tn) n = fp + tn
p’ = tp + fp N = p + n
Result Validation – ROC (Receiver Operating Characteristics)
The figure is from the Wikipedia page about “Receiver Operating Characteristics”.
Result Validation – Class Confusion Matrix
A B
A True A (ta) False A (fa)
B False A (fa) True B (tb)
A B C D
A True A (ta) False B (fb) False C (fc) False D (fd)
B False A (fa) True B (tb) False C (fc) False D (fd)
C False A (fa) False B (fb) True C (tc) False D (fd)
D False A (fa) False B (fb) False C (fc) True D (td)
2 class problem 4 class problem
Result Validation – Bias and Variance
The figure is from http://scott.fortmann-roe.com/docs/BiasVariance.html.
Result Validation – Underfitting & Overfitting
The figure is from the book “Modern Multivariate Statistical Techniques” by Alan Julian Izenman.
Result Validation
Let’s get out hands dirty!!
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansing
4. Data Transformation
5. Data Analysis (Processing)
6. Result Validation
7. Result Presentation (Visual Validation)
8. Recommend Action / Make Decision
Result Presentation (Visual Validation)
• Numbers
• Central tendencies – mode, median, and mean
• Dispersion – range, standard deviation
• Five number summary
• min, 1st quartile, median (mean), 3rd quartile, max
• Margin of error
• (Confidence) Interval
• Tables
• Charts
Result Presentation – Box-and-Whisker Plots
The figure is from the Wikipedia page about “Box Plot”.
Result Presentation – Histograms
The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
Result Presentation – Line Graphs
The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
Result Presentation – Scatter Plots
The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
Result Presentation – Heatmaps
The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
Result Presentation – Bubble Chart
The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
Result Presentation – Word Cloud
The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansing
4. Data Transformation
5. Data Analysis (Processing)
6. Result Validation
7. Result Presentation (Visual Validation)
8. Recommend Action / Make Decision
Now, are you thinking..
• What about identifying the issue/question?
• Know where you are going before you start
• What about recommending action / making the decision?
• Information and knowledge aren’t the same
• Are data X tasks really that important and hard?
• Garbage in, Garbage out
• Aren’t data analysis techniques the most important?
• Smart data (structures) and dumb code works better than the other way
around
Some Personal Observations
• Domain Knowledge is crucial
• Optimizing analysis
• Improving relevance of results
• Always prefer
• Simple solutions over complex solutions
• Fast solutions over slow solutions
• Most correct solutions over fully correct solutions (even no solution!!)
• If you hear “I want it now”, then say “Really? Please Explain”
• Visualization helps only if the results are relevant
• Limited data sets can only get you so far
• Security and privacy are more important than you think
Got any stories from the trenches?

More Related Content

Similar to Data analytics, a (short) tour

Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - BengaluruKunal Jain
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learningjaumebp
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningUnderstanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningAbzetdin Adamov
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 
Clinical Data Classification of alzheimer's disease
Clinical Data Classification of alzheimer's diseaseClinical Data Classification of alzheimer's disease
Clinical Data Classification of alzheimer's diseaseGeorge Kalangi
 
Cork big data_analytics-de_puy_synthes_datsci_cit
Cork big data_analytics-de_puy_synthes_datsci_citCork big data_analytics-de_puy_synthes_datsci_cit
Cork big data_analytics-de_puy_synthes_datsci_citDana Brophy
 
Mastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really DoMastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really DoAvrio Analytics
 
Measuring Data Quality with DataOps
Measuring Data Quality with DataOpsMeasuring Data Quality with DataOps
Measuring Data Quality with DataOpsSteven Ensslen
 
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...TEST Huddle
 
Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16Dhilsath Fathima
 
Coping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCoping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCarly Strasser
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217lyarmey
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 

Similar to Data analytics, a (short) tour (20)

Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Learning from data
Learning from dataLearning from data
Learning from data
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningUnderstanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine Learning
 
Data Quality
Data QualityData Quality
Data Quality
 
Clinical Data Classification of alzheimer's disease
Clinical Data Classification of alzheimer's diseaseClinical Data Classification of alzheimer's disease
Clinical Data Classification of alzheimer's disease
 
Cork big data_analytics-de_puy_synthes_datsci_cit
Cork big data_analytics-de_puy_synthes_datsci_citCork big data_analytics-de_puy_synthes_datsci_cit
Cork big data_analytics-de_puy_synthes_datsci_cit
 
Mastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really DoMastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really Do
 
Measuring Data Quality with DataOps
Measuring Data Quality with DataOpsMeasuring Data Quality with DataOps
Measuring Data Quality with DataOps
 
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
 
Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16
 
BAS 250 Lecture 2
BAS 250 Lecture 2BAS 250 Lecture 2
BAS 250 Lecture 2
 
Coping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCoping with Data for WHOI JP Students
Coping with Data for WHOI JP Students
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 

More from Venkatesh Prasad Ranganath

SeMA: A Design Methodology for Building Secure Android Apps
SeMA: A Design Methodology for Building Secure Android AppsSeMA: A Design Methodology for Building Secure Android Apps
SeMA: A Design Methodology for Building Secure Android AppsVenkatesh Prasad Ranganath
 
Are free Android app security analysis tools effective in detecting known vul...
Are free Android app security analysis tools effective in detecting known vul...Are free Android app security analysis tools effective in detecting known vul...
Are free Android app security analysis tools effective in detecting known vul...Venkatesh Prasad Ranganath
 
Benchpress: Analyzing Android App Vulnerability Benchmark Suites
Benchpress:  Analyzing Android App Vulnerability Benchmark SuitesBenchpress:  Analyzing Android App Vulnerability Benchmark Suites
Benchpress: Analyzing Android App Vulnerability Benchmark SuitesVenkatesh Prasad Ranganath
 
Behavior Driven Development [10] - Software Testing Techniques (CIS640)
Behavior Driven Development [10] - Software Testing Techniques (CIS640)Behavior Driven Development [10] - Software Testing Techniques (CIS640)
Behavior Driven Development [10] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Code Coverage [9] - Software Testing Techniques (CIS640)
Code Coverage [9] - Software Testing Techniques (CIS640)Code Coverage [9] - Software Testing Techniques (CIS640)
Code Coverage [9] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)Equivalence Class Testing [8] - Software Testing Techniques (CIS640)
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Boundary Value Testing [7] - Software Testing Techniques (CIS640)
Boundary Value Testing [7] - Software Testing Techniques (CIS640)Boundary Value Testing [7] - Software Testing Techniques (CIS640)
Boundary Value Testing [7] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Property Based Testing [5] - Software Testing Techniques (CIS640)
Property Based Testing [5] - Software Testing Techniques (CIS640)Property Based Testing [5] - Software Testing Techniques (CIS640)
Property Based Testing [5] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Intro to Python3 [2] - Software Testing Techniques (CIS640)
Intro to Python3 [2] - Software Testing Techniques (CIS640)Intro to Python3 [2] - Software Testing Techniques (CIS640)
Intro to Python3 [2] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Unit testing [4] - Software Testing Techniques (CIS640)
Unit testing [4] - Software Testing Techniques (CIS640)Unit testing [4] - Software Testing Techniques (CIS640)
Unit testing [4] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Testing concepts [3] - Software Testing Techniques (CIS640)
Testing concepts [3] - Software Testing Techniques (CIS640)Testing concepts [3] - Software Testing Techniques (CIS640)
Testing concepts [3] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Introduction [1] - Software Testing Techniques (CIS640)
Introduction [1] - Software Testing Techniques (CIS640)Introduction [1] - Software Testing Techniques (CIS640)
Introduction [1] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Compatibility Testing using Patterns-based Trace Comparison
Compatibility Testing using Patterns-based Trace ComparisonCompatibility Testing using Patterns-based Trace Comparison
Compatibility Testing using Patterns-based Trace ComparisonVenkatesh Prasad Ranganath
 

More from Venkatesh Prasad Ranganath (15)

SeMA: A Design Methodology for Building Secure Android Apps
SeMA: A Design Methodology for Building Secure Android AppsSeMA: A Design Methodology for Building Secure Android Apps
SeMA: A Design Methodology for Building Secure Android Apps
 
Are free Android app security analysis tools effective in detecting known vul...
Are free Android app security analysis tools effective in detecting known vul...Are free Android app security analysis tools effective in detecting known vul...
Are free Android app security analysis tools effective in detecting known vul...
 
Benchpress: Analyzing Android App Vulnerability Benchmark Suites
Benchpress:  Analyzing Android App Vulnerability Benchmark SuitesBenchpress:  Analyzing Android App Vulnerability Benchmark Suites
Benchpress: Analyzing Android App Vulnerability Benchmark Suites
 
Why do Users kill HPC Jobs?
Why do Users kill HPC Jobs?Why do Users kill HPC Jobs?
Why do Users kill HPC Jobs?
 
Behavior Driven Development [10] - Software Testing Techniques (CIS640)
Behavior Driven Development [10] - Software Testing Techniques (CIS640)Behavior Driven Development [10] - Software Testing Techniques (CIS640)
Behavior Driven Development [10] - Software Testing Techniques (CIS640)
 
Code Coverage [9] - Software Testing Techniques (CIS640)
Code Coverage [9] - Software Testing Techniques (CIS640)Code Coverage [9] - Software Testing Techniques (CIS640)
Code Coverage [9] - Software Testing Techniques (CIS640)
 
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)Equivalence Class Testing [8] - Software Testing Techniques (CIS640)
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)
 
Boundary Value Testing [7] - Software Testing Techniques (CIS640)
Boundary Value Testing [7] - Software Testing Techniques (CIS640)Boundary Value Testing [7] - Software Testing Techniques (CIS640)
Boundary Value Testing [7] - Software Testing Techniques (CIS640)
 
Property Based Testing [5] - Software Testing Techniques (CIS640)
Property Based Testing [5] - Software Testing Techniques (CIS640)Property Based Testing [5] - Software Testing Techniques (CIS640)
Property Based Testing [5] - Software Testing Techniques (CIS640)
 
Intro to Python3 [2] - Software Testing Techniques (CIS640)
Intro to Python3 [2] - Software Testing Techniques (CIS640)Intro to Python3 [2] - Software Testing Techniques (CIS640)
Intro to Python3 [2] - Software Testing Techniques (CIS640)
 
Unit testing [4] - Software Testing Techniques (CIS640)
Unit testing [4] - Software Testing Techniques (CIS640)Unit testing [4] - Software Testing Techniques (CIS640)
Unit testing [4] - Software Testing Techniques (CIS640)
 
Testing concepts [3] - Software Testing Techniques (CIS640)
Testing concepts [3] - Software Testing Techniques (CIS640)Testing concepts [3] - Software Testing Techniques (CIS640)
Testing concepts [3] - Software Testing Techniques (CIS640)
 
Introduction [1] - Software Testing Techniques (CIS640)
Introduction [1] - Software Testing Techniques (CIS640)Introduction [1] - Software Testing Techniques (CIS640)
Introduction [1] - Software Testing Techniques (CIS640)
 
Compatibility Testing using Patterns-based Trace Comparison
Compatibility Testing using Patterns-based Trace ComparisonCompatibility Testing using Patterns-based Trace Comparison
Compatibility Testing using Patterns-based Trace Comparison
 
Pattern-based Features
Pattern-based FeaturesPattern-based Features
Pattern-based Features
 

Recently uploaded

Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 

Recently uploaded (20)

Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 

Data analytics, a (short) tour

  • 1. Data Analytics A (Short) Tour Venkatesh-Prasad Ranganath http://about.me/rvprasad Click to edit Master title style
  • 2. Is it Analytics or Analysis? Analytics uses analysis to recommend actions or make decisions.
  • 3. Why Data Analysis? Confirm a hypothesis Confirmatory Explore the data Exploratory (EDA)
  • 4. Word of Caution – Case of Killer Potatoes? This is figure 1.5 in the book “Exploring Data” by Ronald K Pearson.
  • 5. Word of Caution – Case of Killer Potatoes? This is figure 1.6 in the book “Exploring Data” by Ronald K Pearson.
  • 6. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  • 7. Data Collection – Approaches Observation Monitoring Interviews Surveys
  • 8. Data Collection – Comparing Approaches Observation Interviews Surveys Monitoring Technique Shadowing Conversation Questionnaire Logging Interactive No Yes No No Simple No No Yes Yes Automatable No No Yes Yes Scalable No No Yes Yes Data Size Small Small Medium Huge Data Format Flexible Flexible Rigid Rigid Data Type Qualitative Qualitative Qualitative Quantitative Real Time Analysis No No No Yes Expensive Yes Yes No No
  • 9. Data Collection – Comparing Approaches Observation Interviews Surveys Monitoring What to capture? Flexible Flexible Fixed Fixed How to capture? Flexible Flexible Fixed Fixed Human Subjects Yes Yes Yes No Transcription Yes Yes Yes/No No SnR High High High Low Involves NLP Unlikely Unlikely Likely Likely Kind of Analysis Confirmatory Confirmatory Confirmatory Exploratory Kind of Techniques Statistical Testing Statistical Testing Statistical Testing Machine Learning
  • 10. Data Storage – Choices • Flat Files • Databases • Streaming Data (but there is no storage)
  • 11. Data Storage – Flat Files • Simple • Common / Universal • Inexpensive • Independent of specific technology • Compression friendly • Very few choices • Plain text, CSV, XML, and JSON • Well established • Low level data access APIs • No support for automatic scale out / parallel access • Unoptimized data access • Indices • Columnar storage
  • 12. Data Storage – Databases • High level data access API • Support for automatic scale out / parallel access • Optimized data access • Indices • Columnar storage • Well established • Complex • Niche / Requires experts • Optimization • Distribution • Expensive • Dependent on specific technology • DB controlled compression • Lots of choices • SQL, MySQL, PostgreSQL, Maria, Raven, Couch, Redis, Neo4j, ….
  • 13. Data Storage – Streaming • Well, there is not storage  • Novel • Many streaming data sources • Breaks traditional data analysis algorithms • No access to the entire data set • Too many unknowns • Expertise • Cost • Best practices • Accuracy • Benefits • Deficiencies • Ease of use
  • 14. Data Storage – Algorithms and Necessity • Offline • Online • Streaming • Real-time • Flat Files • Databases • Streaming Data • Do we need fast? • How fast is quick enough? • How often do we need fast? • Is it worth the cost? • Is it worth the loss of accuracy?
  • 15. Data Representation – Structured • Easy to process • One time schema setup cost • Common schema types • CSV, XML, JSON, … • You can cook up your schema • Eases data exploration & analysis • Off-the-shelf techniques to handle data • Requires very little expertise • Ideal with automatic data collection • Ideal for storing quantitative data • Rigid • Changing schema can be hard • Upfront cost to define the schema
  • 16. Data Representation – Unstructured • Flexible • Off-the-shelf techniques to preprocess data but requires expertise • Ideal for manual data collection • Requires lots of preprocessing • Complicates data exploration and analysis • Requires domain expertise • Extracting data semantics is hard • Requires schema recovery *
  • 17. Data Access – Security • Who has access to what parts of the data? • What is the access control policy? • How do we enforce these policies? • What techniques do we employ to enforce these policies? • How do we ensure the policies have been enforced?
  • 18. Data Access – Privacy • Who has access to what parts of the data? • Who has access to what aspects of the data? • How do you ensure the privacy of the source? • What are the access control and anonymization policies? • How do we enforce these policies? • What techniques do we employ to enforce these policies? • How do we ensure the policies have been enforced? • How strong is the anonymization policy? • Is it possible to recover the anonymized information? If so, how hard?
  • 19. Data Scale • Nominal • Male, Female • Equality operation • Ordinal • Very satisfied, satisfied, dissatisfied, and very dissatisfied • Inequalities operations • Interval • Temperature, dates • Addition and subtraction operations • Ratio • Mass, length, duration • Multiplication and division operations
  • 20. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  • 21. Data Cleansing Let’s get our hands dirty!! The data set is from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/).
  • 22. Data Cleansing – Common Issues • Missing values • Extra values • Incorrect format • Encoding • File corruption • Incorrect units • Too much data • Outliers • Inliers
  • 23. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  • 24. Data Transformation (Feature Engineering) • Analyze specific aspects of the data • Coarsening data • Discretization • Changing Scale • Normalization
  • 25. Data Transformation (Feature Engineering) • Analyze specific aspects of the data • Coarsening data • Discretization • Changing Scale • Normalization BMI BMI Categories < 18.5 Underweight 18.5 – 24.9 Normal Weight 25 – 29.9 Overweight > 30 Obesity
  • 26. Data Transformation (Feature Engineering) • Analyze specific aspects of the data • Coarsening data • Discretization • Changing Scale • Normalization Actual Weight Normalized 78 0.285 88 0.322 62 0.227 45 0.164
  • 27. Data Transformation (Feature Engineering) • Analyze relations between features of the data • Synthesize new features • Relating existing features • Combining existing features
  • 28. Data Transformation Let’s get out hands dirty!!
  • 29. Data Transformation (Feature Engineering) Keep in mind the following: • Scales • What the permitted operations? • Data Collection • What is the trade-offs in data collection? • Parsimony • Can we get away with simple scales?
  • 30. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  • 31. Data Analysis • Features • Attributes of each datum • Labels • Expert’s input about datum • Data sets • Training • Validation • Test • Work flow • Model building (training) • Model tuning and selection (validation) • Error reporting (test)
  • 32. Data Analysis – Models The figure is from the book “Modern Multivariate Statistical Techniques” by Alan Julian Izenman.
  • 33. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  • 34. Result Validation – Approaches • Expert Inputs • Cross Validation • K-fold cross validation • 5x2 cross validation • Bootstrapping
  • 35. Result Validation – Basic Terms Classification Actuals X Y X True X (tx) False Y (fy) p = tx + fy Y False X (fx) True Y (ty) n = fx + ty p’ = tx + fx N = p + n Consider a 2-class classification problem.
  • 36. Result Validation – Basic Terms Now, consider X as positive evidence and Y as negative evidence. Classification Actuals X Y X True Positive (tp) False Negative (fn) p = tp + fn Y False Positive (fp) True Negative (tn) n = fp + tn p’ = tp + fp N = p + n
  • 37. Result Validation – Measures error = (fp + fn) / N accuracy = (tp + tn) / N tp-rate = tp / p fp-rate = fp / n sensitivity = tp / p = tp-rate specificity = tn / n = 1 – fp-rate precision = tp / p’ recall = tp / p = tp-rate Classification Actuals X Y X True Positive (tp) False Negative (fn) p = tp + fn Y False Positive (fp) True Negative (tn) n = fp + tn p’ = tp + fp N = p + n
  • 38. Result Validation – ROC (Receiver Operating Characteristics) The figure is from the Wikipedia page about “Receiver Operating Characteristics”.
  • 39. Result Validation – Class Confusion Matrix A B A True A (ta) False A (fa) B False A (fa) True B (tb) A B C D A True A (ta) False B (fb) False C (fc) False D (fd) B False A (fa) True B (tb) False C (fc) False D (fd) C False A (fa) False B (fb) True C (tc) False D (fd) D False A (fa) False B (fb) False C (fc) True D (td) 2 class problem 4 class problem
  • 40. Result Validation – Bias and Variance The figure is from http://scott.fortmann-roe.com/docs/BiasVariance.html.
  • 41. Result Validation – Underfitting & Overfitting The figure is from the book “Modern Multivariate Statistical Techniques” by Alan Julian Izenman.
  • 42. Result Validation Let’s get out hands dirty!!
  • 43. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  • 44. Result Presentation (Visual Validation) • Numbers • Central tendencies – mode, median, and mean • Dispersion – range, standard deviation • Five number summary • min, 1st quartile, median (mean), 3rd quartile, max • Margin of error • (Confidence) Interval • Tables • Charts
  • 45. Result Presentation – Box-and-Whisker Plots The figure is from the Wikipedia page about “Box Plot”.
  • 46. Result Presentation – Histograms The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
  • 47. Result Presentation – Line Graphs The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
  • 48. Result Presentation – Scatter Plots The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
  • 49. Result Presentation – Heatmaps The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
  • 50. Result Presentation – Bubble Chart The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
  • 51. Result Presentation – Word Cloud The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
  • 52. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  • 53. Now, are you thinking.. • What about identifying the issue/question? • Know where you are going before you start • What about recommending action / making the decision? • Information and knowledge aren’t the same • Are data X tasks really that important and hard? • Garbage in, Garbage out • Aren’t data analysis techniques the most important? • Smart data (structures) and dumb code works better than the other way around
  • 54. Some Personal Observations • Domain Knowledge is crucial • Optimizing analysis • Improving relevance of results • Always prefer • Simple solutions over complex solutions • Fast solutions over slow solutions • Most correct solutions over fully correct solutions (even no solution!!) • If you hear “I want it now”, then say “Really? Please Explain” • Visualization helps only if the results are relevant • Limited data sets can only get you so far • Security and privacy are more important than you think
  • 55. Got any stories from the trenches?