Data Analytics
A (Short) Tour
Venkatesh-Prasad Ranganath
http://about.me/rvprasad
Click to edit Master title style
Is it Analytics or Analysis?
Analytics uses analysis to recommend actions
or make decisions.
Why Data Analysis?
Confirm a hypothesis
Confirmatory
Explore the data
Exploratory (EDA)
Word of Caution – Case of Killer Potatoes?
This is figure 1.5 in the book “Exploring Data” by Ronald K Pearson.
Word of Caution – Case of Killer Potatoes?
This is figure 1.6 in the book “Exploring Data” by Ronald K Pearson.
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansi...
Data Collection – Approaches
Observation Monitoring
Interviews Surveys
Data Collection – Comparing Approaches
Observation Interviews Surveys Monitoring
Technique Shadowing Conversation Question...
Data Collection – Comparing Approaches
Observation Interviews Surveys Monitoring
What to capture? Flexible Flexible Fixed ...
Data Storage – Choices
• Flat Files
• Databases
• Streaming Data (but there is no storage)
Data Storage – Flat Files
• Simple
• Common / Universal
• Inexpensive
• Independent of specific technology
• Compression f...
Data Storage – Databases
• High level data access API
• Support for automatic scale out /
parallel access
• Optimized data...
Data Storage – Streaming
• Well, there is not storage 
• Novel
• Many streaming data sources
• Breaks traditional data an...
Data Storage – Algorithms and Necessity
• Offline
• Online
• Streaming
• Real-time
• Flat Files
• Databases
• Streaming Da...
Data Representation – Structured
• Easy to process
• One time schema setup cost
• Common schema types
• CSV, XML, JSON, …
...
Data Representation – Unstructured
• Flexible
• Off-the-shelf techniques to preprocess
data but requires expertise
• Ideal...
Data Access – Security
• Who has access to what parts of the data?
• What is the access control policy?
• How do we enforc...
Data Access – Privacy
• Who has access to what parts of the data?
• Who has access to what aspects of the data?
• How do y...
Data Scale
• Nominal
• Male, Female
• Equality operation
• Ordinal
• Very satisfied, satisfied, dissatisfied, and very dis...
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansi...
Data Cleansing
Let’s get our hands dirty!!
The data set is from UCI Machine Learning Repository (http://archive.ics.uci.ed...
Data Cleansing – Common Issues
• Missing values
• Extra values
• Incorrect format
• Encoding
• File corruption
• Incorrect...
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansi...
Data Transformation (Feature Engineering)
• Analyze specific aspects of the data
• Coarsening data
• Discretization
• Chan...
Data Transformation (Feature Engineering)
• Analyze specific aspects of the data
• Coarsening data
• Discretization
• Chan...
Data Transformation (Feature Engineering)
• Analyze specific aspects of the data
• Coarsening data
• Discretization
• Chan...
Data Transformation (Feature Engineering)
• Analyze relations between features of the data
• Synthesize new features
• Rel...
Data Transformation
Let’s get out hands dirty!!
Data Transformation (Feature Engineering)
Keep in mind the following:
• Scales
• What the permitted operations?
• Data Col...
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansi...
Data Analysis
• Features
• Attributes of each datum
• Labels
• Expert’s input about datum
• Data sets
• Training
• Validat...
Data Analysis – Models
The figure is from the book “Modern Multivariate Statistical Techniques” by Alan Julian Izenman.
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansi...
Result Validation – Approaches
• Expert Inputs
• Cross Validation
• K-fold cross validation
• 5x2 cross validation
• Boots...
Result Validation – Basic Terms
Classification
Actuals
X Y
X True X (tx) False Y (fy) p = tx + fy
Y False X (fx) True Y (t...
Result Validation – Basic Terms
Now, consider X as positive evidence and Y as negative evidence.
Classification
Actuals
X ...
Result Validation – Measures
error = (fp + fn) / N
accuracy = (tp + tn) / N
tp-rate = tp / p
fp-rate = fp / n
sensitivity ...
Result Validation – ROC (Receiver Operating Characteristics)
The figure is from the Wikipedia page about “Receiver Operati...
Result Validation – Class Confusion Matrix
A B
A True A (ta) False A (fa)
B False A (fa) True B (tb)
A B C D
A True A (ta)...
Result Validation – Bias and Variance
The figure is from http://scott.fortmann-roe.com/docs/BiasVariance.html.
Result Validation – Underfitting & Overfitting
The figure is from the book “Modern Multivariate Statistical Techniques” by...
Result Validation
Let’s get out hands dirty!!
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansi...
Result Presentation (Visual Validation)
• Numbers
• Central tendencies – mode, median, and mean
• Dispersion – range, stan...
Result Presentation – Box-and-Whisker Plots
The figure is from the Wikipedia page about “Box Plot”.
Result Presentation – Histograms
The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kir...
Result Presentation – Line Graphs
The figure is from the book “Data Visualization: A Successful Design Process” by Andy Ki...
Result Presentation – Scatter Plots
The figure is from the book “Data Visualization: A Successful Design Process” by Andy ...
Result Presentation – Heatmaps
The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
Result Presentation – Bubble Chart
The figure is from the book “Data Visualization: A Successful Design Process” by Andy K...
Result Presentation – Word Cloud
The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kir...
Typical Data Analytics Work Flow
1. Identify Issue
2. Data Collection, Storage, Representation, and Access
3. Data Cleansi...
Now, are you thinking..
• What about identifying the issue/question?
• Know where you are going before you start
• What ab...
Some Personal Observations
• Domain Knowledge is crucial
• Optimizing analysis
• Improving relevance of results
• Always p...
Got any stories from the trenches?
Upcoming SlideShare
Loading in...5
×

Data analytics, a (short) tour

928

Published on

An introduction to data analytics that focuses on various tasks involved in a typical data analytics workflow.

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
928
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
35
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Data analytics, a (short) tour

  1. 1. Data Analytics A (Short) Tour Venkatesh-Prasad Ranganath http://about.me/rvprasad Click to edit Master title style
  2. 2. Is it Analytics or Analysis? Analytics uses analysis to recommend actions or make decisions.
  3. 3. Why Data Analysis? Confirm a hypothesis Confirmatory Explore the data Exploratory (EDA)
  4. 4. Word of Caution – Case of Killer Potatoes? This is figure 1.5 in the book “Exploring Data” by Ronald K Pearson.
  5. 5. Word of Caution – Case of Killer Potatoes? This is figure 1.6 in the book “Exploring Data” by Ronald K Pearson.
  6. 6. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  7. 7. Data Collection – Approaches Observation Monitoring Interviews Surveys
  8. 8. Data Collection – Comparing Approaches Observation Interviews Surveys Monitoring Technique Shadowing Conversation Questionnaire Logging Interactive No Yes No No Simple No No Yes Yes Automatable No No Yes Yes Scalable No No Yes Yes Data Size Small Small Medium Huge Data Format Flexible Flexible Rigid Rigid Data Type Qualitative Qualitative Qualitative Quantitative Real Time Analysis No No No Yes Expensive Yes Yes No No
  9. 9. Data Collection – Comparing Approaches Observation Interviews Surveys Monitoring What to capture? Flexible Flexible Fixed Fixed How to capture? Flexible Flexible Fixed Fixed Human Subjects Yes Yes Yes No Transcription Yes Yes Yes/No No SnR High High High Low Involves NLP Unlikely Unlikely Likely Likely Kind of Analysis Confirmatory Confirmatory Confirmatory Exploratory Kind of Techniques Statistical Testing Statistical Testing Statistical Testing Machine Learning
  10. 10. Data Storage – Choices • Flat Files • Databases • Streaming Data (but there is no storage)
  11. 11. Data Storage – Flat Files • Simple • Common / Universal • Inexpensive • Independent of specific technology • Compression friendly • Very few choices • Plain text, CSV, XML, and JSON • Well established • Low level data access APIs • No support for automatic scale out / parallel access • Unoptimized data access • Indices • Columnar storage
  12. 12. Data Storage – Databases • High level data access API • Support for automatic scale out / parallel access • Optimized data access • Indices • Columnar storage • Well established • Complex • Niche / Requires experts • Optimization • Distribution • Expensive • Dependent on specific technology • DB controlled compression • Lots of choices • SQL, MySQL, PostgreSQL, Maria, Raven, Couch, Redis, Neo4j, ….
  13. 13. Data Storage – Streaming • Well, there is not storage  • Novel • Many streaming data sources • Breaks traditional data analysis algorithms • No access to the entire data set • Too many unknowns • Expertise • Cost • Best practices • Accuracy • Benefits • Deficiencies • Ease of use
  14. 14. Data Storage – Algorithms and Necessity • Offline • Online • Streaming • Real-time • Flat Files • Databases • Streaming Data • Do we need fast? • How fast is quick enough? • How often do we need fast? • Is it worth the cost? • Is it worth the loss of accuracy?
  15. 15. Data Representation – Structured • Easy to process • One time schema setup cost • Common schema types • CSV, XML, JSON, … • You can cook up your schema • Eases data exploration & analysis • Off-the-shelf techniques to handle data • Requires very little expertise • Ideal with automatic data collection • Ideal for storing quantitative data • Rigid • Changing schema can be hard • Upfront cost to define the schema
  16. 16. Data Representation – Unstructured • Flexible • Off-the-shelf techniques to preprocess data but requires expertise • Ideal for manual data collection • Requires lots of preprocessing • Complicates data exploration and analysis • Requires domain expertise • Extracting data semantics is hard • Requires schema recovery *
  17. 17. Data Access – Security • Who has access to what parts of the data? • What is the access control policy? • How do we enforce these policies? • What techniques do we employ to enforce these policies? • How do we ensure the policies have been enforced?
  18. 18. Data Access – Privacy • Who has access to what parts of the data? • Who has access to what aspects of the data? • How do you ensure the privacy of the source? • What are the access control and anonymization policies? • How do we enforce these policies? • What techniques do we employ to enforce these policies? • How do we ensure the policies have been enforced? • How strong is the anonymization policy? • Is it possible to recover the anonymized information? If so, how hard?
  19. 19. Data Scale • Nominal • Male, Female • Equality operation • Ordinal • Very satisfied, satisfied, dissatisfied, and very dissatisfied • Inequalities operations • Interval • Temperature, dates • Addition and subtraction operations • Ratio • Mass, length, duration • Multiplication and division operations
  20. 20. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  21. 21. Data Cleansing Let’s get our hands dirty!! The data set is from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/).
  22. 22. Data Cleansing – Common Issues • Missing values • Extra values • Incorrect format • Encoding • File corruption • Incorrect units • Too much data • Outliers • Inliers
  23. 23. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  24. 24. Data Transformation (Feature Engineering) • Analyze specific aspects of the data • Coarsening data • Discretization • Changing Scale • Normalization
  25. 25. Data Transformation (Feature Engineering) • Analyze specific aspects of the data • Coarsening data • Discretization • Changing Scale • Normalization BMI BMI Categories < 18.5 Underweight 18.5 – 24.9 Normal Weight 25 – 29.9 Overweight > 30 Obesity
  26. 26. Data Transformation (Feature Engineering) • Analyze specific aspects of the data • Coarsening data • Discretization • Changing Scale • Normalization Actual Weight Normalized 78 0.285 88 0.322 62 0.227 45 0.164
  27. 27. Data Transformation (Feature Engineering) • Analyze relations between features of the data • Synthesize new features • Relating existing features • Combining existing features
  28. 28. Data Transformation Let’s get out hands dirty!!
  29. 29. Data Transformation (Feature Engineering) Keep in mind the following: • Scales • What the permitted operations? • Data Collection • What is the trade-offs in data collection? • Parsimony • Can we get away with simple scales?
  30. 30. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  31. 31. Data Analysis • Features • Attributes of each datum • Labels • Expert’s input about datum • Data sets • Training • Validation • Test • Work flow • Model building (training) • Model tuning and selection (validation) • Error reporting (test)
  32. 32. Data Analysis – Models The figure is from the book “Modern Multivariate Statistical Techniques” by Alan Julian Izenman.
  33. 33. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  34. 34. Result Validation – Approaches • Expert Inputs • Cross Validation • K-fold cross validation • 5x2 cross validation • Bootstrapping
  35. 35. Result Validation – Basic Terms Classification Actuals X Y X True X (tx) False Y (fy) p = tx + fy Y False X (fx) True Y (ty) n = fx + ty p’ = tx + fx N = p + n Consider a 2-class classification problem.
  36. 36. Result Validation – Basic Terms Now, consider X as positive evidence and Y as negative evidence. Classification Actuals X Y X True Positive (tp) False Negative (fn) p = tp + fn Y False Positive (fp) True Negative (tn) n = fp + tn p’ = tp + fp N = p + n
  37. 37. Result Validation – Measures error = (fp + fn) / N accuracy = (tp + tn) / N tp-rate = tp / p fp-rate = fp / n sensitivity = tp / p = tp-rate specificity = tn / n = 1 – fp-rate precision = tp / p’ recall = tp / p = tp-rate Classification Actuals X Y X True Positive (tp) False Negative (fn) p = tp + fn Y False Positive (fp) True Negative (tn) n = fp + tn p’ = tp + fp N = p + n
  38. 38. Result Validation – ROC (Receiver Operating Characteristics) The figure is from the Wikipedia page about “Receiver Operating Characteristics”.
  39. 39. Result Validation – Class Confusion Matrix A B A True A (ta) False A (fa) B False A (fa) True B (tb) A B C D A True A (ta) False B (fb) False C (fc) False D (fd) B False A (fa) True B (tb) False C (fc) False D (fd) C False A (fa) False B (fb) True C (tc) False D (fd) D False A (fa) False B (fb) False C (fc) True D (td) 2 class problem 4 class problem
  40. 40. Result Validation – Bias and Variance The figure is from http://scott.fortmann-roe.com/docs/BiasVariance.html.
  41. 41. Result Validation – Underfitting & Overfitting The figure is from the book “Modern Multivariate Statistical Techniques” by Alan Julian Izenman.
  42. 42. Result Validation Let’s get out hands dirty!!
  43. 43. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  44. 44. Result Presentation (Visual Validation) • Numbers • Central tendencies – mode, median, and mean • Dispersion – range, standard deviation • Five number summary • min, 1st quartile, median (mean), 3rd quartile, max • Margin of error • (Confidence) Interval • Tables • Charts
  45. 45. Result Presentation – Box-and-Whisker Plots The figure is from the Wikipedia page about “Box Plot”.
  46. 46. Result Presentation – Histograms The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
  47. 47. Result Presentation – Line Graphs The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
  48. 48. Result Presentation – Scatter Plots The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
  49. 49. Result Presentation – Heatmaps The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
  50. 50. Result Presentation – Bubble Chart The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
  51. 51. Result Presentation – Word Cloud The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.
  52. 52. Typical Data Analytics Work Flow 1. Identify Issue 2. Data Collection, Storage, Representation, and Access 3. Data Cleansing 4. Data Transformation 5. Data Analysis (Processing) 6. Result Validation 7. Result Presentation (Visual Validation) 8. Recommend Action / Make Decision
  53. 53. Now, are you thinking.. • What about identifying the issue/question? • Know where you are going before you start • What about recommending action / making the decision? • Information and knowledge aren’t the same • Are data X tasks really that important and hard? • Garbage in, Garbage out • Aren’t data analysis techniques the most important? • Smart data (structures) and dumb code works better than the other way around
  54. 54. Some Personal Observations • Domain Knowledge is crucial • Optimizing analysis • Improving relevance of results • Always prefer • Simple solutions over complex solutions • Fast solutions over slow solutions • Most correct solutions over fully correct solutions (even no solution!!) • If you hear “I want it now”, then say “Really? Please Explain” • Visualization helps only if the results are relevant • Limited data sets can only get you so far • Security and privacy are more important than you think
  55. 55. Got any stories from the trenches?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×