Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Guerrilla Analytics
Tactics for Coping with Data Science Reality
Enda Ridge, PhD
23 February 2015 0#GuerrillaAnalytics Cop...
What we are told about Data Science
1#GuerrillaAnalytics Copyright Enda Ridge 2015
“Data is the new science. Big data hold...
Hi, we need an update on the
insurance policy classification
work.
It’s going to the Head of
Underwriting this afternoon.
...
My Journey
Mechanical
Engineer
PhD
Computer
Science
• “Design of
Experiments
for the Tuning
of Algorithms”
Boutique
Consul...
What is Data Science?
#GuerrillaAnalytics Copyright Enda Ridge 2015 4
Data Analytics Insight
23 February 2015
Common Misconception
#GuerrillaAnalytics Copyright Enda Ridge 2015 5
Shearer C., The CRISP-
DM model: the new
blueprint fo...
Project Reality – Dynamic
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 6
Data
People
Understanding
Rules...
Project Reality – Constraints
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 7
Time
People
Technology
Data
Project Reality – Transparency
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 8
Explainable
Testable
Repro...
Guerrilla Analytics
#GuerrillaAnalytics Copyright Enda Ridge 2015 9
Data
• Extraction
• Receipt
• Loading
Analytics
• Tran...
Guerrilla Analytics Principles
#GuerrillaAnalytics Copyright Enda Ridge 2015 1023 February 2015
Maintaining Data Provenanc...
Guerrilla Analytics Principles
• Space is cheap, confusion is expensive1
• Prefer simple, visual project structures over
h...
Guerrilla Analytics
#GuerrillaAnalytics Copyright Enda Ridge 2015 12
Data
• Extraction
• Receipt
• Loading
Analytics
• Tra...
Data Receipt
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 13
Guerrilla Analytics Environment
• Lost Data...
Data Receipt
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 14
Guerrilla Analytics Approach
• Have 1 Data ...
Data Load
Files
Crazy-name spreadsheet 1
Crazy-name spreadsheet 2
Crazy-name spreadsheet 3
FNU810A
A_very_long_named_file_...
Data Load
Files
Crazy-name spreadsheet 1
Crazy-name spreadsheet 2
Crazy-name spreadsheet 3
FNU810A
A_very_long_named_file_...
Guerrilla Analytics
#GuerrillaAnalytics Copyright Enda Ridge 2015 17
Data
• Extraction
• Receipt
• Loading
Analytics
• Tra...
Guerrilla Analytics Environment
• Multiple languages
• Many code files
• Variety of outputs
• Data manipulation on
file sy...
Analytics: Code
Guerrilla Analytics Environment Guerrilla Analytics Approach
23 February 2015 #GuerrillaAnalytics Copyrigh...
Analytics: Data
ID Addr_1 City
A 10 Main St London
C 5 Junct London
B 54 Shop Rd Dublin
B 123 Middle Str. Galway
23 Februa...
Analytics: Data
ID Addr_1 City
A 10 Main St London
C 5 Junct London
B 54 Shop Rd Dublin
B 123 Middle Str. Galway
23 Februa...
Guerrilla Analytics
#GuerrillaAnalytics Copyright Enda Ridge 2015 22
Data
• Extraction
• Receipt
• Loading
Analytics
• Tra...
Reporting – what is a report?
#GuerrillaAnalytics Copyright Enda Ridge 2015 2323 February 2015
Reporting – Guerrilla Environment
#GuerrillaAnalytics Copyright Enda Ridge 2015 2423 February 2015
Reporting – Guerrilla Analytics approach
#GuerrillaAnalytics Copyright Enda Ridge 2015 25
1
2
5
Select min/max of
transact...
Guerrilla Analytics
#GuerrillaAnalytics Copyright Enda Ridge 2015 26
Data
• Extraction
• Receipt
• Loading
Analytics
• Tra...
Why consolidate?
#GuerrillaAnalytics Copyright Enda Ridge 2015 27
Raw
Duplicates
Customers Clean_Cust
Deduped New_dupes
Wo...
Why consolidate?
#GuerrillaAnalytics Copyright Enda Ridge 2015 28
Raw
Duplicates
Customers Clean_Cust
Deduped New_dupes
Du...
Consolidating with a Build
#GuerrillaAnalytics Copyright Enda Ridge 2015 29
Deduped
Clean_cust
New_dupes
Duplicates_02
Dup...
Open Questions
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 30
Workflows Testing
‘Big Data’Engineering
Keep in Touch!
#GuerrillaAnalytics Copyright Enda Ridge 2015 31
@Enda_Ridge
GuerrillaAnalytics@gmail.com
www.guerrilla-ana...
Upcoming SlideShare
Loading in …5
×

Guerrilla Analytics: Tactics for Coping with Data Science Reality

1,319 views

Published on

This presentation introduces the Guerrilla Analytics Principles - straigh-forward rules of thumb for doing data science despite disruptions and contraints. The presentation includes several practical examples of how to implement those principles and some ideas on future research.

Published in: Data & Analytics
  • Be the first to comment

Guerrilla Analytics: Tactics for Coping with Data Science Reality

  1. 1. Guerrilla Analytics Tactics for Coping with Data Science Reality Enda Ridge, PhD 23 February 2015 0#GuerrillaAnalytics Copyright Enda Ridge 2015
  2. 2. What we are told about Data Science 1#GuerrillaAnalytics Copyright Enda Ridge 2015 “Data is the new science. Big data holds the answers.” “the sexy job in the next 10 years will be statisticians” “Data Scientist: The Sexiest Job of the 21st Century” “Information is the oil of the 21st century, and analytics is the combustion engine.” http://www.gapminder.org/ http://www.statistics.com/data-science-quotes/ https://github.com/mbostock/d3/wiki/Gallery 23 February 2015
  3. 3. Hi, we need an update on the insurance policy classification work. It’s going to the Head of Underwriting this afternoon. Um. Which work? Jo and I are trying two different approaches. And Jo’s on holidays. I’ll check my mailbox and send you my spreadsheet from last week. Just need the change in uplift since last week. Err.....the policy population changed with the extra system extract on Tuesday. And we added a bunch of business rules to accommodate that.... so we can’t go back to the earlier numbers. The Data Science Reality 2#GuerrillaAnalytics Copyright Enda Ridge 201523 February 2015
  4. 4. My Journey Mechanical Engineer PhD Computer Science • “Design of Experiments for the Tuning of Algorithms” Boutique Consultancy Forensic Data Analytics Senior Manager #GuerrillaAnalytics Copyright Enda Ridge 2015 323 February 2015 Constraints Computation takes time! Dynamic Repeatable Reproducible Dynamic Constrained Dynamic Constrained Reproduce Test Audit
  5. 5. What is Data Science? #GuerrillaAnalytics Copyright Enda Ridge 2015 4 Data Analytics Insight 23 February 2015
  6. 6. Common Misconception #GuerrillaAnalytics Copyright Enda Ridge 2015 5 Shearer C., The CRISP- DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22 23 February 2015
  7. 7. Project Reality – Dynamic 23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 6 Data People Understanding Rules Code
  8. 8. Project Reality – Constraints 23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 7 Time People Technology Data
  9. 9. Project Reality – Transparency 23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 8 Explainable Testable Reproducible Repeatable
  10. 10. Guerrilla Analytics #GuerrillaAnalytics Copyright Enda Ridge 2015 9 Data • Extraction • Receipt • Loading Analytics • Transform • Algorithms • Consolidate Insight • Reporting • Work Products Disruptions 23 February 2015
  11. 11. Guerrilla Analytics Principles #GuerrillaAnalytics Copyright Enda Ridge 2015 1023 February 2015 Maintaining Data Provenance mitigates the effect of disruptions on your work
  12. 12. Guerrilla Analytics Principles • Space is cheap, confusion is expensive1 • Prefer simple, visual project structures over heavily documented and project-specific rules2 • Prefer automation with program code over manual graphical methods3 • Version control changes to data and program code5 Etc... 23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 11
  13. 13. Guerrilla Analytics #GuerrillaAnalytics Copyright Enda Ridge 2015 12 Data • Extraction • Receipt • Loading Analytics • Transform • Algorithms • Consolidate Insight • Reporting • Work Products Disruptions 23 February 2015
  14. 14. Data Receipt 23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 13 Guerrilla Analytics Environment • Lost Data • Multiple Copies of data • No supporting information • Local copies of data • Renamed data
  15. 15. Data Receipt 23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 14 Guerrilla Analytics Approach • Have 1 Data location • Data Unique Identifiers • Data log • Keep supporting material near its data
  16. 16. Data Load Files Crazy-name spreadsheet 1 Crazy-name spreadsheet 2 Crazy-name spreadsheet 3 FNU810A A_very_long_named_file_v 0.2.1.pdf Analytics Environment User_markups Customer_Table Finance_Report_v1.0 #GuerrillaAnalytics Copyright Enda Ridge 2015 15 Guerrilla Environment • Renamed files • Scattered inconsistent locations • Multiple versions of files • Replacements of files 23 February 2015
  17. 17. Data Load Files Crazy-name spreadsheet 1 Crazy-name spreadsheet 2 Crazy-name spreadsheet 3 FNU810A A_very_long_named_file_v0.2. 1.pdf Analytics Environment Crazy-name spreadsheet 1 Crazy-name spreadsheet 2 Crazy-name spreadsheet 3 FNU810A A_very_long_named_file_v0.2 .1.pdf #GuerrillaAnalytics Copyright Enda Ridge 2015 16 Guerrilla Analytics Approach • One-to-one mapping from files to datasets • Keep crazy names • Minimize prep work 23 February 2015
  18. 18. Guerrilla Analytics #GuerrillaAnalytics Copyright Enda Ridge 2015 17 Data • Extraction • Receipt • Loading Analytics • Transform • Algorithms • Consolidate Insight • Reporting • Work Products Disruptions 23 February 2015
  19. 19. Guerrilla Analytics Environment • Multiple languages • Many code files • Variety of outputs • Data manipulation on file system • Data manipulation in analytics environment • Combinations of tools 23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 18
  20. 20. Analytics: Code Guerrilla Analytics Environment Guerrilla Analytics Approach 23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 19 WP_024 Rates cleaned.SQL Rates_by_city_v1_FINAL.R Rates_by_city_v2.R MAP_POSTCODES.SQL WP_024 010_MAP_POSTCODES.SQL 030_Rates cleaned.SQL 050_Rates_by_cityv2.R
  21. 21. Analytics: Data ID Addr_1 City A 10 Main St London C 5 Junct London B 54 Shop Rd Dublin B 123 Middle Str. Galway 23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 20 ID Addr_1 City A 10 MAIN STREET London B 54 SHOP ROAD Dublin C 5 JUNCTION London ... ... ...
  22. 22. Analytics: Data ID Addr_1 City A 10 Main St London C 5 Junct London B 54 Shop Rd Dublin B 123 Middle Str. Galway 23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 21 ID Addr_1 Addr_1_cln City IS_IN_SCOPE A 10 Main St 10 MAIN STREET London YES C 5 Junct 5 JUNCTION London YES B 54 Shop Rd 54 SHOP ROAD Dublin YES B 123 Middle Str. 123 MIDDLE STREET Galway NO
  23. 23. Guerrilla Analytics #GuerrillaAnalytics Copyright Enda Ridge 2015 22 Data • Extraction • Receipt • Loading Analytics • Transform • Algorithms • Consolidate Insight • Reporting • Work Products Disruptions 23 February 2015
  24. 24. Reporting – what is a report? #GuerrillaAnalytics Copyright Enda Ridge 2015 2323 February 2015
  25. 25. Reporting – Guerrilla Environment #GuerrillaAnalytics Copyright Enda Ridge 2015 2423 February 2015
  26. 26. Reporting – Guerrilla Analytics approach #GuerrillaAnalytics Copyright Enda Ridge 2015 25 1 2 5 Select min/max of transaction_time WP_030 Select min/max of customer_age WP_035 Purchases by type WP_042 23 February 2015
  27. 27. Guerrilla Analytics #GuerrillaAnalytics Copyright Enda Ridge 2015 26 Data • Extraction • Receipt • Loading Analytics • Transform • Algorithms • Consolidate Insight • Reporting • Work Products Disruptions 23 February 2015
  28. 28. Why consolidate? #GuerrillaAnalytics Copyright Enda Ridge 2015 27 Raw Duplicates Customers Clean_Cust Deduped New_dupes Work Product 23 February 2015
  29. 29. Why consolidate? #GuerrillaAnalytics Copyright Enda Ridge 2015 28 Raw Duplicates Customers Clean_Cust Deduped New_dupes Duplicates_02 Customers Duplicates Deduped Clean_cust New_dupes Work Product 23 February 2015
  30. 30. Consolidating with a Build #GuerrillaAnalytics Copyright Enda Ridge 2015 29 Deduped Clean_cust New_dupes Duplicates_02 Duplicates Customers Dupes_latest Cust_Latest Raw Latest Clean Rules Interface Version Controlled Code 23 February 2015
  31. 31. Open Questions 23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 30 Workflows Testing ‘Big Data’Engineering
  32. 32. Keep in Touch! #GuerrillaAnalytics Copyright Enda Ridge 2015 31 @Enda_Ridge GuerrillaAnalytics@gmail.com www.guerrilla-analytics.net 23 February 2015

×