Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Science
Patterns
PREPARING DATA FOR
AGILE DATA SCIENCE
0
What You Will Learn
 Why you must identify and mitigate disruptions in projects
 What Data Science patterns are and how ...
What I’ve Learned
PhD
‘Design of
Experiments
for Tuning
Algorithms’
Boutique
Consultancy
Forensic
Data
Analytics
Senior
Ma...
Teams Need ‘Guerrilla Analytics’
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net
3
Data
•Extra...
Solution: Maintain Data Provenance
Data
Code
Business
Domain
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla...
Agile Data Preparation Capability
Agility
3.
Recognize &
Implement
Patterns
2.
Supporting
Tools
1.
Simple
Conventions
Copy...
Data
WHAT IT LOOKS LIKE
WHAT IT SHOULD LOOK LIKE
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.n...
What Raw Data Looks Like
Relational Data
Customer
Address
JSON
{
"firstName": "John",
"lastName": "Smith",
"age": 25,
"add...
What Raw Data Looks Like
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif
HTTP/1.0" 200 6248 "http:/...
Data Scientists Need Data To Look Like This
Artist Track Week Date Rank
2 Pac Baby Don’t
Cry
1 2000-02-26 87
2 Pac Baby Do...
Data Scientists Need Data To Look Like This
Artist Track Week Date Rank
2 Pac Baby Don’t
Cry
1 2000-02-26 87
2 Pac Baby Do...
Data Scientists Need Data To Look Like This
Artist Track Week Date Rank
2 Pac Baby Don’t
Cry
1 2000-02-26 87
2 Pac Baby Do...
What Data Scientists Need Data To Look Like
Artist Track Week Date Rank
2 Pac Baby Don’t
Cry
1 2000-02-26 87
2 Pac Baby Do...
Patterns
Architecture Software Data Science
?
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net
...
Patterns: ‘Recurring solutions to common problems’
Joining Data
Collecting
Unique ID
Map rename
Fuzzy join
Stacking
Transf...
Joining
Patterns
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net
15
Collecting
Unique ID
Map r...
Joining Pattern: Collecting Datasets
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net
16
 Pull...
Joining Pattern: Collecting Datasets
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net
17
 Samp...
Joining Pattern: Unique IDs
Situation: data refreshes
Day_ID date Amt Act
3477 2014-03-
16
150,000 SETTLE
4598 2014-03-
17...
Joining Pattern: Unique IDs
Day_ID date Amt Act Hash_id
3477 2014-03-
16
150,000 SETTLE 244072c9f78f59f7ca0ca93426db
98da
...
Joining Pattern: Map rename
Situation: lots of renaming
Day_ID Cust Amt Act
3477 2014-03-
16
150,000 SETTLE
4598 2014-03-
...
Joining Pattern: Map rename
Situation: lots of renaming
Day_I
D
Cust Amt Act
3477 2014-03-16 150,000 SETTLE
4598 2014-03-1...
Transformation
Patterns
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net
22
Duplicates
Outliers...
Transformation Pattern: Duplicates
Situation: repeated data. Can’t decide what to remove
Artist Track We
ek
Date Ran
k
2 P...
Transformation Pattern: Duplicates
Artist Track Wee
k
Date Rank Dupe_Full_id
2 Pac Baby Don’t
Cry
1 2000-02-
26
87 1
2 Pac...
Transformation Pattern: Duplicates
Artist Track Wee
k
Date Rank Dupe_Full_id Dupe_rank_date
2 Pac Baby Don’t
Cry
1 2000-02...
Pattern
Matching
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net
26
Regular Expressions
Pattern Matching
Situation: getting content
from large amounts of text
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "G...
Pattern Matching: Regular Expressions
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif
HTTP/1.0" 200...
Tidying Data
Patterns
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net
29
Sort
Filter
Derived v...
Tidying Data Pattern: Split-apply-combine
Artist Track Week Date Ran
k
2 Pac Baby Don’t
Cry
1 2000-02-26 87
2 Pac Baby Don...
Split-apply-combine: SPLIT
Artist Track We
ek
Date Rank
2 Pac Baby Don’t
Cry
1 2000-02-26 87
2 Pac Baby Don’t
Cry
2 2000-0...
Split-apply-combine: APPLY
Artist Track We
ek
Date Rank
2 Pac Baby Don’t
Cry
1 2000-02-26 87
2 Pac Baby Don’t
Cry
2 2000-0...
Split-apply-combine: COMBINE
Artist Track We
ek
Date Ran
k
2 Pac Baby Don’t
Cry
1 2000-02-26 87
2 Pac Baby Don’t
Cry
2 200...
Tidying Data Pattern: Unroll (and roll up)
Situation: data on one line
customer_id session_id basket
34567 12 45;67;235;99...
Tidying Data Pattern: Unroll (and roll up)
Situation: data on one line
customer_id session_id basket
34567 12 45;67;235;99...
Tidying Data Pattern: Nth item
Situation: items have an order
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrill...
Tidying Data Pattern: Nth item
Situation: items have an order
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrill...
Tidying Data Pattern: Nth item
Situation: items have an order
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrill...
Sort1 Unroll2 Nth
item3
Split-
Apply-
Combine4
Chaining of Patterns
Copyright Enda Ridge 2015#GuerrillaAnalytics http://gu...
Summing up
 Guerrilla Analytics requires
agile teams
 Data Science Patterns are
recurring solutions to data
preparation ...
Find out more
Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net
41
@Enda_Ridge
http://guerrilla-...
Upcoming SlideShare
Loading in …5
×

Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

2,493 views

Published on

Are you a data scientist working on a project with constantly changing requirements, flawed changing data and other disruptions? Guerrilla Analytics can help.

The key to a high performing Guerrilla Analytics team is its ability to recognise common data preparation patterns and quickly implement them in flexible, defensive data sets.

You will learn about:
* Guerrilla Analytics: a brief introduction to what it is and why you need it for your agile data science ambitions
* Data Science Patterns: what they are and how they enable agile data science
* Walk through of some common patterns in use in real projects

Published in: Data & Analytics
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

  1. 1. Data Science Patterns PREPARING DATA FOR AGILE DATA SCIENCE 0
  2. 2. What You Will Learn  Why you must identify and mitigate disruptions in projects  What Data Science patterns are and how to use them effectively How this will help you  Data Scientists: you need to ‘think in patterns’  Developers: you will productionise these patterns  Managers and Directors: you need this capability in a high performing team Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 1
  3. 3. What I’ve Learned PhD ‘Design of Experiments for Tuning Algorithms’ Boutique Consultancy Forensic Data Analytics Senior Manager Professional Services Head of Algorithms Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 2 No matter the industry, teams were always plagued by the same problem … Time was wasted preparing data and revisiting data instead of delivering real Data Science value 2004 2008 2010 2012 2015
  4. 4. Teams Need ‘Guerrilla Analytics’ Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 3 Data •Extraction •Receipt •Loading Analytics •Transform •Algorithms •Consolidate Insight •Reporting •Work Products Disruptions
  5. 5. Solution: Maintain Data Provenance Data Code Business Domain Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 4
  6. 6. Agile Data Preparation Capability Agility 3. Recognize & Implement Patterns 2. Supporting Tools 1. Simple Conventions Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 5
  7. 7. Data WHAT IT LOOKS LIKE WHAT IT SHOULD LOOK LIKE Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 6
  8. 8. What Raw Data Looks Like Relational Data Customer Address JSON { "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021" }, "phoneNumber": [ { "type": "home", "number": "212 555-1234" }, { "type": "fax", "number": "646 555-4567" } ] } Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 7 firstName lastName age addressID John Smith 25 340 Jane Doe 36 158 addressID Street Address City State postCode 340 21 2nd Street New York NY 10021 341 Main Street Boston MA 34041
  9. 9. What Raw Data Looks Like 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Tex t/RTF" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics/5star.gif HTTP/1.0" 200 1031 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 8 Machine Data
  10. 10. Data Scientists Need Data To Look Like This Artist Track Week Date Rank 2 Pac Baby Don’t Cry 1 2000-02-26 87 2 Pac Baby Don’t Cry 2 2000-03-02 82 2 Pac Baby Don’t Cry 3 2000-03-11 72 2 Pac Baby Don’t Cry 4 2000-03-18 77 2 Pac Baby Don’t Cry 5 2000-03-25 87 2 Pac Baby Don’t Cry 6 2000-04-01 94 2 Pac Baby Don’t Cry 7 2000-04-08 99 3 Doors Down Kryptonite 1 2000-04-08 68 3 Doors Down Kryptonite 2 2000-04-15 67 3 Doors Down Kryptonite 3 2000-04-22 66  One row per observation  One variable per column Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 9 ‘Tidy Data’, H. Wickham, Journal of Statistical Software 2014
  11. 11. Data Scientists Need Data To Look Like This Artist Track Week Date Rank 2 Pac Baby Don’t Cry 1 2000-02-26 87 2 Pac Baby Don’t Cry 2 2000-03-02 82 2 Pac Baby Don’t Cry 3 2000-03-11 72 2 Pac Baby Don’t Cry 4 2000-03-18 77 2 Pac Baby Don’t Cry 5 2000-03-25 87 2 Pac Baby Don’t Cry 6 2000-04-01 94 2 Pac Baby Don’t Cry 7 2000-04-08 99 3 Doors Down Kryptonite 1 2000-04-08 68 3 Doors Down Kryptonite 2 2000-04-15 67 3 Doors Down Kryptonite 3 2000-04-22 66  Easier to describe relationships between variables (columns) than between rows Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 10 ‘Tidy Data’, H. Wickham, Journal of Statistical Software 2014 2014-04-01 is the 6th week
  12. 12. Data Scientists Need Data To Look Like This Artist Track Week Date Rank 2 Pac Baby Don’t Cry 1 2000-02-26 87 2 Pac Baby Don’t Cry 2 2000-03-02 82 2 Pac Baby Don’t Cry 3 2000-03-11 72 2 Pac Baby Don’t Cry 4 2000-03-18 77 2 Pac Baby Don’t Cry 5 2000-03-25 87 2 Pac Baby Don’t Cry 6 2000-04-01 94 2 Pac Baby Don’t Cry 7 2000-04-08 99 3 Doors Down Kryptonite 1 2000-04-08 68 3 Doors Down Kryptonite 2 2000-04-15 67 3 Doors Down Kryptonite 3 2000-04-22 66  Easier to describe relationships between variables (columns) than between rows  Easier to do comparisons between groups of observations than between groups of columns Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 11 ‘Tidy Data’, H. Wickham, Journal of Statistical Software 2014 Min, max, first, Nth, average, median
  13. 13. What Data Scientists Need Data To Look Like Artist Track Week Date Rank 2 Pac Baby Don’t Cry 1 2000-02-26 87 2 Pac Baby Don’t Cry 2 2000-03-02 82 2 Pac Baby Don’t Cry 3 2000-03-11 72 2 Pac Baby Don’t Cry 4 2000-03-18 77 2 Pac Baby Don’t Cry 5 2000-03-25 87 2 Pac Baby Don’t Cry 6 2000-04-01 94 2 Pac Baby Don’t Cry 7 2000-04-08 99 3 Doors Down Kryptonite 1 2000-04-08 68 3 Doors Down Kryptonite 2 2000-04-15 67 3 Doors Down Kryptonite 3 2000-04-22 66  Variables organized by role  Experiment design (fixed) on left  Measurements on right  De-normalised inefficiencies are OK! Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 12 ‘Tidy Data’, H. Wickham, Journal of Statistical Software 2014
  14. 14. Patterns Architecture Software Data Science ? Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 13 Patterns are “Recurring solutions to common problems”
  15. 15. Patterns: ‘Recurring solutions to common problems’ Joining Data Collecting Unique ID Map rename Fuzzy join Stacking Transformation Duplicates Outliers Sampling Tidying Data Sort Filter Derived variables Aggregations Pivot and unpivot Roll and unroll Previous/Next N Split-Apply-Combine Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 14 Pattern Matching Regular Expressions
  16. 16. Joining Patterns Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 15 Collecting Unique ID Map rename Fuzzy join Stacking
  17. 17. Joining Pattern: Collecting Datasets Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 16  Pull datasets by name (if you have a convention)  Pull datasets by content  Index and search Capability Schema 2015-10-01.log 2015-10-02.log 2015-10-03.log 2015-10-04.log 2015-10-05.log 2015-10-06.log 2015-10-07.log … Situation: log files
  18. 18. Joining Pattern: Collecting Datasets Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 17  Sampling: test and train  Experimenting: factors  Exploring: what’s in there? Benefit  Pull datasets by name (if you have a convention)  Pull datasets by content  Index and search Capability Schema 2015-10-01.log 2015-10-02.log 2015-10-03.log 2015-10-04.log 2015-10-05.log 2015-10-06.log 2015-10-07.log … Situation: log files
  19. 19. Joining Pattern: Unique IDs Situation: data refreshes Day_ID date Amt Act 3477 2014-03- 16 150,000 SETTLE 4598 2014-03- 17 45,000 AMEN D … … … … Capability Need to uniquely identify records, even when IDs exist in the data  Hash functions turn large amount of data into ‘unique’ string  MD5(Guerrilla Analytics)  3b04a8085df05752e24c095f 036c44f3  MD5(guerrilla analytics)  8f1438b18748981180e10b8c 1365e4d9 Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 18 Day_ID date Amt Act 3477 2014-03- 16 150,000 SETTLE 4598 2014-03- 17 45,001 AMEN D … … … … WEEK 1 WEEK 3.5
  20. 20. Joining Pattern: Unique IDs Day_ID date Amt Act Hash_id 3477 2014-03- 16 150,000 SETTLE 244072c9f78f59f7ca0ca93426db 98da 4598 2014-03- 17 45,000 AMEND 613b4ddfc4db2436e8b8deda26 bc3c25 … … … … Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 19 Day_ID date Amt Act Hash_id 3477 2014-03-16 150,000 SETTLE 244072c9f78f59f7ca0ca93426db98 da 4598 2014-03-17 45,001 AMEND 03a0d5e5646bfe60fce679a87ef4cd 34 … … … … WEEK 1 WEEK 3.5
  21. 21. Joining Pattern: Map rename Situation: lots of renaming Day_ID Cust Amt Act 3477 2014-03- 16 150,000 SETTLE 4598 2014-03- 17 45,000 AMEND … … … … Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 20 id customer amount event 3477 2014-03- 16 150,000 SETTLE 4598 2014-03- 17 45,000 AMEND … … … …
  22. 22. Joining Pattern: Map rename Situation: lots of renaming Day_I D Cust Amt Act 3477 2014-03-16 150,000 SETTLE 4598 2014-03-17 45,000 AMEND … … … … Pattern Day_I D Cust Amt Act 3477 2014-03-16 150,000 SETTLE 4598 2014-03-17 45,000 AMEND … … … … Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 21 id customer amount event 3477 2014-03-16 150,000 SETTLE 4598 2014-03-17 45,000 AMEND … … … … id customer amount event 3477 2014-03-16 150,000 SETTLE 4598 2014-03-17 45,000 AMEND … … … … dataset from to trades Day_ID id trades Amt amount … … …
  23. 23. Transformation Patterns Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 22 Duplicates Outliers Sampling
  24. 24. Transformation Pattern: Duplicates Situation: repeated data. Can’t decide what to remove Artist Track We ek Date Ran k 2 Pac Baby Don’t Cry 1 2000-02-26 87 2 Pac Baby Don’t Cry 2 2000-03-02 82 2 Pac Baby Don’t Cry 2 2000-03-02 82 2 Pac Baby Don’t Cry 3 2000-03-11 72 2 Pac Baby Don’t Cry 4 2000-03-18 77 2 Pac Baby Don’t Cry 5 2000-03-25 87 2 Pac Baby Don’t Cry 6 2000-04-01 94 2 Pac Baby Don’t Cry 7 2000-04-08 99 3 Doors Down Kryptonite 1 2000-03-02 82 Capability  Tag repeating records  Hold out and review in critical applications  Tag records that repeat across arbitrary columns Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 23
  25. 25. Transformation Pattern: Duplicates Artist Track Wee k Date Rank Dupe_Full_id 2 Pac Baby Don’t Cry 1 2000-02- 26 87 1 2 Pac Baby Don’t Cry 2 2000-03- 02 82 2 2 Pac Baby Don’t Cry 2 2000-03- 02 82 2 2 Pac Baby Don’t Cry 3 2000-03- 11 72 3 2 Pac Baby Don’t Cry 4 2000-03- 18 77 4 2 Pac Baby Don’t Cry 5 2000-03- 25 87 5 2 Pac Baby Don’t Cry 6 2000-04- 01 94 6 2 Pac Baby Don’t Cry 7 2000-04- 08 99 7 3 Doors Down Kryptonite 1 2000-03- 02 82 8 Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 24 Give duplicate groups an ID. Don’t delete!
  26. 26. Transformation Pattern: Duplicates Artist Track Wee k Date Rank Dupe_Full_id Dupe_rank_date 2 Pac Baby Don’t Cry 1 2000-02- 26 87 1 1 2 Pac Baby Don’t Cry 2 2000-03- 02 82 2 2 2 Pac Baby Don’t Cry 2 2000-03- 02 82 2 2 2 Pac Baby Don’t Cry 3 2000-03- 11 72 3 3 2 Pac Baby Don’t Cry 4 2000-03- 18 77 4 4 2 Pac Baby Don’t Cry 5 2000-03- 25 87 5 5 2 Pac Baby Don’t Cry 6 2000-04- 01 94 6 6 2 Pac Baby Don’t Cry 7 2000-04- 08 99 7 7 3 Doors Down Kryptonite 1 2000-03- 02 82 8 2 Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 25 Give multiple duplicate groups their own IDs.
  27. 27. Pattern Matching Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 26 Regular Expressions
  28. 28. Pattern Matching Situation: getting content from large amounts of text 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" Capability: Find and extract arbitrary groups of text ip datetime verb target return Etc etc 123.123.123.123 26/Apr/2000:00:23:48 - 0400 GET /pics/wpaper.g if HTTP/1.0 200 http://www .jafsoft.com /asctortf/ Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 27
  29. 29. Pattern Matching: Regular Expressions 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" ip datetime verb target return Etc etc 123.123.123.123 26/Apr/2000:00:23:48 - 0400 GET /pics/wpaper.g if HTTP/1.0 200 http://www .jafsoft.com /asctortf/ Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 28 From beginning of line, give me: 1 to 3 integers, immediately followed by a dot immediately followed by 1 to 3 integers….etc up until I encounter the first “ - -” /^(S+) S+ S+ [([^]]+)] "([A-Z]+)[^"]*" d+ d+ "[^"]*" "([^"]*)"$/m Regular Expression: Situation: getting content from large amounts of text
  30. 30. Tidying Data Patterns Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 29 Sort Filter Derived variables Aggregations Pivot and unpivot Roll and unroll Previous/Next N Split-Apply-Combine
  31. 31. Tidying Data Pattern: Split-apply-combine Artist Track Week Date Ran k 2 Pac Baby Don’t Cry 1 2000-02-26 87 2 Pac Baby Don’t Cry 2 2000-03-02 82 2 Pac Baby Don’t Cry 3 2000-03-11 72 2 Pac Baby Don’t Cry 4 2000-03-18 77 2 Pac Baby Don’t Cry 5 2000-03-25 87 2 Pac Baby Don’t Cry 6 2000-04-01 94 2 Pac Baby Don’t Cry 7 2000-04-08 99 3 Doors Down Kryptonite 1 2000-04-08 68 3 Doors Down Kryptonite 2 2000-04-15 67 3 Doors Down Kryptonite 3 2000-04-22 66 Capability Apply arbitrary functions to arbitrary groups Example What was each artist’s lowest rank per month (i.e their best track)? Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 30
  32. 32. Split-apply-combine: SPLIT Artist Track We ek Date Rank 2 Pac Baby Don’t Cry 1 2000-02-26 87 2 Pac Baby Don’t Cry 2 2000-03-02 82 2 Pac Baby Don’t Cry 3 2000-03-11 72 2 Pac Baby Don’t Cry 4 2000-03-18 77 2 Pac Baby Don’t Cry 5 2000-03-25 87 2 Pac Baby Don’t Cry 6 2000-04-01 94 2 Pac Baby Don’t Cry 7 2000-04-08 99 3 Doors Down Kryptonite 1 2000-04-08 68 3 Doors Down Kryptonite 2 2000-04-15 67 3 Doors Down Kryptonite 3 2000-04-22 66 Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 31 Artist Track We ek Date Rank 2 Pac Baby Don’t Cry 1 2000-02-26 87 2 Pac Baby Don’t Cry 2 2000-03-02 82 2 Pac Baby Don’t Cry 3 2000-03-11 72 2 Pac Baby Don’t Cry 4 2000-03-18 77 2 Pac Baby Don’t Cry 5 2000-03-25 87 2 Pac Baby Don’t Cry 6 2000-04-01 94 2 Pac Baby Don’t Cry 7 2000-04-08 99 3 Doors Down Kryptonite 1 2000-04-08 68 3 Doors Down Kryptonite 2 2000-04-15 67 3 Doors Down Kryptonite 3 2000-04-22 66
  33. 33. Split-apply-combine: APPLY Artist Track We ek Date Rank 2 Pac Baby Don’t Cry 1 2000-02-26 87 2 Pac Baby Don’t Cry 2 2000-03-02 82 2 Pac Baby Don’t Cry 3 2000-03-11 72 2 Pac Baby Don’t Cry 4 2000-03-18 77 2 Pac Baby Don’t Cry 5 2000-03-25 87 2 Pac Baby Don’t Cry 6 2000-04-01 94 2 Pac Baby Don’t Cry 7 2000-04-08 99 3 Doors Down Kryptonite 1 2000-04-08 68 3 Doors Down Kryptonite 2 2000-04-15 67 3 Doors Down Kryptonite 3 2000-04-22 66 Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 32 Artist Track We ek Date Rank 2 Pac Baby Don’t Cry 1 2000-02-26 87 2 Pac Baby Don’t Cry 2 2000-03-02 82 2 Pac Baby Don’t Cry 3 2000-03-11 72 2 Pac Baby Don’t Cry 4 2000-03-18 77 2 Pac Baby Don’t Cry 5 2000-03-25 87 2 Pac Baby Don’t Cry 6 2000-04-01 94 2 Pac Baby Don’t Cry 7 2000-04-08 99 3 Doors Down Kryptonite 1 2000-04-08 68 3 Doors Down Kryptonite 2 2000-04-15 67 3 Doors Down Kryptonite 3 2000-04-22 66
  34. 34. Split-apply-combine: COMBINE Artist Track We ek Date Ran k 2 Pac Baby Don’t Cry 1 2000-02-26 87 2 Pac Baby Don’t Cry 2 2000-03-02 82 2 Pac Baby Don’t Cry 3 2000-03-11 72 2 Pac Baby Don’t Cry 4 2000-03-18 77 2 Pac Baby Don’t Cry 5 2000-03-25 87 2 Pac Baby Don’t Cry 6 2000-04-01 94 2 Pac Baby Don’t Cry 7 2000-04-08 99 3 Doors Down Kryptonite 1 2000-04-08 68 3 Doors Down Kryptonite 2 2000-04-15 67 3 Doors Down Kryptonite 3 2000-04-22 66 Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 33 Artist Track We ek Date Ran k 2 Pac Baby Don’t Cry 1 2000-02-26 87 2 Pac Baby Don’t Cry 3 2000-03-11 72 2 Pac Baby Don’t Cry 6 2000-04-01 94 3 Doors Down Kryptonite 3 2000-04-22 66
  35. 35. Tidying Data Pattern: Unroll (and roll up) Situation: data on one line customer_id session_id basket 34567 12 45;67;235;9920fD 1232134 2 1345t;456234t Capability Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 34 Get data into a Tidy format
  36. 36. Tidying Data Pattern: Unroll (and roll up) Situation: data on one line customer_id session_id basket 34567 12 45;67;235;9920fD 1232134 2 1345t;456234t Capability customer_id session_id basket basket_item item_order 34567 12 45;67;235;9920fD 45 1 34567 12 45;67;235;9920fD 67 2 34567 12 45;67;235;9920fD 235 3 34567 12 45;67;235;9920fD 99 4 Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 35 SELECT customer_id,session_id unnest( string_to_array (basket, ‘;') ) AS basket_item FROM TheTable
  37. 37. Tidying Data Pattern: Nth item Situation: items have an order Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 36 customer_id session_id access_point access 34567 12 45;67;235;99;235;99 45 34567 12 45;67;235;99;235;99 67 34567 12 45;67;235;99;235;99 235 34567 12 45;67;235;99;235;99 99 34567 12 45;67;235;99;235;99 235 34567 12 45;67;235;99;235;99 99
  38. 38. Tidying Data Pattern: Nth item Situation: items have an order Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 37 customer_id session_id access_point access order 34567 12 45;67;235;99;235;99 45 1 34567 12 45;67;235;99;235;99 67 2 34567 12 45;67;235;99;235;99 235 3 34567 12 45;67;235;99;235;99 99 4 34567 12 45;67;235;99;235;99 235 5 34567 12 45;67;235;99;235;99 99 6 SELECT row_number() over (partition by customer_id, session_id) AS order FROM TheTable
  39. 39. Tidying Data Pattern: Nth item Situation: items have an order Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 38 Can I see when users are flipping between access points? customer_id session_id access_point access order 34567 12 45;67;235;99;235;99 45 1 34567 12 45;67;235;99;235;99 67 2 34567 12 45;67;235;99;235;99 235 3 34567 12 45;67;235;99;235;99 99 4 34567 12 45;67;235;99;235;99 235 5 34567 12 45;67;235;99;235;99 99 6 Gap=1
  40. 40. Sort1 Unroll2 Nth item3 Split- Apply- Combine4 Chaining of Patterns Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 39 With high pattern maturity, focus is no longer on details of the ‘standard’ pattern. Complex evolving code is easier to maintain
  41. 41. Summing up  Guerrilla Analytics requires agile teams  Data Science Patterns are recurring solutions to data preparation problems  Capability to recognize and implement patterns is key for high performance  Pattern groups:  Join  Transform  Pattern matching  Tidying  Chaining Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 40 Agility 3. Recognize & Implement Patterns 2. Supporting Tools 1. Simple Conventions
  42. 42. Find out more Copyright Enda Ridge 2015#GuerrillaAnalytics http://guerrilla-analytics.net 41 @Enda_Ridge http://guerrilla-analytics.net Available on:

×