Upcoming SlideShare
×

1,998 views
1,780 views

Published on

Presentation at Data Harvest 2014

Published in: Education, Technology, Design
6 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,998
On SlideShare
0
From Embeds
0
Number of Embeds
45
Actions
Shares
0
43
0
Likes
6
Embeds 0
No embeds

No notes for slide

1. 1. @PaulBradshaw Leanpub.com/u/paulbradshaw Birmingham City University, City University London Online Journalism Blog, HelpMeInvestigate Saturday, 10 May 14
2. 2. Show of hands. Who has... - Calculated a proportion - Used a function like SUM - Used pivot tables - Used a function like VLOOKUP Saturday, 10 May 14
3. 3. PART ONE: BASICS. Saturday, 10 May 14
4. 4. Saturday, 10 May 14
5. 5. https:// pefonline.electoralcommission.org.uk/ search/searchintro.aspx http://www.eib.org/projects/loans/list/ Download this data: Donations or EIB loans Saturday, 10 May 14
6. 6. - Make a copy, work on that - Use CTRL+arrow keys to skip to edges of data - Clean first few rows to create single heading row - Remove grand total row - Remove empty rows (Open Refine) Speed: keyboard shortcuts for checking the data Saturday, 10 May 14
7. 7. Numbers Strings Calculations 10 John Smith =10+20+30 20 Kate Brown =A2+A3+A4 30 Mike Moore =SUM(A2:A4) N/A Kim Smith =COUNT(A:A) 50 =COUNTA(B:B) Row 1 Column A Column B Column C Row 3 Row 4 Row 5 Row 6 Row 2 Saturday, 10 May 14
8. 8. Granular data has row for every payment, person, crime etc. Aggregate has rows for total crimes, payments, etc. Granular always better - can calculate your own aggregates Two types of datasets: Aggregate and granular Saturday, 10 May 14
9. 9. Aggregate data: - put the focus in Rows - numbers (money, crimes) in Values Granular: pivot tables Saturday, 10 May 14
10. 10. Saturday, 10 May 14
11. 11. = indicates this is a formula SUM is the function to be applied ( contains the ingredients for that formula D2:D300 this is a range (array) of cells* , separates each ingredient ) ends the list of ingredients Using functions - and arguments Saturday, 10 May 14
12. 12. =SUM(D:D) ignores any text/empty cells =MAX(D:D) =MIN(D:D) =AVERAGE(D:D) More speed: use column ranges Saturday, 10 May 14
13. 13. =AVERAGE(D:D) =MEDIAN(D:D) =MODE(D:D) - for ‘most common’: useful for ordinal ratings which shouldn’t be averaged. Sense-checking: misleading averages Saturday, 10 May 14
14. 14. =MAX(D:D)/SUM(D:D) - how much of the total is accounted for by the biggest value? =SUM(D35:D64)/SUM(D:D) - what proportion from one entity? =SUM(D:D)/365 - how much per day? (for annual data) Combining functions to quickly make numbers meaningful Saturday, 10 May 14
15. 15. Org spending £X per day Company receives X% of spending Org spent £X on Y Stories you can report quickly Saturday, 10 May 14
16. 16. Saturday, 10 May 14
17. 17. Data health warning! Remember the context: e.g. spending over £500, inflation Saturday, 10 May 14
18. 18. PART TWO: CHECKING Saturday, 10 May 14
19. 19. Saturday, 10 May 14
20. 20. =COUNT(D:D) =COUNTA(D:D) =COUNTBLANK(D2:D15000) - have to use specific range or blank cells underneath table are counted =COUNTIF(D:D, “Other”) COUNT functions: Checking data coverage Saturday, 10 May 14
21. 21. =COUNTIF(D:D, “Individual”) =COUNTIFS(D:D, “Individual”, B:B,”<10000”) =SUMIF(D:D, “<10000”) =IF(This, then that, otherwise this) IF functions: Drill down further Saturday, 10 May 14
22. 22. =COUNTIF(D:D, “*hire*”) =COUNTIF(D:D, “Scottish*”) =COUNTIF(D:D, “* hire*”) COUNTIF: Use wildcards - and spaces Saturday, 10 May 14
23. 23. Saturday, 10 May 14
24. 24. =COUNTIF(D2, “*adidas*”) =COUNTIF(D3, “*adidas*”) =COUNTIF(D4, “*adidas*”) ... Then sort to bring the 1s to the top COUNTIF: Test free text data Saturday, 10 May 14
25. 25. THE BLACK CROSS DOUBLE CLICK Saturday, 10 May 14
26. 26. Saturday, 10 May 14
27. 27. PART THREE: CLEANING Saturday, 10 May 14
28. 28. Saturday, 10 May 14
29. 29. =TRIM(D2) =SUBSTITUTE(D2,“ ”, “”) (Target cell, what you want to substitute, what you want to replace it with) =SEARCH(“Wales”,A2) Gives a position of the first match Cleaning text: TRIM, SEARCH, SUBSTITUTE Saturday, 10 May 14
30. 30. mr SMITH =UPPER(D2) = MR SMITH =LOWER(D2) = mr smith =PROPER(D2) = Mr Smith Cleaning text: UPPER, LOWER, PROPER Saturday, 10 May 14
31. 31. =LEFT(E2,3) = first 3 characters in E2 =RIGHT(E2,3) = last 3 characters in E2 =MID(E2,10,3) = the 3 characters in E2 starting from position 10 Cleaning text: LEFT, RIGHT, MID Saturday, 10 May 14
32. 32. =LEN(E2) = how many characters in E2 =LEFT(E2,LEN(E2)-3) = Length of E2 - 3. Grab that many characters. i.e. - If E2 is 5 characters, it will grab the first 2 (5-3=2) - If E2 is 7 characters it will grab the first 4 (7-3=4) Combine with LEN Saturday, 10 May 14
33. 33. =SEARCH(“ ”,E2) = which position is the first space =LEFT(E2,SEARCH(“ ”,E2)) = Grab all characters up to (and including) that space Combine with SEARCH Saturday, 10 May 14
34. 34. =SEARCH(“ ”,E2) = which position is the first space =LEFT(E2,SEARCH(“ ”,E2)) = Grab all characters up to (and including) that space =TRIM(LEFT(E2,SEARCH(“ ”,E2))) Combine with SEARCH Saturday, 10 May 14
35. 35. =ISERROR(D2) = TRUE or FALSE See also: ISNUMBER, ISTEXT, ISNONTEXT, ISLOGICAL, ISEVEN, ISODD ISERR (all but N/A) Finding errors: ISERROR, ISNA, ISBLANK Saturday, 10 May 14
36. 36. PART FOUR: ADDING Saturday, 10 May 14
37. 37. Saturday, 10 May 14
38. 38. Save time typing search URLs Saturday, 10 May 14
39. 39. "https://www.duedil.com/beta/search/ companies?name="&B2 Generate URL Saturday, 10 May 14
40. 40. "https://www.duedil.com/beta/search/ companies?name="&B2 "https://www.duedil.com/beta/search/ companies? name="&SUBSTITUTE(B2," ","%20") Generate URL Saturday, 10 May 14
41. 41. =VLOOKUP(What you’re looking for, what range contains a match & what you want back, which column you want back, nearest match?) =VLOOKUP(D2,Sheet1!D:E,2,false) Merging data: VLOOKUP Saturday, 10 May 14
42. 42. =TEXT(D2, “dddd”) =YEAR(D2) =MONTH(D2) = 1 =TEXT(D2, “mmmm”) = ‘January’ =TEXT(D2, “mmm”) = ‘Jan’ If not formatted as date, use LEFT Convert dates to years: TEXT functions Saturday, 10 May 14
43. 43. =IF(B2>2500,“High”,“Low”) Convert amounts to categories: nested IF functions Saturday, 10 May 14
44. 44. =IF(B2>2500,“High”,“Low”) =IF(B2>2500,“High”,IF(B2<1000,“Low” ,“Mid”)) Convert amounts to categories: nested IF functions Saturday, 10 May 14
45. 45. =IF(COUNTIF(B2, “*dropped*”), “Dropped”, “Not dropped”) Can’t use wildcard. Combine with COUNTIF Saturday, 10 May 14
46. 46. 1. Save time. 2. Check your data. 3. Clean your data. 4. Add to your data. 5. Feel clever. But don’t be too clever. Saturday, 10 May 14