Finding stories in spreadsheets

1,998 views
1,780 views

Published on

Presentation at Data Harvest 2014

Published in: Education, Technology, Design

Finding stories in spreadsheets

  1. 1. @PaulBradshaw Leanpub.com/u/paulbradshaw Birmingham City University, City University London Online Journalism Blog, HelpMeInvestigate Saturday, 10 May 14
  2. 2. Show of hands. Who has... - Calculated a proportion - Used a function like SUM - Used pivot tables - Used a function like VLOOKUP Saturday, 10 May 14
  3. 3. PART ONE: BASICS. Saturday, 10 May 14
  4. 4. Saturday, 10 May 14
  5. 5. https:// pefonline.electoralcommission.org.uk/ search/searchintro.aspx http://www.eib.org/projects/loans/list/ Download this data: Donations or EIB loans Saturday, 10 May 14
  6. 6. - Make a copy, work on that - Use CTRL+arrow keys to skip to edges of data - Clean first few rows to create single heading row - Remove grand total row - Remove empty rows (Open Refine) Speed: keyboard shortcuts for checking the data Saturday, 10 May 14
  7. 7. Numbers Strings Calculations 10 John Smith =10+20+30 20 Kate Brown =A2+A3+A4 30 Mike Moore =SUM(A2:A4) N/A Kim Smith =COUNT(A:A) 50 =COUNTA(B:B) Row 1 Column A Column B Column C Row 3 Row 4 Row 5 Row 6 Row 2 Saturday, 10 May 14
  8. 8. Granular data has row for every payment, person, crime etc. Aggregate has rows for total crimes, payments, etc. Granular always better - can calculate your own aggregates Two types of datasets: Aggregate and granular Saturday, 10 May 14
  9. 9. Aggregate data: - put the focus in Rows - numbers (money, crimes) in Values Granular: pivot tables Saturday, 10 May 14
  10. 10. Saturday, 10 May 14
  11. 11. = indicates this is a formula SUM is the function to be applied ( contains the ingredients for that formula D2:D300 this is a range (array) of cells* , separates each ingredient ) ends the list of ingredients Using functions - and arguments Saturday, 10 May 14
  12. 12. =SUM(D:D) ignores any text/empty cells =MAX(D:D) =MIN(D:D) =AVERAGE(D:D) More speed: use column ranges Saturday, 10 May 14
  13. 13. =AVERAGE(D:D) =MEDIAN(D:D) =MODE(D:D) - for ‘most common’: useful for ordinal ratings which shouldn’t be averaged. Sense-checking: misleading averages Saturday, 10 May 14
  14. 14. =MAX(D:D)/SUM(D:D) - how much of the total is accounted for by the biggest value? =SUM(D35:D64)/SUM(D:D) - what proportion from one entity? =SUM(D:D)/365 - how much per day? (for annual data) Combining functions to quickly make numbers meaningful Saturday, 10 May 14
  15. 15. Org spending £X per day Company receives X% of spending Org spent £X on Y Stories you can report quickly Saturday, 10 May 14
  16. 16. Saturday, 10 May 14
  17. 17. Data health warning! Remember the context: e.g. spending over £500, inflation Saturday, 10 May 14
  18. 18. PART TWO: CHECKING Saturday, 10 May 14
  19. 19. Saturday, 10 May 14
  20. 20. =COUNT(D:D) =COUNTA(D:D) =COUNTBLANK(D2:D15000) - have to use specific range or blank cells underneath table are counted =COUNTIF(D:D, “Other”) COUNT functions: Checking data coverage Saturday, 10 May 14
  21. 21. =COUNTIF(D:D, “Individual”) =COUNTIFS(D:D, “Individual”, B:B,”<10000”) =SUMIF(D:D, “<10000”) =IF(This, then that, otherwise this) IF functions: Drill down further Saturday, 10 May 14
  22. 22. =COUNTIF(D:D, “*hire*”) =COUNTIF(D:D, “Scottish*”) =COUNTIF(D:D, “* hire*”) COUNTIF: Use wildcards - and spaces Saturday, 10 May 14
  23. 23. Saturday, 10 May 14
  24. 24. =COUNTIF(D2, “*adidas*”) =COUNTIF(D3, “*adidas*”) =COUNTIF(D4, “*adidas*”) ... Then sort to bring the 1s to the top COUNTIF: Test free text data Saturday, 10 May 14
  25. 25. THE BLACK CROSS DOUBLE CLICK Saturday, 10 May 14
  26. 26. Saturday, 10 May 14
  27. 27. PART THREE: CLEANING Saturday, 10 May 14
  28. 28. Saturday, 10 May 14
  29. 29. =TRIM(D2) =SUBSTITUTE(D2,“ ”, “”) (Target cell, what you want to substitute, what you want to replace it with) =SEARCH(“Wales”,A2) Gives a position of the first match Cleaning text: TRIM, SEARCH, SUBSTITUTE Saturday, 10 May 14
  30. 30. mr SMITH =UPPER(D2) = MR SMITH =LOWER(D2) = mr smith =PROPER(D2) = Mr Smith Cleaning text: UPPER, LOWER, PROPER Saturday, 10 May 14
  31. 31. =LEFT(E2,3) = first 3 characters in E2 =RIGHT(E2,3) = last 3 characters in E2 =MID(E2,10,3) = the 3 characters in E2 starting from position 10 Cleaning text: LEFT, RIGHT, MID Saturday, 10 May 14
  32. 32. =LEN(E2) = how many characters in E2 =LEFT(E2,LEN(E2)-3) = Length of E2 - 3. Grab that many characters. i.e. - If E2 is 5 characters, it will grab the first 2 (5-3=2) - If E2 is 7 characters it will grab the first 4 (7-3=4) Combine with LEN Saturday, 10 May 14
  33. 33. =SEARCH(“ ”,E2) = which position is the first space =LEFT(E2,SEARCH(“ ”,E2)) = Grab all characters up to (and including) that space Combine with SEARCH Saturday, 10 May 14
  34. 34. =SEARCH(“ ”,E2) = which position is the first space =LEFT(E2,SEARCH(“ ”,E2)) = Grab all characters up to (and including) that space =TRIM(LEFT(E2,SEARCH(“ ”,E2))) Combine with SEARCH Saturday, 10 May 14
  35. 35. =ISERROR(D2) = TRUE or FALSE See also: ISNUMBER, ISTEXT, ISNONTEXT, ISLOGICAL, ISEVEN, ISODD ISERR (all but N/A) Finding errors: ISERROR, ISNA, ISBLANK Saturday, 10 May 14
  36. 36. PART FOUR: ADDING Saturday, 10 May 14
  37. 37. Saturday, 10 May 14
  38. 38. Save time typing search URLs Saturday, 10 May 14
  39. 39. "https://www.duedil.com/beta/search/ companies?name="&B2 Generate URL Saturday, 10 May 14
  40. 40. "https://www.duedil.com/beta/search/ companies?name="&B2 "https://www.duedil.com/beta/search/ companies? name="&SUBSTITUTE(B2," ","%20") Generate URL Saturday, 10 May 14
  41. 41. =VLOOKUP(What you’re looking for, what range contains a match & what you want back, which column you want back, nearest match?) =VLOOKUP(D2,Sheet1!D:E,2,false) Merging data: VLOOKUP Saturday, 10 May 14
  42. 42. =TEXT(D2, “dddd”) =YEAR(D2) =MONTH(D2) = 1 =TEXT(D2, “mmmm”) = ‘January’ =TEXT(D2, “mmm”) = ‘Jan’ If not formatted as date, use LEFT Convert dates to years: TEXT functions Saturday, 10 May 14
  43. 43. =IF(B2>2500,“High”,“Low”) Convert amounts to categories: nested IF functions Saturday, 10 May 14
  44. 44. =IF(B2>2500,“High”,“Low”) =IF(B2>2500,“High”,IF(B2<1000,“Low” ,“Mid”)) Convert amounts to categories: nested IF functions Saturday, 10 May 14
  45. 45. =IF(COUNTIF(B2, “*dropped*”), “Dropped”, “Not dropped”) Can’t use wildcard. Combine with COUNTIF Saturday, 10 May 14
  46. 46. 1. Save time. 2. Check your data. 3. Clean your data. 4. Add to your data. 5. Feel clever. But don’t be too clever. Saturday, 10 May 14
  47. 47. Thank you Leanpub.com/u/spreadsheetstories @paulbradshaw Saturday, 10 May 14

×