Introduction to data cleaning with spreadsheets


Published on

Presented at School of Data training conducted in collaboration with the Open Data PH Taskforce in the Philippines, May 2014.

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to data cleaning with spreadsheets

  1. 1. An Introduction to data cleaning with spreadsheets Anders Pedersen, @anpe School of Data
  2. 2. Spreadsheets: The beginning of each and every data story • Which were the top growth sectors in this quarter? • What was the crime in the capital region in 2013 compared to 2012? • Is there a house bubble waiting around the corner?
  3. 3. It is time for journalists themselves to tame this beast called spreadsheets!
  4. 4. Spreadsheets: Excel or google docs
  5. 5. Some basic terminology • data is organized in rows and columns (rows go across the page, columns go top down) • each field holding data is called a cell • Rows are numbered, • columns are referred to by letters • each cell has column and a row, or a specific code (e.g. A1 is the top left cell
  6. 6. Some key features to explore today • Sorting and filtering • Basic formulas • Pivot tables Tricky bits: - don’t include summaries in pivot table - pivot tables cannot remember when you change your data
  7. 7. Data sources for exercise • Education: Secondary school enrollment for 2012 from 2012-enrollment-data-secondary
  8. 8. Sorting - finding the best and the worst • The 10 best paid sectors • The 10 oldest cities • The 10 poorest countries • … • If excel is a tool box for journalists, sorting is the hammer!
  9. 9. How to sort • 1) Mark all your data • 2) In the Data tab go to sort range
  10. 10. Sorting... • 3) Check the Data has header row check box • 4) Select the column you want to sort
  11. 11. Filtering - getting a better sense of your data • 1) Turn on Filtering via the Data tab (Data → Filter)
  12. 12. Filtering... • 2) Filter options now appear at top
  13. 13. Filtering... • 3) Now click on the • blue triangular arrow
  14. 14. Filtering... • 4) Select the section you wish to filter
  15. 15. Filtering... • 5) A green arrow will now appear on top of the column
  16. 16. Moving forward! • Sorting and filtering - check! • Basic formulas • Pivot tables
  17. 17. Basic formulas • Let us know try to sum up some of the values in the dataset… • What is it good for: when you do analysis and when you need to check if calculations by your colleagues are right
  18. 18. Basic formulas • Go to column H: In the second row (cell H2), type “=sum(f2+g2)”
  19. 19. Basic formulas • We now have a sum • Now try to see if this cell can be calculated for average “=average(f2:g2)”
  20. 20. Basic formulas • You can also copy your calculations across cells
  21. 21. Now only Pivot tables to go • Sorting and filtering - check! • Basic formulas - check! • Pivot tables
  22. 22. Pivot tables • finding stories inside datasets • particularly well fitting for organised datasets with clear categories and sub- categories
  23. 23. Pivot tables • Mark the full area of the dataset • Go to Data → Pivot table report
  24. 24. Pivot tables • Pivot tables allows you to work on rows, column values and filters • We start by dropping a column header into Rows • Then we drop one of our value columns into Values
  25. 25. Basic formulas • We now have a nice summary of the budget for each department
  26. 26. Filtering pivot tables • We can now go ahead and filter the Pivot table • Add the column you wish to filter by
  27. 27. Filtering pivot tables • Then select one or more categories within the column you wish to keep
  28. 28. Pivot tables • We can finally add several value columns to the pivot table
  29. 29. Exercises • Find the sectors of the national budget that grew the most in percentage • Identify the budget lines, which had the biggest absolute increase in the budget • Generate a pivot table based on the national budget comparing 2014 and 2013 in specific sectors