City Journalism - Magazines MA - week 8 - Data journalismPresentation Transcript
Data JournalismOnline Journalism - Magazines MA City University February 16 2012
What is data journalism?
The key thing here is to learn how to solveyour own problems. Asking a tutor should beyour last resort - they will not be there for the restof your life!
1.Coming up with a questionYou need to find a data source. But where?Spend 15 minutes mapping out potentialdata sources related to your field. They might be commercial or governmental; theymight need collecting or already be compiled somewhere. For example, if your fieldwas cycling there will be : ● transport data ● crime data ● health data (encouraging people to cycle as part of healthy lifestyle, for example) ● environmental data (pollution) ● community data (things being shared online by cyclists)Also take a look at the examples at http://delicious.com/paulb/foieg
2. Use advanced search techniques to find data for a journalisticquestionThere are lots of different ways to search, not just typing thingsinto Google.You can limit by file type, domain, site and use Boolean limits.
● Limit by filetype: ○ filetype:xls will restrict results to Excel spreadsheets; ○ filetype:csv to comma separated values spreadsheets; ○ filetype:doc to Word documents - often used for internal documents ○ filetype:pdf to PDFs - often used for official reports● Limit by domain: ■ site:gov.uk will restrict results to UK government websites ■ .ac.uk to UK educational establishments (not all of them reputable) - the US equivalent is .edu ■ .org.uk to (mostly) nonprofit organisations - again, this is not guaranteed. You can also try .org although this will include results from other countries. ■ .mod.uk - the Ministry of Defence ■ .nhs.uk - NHS sites ■ .dh.gov.uk - Department of Health ■ .police.uk - police websites, including British Transport Police, the Met ○ Limit by website: ■ site:bolton.gov.uk will further limit results to just one website, rather than all local authority websites. ■ Likewise site:city.ac.uk would only return results from City Universitys website ○ You can limit your search further by using quotation marks so that only pages containing the exact phrase are returned, e.g. "annual report" ○ You can also expand it by using Boolean operators like OR, e.g.
Then put it all together:e.g. "deaths in police custody filetype:xls site:gov.uk"Try other operators such as ● + before a search term to ensure it is in the pages themselves, e.g. +custody ● phrases in quotes, e.g. "deaths in custody" ● The * wildcard, e.g. "deaths in * custody" ● The ~ operator for synonyms, e.g. ~deaths
3. Making sense of the dataChances are that the data youve found will raise further questions.There may be: ● jargon that you need to understand, ● codes that need translating, ● holes in the data, ● contextual data needed: the populations of different regions; data for previous years; etc. ● questions about how it was gathered - the methodology Use your journalistic skills to answer those questions.
Spreadsheet skillsYou can also use some spreadsheet techniques to put the data into aform that is going to be easier to interrogate - for example try thefollowing: ● split addresses so that the postcode is in a separate column (Data > Text into columns in Excel, or =SPLIT in Google Docs) - or separate forename and surname. ● Or you want to count how many times a value appears (=COUNTIF), or how many values are above a certain number. ● Work out the total using =SUM(D:D) if your numbers are in column D, for example ● Work out the amount per day by using =SUM(D:D)/30 for a 30 day month, etc. ● Work out a median average by using a formula like =MEDIAN(D: D). Compare that with other types of average like =AVERAGE(D: D) or =MODE(D:D)
4. Basic visualisationsFind a transcript of a politicians - or two politicians - speeches andvisualise them using Wordle.com, Tagxedo or ManyEyes. (Theadvanced search techniques mentioned above may help)You can either compare one politicians speeches on a particular issue beforeand after taking office - or one politicians speech with his or her replacement.Spend some time tweaking the visualisation: ● Are similar words treated differently, e.g. "patient" and "patients" or "choice" and "options"? Should you combine the counts to clarify the emphases? What are the ethical issues of doing so? ● Should you reduce your sample to the top 10 or 20 words or phrases to make it clearer? ● Can you customise the words included (try copying into a text editor first), colour scheme, arrangement, fonts, etc. to greater effect? ● Is a word cloud best - or should you use a bar chart based on word counts?
Advanced tutorial 1 - GDoc webscraperFollow the tutorials tagged importHTML on Excel Notes: http://excelnotes.posterous.com/tag/importhtml...and importXML on the Online Journalism Blog - http://onlinejournalismblog.com/tag/importxml (start from the bottom)For a really live scraper, see instructions on how to grab XML from Backtweets orRSS from a Twitter search in this tutorial:http://www.brelson.com/2009/11/using-google-spreadsheets-to-extract-twitter-data/
Advanced tutorial 2 - interrogating dataFollow the tutorial at http://excelnotes.posterous.com/tag/filtersAnd the one at http://excelnotes.posterous.com/tag/sumifsOr if you want to play with Google Refine, search for Getting StartedWith Local Council Spending Data or go to http://blog.ouseful.info/2011/01/28/getting-started-with-local-council-spending-data/
Advanced tutorial 3 - Scraper toolsData can come in all sorts of forms. Based on the data you found already, tryone or more of the following: ● Using a PDF conversion service to get to the data within - a list here: http: //helpmeinvestigate.posterous.com/tag/pdfs - also: http://www. pdftoexcelonline.com/ ● Grabbing tables from a database search: try the Firefox plugin Outwit Hub (free version stores 100 results; buy a licence for more)