Data Visualization in the Newsroom


Published on

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Visualization in the Newsroom

  1. 1. Data visualization in thenewsroom{“presented by”: “carl v. lewis”,“for”: “the florida times-union”,“slides”: “”,“email”:“”}
  2. 2. What is data visualization?•Data itself is the story; standalone narrative.•Interactive, communicative, visual.•Ranges from simple (charts) to complex(database-driven applications).•Both a technique and a format.•Both entertaining and factual.• See:“The Many Words forVisualization”
  3. 3. The history of data journalism•Grew out of CAR(computer assisted-reporting)tradition•John Snow’s 1854 choleramap•Has coincided with the eraof “Big Data”
  4. 4. On the emergence of the field ofdata journalism:•"When information was scarce, most of our effortswere devoted to hunting and gathering. Now thatinformation is abundant, processing is moreimportant." –Phillip Meyer, UNC Chapel Hill
  5. 5. On the growing importance ofdata-driven journalism:•“Journalists need to be data-savvy . . . Data-drivenjournalism is the future.” –Sir Tim Berners Lee.•“The explosion ofWeb-based tools and ways ofsifting through and sharing data has createdsomething approaching a revolution, and thepotential benefits for journalism are only justbeginning to reveal themselves.” –Matthew Ingram
  6. 6. What data journalism is not:• Simply incorporating public data into yourtextual narrative• Infographics• Illustration• Resource-intensive• Just about numbers and programming• Just about making data flashy
  7. 7. What data journalism is:• Visual• Often evergreen• Transparent – direct access to primarysource• Credible• Engaging• A good business model
  8. 8. Hans Rosling
  9. 9. Democratization of datajournalism• Free and open-sourcetools (Google Drive,JavaScript libraries, etc.).• Open Data laws.• “Anyone can do it. Datajournalism is the new punk.”-Simon Rogers,TheGuardian
  10. 10. The job of the data journalist• Part statistician, part journalist, partprogrammer.• “Were statisticians.We dont program.”• “We’re programmers.We don’t report.”• “We’re journalists.We don’t code.”
  11. 11. Notable examples of data visualization• “Mapping America: Every City, Every Block,”• “Where Does My Money Go?”, Open KnowledgeFoundation.• “Illinois school report cards,” Chicago Tribune• “We Feel Fine,” Jonathan Harris• “Top Secret America,” The Washington Post
  12. 12. News organizations to follow forinnovative data projects
  13. 13. What are your favoritevisualizations?
  14. 14. When to use data visualization:• Show change over time• Comparing discrete values• Showing connections and flows• Showing hierarchy• Browsing large databases
  15. 15. When not to use datavisualization:• When text or multimedia tells story better• When you have very few data pints• When there is no statistical significance• When a map is not a map• When a table would do
  16. 16. Process of data journalism1. Research – Think of topic and researchfactors.2. Find the data – Locate and retrieve relevantpublic data3. Analysis and evaluation – Crunch numbers,look for trends or inconsistencies4. Visualize – Display the data in appropriatemanner
  17. 17. II. Mining public dataResearch and retrieval
  18. 18. Research1. Think of a topic – what factors influence it?2. What public data might shed light on thosefactors?3. Seek out the data
  19. 19. Locating public data• Thousands of public “data dumps” bygovernment bodies and nonprofits.• Most commonly in delimited spreadsheetformat (look for .csv, .xls), sometimes inXML and JSON.• For geographic data, look for .kml or .shp• Can be found directly at source or bysearch engine keyword
  20. 20. Search tips for data retrieval• If you don’t know which source tolook to find your data, an initial Websearch might help.• After your keywords, type“filetype:XLS”,“filetype:CSV”, orwhatever the extension is of thedata you’re seeking, and you’ll seeonly files of that type from acrossthe Web.• If you get no results, try broadeningyour search term to locate sourcesthat cover the general discipline (i.e.instead of “malaria deaths,” try“public health data”)
  21. 21. Locating public data• Federal sources:,,,,, (full federal list bytopic/agency here).• Data catalogs such,,, aregood places to find non-
  22. 22. • Florida’s “Sunshine” law requires all state agenciesto provide open access to public records, includingdata.• Chapter 119 of Florida State Statutes mandatesthat “any records made or received by any publicagency in the course of its official business areavailable for inspection, unless specifically exempted bythe Florida Legislature.”Florida public data sources
  23. 23. • Dozens of useful open data sourcesmaintained by Florida governmentagencies,,• Full-list of state-maintained databasesby topic here.• A few state-maintained databasesworth mentioning: the Division ofElections’ campaign finance data, theDOE’s test score reports and theDepartment of Law Enforcement’sarrest and officer reports.Florida public data sources
  24. 24. Florida public data sources• A number of advocacy groups also maintain useful,downloadable statewide databases:•, which focuses on public employeepayroll data.•, which provides demographicdata (.csv) and geographic polygons (.shp) for newdistrict boundaries.• Florida Housing Data Clearinghouse, which providesregularly updated property values, housing data (.xls).(for even more, see my semi-exhaustive list with descriptions here).nt.aspx?id=235
  25. 25. Georgia public data sources• Although Georgia has no lawrequiring all government agenciesto make public data accessibleonline, many do anyway.• In 2008, the Transparency inGovernment Act expanded thepublic data site,, to include allthree branches of government,regional education serviceagencies, local boards ofeducation, and transactions madeby the General Assembly.
  26. 26. Georgia public data sources• A comprehensive list of downloadable databases fromstate agencies in Georgia can be found here.• The State Ethics Committee has made all campaignfinance reports, lobbyist reports and campaigncontributions available in downloadable spreadsheets.• OASIS provides a set of web-based tools to browse theGeorgia Department of Public Health’s Data Warehouse,and download the data yourself if you wish.
  27. 27. Locating geographic data• Most geographic data availableas TIGER/Line Shapefilepackages (archivescontaining .shp, .dbf, .prj, .xml,.shx) from U.S. Census Bureau.• Google also hosts a directoryof .kml files for most geographicboundaries here.• Alternatively, Florida andGeorgia GIS data can be foundat, Geoplan
  28. 28. What to look for• Most numeric spreadsheet data comes either as a comma-separated value(.csv) or Microsoft Excel (.xls) file. Example of .csv structure:“Name”,“Date”,“Address”,”Zip”,”State”,”Country”,• XML (eXtensible Markup Language) stores data hierarchically for theWeb, and is good for building news applications because of its broadinteroperability.<menu id="file" value="File"><popup><menuitem value="New" onclick="CreateNewDoc()" /><menuitem value="Open" onclick="OpenDoc()" /><menuitem value="Close" onclick="CloseDoc()" /></popup></menu>• JSON (JavaScript Object Notation) – Similar to XML in structure, but hasa “lighter” punctuation, based on JavaScript conventions. May eventuallyreplace XML as standard.{"menu": {"id": "file","value": "File","popup": {"menuitem": [{"value": "New", "onclick": "CreateNewDoc()"},{"value": "Open", "onclick": "OpenDoc()"},{"value": "Close", "onclick": "CloseDoc()"}] } }}
  29. 29. Scraping other sources• Scrape data from an HTML table withsimple Google spreadsheet formula:=ImportHtml("http://the-url-goes-here", "table", 0)• For database of HTML tables, tryHaystax.• For PDFs, try CometDocs.• Scrape webpages by running or creatingPython script at ScraperWiki.
  30. 30. APIs for data retrieval• APIs (application programming interfaces) are how manywebsites and services share content with one another.• Allows a computer system to fetch, interpret and use datacreated on another system, even if it used a differentprogramming language or structure.• Examples:Twitter Search API, Google Maps API, NYTimesCampaign Finance API.• Usually returns data as XML, JSON or .txt• Often requires use of an API key.
  31. 31. II. Analyzing andrefining public data
  32. 32. Manipulating datasets• Data rarely ready for analysis and visualization out-of-the-box (hence “raw data”).• Spreadsheet applications most common and easiest way towork with data (Excel, Google Spreadsheets).• Allow for complex calculations, formulas, sorting.• Compatible with a variety of file formats(.xls, .ods, .csv, .txt, .tsv).• Scripts may also be written to automate bulk manipulation(Python).• R Project (
  33. 33. Data analysis• To figure out what your datasays, you’ll need to crunch thenumbers.• Statistical significance is litmustest.• Skewed or normal distribution?Why?• Outliers? If so, error orunexplained factor?
  34. 34. Benchmarks for analysis• Mean (μ) simplest to calculate, butsusceptible to errors caused byoutliers.• Median usually a better metric indetermining conclusion, especiallywith skewed distribution.• If mean=mode, no skewness.• Standard deviation (σ) measuresreliability of data set.• Z-Score = how many standarddeviations a value is away from themean and, thus, its likelihood ofbeing an outlier.standard deviationmeanz-score
  35. 35. Calculating values in Excel• Mean: =AVERAGE(A1-A27)• Median: MEDIAN(A1-A27)• Standard deviation: STDEV(A1-A27)• Z-score of a given value: Subtract mean of dataset fromvalue. Divide result by the standard deviation
  36. 36. Other commonly used Excelformulas• Concatenate to merge multiple columns.• MID to split columns.• Percent change to display relative change over time=(new_value-original_value)/ABS(original_value)• See this guide of helpful Excel tricks for datajournalists, compiled by Mary-Jo Webster of St. PaulPioneer Press:
  37. 37. Refining and cleaning data• Sometimes Excel and GoogleSpreadsheets aren’t enough, especiallywhen working with large datasets.• Google Refine – free tool that lets youexplore, power sort and process data.• Useful for finding and fixing errorsand inconsistencies,“power tool forworking with messy data.”• Facets to sort data• Cleaning with clusters• Shan Carter’s Mr. Data Converter toconvert spreadsheets to more web-friendly format.
  38. 38. Other data analysis tips and tricks• Put field names in first row.• Put geographic data in first columns• When you have two different datasets, a good tool tomerge them is Google Fusion Tables (make sure theyshare a common attribute).• Never round until the end of calculations. Round totwo decimal points for visualization purposes.• Cut and paste calculations into a new column as valuesonly.• Know the principle data types (integer, real, string,boolean), and make sure numeric data is classified aseither integer (whole numbers only) or real (anyvalue).
  39. 39. III. Visualizing yourdata
  40. 40. Planning your visualization• Identify your key message• Choose the best data series to illustrate your point• Consider the number of points in the data• Think about complementary/supporting datasets you canincorporate, e.g. sanitation with poverty.• Plan for user interaction, i.e. visual feedback.• Make numerical changes to raw data to enhance yourpoint, e.g. absolute values vs. percent change• Brainstorm potential technologies• Consult experts on topic to back up your interpretationof data
  41. 41. Choosing the right type ofvisualization• Change of single variable over time: line chart.• Comparison of single variable among multiple classes: bar chart.• Two variables: scatter plot, bubble chart.• Hierarchical data: treemap, bubbletree.• Area charts for area only• Makeup of whole: pie chart.• Distribution: histograms, box-and-whisker plots.• Geographic data (point, polygon, chloropleth and symbol maps).• Records: searchable database.• Chronological data: timeline, sparklines.• Other possibilities: matrices, heatmap, games, slopegraphs, stepper graphics,
  42. 42. Visualization design principles• Typography: clear, consistent, notdistracting.• Use bold, mix of serif/sans-serif toprovide emphasis.• Don’t set type at an angle• Color: Let color correspond tovariable, design for accessibility, choosefrom same side of color wheel,consider cultural associations but avoidthematic palletes. Use Adobe Kuler• Visual overload, emotional design,skewmorphism.No white type onblack backgroundNo angled type
  43. 43. • Some guidelines for graphical integrity,according to Edward Tufte in TheVisualDisplay of Quantitative Information:1. Representation of numbers shouldbe directly proportional tonumerical qualities represented.2. Clear, detailed labeling throughout.3. Show data variation, not designvariation.4. Avoid excessive and unnecessaryuse of graphical effectsWhat Edward Tufte calls “the worstvisualization ever published.”Visualization design principles
  44. 44. • Design for the eye• User should be able todiscern key messagevisually.• Design for interaction• Highlighting and details ondemand (example)• User-driven contentselection (example)Visualization design principles
  45. 45. Visualization design principles
  46. 46. AwfulBad, butbetterVisualization design principles
  47. 47. Awful, but betterNot badAwfulVisualization design principles
  48. 48. What’s wrong with this infographic?Visualization design principles
  49. 49. “Four Ways to SliceObama’s Budget Proposal”• From• What makes this visualizationeffective? How does it approachcolor, complexity, interactivityand typography? How does itavoid visual overload?
  50. 50. Wireframing/prototyping• Follow a structured grid system(i.e., 12 column, 960px grid –see and Subtraction).• Very selectively, you canbreak the grid to emphasizea certain visual element.• Sketch out/prototype yourwireframe on paper first (printtemplates such as this)
  51. 51. Selecting tools/technologies• A wealth of free, open-sourcedata visualization tools andlibraries exist to shortendevelopment times• Examples: GoogleVisualization API, GoogleFusion Tables,Highcharts.js, CartoDB,d3.js,Tableau Public.• For everything else, HTML5 +CSS + JavaScript
  52. 52. IV. Building a Webapp
  53. 53. Web app anatomyThree components of aWeb app:1. HTML (structure)2. CSS (styles)3. JavaScript (interactivity)
  54. 54. Parts of an HTML fileAn HTML file is made up of:1. Doctype declaration2. Head <head>3. CSS/JavaScript references4. Title <title>5. Body <body>6. A Div container7. Divs (IDs and classes)
  55. 55. Parts of a CSS fileA CSS file is made up of:1. Container ID2. Default paragraph (p) style3. Default H1,H2, etc. styles4. Default .body style5. Styles for all divs
  56. 56. V. Maps
  57. 57. Maps 101• Interactive maps combinegeocoded data – points orpolygons – along with metadataand/or numeric data.• KML (keyhole markup language)quickly becoming popular fileformat, but Shapefile ( isstill the most widely available• Geographic data can either begeocoded, downloaded from theWeb, or custom-drawn.• Good puveyor of news maps:The Texas Tribune.
  58. 58. Mapping services and libraries• Google Fusion Tables – Quick, versatileand classic maps that integrate seamlesslywith the Google Maps JavaScript API.• CartoDB – A newer open-source toolmuch like Fusion Tables, but with a betterlooking out-of-the-box experience.• Leaflet – An open-source, client-sidemapping library with an API that allowsyou to achieve a number of advancedfeatures. Plays nicely with Fusion Tablesand CartoDB-hosted maps. Part ofCloudMade suite.
  59. 59. Handy desktop mappingsoftware• qGis – Free program that supportsalmost every conceivable map filetype, and allows you to add ormanipulate vector data, which canthen be then exported as a KMLor Shapefile package.• Tilemill – A map creation andstyling software; ideal for thosewith little programmingexperience. UTF-grid enabledtilesets only.
  60. 60. Primary map types• Chloropleth – Colorsfor each geometrycorrespond to numericvalues of a givenvariable.• Point – Locations on amap displayed bygeocoded markers.• Less frequently:proportional maps andgeo maps.Chloropleth map of Georgia voter turnoutPoint map of Jacksonville polling locations
  61. 61. Tips and tricks• If you have street address data, youcan use BatchGeocode to convertthem to lat-long coordinates.• For chloropleth maps,• Include no more than five fillcolors or “buckets”• Don’t define an equidistant colorramp; use ColorBrewer instead.• Use MarkerClusterer when thereare too many points for certainzoom levels.Using ColorBrewer to define an accurate, accessible color ramp.Using MarkerClusterer to cluster points at further zoom levels.
  62. 62. Tips and tricks• To convert Shapefiles so they canbe imported into Fusion Tables,either use Shape to Fusion, orexport it as KML from CartoDB.• Before using the embed tool inFusion Tables or CartoDB, makesure the map is centered whereyou want it.• Ensure your map is set to“Public.”Export a Shapefile as KML in CartoDB.Making your map public in Fusion Tables
  63. 63. V. Charts
  64. 64. Charts• Basic building block of visualization• Simple, but also easy to mess up.• Should always be interactive.• Should always include data source.• Should always include a legend.• Unless necessary, only show labelson mouseover.
  65. 65. Interactive charting tools• Out-of-the-box: Google Drivecharts,• More advanced: Google CodePlayground.• Most agile: Highcharts.js.• Most extendible:Tableau PublicA combo chart made using Highcharts.js
  66. 66. Charting best practices• Color: Pick palette of no morethan 3-4 colors from same side ofcolor wheel.• Increments: Use natural-increments like (0,2,4,6...) insteadof, say, (0,3,6,9...)• Scale: Don’t plot two unrelatedseries with one scale on left andone on right.• Style: Flat and simple. No 3Deffects, shadows, narrow bars ordistracting shading.Don’t plot two different variables on same scale.Bars too narrow Distracting shadingMisleading 3D effects Pointless shadowsSource: TheWall Street Journal Guideto Information Graphics, Dona M.Wong.
  67. 67. Charting best practices• Always set the baseline tozero.• Always order starting withgreatest value• Use broken bars sparingly• No more than five slices onpie charts; no “donut” piecharts.• No more than 3-4 lines online chartWrong order Right orderWrong baseline Right baselineNo donut-piesSource: TheWall Street Journal Guideto Information Graphics, Dona M.Wong.
  68. 68. V. Programming andbeyond
  69. 69. Utilizing JavaScript/HTML5 libraries• Together, JavaScript, HTML5and jQuery have expandedboundaries of datavisualization• Abundance of open-sourcelibraries and packages meanless programming required toproduce unique, interactivevisualizations.• Examples:Timeline.js,Bubbletree.js, Raphael.js,ProPublica tools
  70. 70. The HTML5 revolution• Adobe Edge for HTML5development; end of Flash’sreign• Platform-agnostic, mobile-first movement• Forking resources andpackages off GitHub
  71. 71. Pushing the limits• RaphaelJS for easiermanipulation of serializedvector graphics• Other boundary-pushing datavisualization projects:Processing!, Gephi, d3.js,IBM’s Many Eyes. A network map produced using D3.js
  72. 72. Helpful resources and communities• Blogs/,,,,• Books: The Data JournalismHandbook, O’Reilly Media. FlowingData Guide toVisualization, ChrisWyu. TheWall Street Journal Guide toInformationVisualization, Dona M.Wong.• Communities:, Hacks/Hackers, NICAR.Free data journalism handbookfrom O’Reilly Media
  73. 73. For slides and list of links,
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.