Introductory lesson about data journalism within science journalism and science communication during the International School of Science Journalism 2014 in Erice (June 10th, 2014).
When data journalism meets science | Erice, June 10th, 2014
1. ALESSIO CIMARELLI
Data scientist at Dataninja
jenkin@dataninja.it | @jenkin27
dtnj.it/erice14
International School of Science Journalism
The Digital World (Erice, June 10th, 2014)
2. aka jenkin
PAST
Master Degree in Physics at the University of Rome "La Sapienza"
Master in Science Communication at the International School for
Advanced Studies (SISSA-ISAS) in Trieste
Press officer at the European Laboratory for Non-Linear Spectroscopy
(LENS) in Florence
PRESENT
Freelance data journalist, web developer, open data activist, citizen
scientist, ...
6. As topic
Stories about the edge of scientific research and human knowledge.
Key role in relationship between science and society.
Science journalist can be a watchdog against false science and scientific
frauds.
7. As method
It would be evident in , because the workflow is
similar to police inquiries or scientific research.
Many informations from different sources, accountability problems,
hypothesis and proofs, trial and error cycles, and so on.
Not only a story, but also a discovery itself...
8. A word in a buzzwords era
when his investigation
is ultimately based on (or driven by) digital data, he acquires such prefix.
If a journalist want to tell the world, and the world is now made of digital
and quantitative informations, he has to acquire skills in management
and interpretation of data, or he will miss an opportunity.
9. Teamwork and multidisciplinary
Nose for news, public interest, intuition based on contest knowledge
Analytical mind, mathematical and statistical skills, intuition based on
science of numbers
10. Teamwork and multidisciplinary
Problem solving, hi-tech knowledge in hardware and software, nerd (or
geek, if you prefer) mood
Artistic sensibility and intuition, knowledge in User Experience theory and
techniques
11. Miners, dustmen, researchers, and story tellers
Public search engines or deep web? Official 5-stars open data or web
spiders and screen scrapers? Monitor and keyboard, smartphone and
touch, or boots and mud?
Data should be read by machines and not by humans! Datasets could
hide errors, inconsistencies, lies... or show only a part of a story.
12. Miners, dustmen, researchers, and story tellers
Normalizations and comparisons, filtering, grouping, aggregation,
correlations, ...
How to represent numbers and relations among numbers? Yes, with
arabic numerals, but pictures are worth a thousand words... as long as
you keep in mind that there are facts behind the numbers, and
(copyright of The Guardian).
14. In method
You run into a dataset and feel the presence of a possible news...
OR
... you have an interest, an idea, a thesis, so you are looking for data.
Having quantitative data about a phenomenon means that somewhere
there is a you have to understand, test,
verify... and interpret!
Data themselves can suggest new ways for your investigation or even
falsify some hypothesis or assumptions.
Common sense, intellectual honesty, professional ethics
15. Some random examples
New Scientist Apps
tornadoes
warmingworld
exoplanets
planck
sealevel
The Telegraph map of wind farm
Sorting algorithms
Meteorites
Earth Journalism Network
16. by Global Editors Network
Health
American Way of Birth, Costliest in the World
Inside the Government's Drug Data
Which Emergency Room Will See You the Fastest?
New York floods
Breathless and Burdened
When Italy is shaking
Italy, a delicate land
Kepler’s Tally of Planets
Biomassa
(NYT)
(ProPublica)
(ProPublica)
Environment
(ProPublica)
(Center for Public Integrity)
(La Stampa)
(La Stampa)
Astronomy
(NYT)
Energy
(Planbureau voor de Leefomgeving)
18. Hard sciences and social sciences
Ok, neither LHC petabytes are for journalists, nor statistical data from
epidemiologic surveys.
But , or (open)
, why not?
If you are not specialized in a specific topic or if you lack the knowledge
about the framework, you can ask to an expert you trust.
You can also use numbers not in an investigation, but to tell a complex
story using infographics and interactive visualizations.
19. Bibliographies, social networks of scientists, infrastructures
Science is a human activity and an industry (almost) like any other.
How are the European funds invested in scientific research? Where are
the centers specialized in the treatment of specific diseases? Why some
well known monitoring technologies are not used in some countries?
20. Sensor-based journalism
Cheap electronics and sensors
+
open hardware
+
free information sharing
=
data from stakeholders other than scientists
It's early, but promising:
Swiss Make Open Data Camps
Japan Geigermap at-a-glance
Citizen Science & Sensors
21. If you have data, it's better if you know how to deal with them.
If you think you may find some data, it's better if you use them.
If someone use data, it's better if you can check his claims.
Play with data is funny!
23. Some examples
Public administration
International organizations
NGOs
Civic activists
Press offices
Leaks
Social networks
Journalistic sources
Single journalists
Ourselves...
24. Data made public and reusable
Data.gov
Data.gov.uk
Open Data Hub
OpenIR
(USA)
(UK)
(Italy)
(Indonesia)
...
25. Remember the buzzword era?
Data from big science experiments (Atlas, Human Brain Project, ...)
Social networks (Facebook, Twitter, but also eBay, Amazon, ...)
Maybe it's not for journalists, but it's a hot topic...
Google Earth Engine
26. For machine, not for human
The keyword is !
A well-formed table represent a structured data set. A list of facebook
comments, articles of a newspaper, a recorded speech are not structured
data (and so are not machine-readable).
27. It all depends on the format
If we have Gladstone Gander as best friend:
spreadsheet (xls, xlsx, ods, csv, tsv);
not-so-common good formats (xml, sql, json, shp, kml, ...).
If we are not so lucky:
tables or lists in web pages (html);
simple tables in well-done pdfs (pdf).
If we have Murphy as worst enemy:
scanned images, even if in a pdf wrapper (png, jpg, pdf);
digital data behind complex search engines.
And if we have the best data ever, but under closed license?
28. Well-formed data sets
Numbers are numbers, strings are strings and not numbers, datetime
must always have a single format (ie. yyyy/mm/dd), localization is
important, no gender values in names' column or similar mixings, every
elements should be named with a Unique Identifier (ID).
Data types computer understands:
integers (with sign, zero included),
floating numbers (with sign),
datetime,
characters and string (case sensitive),
null value (the strange case of a value that states "I'm not a value").
And simple comparisons are strictly equalities, also in strings!
29. Aggregation, average, normalization, relative difference, distribution, ...
A single rule: correlation does not imply causation!
Spurious correlations:
Correlated:
http://www.tylervigen.com/
http://www.correlated.org/
31. With great power comes great responsibility
The basic idea is quite simple: you have quantities expressed in numbers
and geometric objects defined by dimensions (ie. radius in a circle), so you
just have to decide how connect your quantities to visual dimensions.
There are several (un)common charts and endless combinations: scatter
plots, lines, bars, areas, pies, donuts, bubble charts, treemaps, word
clouds, alluvional diagrams, dendrograms, networks, streamgraphs,
gauges, chord diagrams, motion charts, parallel coordinates, sankey
diagrams, maps, choropleth, ...
On there is an endless d3js.org gallery list of examples!
32. Building a simple dataset or a large and complex database focused on a
topic of public interest leads to a valuable product: the database itself,
intended as a collection of (linked) data plus metadata.
Can a public frontend to such database, designed for citizens, journalists,
stakeholders, be considered a journalistic outcome? If journalism is a
public good, it can be a service, not only a product...
33. Scraping
"Copy & Paste" combo
Data Miner
IMPORTXML()
Tabula
for Chrome browser
Google Spreadsheet function
for simple pdfs
Python (or other languages) scripts and libraries
Cleaning
Filters and "Find & Replace" tools in spreadsheets
Open Refine
Analysis
Pivot tables and simple charts in spreadsheets
Dedicated softwares (ie. open-source or )
Viz
QtiPlot QGIS
Datawrapper RAW Google Fusion Tables Tableau CartoDB
infogr.am easel.ly Timelinejs Timemapper StoryMap d3js
, , , , ,
, , , , , , ...
34. Tina Casagrand, " Data journalism for science journalists
", The Open
Notebook (2014)
Paul Bradshaw, " Scraping for Journalists
", Leanpub (2014)
John Mair, Richard Lance Keeble, " Data Journalism
", abramis (2014)
Paul Bradshaw, " Data Journalism Heist
"
Claire Miller, " Getting Started with Data Journalism
", Leanpub (2013)
Nathan Yau, " Data Points
", Wiley (2013)
Simon Rogers, " Facts are Sacred
", Faber & Faber (2013)
Jonathan Gray, " The Data Journalism Handbook
", O'Reilly (2012)
Nathan Yau, " Visualize This
", Wiley (2011)
36. Hacking + Marathon = Hackathon
ESPAD (European students and drugs): http://www.espad.org/en/
RASFF (EU food safety): http://ec.europa.eu/food/food/rapidalert/
37. http://ec.europa.eu/food/food/rapidalert/
The Rapid Alert System for Food and Feed (RASFF) was put in place to
provide food and feed control authorities with an effective tool to
exchange information about measures taken responding to serious risks
detected in relation to food or feed. This exchange of information helps
Member States to act more rapidly and in a coordinated manner in
response to a health threat caused by food or feed.
dtnj.it/rasff2013
38. http://www.espad.org/en/
This is the report from the fifth data-collection wave of the European
School Survey Project on Alcohol and Other Drugs (ESPAD). It is based on
data from more than 100,000 European students. Over the years about
500,000 European students have answered the ESPAD questionnaire. A
total of 36 countries and regions have contributed data to the
2011 ESPAD Database. Drugs list includes cigarettes, alcohol, cannabis,
other illecit drugs, tranquillants and sedatives without prescriptions.
dtnj.it/espad2011