ALESSIO CIMARELLI 
Data scientist at Dataninja 
jenkin@dataninja.it | @jenkin27 
dtnj.it/erice14 
International School of Science Journalism 
The Digital World (Erice, June 10th, 2014)
aka jenkin 
PAST 
Master Degree in Physics at the University of Rome "La Sapienza" 
Master in Science Communication at the International School for 
Advanced Studies (SISSA-ISAS) in Trieste 
Press officer at the European Laboratory for Non-Linear Spectroscopy 
(LENS) in Florence 
PRESENT 
Freelance data journalist, web developer, open data activist, citizen 
scientist, ...
Data journalism & data visualization made in Italy
You know very well how it works... :)
As topic 
Stories about the edge of scientific research and human knowledge. 
Key role in relationship between science and society. 
Science journalist can be a watchdog against false science and scientific 
frauds.
As method 
It would be evident in , because the workflow is 
similar to police inquiries or scientific research. 
Many informations from different sources, accountability problems, 
hypothesis and proofs, trial and error cycles, and so on. 
Not only a story, but also a discovery itself...
A word in a buzzwords era 
when his investigation 
is ultimately based on (or driven by) digital data, he acquires such prefix. 
If a journalist want to tell the world, and the world is now made of digital 
and quantitative informations, he has to acquire skills in management 
and interpretation of data, or he will miss an opportunity.
Teamwork and multidisciplinary 
Nose for news, public interest, intuition based on contest knowledge 
Analytical mind, mathematical and statistical skills, intuition based on 
science of numbers
Teamwork and multidisciplinary 
Problem solving, hi-tech knowledge in hardware and software, nerd (or 
geek, if you prefer) mood 
Artistic sensibility and intuition, knowledge in User Experience theory and 
techniques
Miners, dustmen, researchers, and story tellers 
Public search engines or deep web? Official 5-stars open data or web 
spiders and screen scrapers? Monitor and keyboard, smartphone and 
touch, or boots and mud? 
Data should be read by machines and not by humans! Datasets could 
hide errors, inconsistencies, lies... or show only a part of a story.
Miners, dustmen, researchers, and story tellers 
Normalizations and comparisons, filtering, grouping, aggregation, 
correlations, ... 
How to represent numbers and relations among numbers? Yes, with 
arabic numerals, but pictures are worth a thousand words... as long as 
you keep in mind that there are facts behind the numbers, and 
(copyright of The Guardian).
In method 
You run into a dataset and feel the presence of a possible news... 
OR 
... you have an interest, an idea, a thesis, so you are looking for data. 
Having quantitative data about a phenomenon means that somewhere 
there is a you have to understand, test, 
verify... and interpret! 
Data themselves can suggest new ways for your investigation or even 
falsify some hypothesis or assumptions. 
Common sense, intellectual honesty, professional ethics
Some random examples 
New Scientist Apps 
tornadoes 
warmingworld 
exoplanets 
planck 
sealevel 
The Telegraph map of wind farm 
Sorting algorithms 
Meteorites 
Earth Journalism Network
by Global Editors Network 
Health 
American Way of Birth, Costliest in the World 
Inside the Government's Drug Data 
Which Emergency Room Will See You the Fastest? 
New York floods 
Breathless and Burdened 
When Italy is shaking 
Italy, a delicate land 
Kepler’s Tally of Planets 
Biomassa 
(NYT) 
(ProPublica) 
(ProPublica) 
Environment 
(ProPublica) 
(Center for Public Integrity) 
(La Stampa) 
(La Stampa) 
Astronomy 
(NYT) 
Energy 
(Planbureau voor de Leefomgeving)
Research data, science world, citizen science
Hard sciences and social sciences 
Ok, neither LHC petabytes are for journalists, nor statistical data from 
epidemiologic surveys. 
But , or (open) 
, why not? 
If you are not specialized in a specific topic or if you lack the knowledge 
about the framework, you can ask to an expert you trust. 
You can also use numbers not in an investigation, but to tell a complex 
story using infographics and interactive visualizations.
Bibliographies, social networks of scientists, infrastructures 
Science is a human activity and an industry (almost) like any other. 
How are the European funds invested in scientific research? Where are 
the centers specialized in the treatment of specific diseases? Why some 
well known monitoring technologies are not used in some countries?
Sensor-based journalism 
Cheap electronics and sensors 
+ 
open hardware 
+ 
free information sharing 
= 
data from stakeholders other than scientists 
It's early, but promising: 
Swiss Make Open Data Camps 
Japan Geigermap at-a-glance 
Citizen Science & Sensors
If you have data, it's better if you know how to deal with them. 
If you think you may find some data, it's better if you use them. 
If someone use data, it's better if you can check his claims. 
Play with data is funny!
Welcome to the jungle!
Some examples 
Public administration 
International organizations 
NGOs 
Civic activists 
Press offices 
Leaks 
Social networks 
Journalistic sources 
Single journalists 
Ourselves...
Data made public and reusable 
Data.gov 
Data.gov.uk 
Open Data Hub 
OpenIR 
(USA) 
(UK) 
(Italy) 
(Indonesia) 
...
Remember the buzzword era? 
Data from big science experiments (Atlas, Human Brain Project, ...) 
Social networks (Facebook, Twitter, but also eBay, Amazon, ...) 
Maybe it's not for journalists, but it's a hot topic... 
Google Earth Engine
For machine, not for human 
The keyword is ! 
A well-formed table represent a structured data set. A list of facebook 
comments, articles of a newspaper, a recorded speech are not structured 
data (and so are not machine-readable).
It all depends on the format 
If we have Gladstone Gander as best friend: 
spreadsheet (xls, xlsx, ods, csv, tsv); 
not-so-common good formats (xml, sql, json, shp, kml, ...). 
If we are not so lucky: 
tables or lists in web pages (html); 
simple tables in well-done pdfs (pdf). 
If we have Murphy as worst enemy: 
scanned images, even if in a pdf wrapper (png, jpg, pdf); 
digital data behind complex search engines. 
And if we have the best data ever, but under closed license?
Well-formed data sets 
Numbers are numbers, strings are strings and not numbers, datetime 
must always have a single format (ie. yyyy/mm/dd), localization is 
important, no gender values in names' column or similar mixings, every 
elements should be named with a Unique Identifier (ID). 
Data types computer understands: 
integers (with sign, zero included), 
floating numbers (with sign), 
datetime, 
characters and string (case sensitive), 
null value (the strange case of a value that states "I'm not a value"). 
And simple comparisons are strictly equalities, also in strings!
Aggregation, average, normalization, relative difference, distribution, ... 
A single rule: correlation does not imply causation! 
Spurious correlations: 
Correlated: 
http://www.tylervigen.com/ 
http://www.correlated.org/
At a glance
With great power comes great responsibility 
The basic idea is quite simple: you have quantities expressed in numbers 
and geometric objects defined by dimensions (ie. radius in a circle), so you 
just have to decide how connect your quantities to visual dimensions. 
There are several (un)common charts and endless combinations: scatter 
plots, lines, bars, areas, pies, donuts, bubble charts, treemaps, word 
clouds, alluvional diagrams, dendrograms, networks, streamgraphs, 
gauges, chord diagrams, motion charts, parallel coordinates, sankey 
diagrams, maps, choropleth, ... 
On there is an endless d3js.org gallery list of examples!
Building a simple dataset or a large and complex database focused on a 
topic of public interest leads to a valuable product: the database itself, 
intended as a collection of (linked) data plus metadata. 
Can a public frontend to such database, designed for citizens, journalists, 
stakeholders, be considered a journalistic outcome? If journalism is a 
public good, it can be a service, not only a product...
Scraping 
"Copy & Paste" combo 
Data Miner 
IMPORTXML() 
Tabula 
for Chrome browser 
Google Spreadsheet function 
for simple pdfs 
Python (or other languages) scripts and libraries 
Cleaning 
Filters and "Find & Replace" tools in spreadsheets 
Open Refine 
Analysis 
Pivot tables and simple charts in spreadsheets 
Dedicated softwares (ie. open-source or ) 
Viz 
QtiPlot QGIS 
Datawrapper RAW Google Fusion Tables Tableau CartoDB 
infogr.am easel.ly Timelinejs Timemapper StoryMap d3js 
, , , , , 
, , , , , , ...
Tina Casagrand, " Data journalism for science journalists 
", The Open 
Notebook (2014) 
Paul Bradshaw, " Scraping for Journalists 
", Leanpub (2014) 
John Mair, Richard Lance Keeble, " Data Journalism 
", abramis (2014) 
Paul Bradshaw, " Data Journalism Heist 
" 
Claire Miller, " Getting Started with Data Journalism 
", Leanpub (2013) 
Nathan Yau, " Data Points 
", Wiley (2013) 
Simon Rogers, " Facts are Sacred 
", Faber & Faber (2013) 
Jonathan Gray, " The Data Journalism Handbook 
", O'Reilly (2012) 
Nathan Yau, " Visualize This 
", Wiley (2011)
Alessio "jenkin" Cimarelli 
jenkin@dataninja.it 
@ 
Dataninja 
jenkin27 
www.dataninja.it 
school.dataninja.it 
dataninja.it/newsletter 
Q&A 
school.dataninja.it/qa 
SWIM 
sciencewritersinitaly.wordpress.com
Hacking + Marathon = Hackathon 
ESPAD (European students and drugs): http://www.espad.org/en/ 
RASFF (EU food safety): http://ec.europa.eu/food/food/rapidalert/
http://ec.europa.eu/food/food/rapidalert/ 
The Rapid Alert System for Food and Feed (RASFF) was put in place to 
provide food and feed control authorities with an effective tool to 
exchange information about measures taken responding to serious risks 
detected in relation to food or feed. This exchange of information helps 
Member States to act more rapidly and in a coordinated manner in 
response to a health threat caused by food or feed. 
dtnj.it/rasff2013
http://www.espad.org/en/ 
This is the report from the fifth data-collection wave of the European 
School Survey Project on Alcohol and Other Drugs (ESPAD). It is based on 
data from more than 100,000 European students. Over the years about 
500,000 European students have answered the ESPAD questionnaire. A 
total of 36 countries and regions have contributed data to the 
2011 ESPAD Database. Drugs list includes cigarettes, alcohol, cannabis, 
other illecit drugs, tranquillants and sedatives without prescriptions. 
dtnj.it/espad2011

When data journalism meets science | Erice, June 10th, 2014

  • 1.
    ALESSIO CIMARELLI Datascientist at Dataninja jenkin@dataninja.it | @jenkin27 dtnj.it/erice14 International School of Science Journalism The Digital World (Erice, June 10th, 2014)
  • 2.
    aka jenkin PAST Master Degree in Physics at the University of Rome "La Sapienza" Master in Science Communication at the International School for Advanced Studies (SISSA-ISAS) in Trieste Press officer at the European Laboratory for Non-Linear Spectroscopy (LENS) in Florence PRESENT Freelance data journalist, web developer, open data activist, citizen scientist, ...
  • 3.
    Data journalism &data visualization made in Italy
  • 5.
    You know verywell how it works... :)
  • 6.
    As topic Storiesabout the edge of scientific research and human knowledge. Key role in relationship between science and society. Science journalist can be a watchdog against false science and scientific frauds.
  • 7.
    As method Itwould be evident in , because the workflow is similar to police inquiries or scientific research. Many informations from different sources, accountability problems, hypothesis and proofs, trial and error cycles, and so on. Not only a story, but also a discovery itself...
  • 8.
    A word ina buzzwords era when his investigation is ultimately based on (or driven by) digital data, he acquires such prefix. If a journalist want to tell the world, and the world is now made of digital and quantitative informations, he has to acquire skills in management and interpretation of data, or he will miss an opportunity.
  • 9.
    Teamwork and multidisciplinary Nose for news, public interest, intuition based on contest knowledge Analytical mind, mathematical and statistical skills, intuition based on science of numbers
  • 10.
    Teamwork and multidisciplinary Problem solving, hi-tech knowledge in hardware and software, nerd (or geek, if you prefer) mood Artistic sensibility and intuition, knowledge in User Experience theory and techniques
  • 11.
    Miners, dustmen, researchers,and story tellers Public search engines or deep web? Official 5-stars open data or web spiders and screen scrapers? Monitor and keyboard, smartphone and touch, or boots and mud? Data should be read by machines and not by humans! Datasets could hide errors, inconsistencies, lies... or show only a part of a story.
  • 12.
    Miners, dustmen, researchers,and story tellers Normalizations and comparisons, filtering, grouping, aggregation, correlations, ... How to represent numbers and relations among numbers? Yes, with arabic numerals, but pictures are worth a thousand words... as long as you keep in mind that there are facts behind the numbers, and (copyright of The Guardian).
  • 14.
    In method Yourun into a dataset and feel the presence of a possible news... OR ... you have an interest, an idea, a thesis, so you are looking for data. Having quantitative data about a phenomenon means that somewhere there is a you have to understand, test, verify... and interpret! Data themselves can suggest new ways for your investigation or even falsify some hypothesis or assumptions. Common sense, intellectual honesty, professional ethics
  • 15.
    Some random examples New Scientist Apps tornadoes warmingworld exoplanets planck sealevel The Telegraph map of wind farm Sorting algorithms Meteorites Earth Journalism Network
  • 16.
    by Global EditorsNetwork Health American Way of Birth, Costliest in the World Inside the Government's Drug Data Which Emergency Room Will See You the Fastest? New York floods Breathless and Burdened When Italy is shaking Italy, a delicate land Kepler’s Tally of Planets Biomassa (NYT) (ProPublica) (ProPublica) Environment (ProPublica) (Center for Public Integrity) (La Stampa) (La Stampa) Astronomy (NYT) Energy (Planbureau voor de Leefomgeving)
  • 17.
    Research data, scienceworld, citizen science
  • 18.
    Hard sciences andsocial sciences Ok, neither LHC petabytes are for journalists, nor statistical data from epidemiologic surveys. But , or (open) , why not? If you are not specialized in a specific topic or if you lack the knowledge about the framework, you can ask to an expert you trust. You can also use numbers not in an investigation, but to tell a complex story using infographics and interactive visualizations.
  • 19.
    Bibliographies, social networksof scientists, infrastructures Science is a human activity and an industry (almost) like any other. How are the European funds invested in scientific research? Where are the centers specialized in the treatment of specific diseases? Why some well known monitoring technologies are not used in some countries?
  • 20.
    Sensor-based journalism Cheapelectronics and sensors + open hardware + free information sharing = data from stakeholders other than scientists It's early, but promising: Swiss Make Open Data Camps Japan Geigermap at-a-glance Citizen Science & Sensors
  • 21.
    If you havedata, it's better if you know how to deal with them. If you think you may find some data, it's better if you use them. If someone use data, it's better if you can check his claims. Play with data is funny!
  • 22.
  • 23.
    Some examples Publicadministration International organizations NGOs Civic activists Press offices Leaks Social networks Journalistic sources Single journalists Ourselves...
  • 24.
    Data made publicand reusable Data.gov Data.gov.uk Open Data Hub OpenIR (USA) (UK) (Italy) (Indonesia) ...
  • 25.
    Remember the buzzwordera? Data from big science experiments (Atlas, Human Brain Project, ...) Social networks (Facebook, Twitter, but also eBay, Amazon, ...) Maybe it's not for journalists, but it's a hot topic... Google Earth Engine
  • 26.
    For machine, notfor human The keyword is ! A well-formed table represent a structured data set. A list of facebook comments, articles of a newspaper, a recorded speech are not structured data (and so are not machine-readable).
  • 27.
    It all dependson the format If we have Gladstone Gander as best friend: spreadsheet (xls, xlsx, ods, csv, tsv); not-so-common good formats (xml, sql, json, shp, kml, ...). If we are not so lucky: tables or lists in web pages (html); simple tables in well-done pdfs (pdf). If we have Murphy as worst enemy: scanned images, even if in a pdf wrapper (png, jpg, pdf); digital data behind complex search engines. And if we have the best data ever, but under closed license?
  • 28.
    Well-formed data sets Numbers are numbers, strings are strings and not numbers, datetime must always have a single format (ie. yyyy/mm/dd), localization is important, no gender values in names' column or similar mixings, every elements should be named with a Unique Identifier (ID). Data types computer understands: integers (with sign, zero included), floating numbers (with sign), datetime, characters and string (case sensitive), null value (the strange case of a value that states "I'm not a value"). And simple comparisons are strictly equalities, also in strings!
  • 29.
    Aggregation, average, normalization,relative difference, distribution, ... A single rule: correlation does not imply causation! Spurious correlations: Correlated: http://www.tylervigen.com/ http://www.correlated.org/
  • 30.
  • 31.
    With great powercomes great responsibility The basic idea is quite simple: you have quantities expressed in numbers and geometric objects defined by dimensions (ie. radius in a circle), so you just have to decide how connect your quantities to visual dimensions. There are several (un)common charts and endless combinations: scatter plots, lines, bars, areas, pies, donuts, bubble charts, treemaps, word clouds, alluvional diagrams, dendrograms, networks, streamgraphs, gauges, chord diagrams, motion charts, parallel coordinates, sankey diagrams, maps, choropleth, ... On there is an endless d3js.org gallery list of examples!
  • 32.
    Building a simpledataset or a large and complex database focused on a topic of public interest leads to a valuable product: the database itself, intended as a collection of (linked) data plus metadata. Can a public frontend to such database, designed for citizens, journalists, stakeholders, be considered a journalistic outcome? If journalism is a public good, it can be a service, not only a product...
  • 33.
    Scraping "Copy &Paste" combo Data Miner IMPORTXML() Tabula for Chrome browser Google Spreadsheet function for simple pdfs Python (or other languages) scripts and libraries Cleaning Filters and "Find & Replace" tools in spreadsheets Open Refine Analysis Pivot tables and simple charts in spreadsheets Dedicated softwares (ie. open-source or ) Viz QtiPlot QGIS Datawrapper RAW Google Fusion Tables Tableau CartoDB infogr.am easel.ly Timelinejs Timemapper StoryMap d3js , , , , , , , , , , , ...
  • 34.
    Tina Casagrand, "Data journalism for science journalists ", The Open Notebook (2014) Paul Bradshaw, " Scraping for Journalists ", Leanpub (2014) John Mair, Richard Lance Keeble, " Data Journalism ", abramis (2014) Paul Bradshaw, " Data Journalism Heist " Claire Miller, " Getting Started with Data Journalism ", Leanpub (2013) Nathan Yau, " Data Points ", Wiley (2013) Simon Rogers, " Facts are Sacred ", Faber & Faber (2013) Jonathan Gray, " The Data Journalism Handbook ", O'Reilly (2012) Nathan Yau, " Visualize This ", Wiley (2011)
  • 35.
    Alessio "jenkin" Cimarelli jenkin@dataninja.it @ Dataninja jenkin27 www.dataninja.it school.dataninja.it dataninja.it/newsletter Q&A school.dataninja.it/qa SWIM sciencewritersinitaly.wordpress.com
  • 36.
    Hacking + Marathon= Hackathon ESPAD (European students and drugs): http://www.espad.org/en/ RASFF (EU food safety): http://ec.europa.eu/food/food/rapidalert/
  • 37.
    http://ec.europa.eu/food/food/rapidalert/ The RapidAlert System for Food and Feed (RASFF) was put in place to provide food and feed control authorities with an effective tool to exchange information about measures taken responding to serious risks detected in relation to food or feed. This exchange of information helps Member States to act more rapidly and in a coordinated manner in response to a health threat caused by food or feed. dtnj.it/rasff2013
  • 38.
    http://www.espad.org/en/ This isthe report from the fifth data-collection wave of the European School Survey Project on Alcohol and Other Drugs (ESPAD). It is based on data from more than 100,000 European students. Over the years about 500,000 European students have answered the ESPAD questionnaire. A total of 36 countries and regions have contributed data to the 2011 ESPAD Database. Drugs list includes cigarettes, alcohol, cannabis, other illecit drugs, tranquillants and sedatives without prescriptions. dtnj.it/espad2011