Creating Open Data

CHIS: Open Data e Linked Open Data
7: Creating Open Data
Vittorio Scarano
vitsca@dia.unisa.it
Dipartimento di Informatica
Università di Salerno (Italy)
1 CHIS: Open Data and Linked Open Data

•  Acquisition of data
•  by using data provided by others (and collecting them)
•  by generating new data (survey, observations)
•  Extraction of data
•  conversion from original format into something usable
for further analysis and processing
•  Cleaning and transforming
•  sanitizing data, but also improving it (disaggregation
and enrichment)
CHIS: Open Data and Linked Open Data 2
Data Pipeline - 1

•  Analysis of data
•  to answer particular questions that are not easily
recognizable in data
•  Presentation and visualization of data
•  to make more clear and eﬀective arguments
•  dependent on the audience
Some of these topics will be treated with more details
Data Pipeline - 2

•  Acquisition of Data
•  Extraction of Data
•  Cleaning and transforming Data
•  Analysis of Data
•  Presentation and visualization of Data
Data Pipeline

Data Pipeline

•  Qualitative data
•  description of a quality
•  can be experienced and observed but not measured
•  Quantitative data
•  expressed by numbers and can be measured
•  discrete data (integer)
•  continuous data (floating point)
•  Categorical data
•  describe a category which the item belongs to
Acquisition of data: which kind?

•  Data for humans
•  able to understand natural language J
•  unstructured data
•  often found in documents (PDFs): not machine-readable
•  Data for computers
•  structured and in a machine-readable format
•  CSV: comma separated value as an example
•  spreadsheets for simple use
Acquisition of data: for whom?

•  Data provenance
•  Good documentation is needed
•  to preserve the “chain of custody” to identify the owner and
the processing that took place
•  Some automatic tools
•  Open Refine (formerly known as Google Refine)
•  If some custom procedures (programs) are used, they
should be available open source on repositories
•  GitHub, Sourceforge etc.
Acquisition of data: where from?

•  Finding data that was already been released
•  care in ensuring that the Open license is followed
•  Getting hold of more data
•  new data from oﬀicial sources, via Freedom of
Information Act (FOIA) requests
•  data that comes from scraping websites
•  Collecting data
•  gathering data and entering it into a spreadsheet
Acquisition of Data: the sources

•  Government
•  some open data sections on their website
•  often more at central government than local
•  national institute of statistics, etc
•  Organizations
•  often oﬀers interesting data (World Bank, World Health
Organisation)
•  Science
•  projects and institutions (NASA, etc.)
The kind of sources

Data Pipeline

•  Often unstructured data is released with PDF
•  If the PDF is the scanned version of documents,
there is not much to do L
•  Else… there are many converters, that sometimes
are messy with tables
•  Tabula software (MIT open license) is very useful
•  versions for Windows, Mac, Linux available
Extracting data from PDF

•  Sometimes tables are in HTML pages
•  Of course, you could
•  copy and paste the data into a spreadsheet
•  messy, requires a lot of cleaning
•  learn HTML and see the data from the table and rewrite it
•  Or.. you can use Google Sheets that has a very
simple method to import a table into a spreadsheet
•  and then you can esport it as you wish! J
Scraping the web

•  The first cell where the IMPORTHTML function was
used remains with that value, which means that..
•  … the whole dataset is read EVERY TIME
•  you cannot modify it, filter or analyze it!
•  Easy way to get rid of it:
•  export it in a CSV
•  then re-import it into a new sheet in Google Docs
•  J
A warning

Data Pipeline

•  Easy and familiar tool
•  extremely more powerful and useful than average
users think
•  Originated since the very beginning of computing
•  Visicalc, LotusNotes, ..
•  Many opportunities available:
•  Google Spreadsheets
•  Open/Libre Oﬀice
•  Microsoft Excel
“The” tool: spreadsheet!

A quick comparison

•  Sorting and filtering allow you to “know” your
dataset
•  .. to understand what kind of information it
contains
•  .. and understand how it can contribute to
knowledge
•  But before that, we must “clean up the data”
How to “know” your dataset

•  Formatting does not come along with data
•  Whitespace and new lines
•  Blank cells
•  Numbers that are NOT numbers
•  Data in inconvenient places
•  .. and many others!
Some common mistakes

•  All the formatting is
not useful
•  Select all the cells
(CTRL+A)
•  Use Format and
then “Clear
formatting”
Eliminate Formatting

•  Important to make the data readable for
processing
•  Additional blanks, or newlines create problems
•  For example: the first item has a newline
Whitespace and linebreaks

•  TRIM(): Clears trailing and leading blanks
•  CLEAN(): Clears non printable characters
•  From a column B, it is possible to create a new
column C with “cleaned data”
•  and copy and pasted “only values” into a third column
D to get the cleaned data
•  .. and only then, you can get rid of the first two
columns B and C and only deal with the “cleaned” D
column
Some useful functions

•  Often empty cells are present and creates a lot of
problems
•  Useful functions are COUNTBLANK, ISBLANK
•  Useful also is the filter mechanism
•  can check the number of empty cells
•  Careful when replacing empty cell
•  the message should be clear that there are no data, not
that the value is 0
Empty cells

•  National formatting:
•  in italian, the floating point is a the comma
•  so 3,14 is NOT a Pi in a non-italian spreadsheet: it is a string!
•  while 3,141 is three thousand, one hundred forty one!
•  the virgola separates the thousands
•  Wrong blanks in between digits
•  it is not a number: is a string
•  Wrong numbers means that we cannot compute
•  sum, average, min, max, etc.
Numbers that are not numbers

CHIS: Open Data and Linked Open Data

make room
for the new
rows

Data Pipeline

•  Useful for summarizing tables
•  without creating new tables
•  without creating new columns
•  without writing formla
•  Of course, pivot table is only a tool
•  data analysis is a very complex topic
•  and we are just “scratching the
surface” of it!
Pivot table

Let’s start from a simple table

Create a pivot table (with the data)

Select “Group”, “Col”, “Values”

You can choose diﬀerent summaries

•  The columns are ordered in alphabetical order
•  “April” before “February” L
•  Solutions
1.  use the number to indicate the month: 1, 2, .. , 12
•  but the name is not very communicative on the table
2.  use a string that retain the alphabetical order such as
•  “01- January”, “02-February”, etc.
•  In this way we have both order and information
on the column headers
A visualization problem

Change the values (Find&Replace)
CHIS: Open Data and Linked Open 67

Data Pipeline

Diﬃcult… always diﬃcult!

•  Communicating visually information that can be
complex, in the right way
•  Often, spreadsheets oﬀer many “exotic” ways of
defining charts
•  often, not useful to convey information
•  A running example on how to improve a chart
Building graphs and charts

Get rid of
3D!

No
background

No Dck
marks

Larger text

No decimal
point!

$ in the axis!

Legenda on
top

Easier to
read! and no
“red” sign

Less
relevance to
budget

PaTern in
Dme

Change in
scale to see
diﬀerences

No legend

Easy printout

Only show
variance!

Only show
percentage!

Diﬀerent relationships to be explained with charts
1.  Time-series
2.  Ranking
3.  Part-to-whole
4.  Deviation
5.  Distribution
6.  Correlation
7.  Geospatial
Relationships with charts

Pie-chart! (highly debated..!)

The only acceptable pie-chart?

.. and this, have the same avg=55k!

•  “Post hoc ergo propter hoc”
•  A correlation between two variables does not imply
that one causes the other
•  Known logical fallacy
•  epidemiological studies showed that women taking
combined hormone replacement therapy (HRT) had a
lower-than-average incidence of coronary heart disease
(CHD): HRT was protective against CHD
•  women undertaking HRT were more likely to be from higher
socio-economic groups (ABC1), with better-than-average
diet and exercise regimens: fewer CHR
“Correlation does not mean causation”

Some hilarious examples - 3
More available at hTp://www.tylervigen.com/spurious-correlaDons

The best chart ever! Minard (1869)

•  Part of the material comes with license CC
•  picture “Bath time” by archer10 (CC-A-SA 2.0)
•  Bibliography:
•  "Data wrangling handbook", OKF https://media.readthedocs.org/
pdf/datapatterns/latest/datapatterns.pdf
•  School of Data, OKF, http://schoolofdata.org/courses/
•  “Telling compelling stories with Numbers”, Stephen Few,
Perceptual Edge.
http://www.actuate.com/download/acd2012/Telling-Compelling-
Stories-with-Numbers.pdf
•  “Show Me the Numbers: Designing Tables and Graphs to
Enlighten”, Second Edition, Stephen Few, Analytics Press,2012
•  Choosing a good chart, Andrew Abela:
http://img.labnol.org/di/choosing_a_good_chart2.pdf
Reading list and credits

•  Part of the work was funded by the
ROUTE-TO-PA H2020 project
•  www.routetopa.eu for more info
Acknowledgments
The project has received funding from the
European Union’s Horizon 2020 research
and innova<on programme under grant
agreement No 645860.

•  Author: Vittorio Scarano,
ROUTE-TO-PA project
•  vitsca@dia.unisa.it
•  License: This Work is licensed
with Creative Commons
Attribution-ShareAlike 4.0
International (CC BY-SA 4.0)
•  https://creativecommons.org/
licenses/by-sa/4.0/
•  Available onSlideShare
License

Creating Open Data

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Viewers also liked

Viewers also liked (6)

Similar to Creating Open Data

Similar to Creating Open Data (20)

Recently uploaded

Recently uploaded (20)

Creating Open Data