The document discusses open data and the data pipeline process. It covers the steps of acquiring data from various sources, extracting and cleaning the data, analyzing it, and presenting it visually. Specific topics covered include acquiring qualitative and quantitative data, extracting data from PDFs and websites, cleaning issues like formatting, whitespace, empty cells and incorrect numbers, analyzing data using pivot tables, and best practices for visualizing data in charts and graphs. The overall data pipeline process of acquiring, preparing, analyzing and presenting open data is presented.
Python Notes for mca i year students osmania university.docx
Creating Open Data
1. CHIS: Open Data e Linked Open Data
7: Creating Open Data
Vittorio Scarano
vitsca@dia.unisa.it
Dipartimento di Informatica
Università di Salerno (Italy)
1 CHIS: Open Data and Linked Open Data
2. • Acquisition of data
• by using data provided by others (and collecting them)
• by generating new data (survey, observations)
• Extraction of data
• conversion from original format into something usable
for further analysis and processing
• Cleaning and transforming
• sanitizing data, but also improving it (disaggregation
and enrichment)
CHIS: Open Data and Linked Open Data 2
Data Pipeline - 1
3. • Analysis of data
• to answer particular questions that are not easily
recognizable in data
• Presentation and visualization of data
• to make more clear and effective arguments
• dependent on the audience
Some of these topics will be treated with more details
CHIS: Open Data and Linked Open Data 3
Data Pipeline - 2
4. • Acquisition of Data
• Extraction of Data
• Cleaning and transforming Data
• Analysis of Data
• Presentation and visualization of Data
CHIS: Open Data and Linked Open Data 4
Data Pipeline
5. • Acquisition of Data
• Extraction of Data
• Cleaning and transforming Data
• Analysis of Data
• Presentation and visualization of Data
CHIS: Open Data and Linked Open Data 5
Data Pipeline
6. • Qualitative data
• description of a quality
• can be experienced and observed but not measured
• Quantitative data
• expressed by numbers and can be measured
• discrete data (integer)
• continuous data (floating point)
• Categorical data
• describe a category which the item belongs to
CHIS: Open Data and Linked Open Data 6
Acquisition of data: which kind?
7. • Data for humans
• able to understand natural language J
• unstructured data
• often found in documents (PDFs): not machine-readable
• Data for computers
• structured and in a machine-readable format
• CSV: comma separated value as an example
• spreadsheets for simple use
CHIS: Open Data and Linked Open Data 7
Acquisition of data: for whom?
8. • Data provenance
• Good documentation is needed
• to preserve the “chain of custody” to identify the owner and
the processing that took place
• Some automatic tools
• Open Refine (formerly known as Google Refine)
• If some custom procedures (programs) are used, they
should be available open source on repositories
• GitHub, Sourceforge etc.
CHIS: Open Data and Linked Open Data 8
Acquisition of data: where from?
9. • Finding data that was already been released
• care in ensuring that the Open license is followed
• Getting hold of more data
• new data from official sources, via Freedom of
Information Act (FOIA) requests
• data that comes from scraping websites
• Collecting data
• gathering data and entering it into a spreadsheet
CHIS: Open Data and Linked Open Data 9
Acquisition of Data: the sources
10. • Government
• some open data sections on their website
• often more at central government than local
• national institute of statistics, etc
• Organizations
• often offers interesting data (World Bank, World Health
Organisation)
• Science
• projects and institutions (NASA, etc.)
CHIS: Open Data and Linked Open Data 10
The kind of sources
11. • Acquisition of Data
• Extraction of Data
• Cleaning and transforming Data
• Analysis of Data
• Presentation and visualization of Data
CHIS: Open Data and Linked Open Data 11
Data Pipeline
12. • Often unstructured data is released with PDF
• If the PDF is the scanned version of documents,
there is not much to do L
• Else… there are many converters, that sometimes
are messy with tables
• Tabula software (MIT open license) is very useful
• versions for Windows, Mac, Linux available
CHIS: Open Data and Linked Open Data 12
Extracting data from PDF
16. • Sometimes tables are in HTML pages
• Of course, you could
• copy and paste the data into a spreadsheet
• messy, requires a lot of cleaning
• learn HTML and see the data from the table and rewrite it
• Or.. you can use Google Sheets that has a very
simple method to import a table into a spreadsheet
• and then you can esport it as you wish! J
CHIS: Open Data and Linked Open Data 16
Scraping the web
23. • The first cell where the IMPORTHTML function was
used remains with that value, which means that..
• … the whole dataset is read EVERY TIME
• you cannot modify it, filter or analyze it!
• Easy way to get rid of it:
• export it in a CSV
• then re-import it into a new sheet in Google Docs
• J
CHIS: Open Data and Linked Open Data 23
A warning
24. • Acquisition of Data
• Extraction of Data
• Cleaning and transforming Data
• Analysis of Data
• Presentation and visualization of Data
CHIS: Open Data and Linked Open Data 24
Data Pipeline
25. • Easy and familiar tool
• extremely more powerful and useful than average
users think
• Originated since the very beginning of computing
• Visicalc, LotusNotes, ..
• Many opportunities available:
• Google Spreadsheets
• Open/Libre Office
• Microsoft Excel
CHIS: Open Data and Linked Open Data 25
“The” tool: spreadsheet!
33. • Sorting and filtering allow you to “know” your
dataset
• .. to understand what kind of information it
contains
• .. and understand how it can contribute to
knowledge
• But before that, we must “clean up the data”
CHIS: Open Data and Linked Open Data 33
How to “know” your dataset
35. • Formatting does not come along with data
• Whitespace and new lines
• Blank cells
• Numbers that are NOT numbers
• Data in inconvenient places
• .. and many others!
CHIS: Open Data and Linked Open Data 35
Some common mistakes
36. • All the formatting is
not useful
• Select all the cells
(CTRL+A)
• Use Format and
then “Clear
formatting”
CHIS: Open Data and Linked Open Data 36
Eliminate Formatting
39. • Important to make the data readable for
processing
• Additional blanks, or newlines create problems
• For example: the first item has a newline
CHIS: Open Data and Linked Open Data 39
Whitespace and linebreaks
43. • TRIM(): Clears trailing and leading blanks
• CLEAN(): Clears non printable characters
• From a column B, it is possible to create a new
column C with “cleaned data”
• and copy and pasted “only values” into a third column
D to get the cleaned data
• .. and only then, you can get rid of the first two
columns B and C and only deal with the “cleaned” D
column
CHIS: Open Data and Linked Open Data 43
Some useful functions
48. • Often empty cells are present and creates a lot of
problems
• Useful functions are COUNTBLANK, ISBLANK
• Useful also is the filter mechanism
• can check the number of empty cells
• Careful when replacing empty cell
• the message should be clear that there are no data, not
that the value is 0
CHIS: Open Data and Linked Open Data 48
Empty cells
49. • National formatting:
• in italian, the floating point is a the comma
• so 3,14 is NOT a Pi in a non-italian spreadsheet: it is a string!
• while 3,141 is three thousand, one hundred forty one!
• the virgola separates the thousands
• Wrong blanks in between digits
• it is not a number: is a string
• Wrong numbers means that we cannot compute
• sum, average, min, max, etc.
CHIS: Open Data and Linked Open Data 49
Numbers that are not numbers
57. • Acquisition of Data
• Extraction of Data
• Cleaning and transforming Data
• Analysis of Data
• Presentation and visualization of Data
CHIS: Open Data and Linked Open Data 57
Data Pipeline
58. • Useful for summarizing tables
• without creating new tables
• without creating new columns
• without writing formla
• Of course, pivot table is only a tool
• data analysis is a very complex topic
• and we are just “scratching the
surface” of it!
CHIS: Open Data and Linked Open Data 58
Pivot table
66. • The columns are ordered in alphabetical order
• “April” before “February” L
• Solutions
1. use the number to indicate the month: 1, 2, .. , 12
• but the name is not very communicative on the table
2. use a string that retain the alphabetical order such as
• “01- January”, “02-February”, etc.
• In this way we have both order and information
on the column headers
A visualization problem
70. • Acquisition of Data
• Extraction of Data
• Cleaning and transforming Data
• Analysis of Data
• Presentation and visualization of Data
CHIS: Open Data and Linked Open Data 70
Data Pipeline
72. • Communicating visually information that can be
complex, in the right way
• Often, spreadsheets offer many “exotic” ways of
defining charts
• often, not useful to convey information
• A running example on how to improve a chart
CHIS: Open Data and Linked Open Data 72
Building graphs and charts
90. Different relationships to be explained with charts
1. Time-series
2. Ranking
3. Part-to-whole
4. Deviation
5. Distribution
6. Correlation
7. Geospatial
CHIS: Open Data and Linked Open Data 90
Relationships with charts
106. • “Post hoc ergo propter hoc”
• A correlation between two variables does not imply
that one causes the other
• Known logical fallacy
• epidemiological studies showed that women taking
combined hormone replacement therapy (HRT) had a
lower-than-average incidence of coronary heart disease
(CHD): HRT was protective against CHD
• women undertaking HRT were more likely to be from higher
socio-economic groups (ABC1), with better-than-average
diet and exercise regimens: fewer CHR
“Correlation does not mean causation”
118. • Part of the material comes with license CC
• picture “Bath time” by archer10 (CC-A-SA 2.0)
• Bibliography:
• "Data wrangling handbook", OKF https://media.readthedocs.org/
pdf/datapatterns/latest/datapatterns.pdf
• School of Data, OKF, http://schoolofdata.org/courses/
• “Telling compelling stories with Numbers”, Stephen Few,
Perceptual Edge.
http://www.actuate.com/download/acd2012/Telling-Compelling-
Stories-with-Numbers.pdf
• “Show Me the Numbers: Designing Tables and Graphs to
Enlighten”, Second Edition, Stephen Few, Analytics Press,2012
• Choosing a good chart, Andrew Abela:
http://img.labnol.org/di/choosing_a_good_chart2.pdf
CHIS: Open Data and Linked Open Data 118
Reading list and credits
119. • Part of the work was funded by the
ROUTE-TO-PA H2020 project
• www.routetopa.eu for more info
CHIS: Open Data and Linked Open Data 119
Acknowledgments
The project has received funding from the
European Union’s Horizon 2020 research
and innova<on programme under grant
agreement No 645860.
120. • Author: Vittorio Scarano,
ROUTE-TO-PA project
• vitsca@dia.unisa.it
• License: This Work is licensed
with Creative Commons
Attribution-ShareAlike 4.0
International (CC BY-SA 4.0)
• https://creativecommons.org/
licenses/by-sa/4.0/
• Available onSlideShare
CHIS: Open Data and Linked Open Data 120
License