Visualising the Australian open data and research data landscape
LAND AND WATER
Jonathan Yu, Simon Cox, Benjamin Leighton, Hendra Wijaya, Qifeng Bai
C3DIS Melbourne, May 2018
Context
“Big data”
Jonathan Yu2 |
Focus: transparency Focus: reproducibility
and access
Research questions
Understand the current state of the data landscape in
Australia and measure the complexity.
Survey Questions:
• How much data is available?
• How varied?
• Are they interoperable? Metadata and keywords used
Implications
Jonathan Yu3 |
May 2017
http://global.survey.okfn.org/https://www.cio.com.au/article/618858/
australia-tops-global-open-data-index/
OKFN Global Open Data Index survey 2016
Jonathan Yu4 |
What about quality of
life indicators…
What about
environmental
impacts?
What about research
data?
Jonathan Yu5 |
Jonathan Yu6 |
http://www.opendata500.com/au/
Qualitative approach
Doesn’t report on
volumes, formats,
trends
Usage side
7 |
v
Citizens Research Industry Government
Information ‘prosumer’ community
v
Existing climate (and other) prediction
information infrastructure e.g. ACCESS, CCIA
Data ecosystem in Australia
Academies and research
centres (CRCs, NESP hubs,
CoEs)
ECOcloud
Slide courtesy of Paul Box (CSIRO)
Research
Infrastructures
Government
Data Infrastructures
… Need a more holistic
picture than what
is promoted in the
mainstream
Jonathan Yu
Jonathan Yu 8 |
http://kn.csiro.au
URIs/HTTP Identifiers for everything…
Open Data Search across Gov + Research
Indexing:
137,441 datasets
174,809 geo-features
In-development:
Python and R libraries for in-situ analysis
CSIRO Knowledge Network
Web portal
Dataset sources list
Search example
Open Data Survey Methodology
1. Query – based on Knowledge Network indexes for format types,
and keywords/tags/subjects against known metadata formats –
CKAN, GeoNetwork, Socrata, CSIRO DAP
2. Fetch file sizes - via web crawler over every resource URL
3. Query special cases - size and file resources – CSIRO DAP, known
THREDDS servers (CSIRO, NCI, AODN/IMOS, TPAC, TERN OzFlux)
4. Aggregate and visualise – source, formats, format types*,
keywords
Jonathan Yu9 |
Repositories considered
Open Gov Open Research CKAN THREDDS Socrata WS / Other File size info
data-gov-au
data-nsw-gov-au
data-qld-gov-au
data-sa-gov-au
data-wa-gov-au
nsw-oeh
nsw-seed
www-data-vic-gov-au
Yes Some Yes Contained some Yes
AODN
CSIRO THREDDS
NCI
TERN ozflux
TPAC
Yes Yes Contained some Yes
CSIRO DAP Yes Contained some Yes
ALA, AURIN, TERN Yes Yes
GA Yes Yes
data.act, data.melbourne Yes Yes
Jonathan Yu10 |
Results
(October 2017)
Datasets/collections: 125,819
Files crawled: 10,083,968 (~946 TB)
File Formats: 1,548 unique formats
Keywords found: 667,965 (25,000 unique)
Jonathan Yu11 |
Counts - #of datasets per source
Frequency - # datasets per source published over time
Variation by format and volume (counts of datasets)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Variation by format and volume (counts of datasets)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Quite a bit of CSVs
Variation by format and volume (counts of datasets)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Look here for GIS data
Variation by format and volume (counts of datasets)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Structured data prolific
Variation by format and volume (counts of datasets)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Lots of APIs!
Variation by format and volume (counts of datasets)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Compressed files
popular too
Volume - How much data?
Volume - How much data? Open Gov Data
https://bl.ocks.org/jyucsiro/227d
cf04d0b30af3f09435f4d9ef9739
Aggregated size by
Repository source
Gigabytes
Open Gov Data – 1.7 TB
CSIRO DAP – 672.3 tb
Total THREDDS - 272.9 tb
Volume - How much data? Open Research data + Open Gov Data
Terabytes
Open Gov Data – 1.7 TB
Volume - How much data? Open Research data + Open Gov Data
Terabytes Astronomy
data
Volume - How much data? Open Research data + Open Gov Data
Open Gov Data – 1.7 tb
Total THREDDS - 272.9 tb
CSIRO DAP* – 74.1 tb
* Removed
astronomy
data
Terabytes
Open research data vs Open Gov Data
(Factor of 700)
Terabytes
Variation by format and volume (bytes)
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
0 1mb 100mb 1gb 50gb 100gb 500gb 1tb >100tb
Variation – Keywords…
Jonathan Yu27 |
0 5000 10000 15000 20000
Earth Sciences
Oceans
GA Publication
Ocean Temperature
Water Temperature
Top 5 CKAN tags
0 50 100 150 200 250 300 350 400
environment
EARTH SCIENCES
National Computational Infrastructure (NCI)
climatologyMeteorologyAtmosphere
ATMOSPHERIC SCIENCES
Top 5 GeoNetwork subjects
Keywords found: 667,965 - 25,000 unique
Peeking into the content
Variation - Keyword co-occurrences
Variation - Keyword co-occurrences
Theme = ‘environment’, n>10
Missing link due to ‘weak semantics’
Variation - Keyword co-occurrences
Theme = ‘environment’, n>50
Variation - Keyword correlations (co-efficient >= .7)
Keyword correlations – Open Gov (co-efficient >= .9)
Keyword correlations – Research data (co-efficient >= .9)
Open Gov
Open Research
Open Data Ecosystem
data.vic.gov.au
NSW OEH
…
ALA
AURIN
…
Open Gov
Open Research
Open Data Ecosystem
data.vic.gov.au
NSW OEH
…
ALA
AURIN
…
Open Data
content signatures?
Keyword correlations – data.vic.gov.au (p >= .6)
Keyword correlations – AURIN (p >= .6)
Keyword correlations – ALA (p >= .6)
Keyword correlations – NSW OEH (p >= .6)
Keyword correlations – CSIRO DAP (p >= .6)
Variation - Keyword coccurences and correlations
Entirely automated from metadata
catalogues.
Intent was to survey but this correlation data
could have interesting applications…
• “Content signatures” – summary, compare
• Keyword suggestions
• Trends over time
• (Semi-)automated metadata generation
Better if keywords were stronger semantic
tags (e.g. controlled vocabularies)…
Reflections on visualisations
Some great tooling in data analytics space - particularly R,
to produce visualisations and exploring data
Opportunities
• Regular updates
• UI elements (like sliders) to filter and select ranges
e.g. Keyword analysis, Subset Organisations, Date
ranges
Jonathan Yu44 |
Reflections on survey and future work
First cut quantitative survey across open data ecosystem –
government and research.
Methodology could be equally applied to internal data repositories.
Provider-centric perspective – we need usage/reuse perspectives!
Future work
• Allow better filtering (e.g. on source, clusters) and address gaps
• Unify qualitative and quantitative – provider and user perspectives
• Annual national “State-of-Open-Data” analysis valuable?
Jonathan Yu45 |
Summary & Take Home Messages
Australia is ranked high on the open data scale but we lack
consistent approach to quantify the data landscape
Presented a proof-of-concept, 1st-cut quantitative survey of
open data in Australia
“Open Data” needs to include Open Research Data
• Shown volume and velocity beyond just the Aus Gov Data
Future work is needed to gain better insight into the variety
of data being published and tools for understanding how the
data is actually used as that is not well understood.
Jonathan Yu46 |
Jonathan Yu
Research Data Scientist
Jonathan.Yu@csiro.au
@jonyu6
@jyucsiro
Thank you
Link to data visualisations:
https://gist.github.com/jyucsiro/3bea05a8977
0b9651e97f3f57d7eb332
# datasets per source timeseries
# datasets per source timeseries 2015 - current
# datasets per source timeseries 2015 - current
# datasets per source timeseries 2015 – current middle tier
How varied? (Formats) – Open Gov Data
Counts of resources aggregated
by format type
Variation by format and volume (bytes) – exclude CSIRO DAP
0 1mb 100mb 1gb 50gb 100gb 250gb 500gb 1tb
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Variation by format and volume (bytes) – exclude CSIRO DAP
0 1mb 100mb 1gb 50gb 100gb 250gb 500gb 1tb
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Large volumes of structured data
in THREDDS
Variation by format and volume (bytes) – exclude CSIRO DAP
0 1mb 100mb 1gb 50gb 100gb 250gb 500gb 1tb
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Large volumes of compressed files
in these Open Gov Data
Variation by format and volume (bytes) – exclude CSIRO DAP
0 1mb 100mb 1gb 50gb 100gb 250gb 500gb 1tb
API – Web services and APIs
code – Software code
z – Zipped or compressed files
db – Database dumps
doc – documents
gis – Geospatial files
media – Images, videos and other media
otr – Other
str – Structured data (XML, JSON, netCDF)
csv – Tabular e.g. CSV and Excel files
web – Web documents e.g HTML
Documents can amount to
large volumes too!
Variation - Format co-occurrences
Jonathan Yu57 |
Variation - Formats correlation
Jonathan Yu58 |

Visualising the Australian open data and research data landscape

  • 1.
    Visualising the Australianopen data and research data landscape LAND AND WATER Jonathan Yu, Simon Cox, Benjamin Leighton, Hendra Wijaya, Qifeng Bai C3DIS Melbourne, May 2018
  • 2.
    Context “Big data” Jonathan Yu2| Focus: transparency Focus: reproducibility and access
  • 3.
    Research questions Understand thecurrent state of the data landscape in Australia and measure the complexity. Survey Questions: • How much data is available? • How varied? • Are they interoperable? Metadata and keywords used Implications Jonathan Yu3 |
  • 4.
  • 5.
    What about qualityof life indicators… What about environmental impacts? What about research data? Jonathan Yu5 |
  • 6.
    Jonathan Yu6 | http://www.opendata500.com/au/ Qualitativeapproach Doesn’t report on volumes, formats, trends Usage side
  • 7.
    7 | v Citizens ResearchIndustry Government Information ‘prosumer’ community v Existing climate (and other) prediction information infrastructure e.g. ACCESS, CCIA Data ecosystem in Australia Academies and research centres (CRCs, NESP hubs, CoEs) ECOcloud Slide courtesy of Paul Box (CSIRO) Research Infrastructures Government Data Infrastructures … Need a more holistic picture than what is promoted in the mainstream Jonathan Yu
  • 8.
    Jonathan Yu 8| http://kn.csiro.au URIs/HTTP Identifiers for everything… Open Data Search across Gov + Research Indexing: 137,441 datasets 174,809 geo-features In-development: Python and R libraries for in-situ analysis CSIRO Knowledge Network Web portal Dataset sources list Search example
  • 9.
    Open Data SurveyMethodology 1. Query – based on Knowledge Network indexes for format types, and keywords/tags/subjects against known metadata formats – CKAN, GeoNetwork, Socrata, CSIRO DAP 2. Fetch file sizes - via web crawler over every resource URL 3. Query special cases - size and file resources – CSIRO DAP, known THREDDS servers (CSIRO, NCI, AODN/IMOS, TPAC, TERN OzFlux) 4. Aggregate and visualise – source, formats, format types*, keywords Jonathan Yu9 |
  • 10.
    Repositories considered Open GovOpen Research CKAN THREDDS Socrata WS / Other File size info data-gov-au data-nsw-gov-au data-qld-gov-au data-sa-gov-au data-wa-gov-au nsw-oeh nsw-seed www-data-vic-gov-au Yes Some Yes Contained some Yes AODN CSIRO THREDDS NCI TERN ozflux TPAC Yes Yes Contained some Yes CSIRO DAP Yes Contained some Yes ALA, AURIN, TERN Yes Yes GA Yes Yes data.act, data.melbourne Yes Yes Jonathan Yu10 |
  • 11.
    Results (October 2017) Datasets/collections: 125,819 Filescrawled: 10,083,968 (~946 TB) File Formats: 1,548 unique formats Keywords found: 667,965 (25,000 unique) Jonathan Yu11 |
  • 12.
    Counts - #ofdatasets per source
  • 13.
    Frequency - #datasets per source published over time
  • 14.
    Variation by formatand volume (counts of datasets) API – Web services and APIs code – Software code z – Zipped or compressed files db – Database dumps doc – documents gis – Geospatial files media – Images, videos and other media otr – Other str – Structured data (XML, JSON, netCDF) csv – Tabular e.g. CSV and Excel files web – Web documents e.g HTML
  • 15.
    Variation by formatand volume (counts of datasets) API – Web services and APIs code – Software code z – Zipped or compressed files db – Database dumps doc – documents gis – Geospatial files media – Images, videos and other media otr – Other str – Structured data (XML, JSON, netCDF) csv – Tabular e.g. CSV and Excel files web – Web documents e.g HTML Quite a bit of CSVs
  • 16.
    Variation by formatand volume (counts of datasets) API – Web services and APIs code – Software code z – Zipped or compressed files db – Database dumps doc – documents gis – Geospatial files media – Images, videos and other media otr – Other str – Structured data (XML, JSON, netCDF) csv – Tabular e.g. CSV and Excel files web – Web documents e.g HTML Look here for GIS data
  • 17.
    Variation by formatand volume (counts of datasets) API – Web services and APIs code – Software code z – Zipped or compressed files db – Database dumps doc – documents gis – Geospatial files media – Images, videos and other media otr – Other str – Structured data (XML, JSON, netCDF) csv – Tabular e.g. CSV and Excel files web – Web documents e.g HTML Structured data prolific
  • 18.
    Variation by formatand volume (counts of datasets) API – Web services and APIs code – Software code z – Zipped or compressed files db – Database dumps doc – documents gis – Geospatial files media – Images, videos and other media otr – Other str – Structured data (XML, JSON, netCDF) csv – Tabular e.g. CSV and Excel files web – Web documents e.g HTML Lots of APIs!
  • 19.
    Variation by formatand volume (counts of datasets) API – Web services and APIs code – Software code z – Zipped or compressed files db – Database dumps doc – documents gis – Geospatial files media – Images, videos and other media otr – Other str – Structured data (XML, JSON, netCDF) csv – Tabular e.g. CSV and Excel files web – Web documents e.g HTML Compressed files popular too
  • 20.
    Volume - Howmuch data?
  • 21.
    Volume - Howmuch data? Open Gov Data https://bl.ocks.org/jyucsiro/227d cf04d0b30af3f09435f4d9ef9739 Aggregated size by Repository source Gigabytes
  • 22.
    Open Gov Data– 1.7 TB CSIRO DAP – 672.3 tb Total THREDDS - 272.9 tb Volume - How much data? Open Research data + Open Gov Data Terabytes
  • 23.
    Open Gov Data– 1.7 TB Volume - How much data? Open Research data + Open Gov Data Terabytes Astronomy data
  • 24.
    Volume - Howmuch data? Open Research data + Open Gov Data Open Gov Data – 1.7 tb Total THREDDS - 272.9 tb CSIRO DAP* – 74.1 tb * Removed astronomy data Terabytes
  • 25.
    Open research datavs Open Gov Data (Factor of 700) Terabytes
  • 26.
    Variation by formatand volume (bytes) API – Web services and APIs code – Software code z – Zipped or compressed files db – Database dumps doc – documents gis – Geospatial files media – Images, videos and other media otr – Other str – Structured data (XML, JSON, netCDF) csv – Tabular e.g. CSV and Excel files web – Web documents e.g HTML 0 1mb 100mb 1gb 50gb 100gb 500gb 1tb >100tb
  • 27.
    Variation – Keywords… JonathanYu27 | 0 5000 10000 15000 20000 Earth Sciences Oceans GA Publication Ocean Temperature Water Temperature Top 5 CKAN tags 0 50 100 150 200 250 300 350 400 environment EARTH SCIENCES National Computational Infrastructure (NCI) climatologyMeteorologyAtmosphere ATMOSPHERIC SCIENCES Top 5 GeoNetwork subjects Keywords found: 667,965 - 25,000 unique Peeking into the content
  • 28.
    Variation - Keywordco-occurrences
  • 29.
    Variation - Keywordco-occurrences Theme = ‘environment’, n>10
  • 31.
    Missing link dueto ‘weak semantics’
  • 32.
    Variation - Keywordco-occurrences Theme = ‘environment’, n>50
  • 33.
    Variation - Keywordcorrelations (co-efficient >= .7)
  • 34.
    Keyword correlations –Open Gov (co-efficient >= .9)
  • 35.
    Keyword correlations –Research data (co-efficient >= .9)
  • 36.
    Open Gov Open Research OpenData Ecosystem data.vic.gov.au NSW OEH … ALA AURIN …
  • 37.
    Open Gov Open Research OpenData Ecosystem data.vic.gov.au NSW OEH … ALA AURIN … Open Data content signatures?
  • 38.
    Keyword correlations –data.vic.gov.au (p >= .6)
  • 39.
    Keyword correlations –AURIN (p >= .6)
  • 40.
  • 41.
    Keyword correlations –NSW OEH (p >= .6)
  • 42.
    Keyword correlations –CSIRO DAP (p >= .6)
  • 43.
    Variation - Keywordcoccurences and correlations Entirely automated from metadata catalogues. Intent was to survey but this correlation data could have interesting applications… • “Content signatures” – summary, compare • Keyword suggestions • Trends over time • (Semi-)automated metadata generation Better if keywords were stronger semantic tags (e.g. controlled vocabularies)…
  • 44.
    Reflections on visualisations Somegreat tooling in data analytics space - particularly R, to produce visualisations and exploring data Opportunities • Regular updates • UI elements (like sliders) to filter and select ranges e.g. Keyword analysis, Subset Organisations, Date ranges Jonathan Yu44 |
  • 45.
    Reflections on surveyand future work First cut quantitative survey across open data ecosystem – government and research. Methodology could be equally applied to internal data repositories. Provider-centric perspective – we need usage/reuse perspectives! Future work • Allow better filtering (e.g. on source, clusters) and address gaps • Unify qualitative and quantitative – provider and user perspectives • Annual national “State-of-Open-Data” analysis valuable? Jonathan Yu45 |
  • 46.
    Summary & TakeHome Messages Australia is ranked high on the open data scale but we lack consistent approach to quantify the data landscape Presented a proof-of-concept, 1st-cut quantitative survey of open data in Australia “Open Data” needs to include Open Research Data • Shown volume and velocity beyond just the Aus Gov Data Future work is needed to gain better insight into the variety of data being published and tools for understanding how the data is actually used as that is not well understood. Jonathan Yu46 |
  • 47.
    Jonathan Yu Research DataScientist Jonathan.Yu@csiro.au @jonyu6 @jyucsiro Thank you Link to data visualisations: https://gist.github.com/jyucsiro/3bea05a8977 0b9651e97f3f57d7eb332
  • 48.
    # datasets persource timeseries
  • 49.
    # datasets persource timeseries 2015 - current
  • 50.
    # datasets persource timeseries 2015 - current
  • 51.
    # datasets persource timeseries 2015 – current middle tier
  • 52.
    How varied? (Formats)– Open Gov Data Counts of resources aggregated by format type
  • 53.
    Variation by formatand volume (bytes) – exclude CSIRO DAP 0 1mb 100mb 1gb 50gb 100gb 250gb 500gb 1tb API – Web services and APIs code – Software code z – Zipped or compressed files db – Database dumps doc – documents gis – Geospatial files media – Images, videos and other media otr – Other str – Structured data (XML, JSON, netCDF) csv – Tabular e.g. CSV and Excel files web – Web documents e.g HTML
  • 54.
    Variation by formatand volume (bytes) – exclude CSIRO DAP 0 1mb 100mb 1gb 50gb 100gb 250gb 500gb 1tb API – Web services and APIs code – Software code z – Zipped or compressed files db – Database dumps doc – documents gis – Geospatial files media – Images, videos and other media otr – Other str – Structured data (XML, JSON, netCDF) csv – Tabular e.g. CSV and Excel files web – Web documents e.g HTML Large volumes of structured data in THREDDS
  • 55.
    Variation by formatand volume (bytes) – exclude CSIRO DAP 0 1mb 100mb 1gb 50gb 100gb 250gb 500gb 1tb API – Web services and APIs code – Software code z – Zipped or compressed files db – Database dumps doc – documents gis – Geospatial files media – Images, videos and other media otr – Other str – Structured data (XML, JSON, netCDF) csv – Tabular e.g. CSV and Excel files web – Web documents e.g HTML Large volumes of compressed files in these Open Gov Data
  • 56.
    Variation by formatand volume (bytes) – exclude CSIRO DAP 0 1mb 100mb 1gb 50gb 100gb 250gb 500gb 1tb API – Web services and APIs code – Software code z – Zipped or compressed files db – Database dumps doc – documents gis – Geospatial files media – Images, videos and other media otr – Other str – Structured data (XML, JSON, netCDF) csv – Tabular e.g. CSV and Excel files web – Web documents e.g HTML Documents can amount to large volumes too!
  • 57.
    Variation - Formatco-occurrences Jonathan Yu57 |
  • 58.
    Variation - Formatscorrelation Jonathan Yu58 |

Editor's Notes

  • #8 “Open Data” seems to be emphasised as Open Gov Data, whereas it ignores the research infrastructures in Australia and data collected by NCRIS, Universities, Research institutions (e.g. CSIRO), etc. which are often open research data. Data ecosystems are complex and analogous to natural ecosystems Individual > populations > communities within ecosystems - diverse, dynamic and adaptive - Each ecosystem has its own distinct ‘context set of conditions – combination of constituent parts – culture, behaviours interactions system installed base 2. Data flows through this ecosystem like food webs e.g. biodiversity data - data collected by many agencies - consumed aggregated and processed to support others in the ecosystem Consumers bear high cost - aggregation and integration as data is diverse Fragile supply chains Need data governance – PC NDC Agreements including standards around data are necessary to enable data to flow Rnet – academic and research internet provider (telecoms networks layers) – role in connecting research sensors in Research IoT