SlideShare a Scribd company logo
CLIO INFRA
Data analysis in Dataverse & visualization
of datasets on historical maps
Dataverse Community meeting 2015,
June 11, Harvard University
Vyacheslav Tykhonov Richard Zijdeman Jerry de Vries
International Institute of Social History
Introduction
The International Institute of Social History (IISH) collects information in the field of social
history and makes it available to the public. The IISH is one of the major scientific heritage
institutions in the Netherlands. In the field of socio-economic historical research IISH plays a
prominent role.
CLIO INFRA is Digital Research Infrastructure for the Arts and Humanities developed to
integrate a number of databases (hubs) consisting of data on global social, economic and
institutional indicators over the past five centuries, with special attention for the past 200
years. Clio Infra developed by IISH.
Our mission
Clio Infra is the bridge between Data Collections and Research Datasets.
“Old school” researchers time:
Collecting datasets in CSV, Access or Excel and other formats mostly on his own without real collaboration
between all researchers in the same field. Very closed model.
Digital Humanities era:
Young researchers use various computer tools to collect, analyze and combine research datasets they’ve
got from older generation of researchers. They can contribute and help each other and build communities.
The future:
The goal is to build tools to get datasets from “old school” researchers, standardize and store data in the
Clouds in order provide access to Open Data for researchers all over the world and made possible
collaboration between them.
Clio Infra Collaboration platform
Clio Infra functionality based on the Dataverse solution:
- teams collaboratively can curate, share and analyze research datasets
- teams members can share the responsibility to collect data on specific variables (for
example, countries) and inform each other about changes and additions
- dataset version control system is able to track changes in datasets
- other researchers can download their own copy of the data if dataset is published as Open
Data
Dataverse is flexible metadata store (repository) that connected with Research datasets
storage by our Data Processing Engine (DPE)
Added Value for future Researchers
The benefits of data sharing can be classified in terms of Metadata and Data access and
sharing (Collection tools) and Statistical Analysis and Data Mining (Research tools):
● access to a specific case study, citing and finding data
● access to the universe of data from Dataverse network that can organize and display
them for browsing and searching
● data filtering: researchers with proper authorization can obtain the subset of data
provided by data collector
● data analysis to run descriptive statistics and graphics, visualization, plotting on
historical maps
● Data APIs to export data for further analysis by popular statistical packages (STATA,
SPSS, R, iPython Notebook) and advanced data mining tools that will be developed in
the future (always up-to-date solution)
Collaboration possibilities
Descriptive metadata:
- dataset file
- documentation
- standardized tables with codes can be part of dataset
Sharing and collaboration capabilities:
- requesting unique API token by every researcher - user of Dataverse
- exchanging API token between researchers and granting permissions to work on the
same datasets as a team using data analysis tools
Collaboration Data Workflow
- Draft Datasets are visible only for owners
- with API tokens researchers can get access to interactive dashboard to get some
insights about the data stored in the Dataverse
- every dataset converted to dataframe
- dashboard can provide access to all variables from the dataframe and visualize them
on charts, graphs, historical maps and treemaps
After dataset is prepared to go public, it can be published in Collabs:
- guest users can download the copy of dataset
- team members with permissions and authorized API token can contribute to dataset
Data Analysis in Clio Infra
● Data Processing Engine has python core and developed for nlgis.nl (Netherlands
Geographic Information System) and distributed as widget
● interactive data exploration dashboard based on D3 library
● every dataset from Dataverse is available as Data API (json) and can be connected to
any statistical package
● filtering the data on specific years and variables based on dataframes (pandas, python
for data analysis)
● data quality check is the quick visual tool to apply Benford’s law for all values from
specific dataset
● dataframes can be open by researchers in various statistical packages like iPython
notebook, R Studio, SPSS, STATA, Mathlab, etc
Data Processing Engine (DPE) specification
● can split values from any dataset in number of categories specified by researcher (8 by
default)
● algorithm to categorize data values in proper categories can be selected manually
(percentile by default)
● can define maximum possible categories for specific dataset if there is no way to get
categories number specified by user of the system (for example, if there are 2-3
categories of data values)
● data ranges should be defined to get possibility to visualize data on some chart or map
in the right scale
● colors can be specified by user (Color Brewing, see http://colorbrewer2.org)
● legend generated and attached to all visualizations automatically
● values with missing data shown as 'no data' regions on map
● all data values delivered by Data API to make the data analysis platform independent
and communicate with other systems or statistical packages
API Service (Data API)
Data API provided by Data Processing Engine is the most important functionality for the well
equipped digital infrastructure:
● easy way to analyze data in popular statistical packages (STATA, SPSS, Excel)
● use common data science programming languages like Python, R to perform more
advanced research using external Data Science libraries
● analyze data with toolboxes like Wolfram|Alpha and other Discovery Platforms (added
value for the future)
● suitable for other researchers and developers to use advanced technique and data
mining tools that aren’t developed yet
Example of output from Data API
● every dataset ingested by DPE available as Data API with unique handler
● API can be filtered by variables extracted from the content of data file
Example:
/api/data?&handle=F16UDU:30:31&countrycode=USA&year=1880&categories=8&datarange=calculate
"United States of America": {
"code": "F16UDU30_31",
"color": "#FF7F00",
"countrycode": "USA",
"id": "2085",
"indicator": "Total Urban Population",
"intcode": "840",
"r": 923.99,
"region": "W. Offshoots",
"units": "x 1000",
"value": 14264.0,
"year": 1880
}
}
Data visualization and plotting data on historical maps
● Data Processing Engine (DPE) is the core of data visualization process and connected
to geoservice
● data attributes like scales and colors calculated by DPE on the fly based on the input of
researcher (for example, number of categories to split data) and the part of Data API
● histograms, cross tabulations, enhanced descriptive statistics based on pandas
dataframes
● visualization of datasets on historical maps will be available to plot data on maps for
last 500 years
Historical maps services: Geoservice and Geocoder
Delivered by Webmapper.nl and integrated with common Clio Infra infrastructure.
Geoservice basic requirements:
- should provide actual GeoAPI with historical polygons for all countries and regions
based on standardized codes
- available as geojson/topojson for online visualization
- QGIS toolbox should be supported to upload new maps and update old maps as shape
files
Geocoder should be able to standardize all geographical locations in different datasets:
- USSR and Soviet Union should be recognized as one country with the same PID
- Germany before and after 1990
- Indonesia before and after 1999
GeoAPI example
Geoservice can provide polygons for specific years on the national or world level rendered
as topojson or geojson.
GeoAPI:
/api/maps?world=on&year=1962
Polygons for all countries will be delivered as topojson:
arcs":[[1782,2186]]}]}},"arcs":[[[8387,6231],[0,5],[1,1],[1,-1],[2,0],[2,-1],[3,-4],[1,
-3],[0,-1],[-1,-5],[0,-1],[-1,2],[0,1],[-2,2],[-3,1],[-1,0],[-1,0],[-1,3],[0,1]],
[[8390,6247],[1,1],[0,1],[2,1],[1,0],[1,-2],[-1,-5],[-1,0],[-1,1],[-1,1],[-1,1],[0,1]],
[[8391,6204],[0,2],[-1,1],[-1,-1],[0,1],[0,1],[1,3],[1,0],[2,-6],[0,-1],[0,-1],[0,-1],
[-1,1],[-1,1]],[[8364,6093],[0,2],[2,5],[1,0],[1,-2],[-1,-6],[-1,-3],[-1,0],[-1,0],[0,1],
[0,3]],[[5941,6575],[0,-1],[-1,0],[-1,1],[0,1],[-1,0],[-1,0],[-1,0],[-1,0],[0,-1],[-1,-1],
[0,-2],[0,-1],[0,-2],[0,-1],[-1,-3],[-3,-2],[-4,-4],[-1,-1],[-1,0],[-2,-1],[-2,0],[-1,0],[-1,
-1],[-1,-2],[-5,1],[-1,0],[-1,-1],[-1,-1],[-1,0],[-2,1],[-4,3],[-1,1],[-1,2],[-2,8],[-1,4],
[-1,9],[0,1],[0,2],[0,1],[1,0],[0,-1],[0,-1],[1,-1],[0,-1],[1,0]
Integration of Dataverse, DPE, geoservice and geocoder in 6 steps
1. Data API communicates with geocoder to find standardized codes for all locations in
dataset
2. the same geocode with associated polygons is coming from geoservice
3. data and geoservice matching in the frontend by any geospatial javascript library
4. attributes like colors and scales filling the map polygons
5. legend with scales, values and colors coming from DPE to make map look complete
6. notes, sources and other metadata extracting from dataverse API to provide clear
explanation of generated historical map
Live demo for Labor Conflicts in 100 years
Dataset Quality Check
Benford’s law to do test of the quality of data
Linked Edit Rules: a methodology to publish, link, combine and execute edit rules on the
Web as Linked Data to verify consistency of statistical datasets and recognize wrong filled
data values (for example, characters in values where numbers expected)
Chart visualization to get overview of missing data values
Benford’s law in action on real dataset
Datasets combining and aggregation in data exploration tools
● tools that can predict the same variables disambiguated in different datasets (Year and
Jaar)
● geocoding services can standardize different regions (Netherlands → NL, Amsterdam
→ AMS)
● all possible relationship paths can be ranked from the “best” (100%) to the “worst”
(0%), for example, value “05” for variable “Month” can be recognized as “May”
● standardized datasets can be used as “reference” data for other datasets from other
researcher groups and depot services (CHIA, Harvard DataVerse Network, MIT,
DANS)
Clio Infra Infrastructure is suitable for different kinds of datasets
Quantitative datasets that store quantity data values (numbers):
- data measured (length, speed, height, age)
- example: current clio-infra datasets
Qualitative datasets store quality observations (descriptions):
- data can be observed (colors, textures, professions, groups)
- example: HISCO, world strikes dataset
Data Visualization is different for different kinds of datasets:
- quantity can be plotted on charts, graphs, historical maps
- quality (hierarchy) can be visualized on treemaps and maps
Example: Interactive data exploration based on API token
Summary: data exploration and analysis will extend Dataverse functionality
● data quality check during upload/update of dataset
● automatic ingestion and recognition of years and locations in datasets
● integrated with geocoder and geoservice to get polygons for maps
● interactive dashboard to do visual exploration of variables from dataset (graph /
histogram / scatterplot)
● data processing engine to plot data on interactive historical maps (if dataset has
geospatial data)
● correlation for continuous variables
● regression analysis for estimating the relationships among variables
● building treemaps for qualitative data analysis (hierarchical data coming soon)
Thank you!
Any suggestions?

More Related Content

What's hot

ArchAIDE Project Introduction
ArchAIDE Project IntroductionArchAIDE Project Introduction
ArchAIDE Project Introduction
ArchAIDE Project
 
2014 july use_r
2014 july use_r2014 july use_r
2014 july use_r
Yasmin Lucero
 
Seven50 Sparc Overview
Seven50 Sparc OverviewSeven50 Sparc Overview
Seven50 Sparc Overview
Roar Media
 
Graphalytics: A big data benchmark for graph-processing platforms
Graphalytics: A big data benchmark for graph-processing platformsGraphalytics: A big data benchmark for graph-processing platforms
Graphalytics: A big data benchmark for graph-processing platforms
Graph-TA
 
OpenCube Workshop at eGov2015 & ePart2015 dual conference
OpenCube Workshop at eGov2015 & ePart2015 dual conferenceOpenCube Workshop at eGov2015 & ePart2015 dual conference
OpenCube Workshop at eGov2015 & ePart2015 dual conference
Evangelos Kalampokis
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
NavNeet KuMar
 

What's hot (6)

ArchAIDE Project Introduction
ArchAIDE Project IntroductionArchAIDE Project Introduction
ArchAIDE Project Introduction
 
2014 july use_r
2014 july use_r2014 july use_r
2014 july use_r
 
Seven50 Sparc Overview
Seven50 Sparc OverviewSeven50 Sparc Overview
Seven50 Sparc Overview
 
Graphalytics: A big data benchmark for graph-processing platforms
Graphalytics: A big data benchmark for graph-processing platformsGraphalytics: A big data benchmark for graph-processing platforms
Graphalytics: A big data benchmark for graph-processing platforms
 
OpenCube Workshop at eGov2015 & ePart2015 dual conference
OpenCube Workshop at eGov2015 & ePart2015 dual conferenceOpenCube Workshop at eGov2015 & ePart2015 dual conference
OpenCube Workshop at eGov2015 & ePart2015 dual conference
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 

Viewers also liked

Bridging research and collections
Bridging research and collectionsBridging research and collections
Bridging research and collections
vty
 
HiTIME project
HiTIME projectHiTIME project
HiTIME project
vty
 
The recovery of netherlands geographic information system (nlgis 2)
The recovery of netherlands geographic information system (nlgis 2)The recovery of netherlands geographic information system (nlgis 2)
The recovery of netherlands geographic information system (nlgis 2)
vty
 
API economy
API economyAPI economy
API economy
vty
 
FAIR Dataverse
FAIR DataverseFAIR Dataverse
FAIR Dataverse
vty
 
Clio infra Collabs data analysis tools
Clio infra Collabs data analysis toolsClio infra Collabs data analysis tools
Clio infra Collabs data analysis tools
vty
 
Dataverse opportunities
Dataverse opportunitiesDataverse opportunities
Dataverse opportunities
vty
 

Viewers also liked (7)

Bridging research and collections
Bridging research and collectionsBridging research and collections
Bridging research and collections
 
HiTIME project
HiTIME projectHiTIME project
HiTIME project
 
The recovery of netherlands geographic information system (nlgis 2)
The recovery of netherlands geographic information system (nlgis 2)The recovery of netherlands geographic information system (nlgis 2)
The recovery of netherlands geographic information system (nlgis 2)
 
API economy
API economyAPI economy
API economy
 
FAIR Dataverse
FAIR DataverseFAIR Dataverse
FAIR Dataverse
 
Clio infra Collabs data analysis tools
Clio infra Collabs data analysis toolsClio infra Collabs data analysis tools
Clio infra Collabs data analysis tools
 
Dataverse opportunities
Dataverse opportunitiesDataverse opportunities
Dataverse opportunities
 

Similar to Data analysis in dataverse & visualization of datasets on historical maps

TEAMS 6, 7 and 8
TEAMS 6, 7 and 8TEAMS 6, 7 and 8
TEAMS 6, 7 and 8
plan4all
 
Geohosting
GeohostingGeohosting
Geohosting
Karel Charvat
 
ENGAGE Workshop at OpenDataWeek2013
ENGAGE Workshop at OpenDataWeek2013ENGAGE Workshop at OpenDataWeek2013
ENGAGE Workshop at OpenDataWeek2013
Valerie BRASSE
 
Gsoc proposal
Gsoc proposalGsoc proposal
Gsoc proposal
AyushBansal122
 
Inspire hack 2017-linked-data
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-data
Raul Palma
 
Team 05 linked data generation
Team 05 linked data generationTeam 05 linked data generation
Team 05 linked data generation
plan4all
 
Gsoc proposal 2021 polaris
Gsoc proposal 2021 polarisGsoc proposal 2021 polaris
Gsoc proposal 2021 polaris
AyushBansal122
 
EUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan BroederEUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan Broeder
OpenAIRE
 
A Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfA Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdf
GeethaPratyusha
 
Executable papers
Executable papersExecutable papers
Executable papers
Anita de Waard
 
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Research Data Alliance
 
Geospatial metadata and spatial data workshop: 19 June 2014
Geospatial metadata and spatial data workshop: 19 June 2014Geospatial metadata and spatial data workshop: 19 June 2014
Geospatial metadata and spatial data workshop: 19 June 2014
EDINA, University of Edinburgh
 
General concepts: DDI
General concepts: DDIGeneral concepts: DDI
General concepts: DDI
Arhiv družboslovnih podatkov
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
Sanjay Padhi, Ph.D
 
DSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platformDSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platform
Andrea Bollini
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
aceas13tern
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
Dr. Radhey Shyam
 
Elise Smith & Chris Wild - Public Participatory GIS
Elise Smith & Chris Wild - Public Participatory GISElise Smith & Chris Wild - Public Participatory GIS
Elise Smith & Chris Wild - Public Participatory GIS
National Digital Forum
 
BDE SC6-ws-05/12/2016 technology part - SWC
BDE SC6-ws-05/12/2016 technology part - SWCBDE SC6-ws-05/12/2016 technology part - SWC
BDE SC6-ws-05/12/2016 technology part - SWC
BigData_Europe
 
Towards Digital Twin standards following an open source approach
Towards Digital Twin standards following an open source approachTowards Digital Twin standards following an open source approach
Towards Digital Twin standards following an open source approach
FIWARE
 

Similar to Data analysis in dataverse & visualization of datasets on historical maps (20)

TEAMS 6, 7 and 8
TEAMS 6, 7 and 8TEAMS 6, 7 and 8
TEAMS 6, 7 and 8
 
Geohosting
GeohostingGeohosting
Geohosting
 
ENGAGE Workshop at OpenDataWeek2013
ENGAGE Workshop at OpenDataWeek2013ENGAGE Workshop at OpenDataWeek2013
ENGAGE Workshop at OpenDataWeek2013
 
Gsoc proposal
Gsoc proposalGsoc proposal
Gsoc proposal
 
Inspire hack 2017-linked-data
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-data
 
Team 05 linked data generation
Team 05 linked data generationTeam 05 linked data generation
Team 05 linked data generation
 
Gsoc proposal 2021 polaris
Gsoc proposal 2021 polarisGsoc proposal 2021 polaris
Gsoc proposal 2021 polaris
 
EUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan BroederEUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan Broeder
 
A Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfA Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdf
 
Executable papers
Executable papersExecutable papers
Executable papers
 
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
 
Geospatial metadata and spatial data workshop: 19 June 2014
Geospatial metadata and spatial data workshop: 19 June 2014Geospatial metadata and spatial data workshop: 19 June 2014
Geospatial metadata and spatial data workshop: 19 June 2014
 
General concepts: DDI
General concepts: DDIGeneral concepts: DDI
General concepts: DDI
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
DSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platformDSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platform
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
Elise Smith & Chris Wild - Public Participatory GIS
Elise Smith & Chris Wild - Public Participatory GISElise Smith & Chris Wild - Public Participatory GIS
Elise Smith & Chris Wild - Public Participatory GIS
 
BDE SC6-ws-05/12/2016 technology part - SWC
BDE SC6-ws-05/12/2016 technology part - SWCBDE SC6-ws-05/12/2016 technology part - SWC
BDE SC6-ws-05/12/2016 technology part - SWC
 
Towards Digital Twin standards following an open source approach
Towards Digital Twin standards following an open source approachTowards Digital Twin standards following an open source approach
Towards Digital Twin standards following an open source approach
 

More from vty

Decentralised identifiers and knowledge graphs
Decentralised identifiers and knowledge graphs Decentralised identifiers and knowledge graphs
Decentralised identifiers and knowledge graphs
vty
 
Decentralisation and knowledge graphs
Decentralisation and knowledge graphs Decentralisation and knowledge graphs
Decentralisation and knowledge graphs
vty
 
Decentralised identifiers for CLARIAH infrastructure
Decentralised identifiers for CLARIAH infrastructure Decentralised identifiers for CLARIAH infrastructure
Decentralised identifiers for CLARIAH infrastructure
vty
 
Dataverse repository for research data in the COVID-19 Museum
Dataverse repository for research data  in the COVID-19 MuseumDataverse repository for research data  in the COVID-19 Museum
Dataverse repository for research data in the COVID-19 Museum
vty
 
Metaverse for Dataverse
Metaverse for DataverseMetaverse for Dataverse
Metaverse for Dataverse
vty
 
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
vty
 
External CV support in Dataverse 5.7
External CV support in Dataverse 5.7External CV support in Dataverse 5.7
External CV support in Dataverse 5.7
vty
 
Building COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhyBuilding COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhy
vty
 
CLARIN CMDI use case and flexible metadata schemes
CLARIN CMDI use case and flexible metadata schemes CLARIN CMDI use case and flexible metadata schemes
CLARIN CMDI use case and flexible metadata schemes
vty
 
Flexible metadata schemes for research data repositories - CLARIN Conference'21
Flexible metadata schemes for research data repositories - CLARIN Conference'21Flexible metadata schemes for research data repositories - CLARIN Conference'21
Flexible metadata schemes for research data repositories - CLARIN Conference'21
vty
 
Controlled vocabularies and ontologies in Dataverse data repository
Controlled vocabularies and ontologies in Dataverse data repositoryControlled vocabularies and ontologies in Dataverse data repository
Controlled vocabularies and ontologies in Dataverse data repository
vty
 
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...Automated CI/CD testing, installation and deployment of Dataverse infrastruct...
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...
vty
 
Fighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial IntelligenceFighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial Intelligence
vty
 
Building COVID-19 Museum as Open Science Project
Building COVID-19 Museum as Open Science ProjectBuilding COVID-19 Museum as Open Science Project
Building COVID-19 Museum as Open Science Project
vty
 
External controlled vocabularies support in Dataverse
External controlled vocabularies support in DataverseExternal controlled vocabularies support in Dataverse
External controlled vocabularies support in Dataverse
vty
 
Setting up Dataverse repository for research data
Setting up Dataverse repository for research dataSetting up Dataverse repository for research data
Setting up Dataverse repository for research data
vty
 
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in DataverseClariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
vty
 
5 years of Dataverse evolution
5 years of Dataverse evolution 5 years of Dataverse evolution
5 years of Dataverse evolution
vty
 
Ontologies, controlled vocabularies and Dataverse
Ontologies, controlled vocabularies and DataverseOntologies, controlled vocabularies and Dataverse
Ontologies, controlled vocabularies and Dataverse
vty
 
CLARIN CMDI support in Dataverse
CLARIN CMDI support in Dataverse CLARIN CMDI support in Dataverse
CLARIN CMDI support in Dataverse
vty
 

More from vty (20)

Decentralised identifiers and knowledge graphs
Decentralised identifiers and knowledge graphs Decentralised identifiers and knowledge graphs
Decentralised identifiers and knowledge graphs
 
Decentralisation and knowledge graphs
Decentralisation and knowledge graphs Decentralisation and knowledge graphs
Decentralisation and knowledge graphs
 
Decentralised identifiers for CLARIAH infrastructure
Decentralised identifiers for CLARIAH infrastructure Decentralised identifiers for CLARIAH infrastructure
Decentralised identifiers for CLARIAH infrastructure
 
Dataverse repository for research data in the COVID-19 Museum
Dataverse repository for research data  in the COVID-19 MuseumDataverse repository for research data  in the COVID-19 Museum
Dataverse repository for research data in the COVID-19 Museum
 
Metaverse for Dataverse
Metaverse for DataverseMetaverse for Dataverse
Metaverse for Dataverse
 
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
 
External CV support in Dataverse 5.7
External CV support in Dataverse 5.7External CV support in Dataverse 5.7
External CV support in Dataverse 5.7
 
Building COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhyBuilding COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhy
 
CLARIN CMDI use case and flexible metadata schemes
CLARIN CMDI use case and flexible metadata schemes CLARIN CMDI use case and flexible metadata schemes
CLARIN CMDI use case and flexible metadata schemes
 
Flexible metadata schemes for research data repositories - CLARIN Conference'21
Flexible metadata schemes for research data repositories - CLARIN Conference'21Flexible metadata schemes for research data repositories - CLARIN Conference'21
Flexible metadata schemes for research data repositories - CLARIN Conference'21
 
Controlled vocabularies and ontologies in Dataverse data repository
Controlled vocabularies and ontologies in Dataverse data repositoryControlled vocabularies and ontologies in Dataverse data repository
Controlled vocabularies and ontologies in Dataverse data repository
 
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...Automated CI/CD testing, installation and deployment of Dataverse infrastruct...
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...
 
Fighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial IntelligenceFighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial Intelligence
 
Building COVID-19 Museum as Open Science Project
Building COVID-19 Museum as Open Science ProjectBuilding COVID-19 Museum as Open Science Project
Building COVID-19 Museum as Open Science Project
 
External controlled vocabularies support in Dataverse
External controlled vocabularies support in DataverseExternal controlled vocabularies support in Dataverse
External controlled vocabularies support in Dataverse
 
Setting up Dataverse repository for research data
Setting up Dataverse repository for research dataSetting up Dataverse repository for research data
Setting up Dataverse repository for research data
 
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in DataverseClariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
 
5 years of Dataverse evolution
5 years of Dataverse evolution 5 years of Dataverse evolution
5 years of Dataverse evolution
 
Ontologies, controlled vocabularies and Dataverse
Ontologies, controlled vocabularies and DataverseOntologies, controlled vocabularies and Dataverse
Ontologies, controlled vocabularies and Dataverse
 
CLARIN CMDI support in Dataverse
CLARIN CMDI support in Dataverse CLARIN CMDI support in Dataverse
CLARIN CMDI support in Dataverse
 

Recently uploaded

Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Katherine Romanak - Geologic CO2 Storage.pdf
Katherine Romanak - Geologic CO2 Storage.pdfKatherine Romanak - Geologic CO2 Storage.pdf
Katherine Romanak - Geologic CO2 Storage.pdf
Texas Alliance of Groundwater Districts
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
PirithiRaju
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 

Recently uploaded (20)

Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Katherine Romanak - Geologic CO2 Storage.pdf
Katherine Romanak - Geologic CO2 Storage.pdfKatherine Romanak - Geologic CO2 Storage.pdf
Katherine Romanak - Geologic CO2 Storage.pdf
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 

Data analysis in dataverse & visualization of datasets on historical maps

  • 1. CLIO INFRA Data analysis in Dataverse & visualization of datasets on historical maps Dataverse Community meeting 2015, June 11, Harvard University Vyacheslav Tykhonov Richard Zijdeman Jerry de Vries International Institute of Social History
  • 2. Introduction The International Institute of Social History (IISH) collects information in the field of social history and makes it available to the public. The IISH is one of the major scientific heritage institutions in the Netherlands. In the field of socio-economic historical research IISH plays a prominent role. CLIO INFRA is Digital Research Infrastructure for the Arts and Humanities developed to integrate a number of databases (hubs) consisting of data on global social, economic and institutional indicators over the past five centuries, with special attention for the past 200 years. Clio Infra developed by IISH.
  • 3. Our mission Clio Infra is the bridge between Data Collections and Research Datasets. “Old school” researchers time: Collecting datasets in CSV, Access or Excel and other formats mostly on his own without real collaboration between all researchers in the same field. Very closed model. Digital Humanities era: Young researchers use various computer tools to collect, analyze and combine research datasets they’ve got from older generation of researchers. They can contribute and help each other and build communities. The future: The goal is to build tools to get datasets from “old school” researchers, standardize and store data in the Clouds in order provide access to Open Data for researchers all over the world and made possible collaboration between them.
  • 4. Clio Infra Collaboration platform Clio Infra functionality based on the Dataverse solution: - teams collaboratively can curate, share and analyze research datasets - teams members can share the responsibility to collect data on specific variables (for example, countries) and inform each other about changes and additions - dataset version control system is able to track changes in datasets - other researchers can download their own copy of the data if dataset is published as Open Data Dataverse is flexible metadata store (repository) that connected with Research datasets storage by our Data Processing Engine (DPE)
  • 5. Added Value for future Researchers The benefits of data sharing can be classified in terms of Metadata and Data access and sharing (Collection tools) and Statistical Analysis and Data Mining (Research tools): ● access to a specific case study, citing and finding data ● access to the universe of data from Dataverse network that can organize and display them for browsing and searching ● data filtering: researchers with proper authorization can obtain the subset of data provided by data collector ● data analysis to run descriptive statistics and graphics, visualization, plotting on historical maps ● Data APIs to export data for further analysis by popular statistical packages (STATA, SPSS, R, iPython Notebook) and advanced data mining tools that will be developed in the future (always up-to-date solution)
  • 6. Collaboration possibilities Descriptive metadata: - dataset file - documentation - standardized tables with codes can be part of dataset Sharing and collaboration capabilities: - requesting unique API token by every researcher - user of Dataverse - exchanging API token between researchers and granting permissions to work on the same datasets as a team using data analysis tools
  • 7. Collaboration Data Workflow - Draft Datasets are visible only for owners - with API tokens researchers can get access to interactive dashboard to get some insights about the data stored in the Dataverse - every dataset converted to dataframe - dashboard can provide access to all variables from the dataframe and visualize them on charts, graphs, historical maps and treemaps After dataset is prepared to go public, it can be published in Collabs: - guest users can download the copy of dataset - team members with permissions and authorized API token can contribute to dataset
  • 8. Data Analysis in Clio Infra ● Data Processing Engine has python core and developed for nlgis.nl (Netherlands Geographic Information System) and distributed as widget ● interactive data exploration dashboard based on D3 library ● every dataset from Dataverse is available as Data API (json) and can be connected to any statistical package ● filtering the data on specific years and variables based on dataframes (pandas, python for data analysis) ● data quality check is the quick visual tool to apply Benford’s law for all values from specific dataset ● dataframes can be open by researchers in various statistical packages like iPython notebook, R Studio, SPSS, STATA, Mathlab, etc
  • 9. Data Processing Engine (DPE) specification ● can split values from any dataset in number of categories specified by researcher (8 by default) ● algorithm to categorize data values in proper categories can be selected manually (percentile by default) ● can define maximum possible categories for specific dataset if there is no way to get categories number specified by user of the system (for example, if there are 2-3 categories of data values) ● data ranges should be defined to get possibility to visualize data on some chart or map in the right scale ● colors can be specified by user (Color Brewing, see http://colorbrewer2.org) ● legend generated and attached to all visualizations automatically ● values with missing data shown as 'no data' regions on map ● all data values delivered by Data API to make the data analysis platform independent and communicate with other systems or statistical packages
  • 10. API Service (Data API) Data API provided by Data Processing Engine is the most important functionality for the well equipped digital infrastructure: ● easy way to analyze data in popular statistical packages (STATA, SPSS, Excel) ● use common data science programming languages like Python, R to perform more advanced research using external Data Science libraries ● analyze data with toolboxes like Wolfram|Alpha and other Discovery Platforms (added value for the future) ● suitable for other researchers and developers to use advanced technique and data mining tools that aren’t developed yet
  • 11. Example of output from Data API ● every dataset ingested by DPE available as Data API with unique handler ● API can be filtered by variables extracted from the content of data file Example: /api/data?&handle=F16UDU:30:31&countrycode=USA&year=1880&categories=8&datarange=calculate "United States of America": { "code": "F16UDU30_31", "color": "#FF7F00", "countrycode": "USA", "id": "2085", "indicator": "Total Urban Population", "intcode": "840", "r": 923.99, "region": "W. Offshoots", "units": "x 1000", "value": 14264.0, "year": 1880 } }
  • 12. Data visualization and plotting data on historical maps ● Data Processing Engine (DPE) is the core of data visualization process and connected to geoservice ● data attributes like scales and colors calculated by DPE on the fly based on the input of researcher (for example, number of categories to split data) and the part of Data API ● histograms, cross tabulations, enhanced descriptive statistics based on pandas dataframes ● visualization of datasets on historical maps will be available to plot data on maps for last 500 years
  • 13. Historical maps services: Geoservice and Geocoder Delivered by Webmapper.nl and integrated with common Clio Infra infrastructure. Geoservice basic requirements: - should provide actual GeoAPI with historical polygons for all countries and regions based on standardized codes - available as geojson/topojson for online visualization - QGIS toolbox should be supported to upload new maps and update old maps as shape files Geocoder should be able to standardize all geographical locations in different datasets: - USSR and Soviet Union should be recognized as one country with the same PID - Germany before and after 1990 - Indonesia before and after 1999
  • 14. GeoAPI example Geoservice can provide polygons for specific years on the national or world level rendered as topojson or geojson. GeoAPI: /api/maps?world=on&year=1962 Polygons for all countries will be delivered as topojson: arcs":[[1782,2186]]}]}},"arcs":[[[8387,6231],[0,5],[1,1],[1,-1],[2,0],[2,-1],[3,-4],[1, -3],[0,-1],[-1,-5],[0,-1],[-1,2],[0,1],[-2,2],[-3,1],[-1,0],[-1,0],[-1,3],[0,1]], [[8390,6247],[1,1],[0,1],[2,1],[1,0],[1,-2],[-1,-5],[-1,0],[-1,1],[-1,1],[-1,1],[0,1]], [[8391,6204],[0,2],[-1,1],[-1,-1],[0,1],[0,1],[1,3],[1,0],[2,-6],[0,-1],[0,-1],[0,-1], [-1,1],[-1,1]],[[8364,6093],[0,2],[2,5],[1,0],[1,-2],[-1,-6],[-1,-3],[-1,0],[-1,0],[0,1], [0,3]],[[5941,6575],[0,-1],[-1,0],[-1,1],[0,1],[-1,0],[-1,0],[-1,0],[-1,0],[0,-1],[-1,-1], [0,-2],[0,-1],[0,-2],[0,-1],[-1,-3],[-3,-2],[-4,-4],[-1,-1],[-1,0],[-2,-1],[-2,0],[-1,0],[-1, -1],[-1,-2],[-5,1],[-1,0],[-1,-1],[-1,-1],[-1,0],[-2,1],[-4,3],[-1,1],[-1,2],[-2,8],[-1,4], [-1,9],[0,1],[0,2],[0,1],[1,0],[0,-1],[0,-1],[1,-1],[0,-1],[1,0]
  • 15. Integration of Dataverse, DPE, geoservice and geocoder in 6 steps 1. Data API communicates with geocoder to find standardized codes for all locations in dataset 2. the same geocode with associated polygons is coming from geoservice 3. data and geoservice matching in the frontend by any geospatial javascript library 4. attributes like colors and scales filling the map polygons 5. legend with scales, values and colors coming from DPE to make map look complete 6. notes, sources and other metadata extracting from dataverse API to provide clear explanation of generated historical map Live demo for Labor Conflicts in 100 years
  • 16. Dataset Quality Check Benford’s law to do test of the quality of data Linked Edit Rules: a methodology to publish, link, combine and execute edit rules on the Web as Linked Data to verify consistency of statistical datasets and recognize wrong filled data values (for example, characters in values where numbers expected) Chart visualization to get overview of missing data values
  • 17. Benford’s law in action on real dataset
  • 18. Datasets combining and aggregation in data exploration tools ● tools that can predict the same variables disambiguated in different datasets (Year and Jaar) ● geocoding services can standardize different regions (Netherlands → NL, Amsterdam → AMS) ● all possible relationship paths can be ranked from the “best” (100%) to the “worst” (0%), for example, value “05” for variable “Month” can be recognized as “May” ● standardized datasets can be used as “reference” data for other datasets from other researcher groups and depot services (CHIA, Harvard DataVerse Network, MIT, DANS)
  • 19. Clio Infra Infrastructure is suitable for different kinds of datasets Quantitative datasets that store quantity data values (numbers): - data measured (length, speed, height, age) - example: current clio-infra datasets Qualitative datasets store quality observations (descriptions): - data can be observed (colors, textures, professions, groups) - example: HISCO, world strikes dataset Data Visualization is different for different kinds of datasets: - quantity can be plotted on charts, graphs, historical maps - quality (hierarchy) can be visualized on treemaps and maps
  • 20. Example: Interactive data exploration based on API token
  • 21. Summary: data exploration and analysis will extend Dataverse functionality ● data quality check during upload/update of dataset ● automatic ingestion and recognition of years and locations in datasets ● integrated with geocoder and geoservice to get polygons for maps ● interactive dashboard to do visual exploration of variables from dataset (graph / histogram / scatterplot) ● data processing engine to plot data on interactive historical maps (if dataset has geospatial data) ● correlation for continuous variables ● regression analysis for estimating the relationships among variables ● building treemaps for qualitative data analysis (hierarchical data coming soon)