Efficient & effective
data management for research projects
ILRI's Data Management
Platform
Carlos Quiros
June, 2015
• Back in 2011
• Current status
• How we did it
• Example of a process
• CKAN
• Key decisions made
• Technology and skills required
Contents
Back in 2011
Survey design
• Too many
• Not common indicators
• <> Variables
• <> Calculations
Survey implementation
• Too many tools
• No protocols
• Poor field data
cleaning
• No standard process
Storage
• In files
• Too many formats
• Too many versions
• Messy data cleaning
• No accountability
Availability & accessibility
• Nothing
Now
Survey design
• Too many
• Common indicators
• = Variables
• = Calculations
Storage
• Server database
• No formats
• One version
• Central cleaning
• Accountability
Availability & accessibility
• CKAN
• OData
Survey implementation
• 2 tools (ODK, CSPro)
• Protocols
• Field data cleaning
• Standard process
• Standard tools
How we went around it
Storage• Server database
• How to integrate ODK and CSPro?
• How to make it easy for scientists?
• How to manage user decentralization?
• Increase accountability?
Availability and accessibility• What to use? CKAN, Dataverse, etc.
 CKAN
• How to extend it to serve our purpose?
• How to integrate it with a server database?
• How to manage our metadata and vocabularies?
• How to do this?
• Data interoperability? RDF, OData, Gdata, etc?
 OData
• How to do it?
Survey implementation• Support only two tools
• Wrote protocols
• Wrote field data cleaning applications
• Wrote policies and implementation plans
• Wrote standard processes and tools for processing the data
• Worked closely with teams
• Created a central place for all the surveys
• Separated surveys in modules
• Worked on common indicators
• Management supports this process
Survey design (ongoing)
Example of a process
Testing &
Review (.xls)
Uploaded to
Formhub to test
account
Testing &
Review
(ODK Collect)
Ok
?
Field
Deployment
Uploaded to
Formhub to
project account
Data
collection
Upload data
to Formhub
End of
Data
Collecti
on
Sharing in
Data Portal
Data Cleaning from
server using MySQL for
Excel
Detailed breakdown of ILRI’s RMD workflow with ODK
Coding
.doc  .xls
Start
Draft tool
(.doc) Consultation
Final tool
(.doc)
Who
Code
s
RMG Staff
Project Team Member
Create MySQL
schema with
ODKToMySQL
MySQL
schema in
server
Convert data to
JSON with
FormhubToJSO
N
Data in
JSON
format
Upload JSON into
MySQL Schema
with
JSONToMySQL
Metadata
for portal
Initialize META in
schema
S = Scientist input / usage
S S S
S
S
S
S
ILRI’s data portal (CKAN) – http://data.ilri.org/portal/
• CKAN?
• The Open Knowledge Foundation
• Biggest deployed data portal software
• USA data portal
• UK data portal
• EU data portal
• Open Africa
• What do you get out of the box?
• Create datasets with minimum metadata
• Name, Abstract, Author, Date
• Tags into controlled vocabulary
• Powerful search engine
• Public / private access to datasets
• Able to attach resources (files) to a dataset
• Data interoperability through powerful API and RDF
• Arrange datasets into organization and topics
• What can you do by creating extensions
• Add new vocabularies (e.g., Language, Countries, etc.)
• Add new metadata fields
• Visualize different kinds of data (e.g., maps)
• Change theme (colors, logos, fonts, etc.)
• Create data hubs by harvesting other CKANs
• What ever else you want…..
Key decisions made
• Use open source for all RDM
Pros:
• Bigger pool of tools
• Flexible
• Innovation
Cons:
• Complex skill set
• Learning curve
• Relational Database Management System (RDMS)
Pros:
• Central place
• Auditing
Cons:
• DB management skill set
• Scientist have no idea on how to work with a RDMS
• CKAN
Pros:
• There is nothing better out there
• Flexible and extendible
Cons:
• Programming in several languages is required
• Learning curve
Technology and skills required
• Server
• Linux (Ubuntu server) [Linux administration]
• http://www.ubuntu.com/download/server
• Database server
• MySQL – An open source database system [DB administration, SQL]
• http://www.mysql.com/
• Data processing software [Linux, C++, Python]
• ODK – A toolset for collecting data on mobile devices.
• https://opendatakit.org/
• CSPro – A software for creating data entry applications.
• https://www.census.gov/population/international/software/cspro/
• Formhub – A software tools that collects ODK data.
• https://github.com/SEL-Columbia/formhub
• ODK Tools – A toolbox for processing ODK survey data into MySQL databases.
• https://github.com/ilri/odktools
• META – A toolbox for managing research data in MySQL databases.
• https://github.com/ilri/meta
• CSProTools – A toolbox for processing CSPro survey data into MySQL databases.
• https://github.com/ilri/csprotools
• Data sharing and interoperability
• CKAN – The open source data portal software. [Linux, Python, WebDev]
• http://ckan.org/
• http://docs.ckan.org/en/latest/maintaining/installing/index.html
• http://docs.ckan.org/en/latest/extensions/index.html
• Odata – Allow the creation and consumption of queryable and interoperable data
resources in a simple and standard way. [Linux, Java, WebDev]
• http://www.odata.org/
Thank you
Visit us @
http://data.ilri.org/

Efficient & effective data management for research projects : ILRI's Data Management Platform

  • 1.
    Efficient & effective datamanagement for research projects ILRI's Data Management Platform Carlos Quiros June, 2015
  • 2.
    • Back in2011 • Current status • How we did it • Example of a process • CKAN • Key decisions made • Technology and skills required Contents
  • 3.
    Back in 2011 Surveydesign • Too many • Not common indicators • <> Variables • <> Calculations Survey implementation • Too many tools • No protocols • Poor field data cleaning • No standard process Storage • In files • Too many formats • Too many versions • Messy data cleaning • No accountability Availability & accessibility • Nothing Now Survey design • Too many • Common indicators • = Variables • = Calculations Storage • Server database • No formats • One version • Central cleaning • Accountability Availability & accessibility • CKAN • OData Survey implementation • 2 tools (ODK, CSPro) • Protocols • Field data cleaning • Standard process • Standard tools
  • 4.
    How we wentaround it Storage• Server database • How to integrate ODK and CSPro? • How to make it easy for scientists? • How to manage user decentralization? • Increase accountability? Availability and accessibility• What to use? CKAN, Dataverse, etc.  CKAN • How to extend it to serve our purpose? • How to integrate it with a server database? • How to manage our metadata and vocabularies? • How to do this? • Data interoperability? RDF, OData, Gdata, etc?  OData • How to do it? Survey implementation• Support only two tools • Wrote protocols • Wrote field data cleaning applications • Wrote policies and implementation plans • Wrote standard processes and tools for processing the data • Worked closely with teams • Created a central place for all the surveys • Separated surveys in modules • Worked on common indicators • Management supports this process Survey design (ongoing)
  • 5.
    Example of aprocess Testing & Review (.xls) Uploaded to Formhub to test account Testing & Review (ODK Collect) Ok ? Field Deployment Uploaded to Formhub to project account Data collection Upload data to Formhub End of Data Collecti on Sharing in Data Portal Data Cleaning from server using MySQL for Excel Detailed breakdown of ILRI’s RMD workflow with ODK Coding .doc  .xls Start Draft tool (.doc) Consultation Final tool (.doc) Who Code s RMG Staff Project Team Member Create MySQL schema with ODKToMySQL MySQL schema in server Convert data to JSON with FormhubToJSO N Data in JSON format Upload JSON into MySQL Schema with JSONToMySQL Metadata for portal Initialize META in schema S = Scientist input / usage S S S S S S S
  • 6.
    ILRI’s data portal(CKAN) – http://data.ilri.org/portal/ • CKAN? • The Open Knowledge Foundation • Biggest deployed data portal software • USA data portal • UK data portal • EU data portal • Open Africa • What do you get out of the box? • Create datasets with minimum metadata • Name, Abstract, Author, Date • Tags into controlled vocabulary • Powerful search engine • Public / private access to datasets • Able to attach resources (files) to a dataset • Data interoperability through powerful API and RDF • Arrange datasets into organization and topics • What can you do by creating extensions • Add new vocabularies (e.g., Language, Countries, etc.) • Add new metadata fields • Visualize different kinds of data (e.g., maps) • Change theme (colors, logos, fonts, etc.) • Create data hubs by harvesting other CKANs • What ever else you want…..
  • 7.
    Key decisions made •Use open source for all RDM Pros: • Bigger pool of tools • Flexible • Innovation Cons: • Complex skill set • Learning curve • Relational Database Management System (RDMS) Pros: • Central place • Auditing Cons: • DB management skill set • Scientist have no idea on how to work with a RDMS • CKAN Pros: • There is nothing better out there • Flexible and extendible Cons: • Programming in several languages is required • Learning curve
  • 8.
    Technology and skillsrequired • Server • Linux (Ubuntu server) [Linux administration] • http://www.ubuntu.com/download/server • Database server • MySQL – An open source database system [DB administration, SQL] • http://www.mysql.com/ • Data processing software [Linux, C++, Python] • ODK – A toolset for collecting data on mobile devices. • https://opendatakit.org/ • CSPro – A software for creating data entry applications. • https://www.census.gov/population/international/software/cspro/ • Formhub – A software tools that collects ODK data. • https://github.com/SEL-Columbia/formhub • ODK Tools – A toolbox for processing ODK survey data into MySQL databases. • https://github.com/ilri/odktools • META – A toolbox for managing research data in MySQL databases. • https://github.com/ilri/meta • CSProTools – A toolbox for processing CSPro survey data into MySQL databases. • https://github.com/ilri/csprotools • Data sharing and interoperability • CKAN – The open source data portal software. [Linux, Python, WebDev] • http://ckan.org/ • http://docs.ckan.org/en/latest/maintaining/installing/index.html • http://docs.ckan.org/en/latest/extensions/index.html • Odata – Allow the creation and consumption of queryable and interoperable data resources in a simple and standard way. [Linux, Java, WebDev] • http://www.odata.org/
  • 9.
    Thank you Visit us@ http://data.ilri.org/

Editor's Notes