Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

  • 358 views
Uploaded on

TDWG 2013 talk on data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions. …

TDWG 2013 talk on data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions.
Authors : Christian Gendreau, David P. Shorthouse, Peter Desmet

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
358
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Chris&an  Gendreau,  David  Shorthouse  &  Peter  Desmet  
  • 2. Game  plan   •  Introduc&on  to  Canadensys   •  Data  quality  @  Canadensys   •  Canadensys  processing  solu&ons   •  Numbers  from  Canadensys   •  Hopes  and  expecta&ons  
  • 3. A Network Of people and collections
  • 4. Canadensys Headquarters Université de Montréal Biodiversity Centre
  • 5. data.canadensys.net/vascan  
  • 6. data.canadensys.net/ipt  
  • 7. data.canadensys.net/explorer  
  • 8. Data quality related activities From an aggregator perspective
  • 9. During  data  entry   •  Help  to  avoid  typographical  errors   •  Help  to  convert  verba&m  data   Actor : data entry person
  • 10. Before  publica&on   •  Detect  file  character  encoding  issue   •  Detect  duplicate  or  missing  IDs   Actor : data publisher Previous Activity: Data entry
  • 11. During  aggrega&on   •  Process  data:  valida&on,  cleaning   •  Produce  structured  reports  :  quality  control     Actor : data aggregator Previous Activity: Before publication
  • 12. AKer  aggrega&on   •  Allow  and  facilitate  community  feedback   •  Help  data  publisher  to  integrate  correc&ons   Actor : users and community Previous Activity: Aggregation
  • 13. Canadensys  tools   during  data  entry   data.canadensys.net/tools  
  • 14. Why  do  we  process  data?   •  Enrich  our  Explorer,  h"p://data.canadensys.net   •  Provide  structured  reports  to  data  providers   •  Help  iden&fy  records  that  need  re-­‐examina&on   •  Help  to  improve  data  entry  procedure  
  • 15. Data  processing  
  • 16. Processing  solu&ons   Narwhals  to  the  rescue   Narwhal image Public Domain
  • 17. The  narwhal-­‐processor  approach   ●  Single  field  processing  to  allow  complex   processing  (combined  fields)   ●  Processors  with  common  interface  ease   integra&on  and  usage   ●  Collabora&on   https://github.com/Canadensys/narwhal-processor
  • 18. Data  usability   before  processing   96%   100%   92%   90%   %  of  non-­‐null  clean  verba>m  data   80%   70%   60%   60%   50%   44%   40%   30%   20%   10%   0%   country  text   state/province  text   coordinates   dates  
  • 19. Data  usability   aKer  processing   •  7%  of  provided  country  text         USA   ISO  3166-­‐2:US,   United  States  
  • 20. Data  usability   aKer  processing   •  7%  of  provided  country  text   •  16%  of  provided  state/province  text         Qué   ISO  3166-­‐2  CA-­‐ QC,  Quebec  
  • 21. Data  usability   aKer  processing   •  7%  of  provided  country  text   •  16%  of  provided  state/province  text   •  4%  of  provided  coordinates       45°  32'  25"  N,   129°  40'  31"  W   45.5402778,   -­‐129.6752778  
  • 22. Data  usability   aKer  processing   •  7%  of  provided  country  text   •  16%  of  provided  state/province  text   •  4%  of  provided  coordinates   •  42%  of  provided  dates       2008  VI  13   2008-­‐06-­‐13  
  • 23. Data  usability   including  processed  data   4%   100%   7%   90%   %  of  non-­‐null  provided   80%   70%   16%   42%   60%   50%   96%   92%   40%   60%   30%   44%   20%   10%   0%   country  text   state/province  text   coordinates   dates  
  • 24. Projects  With  Data  Quality  Tools   •  Atlas  of  living  Australia   •  GBIF  Norway,  GBIF  Spain,  Na&onal   Biodiversity  Network,  BioVeL  …     •  GBIF  libraries   •  Most  nodes  have  their  own  data  quality   rou&ne  
  • 25. Hopes  and  expecta&ons  
  • 26. We  do  not  want  to   •  Maintain  taxonomic  authority  files   •  Maintain  country,  province  and  city  lists  
  • 27. We  prefer  to   •  Efficiently  use  specialized  resources/services   •  Provide  report,  quality  indices  
  • 28. Help  from  Seman&c  Web   •  Data  in  other  languages  (French,  Spanish,  …)   should  not  be  flagged  as  error   •  Misspellings  should  be  shared  as  a  common   resource  (e.g.  SKOS)   •  Understand  historical  data  (e.g.  collected  in   USSR  in  1980)  
  • 29. Repor&ng  and  log   •  DarwinCore  annota&ons  for  processed  data   •  Shared  vocabulary  for  structured  reports  and   quality  indices  
  • 30. Summary   •  Tools  available  for  sharing   •  Use,  review,  contribute   •  Opportunity  for  broad  coordina&on  and   increased  efficiencies  
  • 31. Thanks   Anne Bruneau, Institut de recherche en biologie végétale and Département de Sciences Biologiques, Université de Montréal
  • 32. Contact    hrp://www.canadensys.net    hrp://github.com/Canadensys    @Canadensys   Gulo gulo, Larry Master (www.masterimages.org)
  • 33. Mul&-­‐field  processing   DwC  Field   Raw  data   Processed  data   verba&mLa&tude   45°30ʹ′N    45.5   verba&mLongitude   73°34ʹ′W   -­‐73.5666667   country   Canada   Canada   stateProvince   QC   Quebec   municipality   Montreal  City   Montreal  
  • 34. Mul&-­‐field  processing   1.  Get  informa&on  on  coordinates   45.5,-­‐73.5666667   2.  Compare  with  processed  data   3.  Assert  that  these  coordinates  are  in  Montréal