Data quality challenges in the
Canadensys network of
occurrence records: examples,
tools, and solutions

Chris&an	
  Gendr...
Game	
  plan
	
  
•  Introduc&on	
  to	
  Canadensys	
  
•  Data	
  quality	
  @	
  Canadensys	
  
•  Canadensys	
  proces...
A Network
Of people and collections
Canadensys Headquarters
Université de Montréal Biodiversity Centre
data.canadensys.net/vascan	
  
data.canadensys.net/ipt	
  
data.canadensys.net/explorer	
  
Data quality related
activities
From an aggregator perspective
During	
  data	
  entry	
  
•  Help	
  to	
  avoid	
  typographical	
  errors	
  
•  Help	
  to	
  convert	
  verba&m	
  d...
Before	
  publica&on	
  
•  Detect	
  file	
  character	
  encoding	
  issue	
  
•  Detect	
  duplicate	
  or	
  missing	
 ...
During	
  aggrega&on	
  
•  Process	
  data:	
  valida&on,	
  cleaning	
  
•  Produce	
  structured	
  reports	
  :	
  qua...
AKer	
  aggrega&on	
  
•  Allow	
  and	
  facilitate	
  community	
  feedback	
  
•  Help	
  data	
  publisher	
  to	
  in...
Canadensys	
  tools	
  
during	
  data	
  entry	
  

data.canadensys.net/tools	
  
Why	
  do	
  we	
  process	
  data?	
  
•  Enrich	
  our	
  Explorer,	
  h"p://data.canadensys.net	
  
•  Provide	
  struc...
Data	
  processing	
  
Processing	
  solu&ons	
  
Narwhals	
  to	
  the	
  rescue	
  

Narwhal image Public Domain
The	
  narwhal-­‐processor	
  approach	
  
●  Single	
  field	
  processing	
  to	
  allow	
  complex	
  
processing	
  (co...
Data	
  usability	
  
before	
  processing	
  
96%	
  

100%	
  

92%	
  
90%	
  

%	
  of	
  non-­‐null	
  clean	
  verba...
Data	
  usability	
  
aKer	
  processing	
  

•  7%	
  of	
  provided	
  country	
  text	
  
	
  
	
  
	
  

USA	
  

ISO	...
Data	
  usability	
  
aKer	
  processing	
  

•  7%	
  of	
  provided	
  country	
  text	
  
•  16%	
  of	
  provided	
  s...
Data	
  usability	
  
aKer	
  processing	
  

•  7%	
  of	
  provided	
  country	
  text	
  
•  16%	
  of	
  provided	
  s...
Data	
  usability	
  
aKer	
  processing	
  

•  7%	
  of	
  provided	
  country	
  text	
  
•  16%	
  of	
  provided	
  s...
Data	
  usability	
  
including	
  processed	
  data	
  
4%	
  

100%	
  

7%	
  
90%	
  

%	
  of	
  non-­‐null	
  provid...
Projects	
  With	
  Data	
  Quality	
  Tools	
  
•  Atlas	
  of	
  living	
  Australia	
  
•  GBIF	
  Norway,	
  GBIF	
  S...
Hopes	
  and	
  expecta&ons	
  
We	
  do	
  not	
  want	
  to	
  
•  Maintain	
  taxonomic	
  authority	
  files	
  
•  Maintain	
  country,	
  province	
 ...
We	
  prefer	
  to	
  
•  Efficiently	
  use	
  specialized	
  resources/services	
  
•  Provide	
  report,	
  quality	
  in...
Help	
  from	
  Seman&c	
  Web	
  
•  Data	
  in	
  other	
  languages	
  (French,	
  Spanish,	
  …)	
  
should	
  not	
  ...
Repor&ng	
  and	
  log	
  
•  DarwinCore	
  annota&ons	
  for	
  processed	
  data	
  
•  Shared	
  vocabulary	
  for	
  s...
Summary	
  
•  Tools	
  available	
  for	
  sharing	
  
•  Use,	
  review,	
  contribute	
  
•  Opportunity	
  for	
  broa...
Thanks	
  

Anne Bruneau, Institut de recherche en biologie végétale and
Département de Sciences Biologiques, Université d...
Contact	
  
	
  hrp://www.canadensys.net	
  
	
  hrp://github.com/Canadensys	
  
	
  @Canadensys	
  

Gulo gulo, Larry Mas...
Mul&-­‐field	
  processing	
  
DwC	
  Field	
  

Raw	
  data	
  

Processed	
  data	
  

verba&mLa&tude	
  

45°30ʹ′N	
  

...
Mul&-­‐field	
  processing	
  
1.  Get	
  informa&on	
  on	
  coordinates	
  
45.5,-­‐73.5666667	
  
2.  Compare	
  with	
 ...
Upcoming SlideShare
Loading in …5
×

Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

659 views
596 views

Published on

TDWG 2013 talk on data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions.
Authors : Christian Gendreau, David P. Shorthouse, Peter Desmet

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
659
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

  1. 1. Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Chris&an  Gendreau,  David  Shorthouse  &  Peter  Desmet  
  2. 2. Game  plan   •  Introduc&on  to  Canadensys   •  Data  quality  @  Canadensys   •  Canadensys  processing  solu&ons   •  Numbers  from  Canadensys   •  Hopes  and  expecta&ons  
  3. 3. A Network Of people and collections
  4. 4. Canadensys Headquarters Université de Montréal Biodiversity Centre
  5. 5. data.canadensys.net/vascan  
  6. 6. data.canadensys.net/ipt  
  7. 7. data.canadensys.net/explorer  
  8. 8. Data quality related activities From an aggregator perspective
  9. 9. During  data  entry   •  Help  to  avoid  typographical  errors   •  Help  to  convert  verba&m  data   Actor : data entry person
  10. 10. Before  publica&on   •  Detect  file  character  encoding  issue   •  Detect  duplicate  or  missing  IDs   Actor : data publisher Previous Activity: Data entry
  11. 11. During  aggrega&on   •  Process  data:  valida&on,  cleaning   •  Produce  structured  reports  :  quality  control     Actor : data aggregator Previous Activity: Before publication
  12. 12. AKer  aggrega&on   •  Allow  and  facilitate  community  feedback   •  Help  data  publisher  to  integrate  correc&ons   Actor : users and community Previous Activity: Aggregation
  13. 13. Canadensys  tools   during  data  entry   data.canadensys.net/tools  
  14. 14. Why  do  we  process  data?   •  Enrich  our  Explorer,  h"p://data.canadensys.net   •  Provide  structured  reports  to  data  providers   •  Help  iden&fy  records  that  need  re-­‐examina&on   •  Help  to  improve  data  entry  procedure  
  15. 15. Data  processing  
  16. 16. Processing  solu&ons   Narwhals  to  the  rescue   Narwhal image Public Domain
  17. 17. The  narwhal-­‐processor  approach   ●  Single  field  processing  to  allow  complex   processing  (combined  fields)   ●  Processors  with  common  interface  ease   integra&on  and  usage   ●  Collabora&on   https://github.com/Canadensys/narwhal-processor
  18. 18. Data  usability   before  processing   96%   100%   92%   90%   %  of  non-­‐null  clean  verba>m  data   80%   70%   60%   60%   50%   44%   40%   30%   20%   10%   0%   country  text   state/province  text   coordinates   dates  
  19. 19. Data  usability   aKer  processing   •  7%  of  provided  country  text         USA   ISO  3166-­‐2:US,   United  States  
  20. 20. Data  usability   aKer  processing   •  7%  of  provided  country  text   •  16%  of  provided  state/province  text         Qué   ISO  3166-­‐2  CA-­‐ QC,  Quebec  
  21. 21. Data  usability   aKer  processing   •  7%  of  provided  country  text   •  16%  of  provided  state/province  text   •  4%  of  provided  coordinates       45°  32'  25"  N,   129°  40'  31"  W   45.5402778,   -­‐129.6752778  
  22. 22. Data  usability   aKer  processing   •  7%  of  provided  country  text   •  16%  of  provided  state/province  text   •  4%  of  provided  coordinates   •  42%  of  provided  dates       2008  VI  13   2008-­‐06-­‐13  
  23. 23. Data  usability   including  processed  data   4%   100%   7%   90%   %  of  non-­‐null  provided   80%   70%   16%   42%   60%   50%   96%   92%   40%   60%   30%   44%   20%   10%   0%   country  text   state/province  text   coordinates   dates  
  24. 24. Projects  With  Data  Quality  Tools   •  Atlas  of  living  Australia   •  GBIF  Norway,  GBIF  Spain,  Na&onal   Biodiversity  Network,  BioVeL  …     •  GBIF  libraries   •  Most  nodes  have  their  own  data  quality   rou&ne  
  25. 25. Hopes  and  expecta&ons  
  26. 26. We  do  not  want  to   •  Maintain  taxonomic  authority  files   •  Maintain  country,  province  and  city  lists  
  27. 27. We  prefer  to   •  Efficiently  use  specialized  resources/services   •  Provide  report,  quality  indices  
  28. 28. Help  from  Seman&c  Web   •  Data  in  other  languages  (French,  Spanish,  …)   should  not  be  flagged  as  error   •  Misspellings  should  be  shared  as  a  common   resource  (e.g.  SKOS)   •  Understand  historical  data  (e.g.  collected  in   USSR  in  1980)  
  29. 29. Repor&ng  and  log   •  DarwinCore  annota&ons  for  processed  data   •  Shared  vocabulary  for  structured  reports  and   quality  indices  
  30. 30. Summary   •  Tools  available  for  sharing   •  Use,  review,  contribute   •  Opportunity  for  broad  coordina&on  and   increased  efficiencies  
  31. 31. Thanks   Anne Bruneau, Institut de recherche en biologie végétale and Département de Sciences Biologiques, Université de Montréal
  32. 32. Contact    hrp://www.canadensys.net    hrp://github.com/Canadensys    @Canadensys   Gulo gulo, Larry Master (www.masterimages.org)
  33. 33. Mul&-­‐field  processing   DwC  Field   Raw  data   Processed  data   verba&mLa&tude   45°30ʹ′N    45.5   verba&mLongitude   73°34ʹ′W   -­‐73.5666667   country   Canada   Canada   stateProvince   QC   Quebec   municipality   Montreal  City   Montreal  
  34. 34. Mul&-­‐field  processing   1.  Get  informa&on  on  coordinates   45.5,-­‐73.5666667   2.  Compare  with  processed  data   3.  Assert  that  these  coordinates  are  in  Montréal  

×