Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data quality challenges in the
Canadensys network of
occurrence records: examples,
tools, and solutions

Chris&an	
  Gendr...
Game	
  plan
	
  
•  Introduc&on	
  to	
  Canadensys	
  
•  Data	
  quality	
  @	
  Canadensys	
  
•  Canadensys	
  proces...
A Network
Of people and collections
Canadensys Headquarters
Université de Montréal Biodiversity Centre
data.canadensys.net/vascan	
  
data.canadensys.net/ipt	
  
data.canadensys.net/explorer	
  
Data quality related
activities
From an aggregator perspective
During	
  data	
  entry	
  
•  Help	
  to	
  avoid	
  typographical	
  errors	
  
•  Help	
  to	
  convert	
  verba&m	
  d...
Before	
  publica&on	
  
•  Detect	
  file	
  character	
  encoding	
  issue	
  
•  Detect	
  duplicate	
  or	
  missing	
 ...
During	
  aggrega&on	
  
•  Process	
  data:	
  valida&on,	
  cleaning	
  
•  Produce	
  structured	
  reports	
  :	
  qua...
AKer	
  aggrega&on	
  
•  Allow	
  and	
  facilitate	
  community	
  feedback	
  
•  Help	
  data	
  publisher	
  to	
  in...
Canadensys	
  tools	
  
during	
  data	
  entry	
  

data.canadensys.net/tools	
  
Why	
  do	
  we	
  process	
  data?	
  
•  Enrich	
  our	
  Explorer,	
  h"p://data.canadensys.net	
  
•  Provide	
  struc...
Data	
  processing	
  
Processing	
  solu&ons	
  
Narwhals	
  to	
  the	
  rescue	
  

Narwhal image Public Domain
The	
  narwhal-­‐processor	
  approach	
  
●  Single	
  field	
  processing	
  to	
  allow	
  complex	
  
processing	
  (co...
Data	
  usability	
  
before	
  processing	
  
96%	
  

100%	
  

92%	
  
90%	
  

%	
  of	
  non-­‐null	
  clean	
  verba...
Data	
  usability	
  
aKer	
  processing	
  

•  7%	
  of	
  provided	
  country	
  text	
  
	
  
	
  
	
  

USA	
  

ISO	...
Data	
  usability	
  
aKer	
  processing	
  

•  7%	
  of	
  provided	
  country	
  text	
  
•  16%	
  of	
  provided	
  s...
Data	
  usability	
  
aKer	
  processing	
  

•  7%	
  of	
  provided	
  country	
  text	
  
•  16%	
  of	
  provided	
  s...
Data	
  usability	
  
aKer	
  processing	
  

•  7%	
  of	
  provided	
  country	
  text	
  
•  16%	
  of	
  provided	
  s...
Data	
  usability	
  
including	
  processed	
  data	
  
4%	
  

100%	
  

7%	
  
90%	
  

%	
  of	
  non-­‐null	
  provid...
Projects	
  With	
  Data	
  Quality	
  Tools	
  
•  Atlas	
  of	
  living	
  Australia	
  
•  GBIF	
  Norway,	
  GBIF	
  S...
Hopes	
  and	
  expecta&ons	
  
We	
  do	
  not	
  want	
  to	
  
•  Maintain	
  taxonomic	
  authority	
  files	
  
•  Maintain	
  country,	
  province	
 ...
We	
  prefer	
  to	
  
•  Efficiently	
  use	
  specialized	
  resources/services	
  
•  Provide	
  report,	
  quality	
  in...
Help	
  from	
  Seman&c	
  Web	
  
•  Data	
  in	
  other	
  languages	
  (French,	
  Spanish,	
  …)	
  
should	
  not	
  ...
Repor&ng	
  and	
  log	
  
•  DarwinCore	
  annota&ons	
  for	
  processed	
  data	
  
•  Shared	
  vocabulary	
  for	
  s...
Summary	
  
•  Tools	
  available	
  for	
  sharing	
  
•  Use,	
  review,	
  contribute	
  
•  Opportunity	
  for	
  broa...
Thanks	
  

Anne Bruneau, Institut de recherche en biologie végétale and
Département de Sciences Biologiques, Université d...
Contact	
  
	
  hrp://www.canadensys.net	
  
	
  hrp://github.com/Canadensys	
  
	
  @Canadensys	
  

Gulo gulo, Larry Mas...
Mul&-­‐field	
  processing	
  
DwC	
  Field	
  

Raw	
  data	
  

Processed	
  data	
  

verba&mLa&tude	
  

45°30ʹ′N	
  

...
Mul&-­‐field	
  processing	
  
1.  Get	
  informa&on	
  on	
  coordinates	
  
45.5,-­‐73.5666667	
  
2.  Compare	
  with	
 ...
Upcoming SlideShare
Loading in …5
×

Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

928 views

Published on

TDWG 2013 talk on data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions.
Authors : Christian Gendreau, David P. Shorthouse, Peter Desmet

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

  1. 1. Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Chris&an  Gendreau,  David  Shorthouse  &  Peter  Desmet  
  2. 2. Game  plan   •  Introduc&on  to  Canadensys   •  Data  quality  @  Canadensys   •  Canadensys  processing  solu&ons   •  Numbers  from  Canadensys   •  Hopes  and  expecta&ons  
  3. 3. A Network Of people and collections
  4. 4. Canadensys Headquarters Université de Montréal Biodiversity Centre
  5. 5. data.canadensys.net/vascan  
  6. 6. data.canadensys.net/ipt  
  7. 7. data.canadensys.net/explorer  
  8. 8. Data quality related activities From an aggregator perspective
  9. 9. During  data  entry   •  Help  to  avoid  typographical  errors   •  Help  to  convert  verba&m  data   Actor : data entry person
  10. 10. Before  publica&on   •  Detect  file  character  encoding  issue   •  Detect  duplicate  or  missing  IDs   Actor : data publisher Previous Activity: Data entry
  11. 11. During  aggrega&on   •  Process  data:  valida&on,  cleaning   •  Produce  structured  reports  :  quality  control     Actor : data aggregator Previous Activity: Before publication
  12. 12. AKer  aggrega&on   •  Allow  and  facilitate  community  feedback   •  Help  data  publisher  to  integrate  correc&ons   Actor : users and community Previous Activity: Aggregation
  13. 13. Canadensys  tools   during  data  entry   data.canadensys.net/tools  
  14. 14. Why  do  we  process  data?   •  Enrich  our  Explorer,  h"p://data.canadensys.net   •  Provide  structured  reports  to  data  providers   •  Help  iden&fy  records  that  need  re-­‐examina&on   •  Help  to  improve  data  entry  procedure  
  15. 15. Data  processing  
  16. 16. Processing  solu&ons   Narwhals  to  the  rescue   Narwhal image Public Domain
  17. 17. The  narwhal-­‐processor  approach   ●  Single  field  processing  to  allow  complex   processing  (combined  fields)   ●  Processors  with  common  interface  ease   integra&on  and  usage   ●  Collabora&on   https://github.com/Canadensys/narwhal-processor
  18. 18. Data  usability   before  processing   96%   100%   92%   90%   %  of  non-­‐null  clean  verba>m  data   80%   70%   60%   60%   50%   44%   40%   30%   20%   10%   0%   country  text   state/province  text   coordinates   dates  
  19. 19. Data  usability   aKer  processing   •  7%  of  provided  country  text         USA   ISO  3166-­‐2:US,   United  States  
  20. 20. Data  usability   aKer  processing   •  7%  of  provided  country  text   •  16%  of  provided  state/province  text         Qué   ISO  3166-­‐2  CA-­‐ QC,  Quebec  
  21. 21. Data  usability   aKer  processing   •  7%  of  provided  country  text   •  16%  of  provided  state/province  text   •  4%  of  provided  coordinates       45°  32'  25"  N,   129°  40'  31"  W   45.5402778,   -­‐129.6752778  
  22. 22. Data  usability   aKer  processing   •  7%  of  provided  country  text   •  16%  of  provided  state/province  text   •  4%  of  provided  coordinates   •  42%  of  provided  dates       2008  VI  13   2008-­‐06-­‐13  
  23. 23. Data  usability   including  processed  data   4%   100%   7%   90%   %  of  non-­‐null  provided   80%   70%   16%   42%   60%   50%   96%   92%   40%   60%   30%   44%   20%   10%   0%   country  text   state/province  text   coordinates   dates  
  24. 24. Projects  With  Data  Quality  Tools   •  Atlas  of  living  Australia   •  GBIF  Norway,  GBIF  Spain,  Na&onal   Biodiversity  Network,  BioVeL  …     •  GBIF  libraries   •  Most  nodes  have  their  own  data  quality   rou&ne  
  25. 25. Hopes  and  expecta&ons  
  26. 26. We  do  not  want  to   •  Maintain  taxonomic  authority  files   •  Maintain  country,  province  and  city  lists  
  27. 27. We  prefer  to   •  Efficiently  use  specialized  resources/services   •  Provide  report,  quality  indices  
  28. 28. Help  from  Seman&c  Web   •  Data  in  other  languages  (French,  Spanish,  …)   should  not  be  flagged  as  error   •  Misspellings  should  be  shared  as  a  common   resource  (e.g.  SKOS)   •  Understand  historical  data  (e.g.  collected  in   USSR  in  1980)  
  29. 29. Repor&ng  and  log   •  DarwinCore  annota&ons  for  processed  data   •  Shared  vocabulary  for  structured  reports  and   quality  indices  
  30. 30. Summary   •  Tools  available  for  sharing   •  Use,  review,  contribute   •  Opportunity  for  broad  coordina&on  and   increased  efficiencies  
  31. 31. Thanks   Anne Bruneau, Institut de recherche en biologie végétale and Département de Sciences Biologiques, Université de Montréal
  32. 32. Contact    hrp://www.canadensys.net    hrp://github.com/Canadensys    @Canadensys   Gulo gulo, Larry Master (www.masterimages.org)
  33. 33. Mul&-­‐field  processing   DwC  Field   Raw  data   Processed  data   verba&mLa&tude   45°30ʹ′N    45.5   verba&mLongitude   73°34ʹ′W   -­‐73.5666667   country   Canada   Canada   stateProvince   QC   Quebec   municipality   Montreal  City   Montreal  
  34. 34. Mul&-­‐field  processing   1.  Get  informa&on  on  coordinates   45.5,-­‐73.5666667   2.  Compare  with  processed  data   3.  Assert  that  these  coordinates  are  in  Montréal  

×