Crawling for EarthCube
Ruth Duerr, Luis Lopez, Abeve Tayachow, Erik Mingo
Outline
• Very brief NSIDC intro
• Why crawl?
• Libre crawler architecture
• Questions for the community
2
NSIDC: An overview2
Cooperative Institute
for Research in
Environmental
Sciences
Main sponsors:
University of Colorado
Boulder
NSIDC affiliations and sponsorship
National Science
Foundation
NASA
National
Oceanographic
and Atmospheric
Administration
The National Snow and Ice Data Center…
Provides
tools for
data access
Researches the
cryosphere
and data
science
Educates
the public
about the
cryosphere
Supports data
users
Manages and
distributes
scientific data
Supports local
and traditional
knowledge
Outline
• Very brief NSIDC intro
• Why crawl?
• Libre crawler architecture
• Questions for the community
2
Why not let Google do it?
• What's their incentive?
• The schema.org route for data has extreme limitations
2
Ways to build a comprehensive catalog
• Ask folks to register their data and services
• Build your catalog by hand
• Automate discovery of data and services
2
Preparing Data for Ingest, presented 10/27/09 by R. Duerr
LID590DCL Foundations of Data Curation
What if...
Advertising your data so that everyone could
find them, were as simple as...
1 - Filling out a web form
2 - Saving it to your website
3 - Adding its link to your site
Well... It can be!
Why not let Google do it?
2
Outline
• Very brief NSIDC intro
• Why crawl?
• Libre crawler architecture
• Questions for the community
2
Crawler Big Picture
2
BCube Crawler
BCube Broker
CINERGI
Crawler Architecture
2
Things we are going to search for
• OpenSearch
• OAI-PMH, ESIP Data and service cast feeds
• THREDDS catalogs
• Web-enabled folders
• WADL/WSDL
2
Things we are going to search for
• OpenSearch
• OAI-PMH, ESIP Data and service cast feeds
• THREDDS catalogs
• Web-enabled folders
• WADL/WSDL
2
But what else should we look for?
16
Questions/Comments

AHM 2014: Crawling for EarthCube

  • 1.
    Crawling for EarthCube RuthDuerr, Luis Lopez, Abeve Tayachow, Erik Mingo
  • 2.
    Outline • Very briefNSIDC intro • Why crawl? • Libre crawler architecture • Questions for the community 2
  • 3.
    NSIDC: An overview2 CooperativeInstitute for Research in Environmental Sciences Main sponsors: University of Colorado Boulder NSIDC affiliations and sponsorship National Science Foundation NASA National Oceanographic and Atmospheric Administration
  • 4.
    The National Snowand Ice Data Center… Provides tools for data access Researches the cryosphere and data science Educates the public about the cryosphere Supports data users Manages and distributes scientific data Supports local and traditional knowledge
  • 5.
    Outline • Very briefNSIDC intro • Why crawl? • Libre crawler architecture • Questions for the community 2
  • 6.
    Why not letGoogle do it? • What's their incentive? • The schema.org route for data has extreme limitations 2
  • 7.
    Ways to builda comprehensive catalog • Ask folks to register their data and services • Build your catalog by hand • Automate discovery of data and services 2
  • 8.
    Preparing Data forIngest, presented 10/27/09 by R. Duerr LID590DCL Foundations of Data Curation What if... Advertising your data so that everyone could find them, were as simple as... 1 - Filling out a web form 2 - Saving it to your website 3 - Adding its link to your site Well... It can be!
  • 9.
    Why not letGoogle do it? 2
  • 10.
    Outline • Very briefNSIDC intro • Why crawl? • Libre crawler architecture • Questions for the community 2
  • 11.
    Crawler Big Picture 2 BCubeCrawler BCube Broker CINERGI
  • 12.
  • 13.
    Things we aregoing to search for • OpenSearch • OAI-PMH, ESIP Data and service cast feeds • THREDDS catalogs • Web-enabled folders • WADL/WSDL 2
  • 14.
    Things we aregoing to search for • OpenSearch • OAI-PMH, ESIP Data and service cast feeds • THREDDS catalogs • Web-enabled folders • WADL/WSDL 2 But what else should we look for?
  • 15.