Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pangaea providing access to geoscientific data using apache lucene java

964 views

Published on

  • Be the first to comment

Pangaea providing access to geoscientific data using apache lucene java

  1. 1. PANGAEA - Providing access to geoscientific data using Apache Lucene Java Uwe Schindler PANGAEA / SD DataSolutions GmbH, uschindler@pangaea.de
  2. 2. My Background My main focus is on development of Lucene Java. Implemented fast numerical search and maintaining the new attribute-based text analysis API. Studied Physics at the University of Erlangen-Nuremberg and work as consultant and software architect for PANGAEA (Publishing Network for Geoscientific & Environmental Data) in Bremen, Germany, where I implemented the portal's geo- spatial retrieval functions with Lucene Java. Talks about Lucene at various international conferences like ApacheCon EU/US, Lucene Eurocon, Berlin Buzzwords and various local meetups. I am committer and PMC member of Apache Lucene and Solr.
  3. 3. since 1993 Information system for earth system science data hosted by AWI & MARUM 2001 Mandate of the International Council for Science (ICSU): World Data Center for Marine Environmental Sciences (WDC- MARE) 2007 Mandate of the World Meteorological Organisation (WMO): World Radiation Monitoring Center (WRMC) 2010 (certification in progress) Mandate of the World Meteorological Organisation (WMO): Data Collection and Processing Center (DCPC) About PANGAEA
  4. 4. Nuclear Radiation Tokyo, Japan WDC Co-ordination Offices Washington DC, USA Beijing, China Meteorology Asheville NC, USA Beijing, China Obninsk, Russia Oceaography Obninsk, Russia Silver Spring MD, USA Tianjin, China Paleoclimatology Boulder CO, USA Marine Geology and Geophysics Boulder CO, USA Moscow, Russia Remotely Sensed Land Data Sioux Falls SD, USA Renewable Resources and Environment Beijing, China Recent Crustal Movements Ondrejov, Czech Republic Airglow Mitaka,Japan Astronomy Beijing, China Atmospheric Trace Gases Oak Ridge TN, USA Aurora Tokyo, Japan Cosmic Rays Toyokawa, Japan Geology Beijing, China Human Interactions in the Environment Palisades NY, USA Ionosphere Tokyo, Japan Earth Tides Brussels, Belgium Geomagnetism Copenhagen, Denmark Edinburgh, UK Kyoto, Japan Colaba, India Glaciology Boulder CO, USA Cambridge, UK Lanzhou, China Marine Environmental Sciences Bremen, Germany, (2001) Rotation of the Earth Obninsk, Russia Washington DC, USA Satellite Information Greenbelt MD, USA Rockets and Satellites Obninsk, Russia Seismology Denver CO, USA Beijing, China Solar Radio Emission Nagano, Japan Space Science Beijing, China Space Science Satellites Kanagawa, Japan Solar Activity Meudon, France Soils Wageningen, The Netherlands Sunspot Index Brussels, Belgium Solar Terrestrial Physics Boulder CO, USA Didcot Oxon, UK Moscow, Russia Haymarket, Australia Solid Earth Geophysics Beijing, China Boulder CO, USA Moscow, Russia Network of World Data Centers Geophysical Year 1957
  5. 5. Why do we need Data Libraries? - Good scientific practice - Needed for verification of scientific work - Good availability of data for large scale and complex scientific approaches - than reproduction
  6. 6. Geosciences before 1900 Turin papyrus, ~1160 BC William Smith, 1815Glomar challenger, 1875
  7. 7. ENIAC, 1944 Technical Improvements Magnetometer
  8. 8. Development of the global climate The last 1300 years Thousands of years before present Thousands of years before present
  9. 9. 0 5 10 15 20 25 30 1970 1980 1990 2000 2010 Publications Data ? Information increase in empirical sciences
  10. 10. Archiving and publication of scientific data Data acquisition Quality assurance Long-term availability and access
  11. 11. Long term archive Open access & non restricted data o Creative Commons license Data accepted from individual scientists, institutes, and science projects Long term funding for basic operation o hardware, software, system management & organisation Long term preservation of data o Technical: security, migration of media, o Usability: preserving the integrity & semantics of data sets
  12. 12. Contents
  13. 13. Data Types in PANGAEA IRD ( gr av/ 10 cm3) Sand ( %) CaCO3 ( %) TOC ( %) Radio ( %/ sand) Smect ( %/ clay) IRD ( gr av/ 10 cm3) Sand ( %) CaCO3 ( %) TOC ( %) Radio ( %/ sand) Smect ( %/ clay) IRD ( gr av/ 10 cm3) Sand ( %) CaCO3 ( %) TOC ( %) Radio ( %/ sand) Smect ( %/ clay) IRD ( gr av/ 10 cm3) Sand ( %) CaCO3 ( %) TOC ( %) Radio ( %/ sand) Smect ( %/ clay) IRD ( gr av/ 10 cm3) Sand ( %) CaCO3 ( %) TOC ( %) Radio ( %/ sand) Smect ( %/ clay) PS1389-3 PS1390-3 PS1431-1 PS1640-1 PS1648-1 Age (kyr) max. : 233.55 kyr PS1389-3ff 0.0 100.0 200.0 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 54° 0' 54° 0' 54°30' 54°30' 55° 0' 55° 0' 55°30' 55°30' 11° 11° 12° 12° 13° 13° 14° 14° 15° 15° World vector shore line Grain size class KOLP A Grain size class KOEHN2 Grain size class KOEHN Geochemistry Grain size class KOLP B Grain size class KOLP DIN 20 m Scale: 1:2695194 at Latitude 0° Source: Baltic Sea Research Institute, Warnemünde. Profiles => doi:10.1594/PANGAEA.701299 Time series => doi:10.1594/PANGAEA.323487 Sea bed photos => doi:10.1594/PANGAEA.319877 Distributes samples => doi:10.1594/PANGAEA.51749 Complex data => doi:10.1594/PANGAEA.108079 Air photos => doi:10.1594/PANGAEA.323540 Audio record => doi:10.1594/PANGAEA.339110
  14. 14. unclassified Sediment Water Corals Atmosphere Ice Total number of data sets ~ 1 million Data items ~ 8 billions Statistics (9/2010)
  15. 15. Now the technical details :-)
  16. 16. Sybase ASE MiddlewareWebserver Editorial system PANGAEA search engine PANGAEA - Architecture Harddisk + tape (silo) RDB Apache Lucene Google Maps / Earth
  17. 17. Indexing contents from relational database with dynamic updates Data  Set Staffs Projects Data  Series Events Update  Log XML  Data  Set Description (Metadata)
  18. 18. Indexed Information Textual metadata: citation (authors, title), abstract, measurement parameters, methods, associated projects, comments, documentation including field info for all XML schema element types) Fulltext data set contents Geographical information: latitude/longitude/BBOX/track, dates, geological age, depth/elevation [NumericField/NumericRangeQuery] Soon: Fulltext of attached external documentation
  19. 19. Geo-Retrieval with Lucene
  20. 20. Using scored queries with KML regions as filters
  21. 21. Apache Lucene as fast Key-Value Store Lucene is used for almost every query on the web-client of keyword terms indexed for quick retrieval of data sets Example: Lookup of datsets related to publications using DOI PANGAEA is hit by hundreds of DOI lookup queries per second from scientific publishers:
  22. 22. Apache Lucene as fast Key-Value Store Lucene is used for almost every query on the web-client of keyword terms indexed for quick retrieval of data sets Example: Lookup of datsets related to publications using DOI PANGAEA is hit by hundreds of DOI lookup queries per second from scientific publishers:
  23. 23. PRESENTATION Live
  24. 24. Contact Uwe Schindler PANGAEA - Publishing Network for Geoscientific & Environmental Data MARUM, Leobener Str., 28359 Bremen, Germany uschindler@pangaea.de SD DataSolutions GmbH Wätjenstr. 49, 28213 Bremen, Germany uschindler@sd-datasolutions.de
  25. 25. Thank you! Know more about Apache Lucene at www.lucidimaginatin.com

×