PANGAEA - Providing access to
geoscientific data using Apache
Lucene Java
Uwe Schindler
PANGAEA / SD DataSolutions GmbH, u...
My Background
My main focus is on development of Lucene Java.
Implemented fast numerical search and maintaining the new
at...
since 1993
Information system for earth system science data hosted by AWI &
MARUM
2001
Mandate of the International Counci...
Nuclear Radiation
Tokyo, Japan
WDC Co-ordination Offices
Washington DC, USA
Beijing, China
Meteorology
Asheville NC, USA
B...
Why do we need Data Libraries?
- Good scientific practice
- Needed for verification of scientific
work
- Good availability...
Geosciences before 1900
Turin papyrus,
~1160 BC
William Smith, 1815Glomar challenger, 1875
ENIAC, 1944
Technical Improvements
Magnetometer
Development of the global
climate
The last 1300 years
Thousands of years before present
Thousands of years before present
0
5
10
15
20
25
30
1970 1980 1990 2000 2010
Publications
Data
?
Information increase in empirical sciences
Archiving and publication of
scientific data
Data acquisition
Quality assurance
Long-term availability and access
Long term archive
Open access & non restricted data
o Creative Commons license
Data accepted from individual scientists,
i...
Contents
Data Types in PANGAEA
IRD
( gr av/ 10 cm3)
Sand
( %)
CaCO3
( %)
TOC
( %)
Radio
( %/ sand)
Smect
( %/ clay)
IRD
( gr av/ 10...
unclassified
Sediment
Water
Corals
Atmosphere Ice
Total number of data sets ~ 1 million
Data items ~ 8 billions
Statistics...
Now the technical details :-)
Sybase
ASE
MiddlewareWebserver
Editorial
system
PANGAEA
search
engine
PANGAEA -
Architecture
Harddisk
+ tape (silo)
RDB
Ap...
Indexing contents from relational
database with dynamic updates
Data  Set
Staffs
Projects
Data  Series
Events
Update  Log
...
Indexed Information
Textual metadata: citation (authors, title),
abstract, measurement parameters,
methods, associated pro...
Geo-Retrieval with Lucene
Using scored queries
with KML regions as filters
Apache Lucene
as fast Key-Value Store
Lucene is used for almost every query on the
web-client
of keyword terms indexed for...
Apache Lucene
as fast Key-Value Store
Lucene is used for almost every query on the
web-client
of keyword terms indexed for...
PRESENTATION
Live
Contact
Uwe Schindler
PANGAEA - Publishing Network for Geoscientific &
Environmental Data
MARUM, Leobener Str., 28359 Brem...
Thank you!
Know more about Apache Lucene at
www.lucidimaginatin.com
Upcoming SlideShare
Loading in...5
×

Pangaea providing access to geoscientific data using apache lucene java

732

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
732
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Pangaea providing access to geoscientific data using apache lucene java

  1. 1. PANGAEA - Providing access to geoscientific data using Apache Lucene Java Uwe Schindler PANGAEA / SD DataSolutions GmbH, uschindler@pangaea.de
  2. 2. My Background My main focus is on development of Lucene Java. Implemented fast numerical search and maintaining the new attribute-based text analysis API. Studied Physics at the University of Erlangen-Nuremberg and work as consultant and software architect for PANGAEA (Publishing Network for Geoscientific & Environmental Data) in Bremen, Germany, where I implemented the portal's geo- spatial retrieval functions with Lucene Java. Talks about Lucene at various international conferences like ApacheCon EU/US, Lucene Eurocon, Berlin Buzzwords and various local meetups. I am committer and PMC member of Apache Lucene and Solr.
  3. 3. since 1993 Information system for earth system science data hosted by AWI & MARUM 2001 Mandate of the International Council for Science (ICSU): World Data Center for Marine Environmental Sciences (WDC- MARE) 2007 Mandate of the World Meteorological Organisation (WMO): World Radiation Monitoring Center (WRMC) 2010 (certification in progress) Mandate of the World Meteorological Organisation (WMO): Data Collection and Processing Center (DCPC) About PANGAEA
  4. 4. Nuclear Radiation Tokyo, Japan WDC Co-ordination Offices Washington DC, USA Beijing, China Meteorology Asheville NC, USA Beijing, China Obninsk, Russia Oceaography Obninsk, Russia Silver Spring MD, USA Tianjin, China Paleoclimatology Boulder CO, USA Marine Geology and Geophysics Boulder CO, USA Moscow, Russia Remotely Sensed Land Data Sioux Falls SD, USA Renewable Resources and Environment Beijing, China Recent Crustal Movements Ondrejov, Czech Republic Airglow Mitaka,Japan Astronomy Beijing, China Atmospheric Trace Gases Oak Ridge TN, USA Aurora Tokyo, Japan Cosmic Rays Toyokawa, Japan Geology Beijing, China Human Interactions in the Environment Palisades NY, USA Ionosphere Tokyo, Japan Earth Tides Brussels, Belgium Geomagnetism Copenhagen, Denmark Edinburgh, UK Kyoto, Japan Colaba, India Glaciology Boulder CO, USA Cambridge, UK Lanzhou, China Marine Environmental Sciences Bremen, Germany, (2001) Rotation of the Earth Obninsk, Russia Washington DC, USA Satellite Information Greenbelt MD, USA Rockets and Satellites Obninsk, Russia Seismology Denver CO, USA Beijing, China Solar Radio Emission Nagano, Japan Space Science Beijing, China Space Science Satellites Kanagawa, Japan Solar Activity Meudon, France Soils Wageningen, The Netherlands Sunspot Index Brussels, Belgium Solar Terrestrial Physics Boulder CO, USA Didcot Oxon, UK Moscow, Russia Haymarket, Australia Solid Earth Geophysics Beijing, China Boulder CO, USA Moscow, Russia Network of World Data Centers Geophysical Year 1957
  5. 5. Why do we need Data Libraries? - Good scientific practice - Needed for verification of scientific work - Good availability of data for large scale and complex scientific approaches - than reproduction
  6. 6. Geosciences before 1900 Turin papyrus, ~1160 BC William Smith, 1815Glomar challenger, 1875
  7. 7. ENIAC, 1944 Technical Improvements Magnetometer
  8. 8. Development of the global climate The last 1300 years Thousands of years before present Thousands of years before present
  9. 9. 0 5 10 15 20 25 30 1970 1980 1990 2000 2010 Publications Data ? Information increase in empirical sciences
  10. 10. Archiving and publication of scientific data Data acquisition Quality assurance Long-term availability and access
  11. 11. Long term archive Open access & non restricted data o Creative Commons license Data accepted from individual scientists, institutes, and science projects Long term funding for basic operation o hardware, software, system management & organisation Long term preservation of data o Technical: security, migration of media, o Usability: preserving the integrity & semantics of data sets
  12. 12. Contents
  13. 13. Data Types in PANGAEA IRD ( gr av/ 10 cm3) Sand ( %) CaCO3 ( %) TOC ( %) Radio ( %/ sand) Smect ( %/ clay) IRD ( gr av/ 10 cm3) Sand ( %) CaCO3 ( %) TOC ( %) Radio ( %/ sand) Smect ( %/ clay) IRD ( gr av/ 10 cm3) Sand ( %) CaCO3 ( %) TOC ( %) Radio ( %/ sand) Smect ( %/ clay) IRD ( gr av/ 10 cm3) Sand ( %) CaCO3 ( %) TOC ( %) Radio ( %/ sand) Smect ( %/ clay) IRD ( gr av/ 10 cm3) Sand ( %) CaCO3 ( %) TOC ( %) Radio ( %/ sand) Smect ( %/ clay) PS1389-3 PS1390-3 PS1431-1 PS1640-1 PS1648-1 Age (kyr) max. : 233.55 kyr PS1389-3ff 0.0 100.0 200.0 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 54° 0' 54° 0' 54°30' 54°30' 55° 0' 55° 0' 55°30' 55°30' 11° 11° 12° 12° 13° 13° 14° 14° 15° 15° World vector shore line Grain size class KOLP A Grain size class KOEHN2 Grain size class KOEHN Geochemistry Grain size class KOLP B Grain size class KOLP DIN 20 m Scale: 1:2695194 at Latitude 0° Source: Baltic Sea Research Institute, Warnemünde. Profiles => doi:10.1594/PANGAEA.701299 Time series => doi:10.1594/PANGAEA.323487 Sea bed photos => doi:10.1594/PANGAEA.319877 Distributes samples => doi:10.1594/PANGAEA.51749 Complex data => doi:10.1594/PANGAEA.108079 Air photos => doi:10.1594/PANGAEA.323540 Audio record => doi:10.1594/PANGAEA.339110
  14. 14. unclassified Sediment Water Corals Atmosphere Ice Total number of data sets ~ 1 million Data items ~ 8 billions Statistics (9/2010)
  15. 15. Now the technical details :-)
  16. 16. Sybase ASE MiddlewareWebserver Editorial system PANGAEA search engine PANGAEA - Architecture Harddisk + tape (silo) RDB Apache Lucene Google Maps / Earth
  17. 17. Indexing contents from relational database with dynamic updates Data  Set Staffs Projects Data  Series Events Update  Log XML  Data  Set Description (Metadata)
  18. 18. Indexed Information Textual metadata: citation (authors, title), abstract, measurement parameters, methods, associated projects, comments, documentation including field info for all XML schema element types) Fulltext data set contents Geographical information: latitude/longitude/BBOX/track, dates, geological age, depth/elevation [NumericField/NumericRangeQuery] Soon: Fulltext of attached external documentation
  19. 19. Geo-Retrieval with Lucene
  20. 20. Using scored queries with KML regions as filters
  21. 21. Apache Lucene as fast Key-Value Store Lucene is used for almost every query on the web-client of keyword terms indexed for quick retrieval of data sets Example: Lookup of datsets related to publications using DOI PANGAEA is hit by hundreds of DOI lookup queries per second from scientific publishers:
  22. 22. Apache Lucene as fast Key-Value Store Lucene is used for almost every query on the web-client of keyword terms indexed for quick retrieval of data sets Example: Lookup of datsets related to publications using DOI PANGAEA is hit by hundreds of DOI lookup queries per second from scientific publishers:
  23. 23. PRESENTATION Live
  24. 24. Contact Uwe Schindler PANGAEA - Publishing Network for Geoscientific & Environmental Data MARUM, Leobener Str., 28359 Bremen, Germany uschindler@pangaea.de SD DataSolutions GmbH Wätjenstr. 49, 28213 Bremen, Germany uschindler@sd-datasolutions.de
  25. 25. Thank you! Know more about Apache Lucene at www.lucidimaginatin.com

×