System Architecture of GBIF

•Download as PPTX, PDF•

2 likes•624 views

Hadoop architecture discussion of the Global Biodiversity Information Facility (GBIF) by Oliver Meyn for Toronto Hadoop Users Group (THUG) on 2015-11-27.

Technology

System Architecture of GBIF
Oliver Meyn
https://elephant.tech

Hi, I’m Oliver Meyn!
• Working with Hadoop and HBase since 2009
• Java and SQL since 1999
• Cloudera Certified HBase specialist (CCSHB)
• Engineering & Computer Science by training
• Lived in Copenhagen the last 5 years, working for GBIF (gbif.org)
• https://elephant.tech

2004 - Global Biodiversity Information Facility is
formed
• “Occurrence Records” - what, where, when, who
• XML protocols exist for sharing data (federated
search)
• GBIF is formed and builds first prototype index
giving access to 60M records, on MySQL
• queries decoupled from publishers, regular
crawling means frequent updates
• no maps or analytical tools (search and
downloads only)

2007 - Global launch of “Data Portal”
• index is up to 120M records, maps, charts, search and
downloads
• need to limit downloads
• MySQL locks mean rollovers introduced, weeks turn to
months
• the wheels start to come off

2009 - Enter the Elephant
• start using MapReduce to do the
rollovers
• Sqoop into HDFS, MR to process,
Sqoop back out to MySQL
• "How to kill a MySQL server without
even trying"
• delimiters are the devil: Avro to the
rescue
• 10x 8GB, 4 core, 2 disk

2013 - Relaunch of gbif.org on full Hadoop
stack
• HBase, Hive, HDFS, MapReduce,
Oozie, Solr
• ~400M records in 2013
• gbif.org runs from Java ws that use
our public API
• unlimited downloads, much better
search
• new hardware: 12x 24 core, 64GB,
12x 1TB (Dell R720)

Maps
• precalculated on write, 16 zoom levels, in
HBase, as a sort of rollup
• key contains rollup dimensions (e.g. country,
kingdom, species), zoom level, and tile
coordinates
• single cell per row which has an Avro file
which provides layers for decade ranges and
Basis Of Record, and 256x256 array holding
counts (pixels)
• rendered on the fly by a custom tile renderer
that can turn HBase cells into pngs

Crawling / Processing
• crawling coordinated in Zookeeper
• RabbitMQ passing json messages
• custom Java crawlers for each
protocol listening for "start crawl"
messages
• multiple services to clean species
names, lat/lng, dates
• downstream listeners update counts
(HBase), Solr, and maps (HBase)
HBase
SOLR
Cloud
Crawlers
Persist
Normalize
Interpret
Index
broadcast
(RabbitMQ)

Occurrence Record Search
• Solr stores search fields and record
ID
• originally single Solr instance
• moving towards facets in SolrCloud
v5 (v5 not part of CDH5)
• facets enable “map my search” ?
• SolrCloud memory tuning not trivial

Downloads
• vast majority are < 200k
records (“small”)
• for small downloads do Solr
query for IDs, then
multithreaded get from HBase
• for big downloads use Hive to
do full scan of HDFS dump of
HBase table

Registry
Checklist
Bank
Occurrenc
e
GeoCode
Messaging API
Metrics Maps
Java Drupal
Varnish
Web Layer
GBIF API

Pain points
• Inconsistency across stores
• Rebuilding counts and Solr
index
• Migration from MR1 to Yarn
• Many moving parts
• DOS ourselves with big
crawls
• This stuff is not trivial flickr @Graham Wise

Thanks!
Oliver Meyn
oliver@elephant.tech
https://elephant.tech

What's hot

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr

Hadoop data analysisVakul Vankadaru

Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev

Cloudera Hadoop DistributionThisara Pramuditha

Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN

An introduction to PincasterFrank Denis

Rich storytelling with Drupal, Paragraphs and Islandora DAMSalxbrdg

Small intro to Big Data - Old versionSoftwareMill

Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk

Hadoop course content @ a1 trainingssA1 Trainings

DSpace at ILRI: A semi-technical overview of “CGSpace”ILRI

Nosql databases for the .net developerJesus Rodriguez

Open source big data landscape and possible ITS applicationsSoftwareMill

Apache drillMapR Technologies

LoCloud Geocoding Application, Runar Bergheim, Asplan Viak Internetlocloud

Introduction to apache hadoop copyMohammad_Tariq

Big Data and Hadoop EcosystemRajkumar Singh

Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks

Wikimedia Content API (Strangeloop)Eric Evans

What's hot (19)

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...

Hadoop data analysis

Thorny path to the Large-Scale Graph Processing (Highload++, 2014)

Cloudera Hadoop Distribution

Basic Hadoop Architecture V1 vs V2

An introduction to Pincaster

Rich storytelling with Drupal, Paragraphs and Islandora DAMS

Small intro to Big Data - Old version

Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...

Hadoop course content @ a1 trainingss

DSpace at ILRI: A semi-technical overview of “CGSpace”

Nosql databases for the .net developer

Open source big data landscape and possible ITS applications

Apache drill

LoCloud Geocoding Application, Runar Bergheim, Asplan Viak Internet

Introduction to apache hadoop copy

Big Data and Hadoop Ecosystem

Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera

Wikimedia Content API (Strangeloop)

Viewers also liked

Kubatbekova kaliyaTynchtykbek Zhanadylov

USDA Report - Organic fruits and vegetables prices_July 2013 BioEmarket - The Global Organic EMarketplace

Ices wgdim-may-2010Vishwas Chavan

Exploring the future of scholarly publishing of biodiversity dataVishwas Chavan

The Global Biodiversity Information Facility and Africa RisingFatima Parker-Allie

Session 01. Introduction to the GBIF GB22 training event for NodesAlberto González-Talaván

GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...Dag Endresen

Indo norway delhi_vishwas_28_oct2011_finalVishwas Chavan

Guardians educational tourMahesh Sabne

Road Safety presentation by The Safe Road FoundationThe Safe Road Foundation

Know your traffic signsMuralikrishna Vadlamudi

Road Safety Foundation: Making Road Safety PayAgeas UK

Traffic laws, rules and regulationDannica Agbayani

Road safety presentation(PPT) by FaisalFaisal Ayub

Road signsminikui81

Traffic RulesManjul Agrawal

Traffic light controllerRkrishna Mishra

Viewers also liked (17)

Kubatbekova kaliya

USDA Report - Organic fruits and vegetables prices_July 2013

Ices wgdim-may-2010

Exploring the future of scholarly publishing of biodiversity data

The Global Biodiversity Information Facility and Africa Rising

Session 01. Introduction to the GBIF GB22 training event for Nodes

GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...

Indo norway delhi_vishwas_28_oct2011_final

Guardians educational tour

Road Safety presentation by The Safe Road Foundation

Know your traffic signs

Road Safety Foundation: Making Road Safety Pay

Traffic laws, rules and regulation

Road safety presentation(PPT) by Faisal

Road signs

Traffic Rules

Traffic light controller

Similar to System Architecture of GBIF

Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit

Hadoop User Group - Status Apache DrillMapR Technologies

Swiss Big Data User Group - Introduction to Apache DrillMapR Technologies

Solr + Hadoop: Interactive Search for Hadoopgregchanan

Object Relational Database Management SystemAmar Myana

Search On Hadoop Frontier Meetupgregchanan

Analyzing Web Archivesvinaygo

Big data hadoop training in pune course content advanto softwareAdvanto Software

Hbase Nosqlelliando dias

Intro to HBase - Lars GeorgeJAX London

NoSQL, Apache SOLR and Apache HadoopDmitry Kan

Mongo db admin_20110329radiocats

Polyglot Persistence & Big Data in the CloudAndrei Savu

Databases in the hosted cloud Colin Charles

Open Source SQL DatabasesEmanuel Calvo

Hadoop: Components and Key Ideas, -part1Sandeep Kunkunuru

Drill Bay Area HUG 2012-09-19jasonfrantz

Sep 2012 HUG: Apache Drill for Interactive Analysis Yahoo Developer Network

Drill at the Chug 9-19-12Ted Dunning

Dissecting Scalable Database Architectureshypertable

Similar to System Architecture of GBIF (20)

Real time fraud detection at 1+M scale on hadoop stack

Hadoop User Group - Status Apache Drill

Swiss Big Data User Group - Introduction to Apache Drill

Solr + Hadoop: Interactive Search for Hadoop

Object Relational Database Management System

Search On Hadoop Frontier Meetup

Analyzing Web Archives

Big data hadoop training in pune course content advanto software

Hbase Nosql

Intro to HBase - Lars George

NoSQL, Apache SOLR and Apache Hadoop

Mongo db admin_20110329

Polyglot Persistence & Big Data in the Cloud

Databases in the hosted cloud

Open Source SQL Databases

Hadoop: Components and Key Ideas, -part1

Drill Bay Area HUG 2012-09-19

Sep 2012 HUG: Apache Drill for Interactive Analysis

Drill at the Chug 9-19-12

Dissecting Scalable Database Architectures

Recently uploaded

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

GenAI Risks & Security Meetup 01052024.pdflior mazor

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Tech Trends Report 2024 Future Today Institute.pdfhans926745

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Histor y of HAM Radio presentation slidevu2urc

GenCyber Cyber Security Day PresentationMichael W. Hawkins

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?

2024: Domino Containers - The Next Step. News from the Domino Container commu...

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

How to Troubleshoot Apps for the Modern Connected Worker

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

GenAI Risks & Security Meetup 01052024.pdf

Powerful Google developer tools for immediate impact! (2023-24 C)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

08448380779 Call Girls In Friends Colony Women Seeking Men

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Landing an Oracle DBA Job as a Fresher

Tech Trends Report 2024 Future Today Institute.pdf

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Boost Fertility New Invention Ups Success Rates.pdf

Histor y of HAM Radio presentation slide

GenCyber Cyber Security Day Presentation

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Boost PC performance: How more available memory can improve productivity

System Architecture of GBIF

1. System Architecture of GBIF Oliver Meyn https://elephant.tech

2. Hi, I’m Oliver Meyn! • Working with Hadoop and HBase since 2009 • Java and SQL since 1999 • Cloudera Certified HBase specialist (CCSHB) • Engineering & Computer Science by training • Lived in Copenhagen the last 5 years, working for GBIF (gbif.org) • https://elephant.tech

4. 2004 - Global Biodiversity Information Facility is formed • “Occurrence Records” - what, where, when, who • XML protocols exist for sharing data (federated search) • GBIF is formed and builds first prototype index giving access to 60M records, on MySQL • queries decoupled from publishers, regular crawling means frequent updates • no maps or analytical tools (search and downloads only)

5. 2007 - Global launch of “Data Portal” • index is up to 120M records, maps, charts, search and downloads • need to limit downloads • MySQL locks mean rollovers introduced, weeks turn to months • the wheels start to come off

6. 2009 - Enter the Elephant • start using MapReduce to do the rollovers • Sqoop into HDFS, MR to process, Sqoop back out to MySQL • "How to kill a MySQL server without even trying" • delimiters are the devil: Avro to the rescue • 10x 8GB, 4 core, 2 disk

7. 2013 - Relaunch of gbif.org on full Hadoop stack • HBase, Hive, HDFS, MapReduce, Oozie, Solr • ~400M records in 2013 • gbif.org runs from Java ws that use our public API • unlimited downloads, much better search • new hardware: 12x 24 core, 64GB, 12x 1TB (Dell R720)

10. Maps • precalculated on write, 16 zoom levels, in HBase, as a sort of rollup • key contains rollup dimensions (e.g. country, kingdom, species), zoom level, and tile coordinates • single cell per row which has an Avro file which provides layers for decade ranges and Basis Of Record, and 256x256 array holding counts (pixels) • rendered on the fly by a custom tile renderer that can turn HBase cells into pngs

11. Crawling / Processing • crawling coordinated in Zookeeper • RabbitMQ passing json messages • custom Java crawlers for each protocol listening for "start crawl" messages • multiple services to clean species names, lat/lng, dates • downstream listeners update counts (HBase), Solr, and maps (HBase) HBase SOLR Cloud Crawlers Persist Normalize Interpret Index broadcast (RabbitMQ)

12. Occurrence Record Search • Solr stores search fields and record ID • originally single Solr instance • moving towards facets in SolrCloud v5 (v5 not part of CDH5) • facets enable “map my search” ? • SolrCloud memory tuning not trivial

13.

14. Downloads • vast majority are < 200k records (“small”) • for small downloads do Solr query for IDs, then multithreaded get from HBase • for big downloads use Hive to do full scan of HDFS dump of HBase table

15.

16. Registry Checklist Bank Occurrenc e GeoCode Messaging API Metrics Maps Java Drupal Varnish Web Layer GBIF API

17. Hadoop Layer

18. Pain points • Inconsistency across stores • Rebuilding counts and Solr index • Migration from MR1 to Yarn • Many moving parts • DOS ourselves with big crawls • This stuff is not trivial flickr @Graham Wise

19. Thanks! Oliver Meyn oliver@elephant.tech https://elephant.tech

Editor's Notes

Anopheles gambiae
- slow, unreliable servers, can never be sure you have the full data
all species visualizing data is the fastest way to spot errors
Anopheles gambiae - every taxon has a map
- xml protocols still exist, being overtaken by DWCA (controlled vocabulary, zipped tab file)
- disable deep paging - note the Download button
several tables used as a cube rollup detail page is direct hbase call
remember downloads button? started as full scans of HBase for everything
reprocess historical snapshots to the latest cleaning routines then do monster unions to produce csvs the limiting factor is the ws lookups - we have to take distinct species and lat/lng to do lookups, then join back R for graphs
Paraponera clavata - bullet ant - shade fucking guava - don’t try to follow the latest fad - get stable first

System Architecture of GBIF

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (17)

Similar to System Architecture of GBIF

Similar to System Architecture of GBIF (20)

Recently uploaded

Recently uploaded (20)

System Architecture of GBIF

Editor's Notes