SlideShare a Scribd company logo
1 of 19
System Architecture of GBIF
Oliver Meyn
https://elephant.tech
Hi, I’m Oliver Meyn!
• Working with Hadoop and HBase since 2009
• Java and SQL since 1999
• Cloudera Certified HBase specialist (CCSHB)
• Engineering & Computer Science by training
• Lived in Copenhagen the last 5 years, working for GBIF (gbif.org)
• https://elephant.tech
2004 - Global Biodiversity Information Facility is
formed
• “Occurrence Records” - what, where, when, who
• XML protocols exist for sharing data (federated
search)
• GBIF is formed and builds first prototype index
giving access to 60M records, on MySQL
• queries decoupled from publishers, regular
crawling means frequent updates
• no maps or analytical tools (search and
downloads only)
2007 - Global launch of “Data Portal”
• index is up to 120M records, maps, charts, search and
downloads
• need to limit downloads
• MySQL locks mean rollovers introduced, weeks turn to
months
• the wheels start to come off
2009 - Enter the Elephant
• start using MapReduce to do the
rollovers
• Sqoop into HDFS, MR to process,
Sqoop back out to MySQL
• "How to kill a MySQL server without
even trying"
• delimiters are the devil: Avro to the
rescue
• 10x 8GB, 4 core, 2 disk
2013 - Relaunch of gbif.org on full Hadoop
stack
• HBase, Hive, HDFS, MapReduce,
Oozie, Solr
• ~400M records in 2013
• gbif.org runs from Java ws that use
our public API
• unlimited downloads, much better
search
• new hardware: 12x 24 core, 64GB,
12x 1TB (Dell R720)
Maps
• precalculated on write, 16 zoom levels, in
HBase, as a sort of rollup
• key contains rollup dimensions (e.g. country,
kingdom, species), zoom level, and tile
coordinates
• single cell per row which has an Avro file
which provides layers for decade ranges and
Basis Of Record, and 256x256 array holding
counts (pixels)
• rendered on the fly by a custom tile renderer
that can turn HBase cells into pngs
Crawling / Processing
• crawling coordinated in Zookeeper
• RabbitMQ passing json messages
• custom Java crawlers for each
protocol listening for "start crawl"
messages
• multiple services to clean species
names, lat/lng, dates
• downstream listeners update counts
(HBase), Solr, and maps (HBase)
HBase
SOLR
Cloud
Crawlers
Persist
Normalize
Interpret
Index
broadcast
(RabbitMQ)
Occurrence Record Search
• Solr stores search fields and record
ID
• originally single Solr instance
• moving towards facets in SolrCloud
v5 (v5 not part of CDH5)
• facets enable “map my search” ?
• SolrCloud memory tuning not trivial
Downloads
• vast majority are < 200k
records (“small”)
• for small downloads do Solr
query for IDs, then
multithreaded get from HBase
• for big downloads use Hive to
do full scan of HDFS dump of
HBase table
Registry
Checklist
Bank
Occurrenc
e
GeoCode
Messaging API
Metrics Maps
Java Drupal
Varnish
Web Layer
GBIF API
Hadoop Layer
Pain points
• Inconsistency across stores
• Rebuilding counts and Solr
index
• Migration from MR1 to Yarn
• Many moving parts
• DOS ourselves with big
crawls
• This stuff is not trivial flickr @Graham Wise
Thanks!
Oliver Meyn
oliver@elephant.tech
https://elephant.tech

More Related Content

What's hot

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN
 
An introduction to Pincaster
An introduction to PincasterAn introduction to Pincaster
An introduction to PincasterFrank Denis
 
Rich storytelling with Drupal, Paragraphs and Islandora DAMS
Rich storytelling with Drupal, Paragraphs and Islandora DAMSRich storytelling with Drupal, Paragraphs and Islandora DAMS
Rich storytelling with Drupal, Paragraphs and Islandora DAMSalxbrdg
 
Small intro to Big Data - Old version
Small intro to Big Data - Old versionSmall intro to Big Data - Old version
Small intro to Big Data - Old versionSoftwareMill
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Hadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingssHadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingssA1 Trainings
 
DSpace at ILRI: A semi-technical overview of “CGSpace”
DSpace at ILRI: A semi-technical overview of “CGSpace”DSpace at ILRI: A semi-technical overview of “CGSpace”
DSpace at ILRI: A semi-technical overview of “CGSpace”ILRI
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developerJesus Rodriguez
 
Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsSoftwareMill
 
LoCloud Geocoding Application, Runar Bergheim, Asplan Viak Internet
LoCloud Geocoding Application, Runar Bergheim, Asplan Viak InternetLoCloud Geocoding Application, Runar Bergheim, Asplan Viak Internet
LoCloud Geocoding Application, Runar Bergheim, Asplan Viak Internetlocloud
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copyMohammad_Tariq
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Eric Evans
 

What's hot (19)

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
An introduction to Pincaster
An introduction to PincasterAn introduction to Pincaster
An introduction to Pincaster
 
Rich storytelling with Drupal, Paragraphs and Islandora DAMS
Rich storytelling with Drupal, Paragraphs and Islandora DAMSRich storytelling with Drupal, Paragraphs and Islandora DAMS
Rich storytelling with Drupal, Paragraphs and Islandora DAMS
 
Small intro to Big Data - Old version
Small intro to Big Data - Old versionSmall intro to Big Data - Old version
Small intro to Big Data - Old version
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Hadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingssHadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingss
 
DSpace at ILRI: A semi-technical overview of “CGSpace”
DSpace at ILRI: A semi-technical overview of “CGSpace”DSpace at ILRI: A semi-technical overview of “CGSpace”
DSpace at ILRI: A semi-technical overview of “CGSpace”
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developer
 
Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applications
 
Apache drill
Apache drillApache drill
Apache drill
 
LoCloud Geocoding Application, Runar Bergheim, Asplan Viak Internet
LoCloud Geocoding Application, Runar Bergheim, Asplan Viak InternetLoCloud Geocoding Application, Runar Bergheim, Asplan Viak Internet
LoCloud Geocoding Application, Runar Bergheim, Asplan Viak Internet
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 

Viewers also liked

Exploring the future of scholarly publishing of biodiversity data
Exploring the future of scholarly publishing of biodiversity dataExploring the future of scholarly publishing of biodiversity data
Exploring the future of scholarly publishing of biodiversity dataVishwas Chavan
 
The Global Biodiversity Information Facility and Africa Rising
The Global Biodiversity Information Facility and Africa RisingThe Global Biodiversity Information Facility and Africa Rising
The Global Biodiversity Information Facility and Africa RisingFatima Parker-Allie
 
Session 01. Introduction to the GBIF GB22 training event for Nodes
Session 01. Introduction to the GBIF GB22 training event for NodesSession 01. Introduction to the GBIF GB22 training event for Nodes
Session 01. Introduction to the GBIF GB22 training event for NodesAlberto González-Talaván
 
GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...
GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...
GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...Dag Endresen
 
Indo norway delhi_vishwas_28_oct2011_final
Indo norway delhi_vishwas_28_oct2011_finalIndo norway delhi_vishwas_28_oct2011_final
Indo norway delhi_vishwas_28_oct2011_finalVishwas Chavan
 
Guardians educational tour
Guardians educational tourGuardians educational tour
Guardians educational tourMahesh Sabne
 
Road Safety presentation by The Safe Road Foundation
Road Safety presentation by The Safe Road FoundationRoad Safety presentation by The Safe Road Foundation
Road Safety presentation by The Safe Road FoundationThe Safe Road Foundation
 
Road Safety Foundation: Making Road Safety Pay
Road Safety Foundation: Making Road Safety PayRoad Safety Foundation: Making Road Safety Pay
Road Safety Foundation: Making Road Safety PayAgeas UK
 
Traffic laws, rules and regulation
Traffic laws, rules and regulationTraffic laws, rules and regulation
Traffic laws, rules and regulationDannica Agbayani
 
Road safety presentation(PPT) by Faisal
Road safety presentation(PPT) by FaisalRoad safety presentation(PPT) by Faisal
Road safety presentation(PPT) by FaisalFaisal Ayub
 
Traffic light controller
Traffic light controllerTraffic light controller
Traffic light controllerRkrishna Mishra
 

Viewers also liked (17)

Kubatbekova kaliya
Kubatbekova kaliyaKubatbekova kaliya
Kubatbekova kaliya
 
USDA Report - Organic fruits and vegetables prices_July 2013
USDA Report - Organic fruits and vegetables prices_July 2013 USDA Report - Organic fruits and vegetables prices_July 2013
USDA Report - Organic fruits and vegetables prices_July 2013
 
Ices wgdim-may-2010
Ices wgdim-may-2010Ices wgdim-may-2010
Ices wgdim-may-2010
 
Exploring the future of scholarly publishing of biodiversity data
Exploring the future of scholarly publishing of biodiversity dataExploring the future of scholarly publishing of biodiversity data
Exploring the future of scholarly publishing of biodiversity data
 
The Global Biodiversity Information Facility and Africa Rising
The Global Biodiversity Information Facility and Africa RisingThe Global Biodiversity Information Facility and Africa Rising
The Global Biodiversity Information Facility and Africa Rising
 
Session 01. Introduction to the GBIF GB22 training event for Nodes
Session 01. Introduction to the GBIF GB22 training event for NodesSession 01. Introduction to the GBIF GB22 training event for Nodes
Session 01. Introduction to the GBIF GB22 training event for Nodes
 
GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...
GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...
GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...
 
Indo norway delhi_vishwas_28_oct2011_final
Indo norway delhi_vishwas_28_oct2011_finalIndo norway delhi_vishwas_28_oct2011_final
Indo norway delhi_vishwas_28_oct2011_final
 
Guardians educational tour
Guardians educational tourGuardians educational tour
Guardians educational tour
 
Road Safety presentation by The Safe Road Foundation
Road Safety presentation by The Safe Road FoundationRoad Safety presentation by The Safe Road Foundation
Road Safety presentation by The Safe Road Foundation
 
Know your traffic signs
Know your traffic signsKnow your traffic signs
Know your traffic signs
 
Road Safety Foundation: Making Road Safety Pay
Road Safety Foundation: Making Road Safety PayRoad Safety Foundation: Making Road Safety Pay
Road Safety Foundation: Making Road Safety Pay
 
Traffic laws, rules and regulation
Traffic laws, rules and regulationTraffic laws, rules and regulation
Traffic laws, rules and regulation
 
Road safety presentation(PPT) by Faisal
Road safety presentation(PPT) by FaisalRoad safety presentation(PPT) by Faisal
Road safety presentation(PPT) by Faisal
 
Road signs
Road signsRoad signs
Road signs
 
Traffic Rules
Traffic RulesTraffic Rules
Traffic Rules
 
Traffic light controller
Traffic light controllerTraffic light controller
Traffic light controller
 

Similar to System Architecture of GBIF

Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillMapR Technologies
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Object Relational Database Management System
Object Relational Database Management SystemObject Relational Database Management System
Object Relational Database Management SystemAmar Myana
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetupgregchanan
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archivesvinaygo
 
Big data hadoop training in pune course content advanto software
Big data hadoop training in pune course content advanto softwareBig data hadoop training in pune course content advanto software
Big data hadoop training in pune course content advanto softwareAdvanto Software
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
Mongo db admin_20110329
Mongo db admin_20110329Mongo db admin_20110329
Mongo db admin_20110329radiocats
 
Polyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudPolyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudAndrei Savu
 
Databases in the hosted cloud
Databases in the hosted cloud Databases in the hosted cloud
Databases in the hosted cloud Colin Charles
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL DatabasesEmanuel Calvo
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Sandeep Kunkunuru
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19jasonfrantz
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Yahoo Developer Network
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Ted Dunning
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectureshypertable
 

Similar to System Architecture of GBIF (20)

Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Object Relational Database Management System
Object Relational Database Management SystemObject Relational Database Management System
Object Relational Database Management System
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archives
 
Big data hadoop training in pune course content advanto software
Big data hadoop training in pune course content advanto softwareBig data hadoop training in pune course content advanto software
Big data hadoop training in pune course content advanto software
 
Hbase Nosql
Hbase NosqlHbase Nosql
Hbase Nosql
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Mongo db admin_20110329
Mongo db admin_20110329Mongo db admin_20110329
Mongo db admin_20110329
 
Polyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudPolyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the Cloud
 
Databases in the hosted cloud
Databases in the hosted cloud Databases in the hosted cloud
Databases in the hosted cloud
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 

Recently uploaded

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

System Architecture of GBIF

  • 1. System Architecture of GBIF Oliver Meyn https://elephant.tech
  • 2. Hi, I’m Oliver Meyn! • Working with Hadoop and HBase since 2009 • Java and SQL since 1999 • Cloudera Certified HBase specialist (CCSHB) • Engineering & Computer Science by training • Lived in Copenhagen the last 5 years, working for GBIF (gbif.org) • https://elephant.tech
  • 3.
  • 4. 2004 - Global Biodiversity Information Facility is formed • “Occurrence Records” - what, where, when, who • XML protocols exist for sharing data (federated search) • GBIF is formed and builds first prototype index giving access to 60M records, on MySQL • queries decoupled from publishers, regular crawling means frequent updates • no maps or analytical tools (search and downloads only)
  • 5. 2007 - Global launch of “Data Portal” • index is up to 120M records, maps, charts, search and downloads • need to limit downloads • MySQL locks mean rollovers introduced, weeks turn to months • the wheels start to come off
  • 6. 2009 - Enter the Elephant • start using MapReduce to do the rollovers • Sqoop into HDFS, MR to process, Sqoop back out to MySQL • "How to kill a MySQL server without even trying" • delimiters are the devil: Avro to the rescue • 10x 8GB, 4 core, 2 disk
  • 7. 2013 - Relaunch of gbif.org on full Hadoop stack • HBase, Hive, HDFS, MapReduce, Oozie, Solr • ~400M records in 2013 • gbif.org runs from Java ws that use our public API • unlimited downloads, much better search • new hardware: 12x 24 core, 64GB, 12x 1TB (Dell R720)
  • 8.
  • 9.
  • 10. Maps • precalculated on write, 16 zoom levels, in HBase, as a sort of rollup • key contains rollup dimensions (e.g. country, kingdom, species), zoom level, and tile coordinates • single cell per row which has an Avro file which provides layers for decade ranges and Basis Of Record, and 256x256 array holding counts (pixels) • rendered on the fly by a custom tile renderer that can turn HBase cells into pngs
  • 11. Crawling / Processing • crawling coordinated in Zookeeper • RabbitMQ passing json messages • custom Java crawlers for each protocol listening for "start crawl" messages • multiple services to clean species names, lat/lng, dates • downstream listeners update counts (HBase), Solr, and maps (HBase) HBase SOLR Cloud Crawlers Persist Normalize Interpret Index broadcast (RabbitMQ)
  • 12. Occurrence Record Search • Solr stores search fields and record ID • originally single Solr instance • moving towards facets in SolrCloud v5 (v5 not part of CDH5) • facets enable “map my search” ? • SolrCloud memory tuning not trivial
  • 13.
  • 14. Downloads • vast majority are < 200k records (“small”) • for small downloads do Solr query for IDs, then multithreaded get from HBase • for big downloads use Hive to do full scan of HDFS dump of HBase table
  • 15.
  • 18. Pain points • Inconsistency across stores • Rebuilding counts and Solr index • Migration from MR1 to Yarn • Many moving parts • DOS ourselves with big crawls • This stuff is not trivial flickr @Graham Wise

Editor's Notes

  1. Anopheles gambiae
  2. - slow, unreliable servers, can never be sure you have the full data
  3. all species visualizing data is the fastest way to spot errors
  4. Anopheles gambiae - every taxon has a map
  5. - xml protocols still exist, being overtaken by DWCA (controlled vocabulary, zipped tab file)
  6. - disable deep paging - note the Download button
  7. several tables used as a cube rollup detail page is direct hbase call
  8. remember downloads button? started as full scans of HBase for everything
  9. reprocess historical snapshots to the latest cleaning routines then do monster unions to produce csvs the limiting factor is the ws lookups - we have to take distinct species and lat/lng to do lookups, then join back R for graphs
  10. Paraponera clavata - bullet ant - shade fucking guava - don’t try to follow the latest fad - get stable first