Hadoop architecture discussion of the Global Biodiversity Information Facility (GBIF) by Oliver Meyn for Toronto Hadoop Users Group (THUG) on 2015-11-27.
2. Hi, I’m Oliver Meyn!
• Working with Hadoop and HBase since 2009
• Java and SQL since 1999
• Cloudera Certified HBase specialist (CCSHB)
• Engineering & Computer Science by training
• Lived in Copenhagen the last 5 years, working for GBIF (gbif.org)
• https://elephant.tech
3.
4. 2004 - Global Biodiversity Information Facility is
formed
• “Occurrence Records” - what, where, when, who
• XML protocols exist for sharing data (federated
search)
• GBIF is formed and builds first prototype index
giving access to 60M records, on MySQL
• queries decoupled from publishers, regular
crawling means frequent updates
• no maps or analytical tools (search and
downloads only)
5. 2007 - Global launch of “Data Portal”
• index is up to 120M records, maps, charts, search and
downloads
• need to limit downloads
• MySQL locks mean rollovers introduced, weeks turn to
months
• the wheels start to come off
6. 2009 - Enter the Elephant
• start using MapReduce to do the
rollovers
• Sqoop into HDFS, MR to process,
Sqoop back out to MySQL
• "How to kill a MySQL server without
even trying"
• delimiters are the devil: Avro to the
rescue
• 10x 8GB, 4 core, 2 disk
7. 2013 - Relaunch of gbif.org on full Hadoop
stack
• HBase, Hive, HDFS, MapReduce,
Oozie, Solr
• ~400M records in 2013
• gbif.org runs from Java ws that use
our public API
• unlimited downloads, much better
search
• new hardware: 12x 24 core, 64GB,
12x 1TB (Dell R720)
8.
9.
10. Maps
• precalculated on write, 16 zoom levels, in
HBase, as a sort of rollup
• key contains rollup dimensions (e.g. country,
kingdom, species), zoom level, and tile
coordinates
• single cell per row which has an Avro file
which provides layers for decade ranges and
Basis Of Record, and 256x256 array holding
counts (pixels)
• rendered on the fly by a custom tile renderer
that can turn HBase cells into pngs
11. Crawling / Processing
• crawling coordinated in Zookeeper
• RabbitMQ passing json messages
• custom Java crawlers for each
protocol listening for "start crawl"
messages
• multiple services to clean species
names, lat/lng, dates
• downstream listeners update counts
(HBase), Solr, and maps (HBase)
HBase
SOLR
Cloud
Crawlers
Persist
Normalize
Interpret
Index
broadcast
(RabbitMQ)
12. Occurrence Record Search
• Solr stores search fields and record
ID
• originally single Solr instance
• moving towards facets in SolrCloud
v5 (v5 not part of CDH5)
• facets enable “map my search” ?
• SolrCloud memory tuning not trivial
13.
14. Downloads
• vast majority are < 200k
records (“small”)
• for small downloads do Solr
query for IDs, then
multithreaded get from HBase
• for big downloads use Hive to
do full scan of HDFS dump of
HBase table
18. Pain points
• Inconsistency across stores
• Rebuilding counts and Solr
index
• Migration from MR1 to Yarn
• Many moving parts
• DOS ourselves with big
crawls
• This stuff is not trivial flickr @Graham Wise
- slow, unreliable servers, can never be sure you have the full data
all species
visualizing data is the fastest way to spot errors
Anopheles gambiae
- every taxon has a map
- xml protocols still exist, being overtaken by DWCA (controlled vocabulary, zipped tab file)
- disable deep paging
- note the Download button
several tables used as a cube rollup
detail page is direct hbase call
remember downloads button?
started as full scans of HBase for everything
reprocess historical snapshots to the latest cleaning routines then do monster unions to produce csvs
the limiting factor is the ws lookups - we have to take distinct species and lat/lng to do lookups, then join back
R for graphs
Paraponera clavata - bullet ant
- shade fucking guava
- don’t try to follow the latest fad - get stable first