HBase Global Indexing to
support large-scale data
ingestion @ Uber
May 21, 2019
Danny Chen
dannyc@uber.com
● Engineering Manager on Hadoop
Data Platform team
● Leading Data Ingestion team
● Previous worked @ on
storage team (Manhattan)
● Enjoy playing basketball, biking,
and spending time w/my kids.
Uber Apache Hadoop
Platform Team Mission
Build products to support reliable,
scalable, easy-to-use, compliant, and
efficient data transfer (both ingestion
& dispersal) as well as data storage
leveraging the Apache Hadoop
ecosystem.
Apache Hadoop is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other
countries. No endorsement by The Apache Software Foundation is implied by the use of this mark.
Overview
● High-Level Ingestion & Dispersal
introduction
● Different types of workloads
● Need for Global Index
● How Global Index Works
● Generating Global Indexes with
HFiles
● Throttling HBase Access
● Next Steps
High Level
Ingestion/Dispersal
Introduction
Hadoop Data Ecosystem at Uber
Apache
Hadoop
Data
Lake
Schemaless
Analytical
Processing
Apache Kafka, Cassandra, Spark, and HDFS logos are either registered trademarks or trademarks of the Apache Software Foundation in the
United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Data
Ingestion
Data
Dispersal
Hadoop Data Ecosystem at Uber
Different
Types of
Workloads
Bootstrap
● One time only at beginning of lifecycle
● Large amounts of data
● Millions of QPS throughput
● Need to finish in a matter of hours
● NoSQL stores cannot keep up
Incremental
● Dominates lifecycle of Hive table ingestion
● Incremental upstream changes from Kafka
or other data sources.
● 1000’s QPS per dataset
● Reasonable throughput requirements for
NoSQL stores
Cell vs Row Changes
Need for Global
Index
Requirements for Global Index
● Large amounts of historical data ingested in short
amount of time
● Append only vs Append-plus-update
● Data layout and partitioning
● Bookkeeping for data layout
● Strong consistency
● High Throughput
● Horizontally scalable
● Required a NoSQL store
● Decision was to use HBase
● Trade Availability for Consistency
● Automatic Rebalancing of HBase tables via region splitting
● Global view of dataset via master/slave architecture
VS
How Global Index
Works
Generating Global
Indexes
Batch and One Time Index Upload
Data Model For Global Index
Spark & RDD Transformations for
index generation
HFile Upload Process
HFile Index Job Tuning
● Explicitly register classes with Kryo Serialization
● Reduce 3 shuffle stages to one
● Proper HFile Size
● Proper Partition Counting Size
● 13 TB index data with 54 billion indexes
○ 2 hours to generate indexes
○ 10 min to load
Throttling HBase
Access
The need for throttling HBase Access
Horizontal Scalability & Throttling
Next Steps
Next Steps
● Handle non-append-only
data during bootstrap
● Explore other indexing
solutions
Useful Links
https://github.com/uber/marmaray
https://github.com/uber/hudi
https://eng.uber.com/data-partitioning-global-indexing/
https://eng.uber.com/uber-big-data-platform/
https://eng.uber.com/marmaray-hadoop-ingestion-
open-source/
Other Dataworks Summit Talks
Marmaray: Uber’s Open-sourced Generic Hadoop Data
Ingestion and Dispersal Framework
Wednesday at 11 am
Attribution
Kaushik
Devarajaiah
Nishith
Agarwal
Jing
Li
Positions available: Seattle, Palo Alto & San
Francisco
email : hadoop-platform-jobs@uber.com
We are hiring!
Thank you
Proprietary © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying, recording, or by any
information storage or retrieval systems, without permission in writing from
Uber. This document is intended only for the use of the individual or entity to
whom it is addressed. All recipients of this document are notified that the
information contained herein includes proprietary information of Uber, and
recipient may not make use of, disseminate, or in any way disclose this
document or any of the enclosed information to any person other than
employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
Questions: email ospo@uber.com
Follow our Facebook page: www.facebook.com/uberopensource

HBase Global Indexing to support large-scale data ingestion at Uber

Editor's Notes

  • #5 Lots of effort into making a completely self-serve onboarding process Analytical users will little technical knowledge of Spark, Hadoop, Hive etc will still be able to take advantage of our platform Our assertion is that when relevant data is discoverable in the appropriate data stores for either analytical purposes, there really can be a substantial gains in terms of efficiency and value for your business. Marmaray is critical for ensuring data is in the appropriate data store. Familiarity with suite of tools in our Hadoop Ecosystem for many potential use cases to extract insights out of raw data
  • #7 Completion of the Hadoop Ecosystem of tools at Uber and original vision of the Data Processing Platform Heatpipe/Watchtower produce quality schematized data Ingest the data via Marmaray Orchestrate jobs via Workflow Management System to run analytics and generate derived datasets, or build models using Michelangelo Disperse the data using Marmaray to stores with low latency semantics What sets it apart Generic ingestion framework Not tightly coupled to any source or sink Shouldn’t be coupled to a specific source or a specific sink (product teams focus on this)
  • #13 Dividing bootstrap and incremental allows us to choose a kv store where we that can scale for incremental phase indexing but not necessarily for bootstrapping of data.
  • #15 HBase automatically rebalances tables within a cluster by splitting up key ranges when a region gets too large. Can also load balance by having new regions moved to other servers The master-slave architecture enables getting a global view of the spread of a dataset across the cluster, which we utilize in customizing dataset specific throughputs to our HBase cluster.
  • #17 During incremental ingestion We work in mini batches. It is the job of work unit calculator to provide required level of throttling
  • #18 We work in mini batches. It is the job of work unit calculator to provide required level of throttling
  • #21  Our Big Data ecosystem’s model of indexes stored in HBase contains entities shown in green that help identify files that need to be updated corresponding to a given record in an append-plus-update dataset. The layout of index entries in HFiles lets us sort based on key value and column.
  • #22 This is for the one time upload case FlatMapToMair transformation in Apache Spark does not preserve the ordering of entries, so a partition isolated sort is performed. The partitioning is unchanged to ensure each partition still corresponds to a non-overlapping key range.
  • #23 HFiles are written to the cluster where HBase is hosted to ensure HBase region servers have access to them during the upload process. - Hfile upload by be severely affected by splitting - Presplit HBase table into as many regions as there are HFiles so each Hfile can fit within a regio - We avoid splitting Hfile based on Hfile size and it severely impacts Hfile upload time (10 min even for tens of TB) - Done by presplitting hbase table so each hfile fits within a seaparate HBase region with non overlapping keys
  • #24 HFiles are written to the cluster where HBase is hosted to ensure HBase region servers have access to them during the upload process.
  • #26 Three Apache Spark jobs corresponding to three different datasets access their respective HBase index table, creating loads on HBase regional servers hosting these tables.
  • #27 Adding more servers to the HBase cluster for a single dataset that is using global index linearly correlates with a QPS increase, although the dataset’s QPSFraction remains constant.
  • #29 Explore other indexing solutions to possibly merge bootstrap and incremental indexing solutions for easier maintenance.