Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBase Global Indexing to support large-scale data ingestion at Uber

Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.

Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.

At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.

At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.

  • Login to see the comments

  • Be the first to like this

HBase Global Indexing to support large-scale data ingestion at Uber

  1. 1. HBase Global Indexing to support large-scale data ingestion @ Uber May 21, 2019
  2. 2. Danny Chen ● Engineering Manager on Hadoop Data Platform team ● Leading Data Ingestion team ● Previous worked @ on storage team (Manhattan) ● Enjoy playing basketball, biking, and spending time w/my kids.
  3. 3. Uber Apache Hadoop Platform Team Mission Build products to support reliable, scalable, easy-to-use, compliant, and efficient data transfer (both ingestion & dispersal) as well as data storage leveraging the Apache Hadoop ecosystem. Apache Hadoop is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of this mark.
  4. 4. Overview ● High-Level Ingestion & Dispersal introduction ● Different types of workloads ● Need for Global Index ● How Global Index Works ● Generating Global Indexes with HFiles ● Throttling HBase Access ● Next Steps
  5. 5. High Level Ingestion/Dispersal Introduction
  6. 6. Hadoop Data Ecosystem at Uber Apache Hadoop Data Lake Schemaless Analytical Processing Apache Kafka, Cassandra, Spark, and HDFS logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks. Data Ingestion Data Dispersal
  7. 7. Hadoop Data Ecosystem at Uber
  8. 8. Different Types of Workloads
  9. 9. Bootstrap ● One time only at beginning of lifecycle ● Large amounts of data ● Millions of QPS throughput ● Need to finish in a matter of hours ● NoSQL stores cannot keep up
  10. 10. Incremental ● Dominates lifecycle of Hive table ingestion ● Incremental upstream changes from Kafka or other data sources. ● 1000’s QPS per dataset ● Reasonable throughput requirements for NoSQL stores
  11. 11. Cell vs Row Changes
  12. 12. Need for Global Index
  13. 13. Requirements for Global Index ● Large amounts of historical data ingested in short amount of time ● Append only vs Append-plus-update ● Data layout and partitioning ● Bookkeeping for data layout ● Strong consistency ● High Throughput ● Horizontally scalable ● Required a NoSQL store
  14. 14. ● Decision was to use HBase ● Trade Availability for Consistency ● Automatic Rebalancing of HBase tables via region splitting ● Global view of dataset via master/slave architecture VS
  15. 15. How Global Index Works
  16. 16. Generating Global Indexes
  17. 17. Batch and One Time Index Upload
  18. 18. Data Model For Global Index
  19. 19. Spark & RDD Transformations for index generation
  20. 20. HFile Upload Process
  21. 21. HFile Index Job Tuning ● Explicitly register classes with Kryo Serialization ● Reduce 3 shuffle stages to one ● Proper HFile Size ● Proper Partition Counting Size ● 13 TB index data with 54 billion indexes ○ 2 hours to generate indexes ○ 10 min to load
  22. 22. Throttling HBase Access
  23. 23. The need for throttling HBase Access
  24. 24. Horizontal Scalability & Throttling
  25. 25. Next Steps
  26. 26. Next Steps ● Handle non-append-only data during bootstrap ● Explore other indexing solutions
  27. 27. Useful Links open-source/
  28. 28. Other Dataworks Summit Talks Marmaray: Uber’s Open-sourced Generic Hadoop Data Ingestion and Dispersal Framework Wednesday at 11 am
  29. 29. Attribution Kaushik Devarajaiah Nishith Agarwal Jing Li
  30. 30. Positions available: Seattle, Palo Alto & San Francisco email : We are hiring!
  31. 31. Thank you Proprietary © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed. All recipients of this document are notified that the information contained herein includes proprietary information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. Questions: email Follow our Facebook page: