Strictly Confidential 1
Exploring Alluxio for
daily tasks
Our first glance into the Alluxio world Jiawei Zhang
Yichuan Huang
Grace Lu
Wenlong Xiong
Strictly Confidential 2
Content
● Background
○ Data warehousing at Robinhood
○ Our Presto use cases
● The limitations we’ve met
● How Alluxio helps
○ Visualization queries
○ Data analyzation
● Our performance test results
Strictly Confidential 3
1. How we build our data lake
2. Our daily traffic
3. Perspective
01
Background
Strictly Confidential 4
How we build our data lake
Strictly Confidential 5
Our daily traffic
● Ad-hoc queries
○ Mainly from visualizations
○ Performing simple filters and aggregations on large range of data
○ Always come with a limit on resultset
○ Featuring data ranging from a day to a month
○ Concentrating on a small number of datasets
○ Repeated many times
Strictly Confidential 6
Our daily traffic
● Data Analyzation queries
○ From scheduled jobs, services or manually from data scientists
○ Comprehensive query that contains lots of stages
○ Performing a bunch of queries on same set of data
Strictly Confidential 7
Our daily traffic
● Data report generations
○ On-demand from data scientists
○ Doing aggregation on most recent data
○ Running daily or weekly
○ Featuring data from previous day ~ previous week
Strictly Confidential 8
Perspective
● We always read data from last several days, from several datasets.
● S3 read speed is slow and unstable
● What if we have a cache for just the most commonly used data?
● Requirements:
○ Able to integrate with Presto/HMS and cache input data based on query
○ Able to flush based on data access frequency
○ Scalable
Strictly Confidential 9
1. Our Alluxio Setup
2. Standalone vs. Co-located
3. Performance improvement
02
How Alluxio Helps
Strictly Confidential 10
Our Alluxio setup
Strictly Confidential 11
Standalone vs. Co-located
Co-located
Pros:
● Lower cost
● Better Utilize VM resource
● Skip the hop between VPCs
Cons:
● Auto-scaling and graceful shutdown for
both services
● Might impact bandwidth between workers
● Extra CPU usage
Standalone
Pros:
● Won’t be affected by Presto cluster
changes
● Can be extended to other usages, e.g.
Spark, Flink, etc
Cons:
● More cost
● Network transfer vs data locality
Strictly Confidential 12
Performance Improvements
● Warm data:
○ 30~50% read speed increase
■ Increase read throughput
■ Reduce total rows scanned
○ Overall ~30% improvement over data-intensive queries
○ Eliminated issues caused by unstable connectivity to data source
Strictly Confidential 13
Technical Challenges we faced
● Reading Cold data
● Mounting huge schema
● Handling large tables
Strictly Confidential 14
Thank you

Exploring Alluxio for Daily Tasks at Robinhood

  • 1.
    Strictly Confidential 1 ExploringAlluxio for daily tasks Our first glance into the Alluxio world Jiawei Zhang Yichuan Huang Grace Lu Wenlong Xiong
  • 2.
    Strictly Confidential 2 Content ●Background ○ Data warehousing at Robinhood ○ Our Presto use cases ● The limitations we’ve met ● How Alluxio helps ○ Visualization queries ○ Data analyzation ● Our performance test results
  • 3.
    Strictly Confidential 3 1.How we build our data lake 2. Our daily traffic 3. Perspective 01 Background
  • 4.
    Strictly Confidential 4 Howwe build our data lake
  • 5.
    Strictly Confidential 5 Ourdaily traffic ● Ad-hoc queries ○ Mainly from visualizations ○ Performing simple filters and aggregations on large range of data ○ Always come with a limit on resultset ○ Featuring data ranging from a day to a month ○ Concentrating on a small number of datasets ○ Repeated many times
  • 6.
    Strictly Confidential 6 Ourdaily traffic ● Data Analyzation queries ○ From scheduled jobs, services or manually from data scientists ○ Comprehensive query that contains lots of stages ○ Performing a bunch of queries on same set of data
  • 7.
    Strictly Confidential 7 Ourdaily traffic ● Data report generations ○ On-demand from data scientists ○ Doing aggregation on most recent data ○ Running daily or weekly ○ Featuring data from previous day ~ previous week
  • 8.
    Strictly Confidential 8 Perspective ●We always read data from last several days, from several datasets. ● S3 read speed is slow and unstable ● What if we have a cache for just the most commonly used data? ● Requirements: ○ Able to integrate with Presto/HMS and cache input data based on query ○ Able to flush based on data access frequency ○ Scalable
  • 9.
    Strictly Confidential 9 1.Our Alluxio Setup 2. Standalone vs. Co-located 3. Performance improvement 02 How Alluxio Helps
  • 10.
  • 11.
    Strictly Confidential 11 Standalonevs. Co-located Co-located Pros: ● Lower cost ● Better Utilize VM resource ● Skip the hop between VPCs Cons: ● Auto-scaling and graceful shutdown for both services ● Might impact bandwidth between workers ● Extra CPU usage Standalone Pros: ● Won’t be affected by Presto cluster changes ● Can be extended to other usages, e.g. Spark, Flink, etc Cons: ● More cost ● Network transfer vs data locality
  • 12.
    Strictly Confidential 12 PerformanceImprovements ● Warm data: ○ 30~50% read speed increase ■ Increase read throughput ■ Reduce total rows scanned ○ Overall ~30% improvement over data-intensive queries ○ Eliminated issues caused by unstable connectivity to data source
  • 13.
    Strictly Confidential 13 TechnicalChallenges we faced ● Reading Cold data ● Mounting huge schema ● Handling large tables
  • 14.