Exploring Alluxio for Daily Tasks at Robinhood

Strictly Confidential 1
Exploring Alluxio for
daily tasks
Our first glance into the Alluxio world Jiawei Zhang
Yichuan Huang
Grace Lu
Wenlong Xiong

Content
● Background
○ Data warehousing at Robinhood
○ Our Presto use cases
● The limitations we’ve met
● How Alluxio helps
○ Visualization queries
○ Data analyzation
● Our performance test results

1. How we build our data lake
2. Our daily traffic
3. Perspective
01
Background

How we build our data lake

Our daily traffic
● Ad-hoc queries
○ Mainly from visualizations
○ Performing simple filters and aggregations on large range of data
○ Always come with a limit on resultset
○ Featuring data ranging from a day to a month
○ Concentrating on a small number of datasets
○ Repeated many times

Our daily traffic
● Data Analyzation queries
○ From scheduled jobs, services or manually from data scientists
○ Comprehensive query that contains lots of stages
○ Performing a bunch of queries on same set of data

Our daily traffic
● Data report generations
○ On-demand from data scientists
○ Doing aggregation on most recent data
○ Running daily or weekly
○ Featuring data from previous day ~ previous week

Perspective
● We always read data from last several days, from several datasets.
● S3 read speed is slow and unstable
● What if we have a cache for just the most commonly used data?
● Requirements:
○ Able to integrate with Presto/HMS and cache input data based on query
○ Able to flush based on data access frequency
○ Scalable

1. Our Alluxio Setup
2. Standalone vs. Co-located
3. Performance improvement
02
How Alluxio Helps

Our Alluxio setup

Standalone vs. Co-located
Co-located
Pros:
● Lower cost
● Better Utilize VM resource
● Skip the hop between VPCs
Cons:
● Auto-scaling and graceful shutdown for
both services
● Might impact bandwidth between workers
● Extra CPU usage
Standalone
Pros:
● Won’t be affected by Presto cluster
changes
● Can be extended to other usages, e.g.
Spark, Flink, etc
Cons:
● More cost
● Network transfer vs data locality

Performance Improvements
● Warm data:
○ 30~50% read speed increase
■ Increase read throughput
■ Reduce total rows scanned
○ Overall ~30% improvement over data-intensive queries
○ Eliminated issues caused by unstable connectivity to data source

Technical Challenges we faced
● Reading Cold data
● Mounting huge schema
● Handling large tables

Thank you

Exploring Alluxio for Daily Tasks at Robinhood

More Related Content

What's hot

Similar to Exploring Alluxio for Daily Tasks at Robinhood

More from Alluxio, Inc.

Recently uploaded

Exploring Alluxio for Daily Tasks at Robinhood