Hybrid collaborative tiered
storage with Alluxio
Thai Bui
Data Engineer @ Bazaarvoice
Bazaarvoice
● Founded in 2005 in Austin, TX
● Digital marketing SaaS platforms for ratings and reviews
○ Display & syndicate reviews from brands to retailer websites
○ Reporting & analytics on consumers, reviews, products, etc.
● 2,600 client websites
● 5.4 billion product page views each month
● 900 million unique shoppers each month
Reporting & analytics on S3
When you have 100s of TB of data on S3
● Just listing the files is slow
● Download speed in EC2 is limited (50-150Mb/s per node)
● No concept of cache
● No concept of data locality
AWS S3 : The Need For Speed
● Add tiered storage to S3
○ Hot, warm, cold storage (fastest, fast, and not so fast)
○ Metadata cache
○ Data cache
● Keep data local
○ In the same machine, not via the Ethernet cable
● Compatible with existing services
○ Hadoop, Spark, Hive, Presto, etc.
● Adaptive & highly configurable
○ Symlink for S3
ZFS
App1 Spark
Alluxio
S3
Hot & Warm
Cold
Overview
App2
● Alluxio
○ Distributed data
storage
○ Hadoop compatible
○ By AMPLab
● ZFS
○ OS-level file system
○ Volume manager
○ By Sun Microsystems
● Both are open-source
Metastore
Alluxio : The tiered-storage layer
● Support for native filesystem and Hadoop filesystem
● Distributed and can be installed on every node
○ Provides data locality
● Mount S3, HDFS, etc. to Alluxio
○ Think symlink. No data movement.
● Use Hive metastore to partition data into hot/warm and cold region
○ Acts as a remote tiered-storage layer
ZFS : The acceleration layer
● Both a filesytem & a volume manager
○ Mirror write to 2 SSDs -> 2x read speed
● Works at the Linux kernel-space
○ Works with RAM to accelerate read/write
○ Auto promote/demote blocks from RAM to other storage
○ Used with local NVMe SSD if data is not in RAM
○ Acts as a local tiered-storage layer
● Extremely reliable
○ Automatic block checksum & repair
ZFS + NVMe: Micro benchmark
I3.4xlarge, up to 10Gbit network, 2 x 1.9 NVMe SSD
● Baseline w/ EBS
○ 135 MB/s write (dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync)
○ 157 MB/s read (dd if=/tmp/test1.img of=/dev/zero bs=8k)
● ZFS + 2 mirrored NVMe SSD
○ 820 MB/s write (dd if=/dev/zero of=/alluxio/fs/test1.img bs=1G count=1)
○ 1.7 GB/s read (dd if=/alluxio/fs/test1.img of=/dev/zero bs=1G count=1)
● 4x write, 10x read compared to EBS
● 10-15x compared to S3
With ZFS
ZFS
Hot
Warm
Kernel-space
User-space
Alluxio
RAM
NVMe SSD
promote demote
Native/Hadoop Filesystem API
Hive
Metastore
Last 30
days
Alluxio
> 30 daysS3
Hot &
Warm
Cold
With Hive
CPU/IO Monitoring
Tiered storage Monitoring
Alluxio Monitoring
Hive Monitoring & Performance
Scanning 200G of data in
tiered storage, 500M
rows, select *
Scanning 5G of data in
tiered storage, 350M
rows, fewer projections
Scanning 35G of data in
S3, 1.6B rows, count
distinct
Metadata/split calculation ops
60s, majority of the
time spent on
scanning S3
Result
● 5-10X read improvement in Hive
○ Worker can short-circuit and read directly from ZFS instead of S3
○ Move compute to the data
● Easy to debug, with feedback loop, collaborative
○ Data publishers + data analysts/scientists
● Good for iterating over the same data set multiple times
○ Machine learning
○ Exploratory analysis
● Give us control over S3
○ More recent data should be faster to access
Question?

Hybrid collaborative tiered storage with alluxio

  • 1.
    Hybrid collaborative tiered storagewith Alluxio Thai Bui Data Engineer @ Bazaarvoice
  • 2.
    Bazaarvoice ● Founded in2005 in Austin, TX ● Digital marketing SaaS platforms for ratings and reviews ○ Display & syndicate reviews from brands to retailer websites ○ Reporting & analytics on consumers, reviews, products, etc. ● 2,600 client websites ● 5.4 billion product page views each month ● 900 million unique shoppers each month
  • 3.
    Reporting & analyticson S3 When you have 100s of TB of data on S3 ● Just listing the files is slow ● Download speed in EC2 is limited (50-150Mb/s per node) ● No concept of cache ● No concept of data locality
  • 4.
    AWS S3 :The Need For Speed ● Add tiered storage to S3 ○ Hot, warm, cold storage (fastest, fast, and not so fast) ○ Metadata cache ○ Data cache ● Keep data local ○ In the same machine, not via the Ethernet cable ● Compatible with existing services ○ Hadoop, Spark, Hive, Presto, etc. ● Adaptive & highly configurable ○ Symlink for S3
  • 5.
    ZFS App1 Spark Alluxio S3 Hot &Warm Cold Overview App2 ● Alluxio ○ Distributed data storage ○ Hadoop compatible ○ By AMPLab ● ZFS ○ OS-level file system ○ Volume manager ○ By Sun Microsystems ● Both are open-source Metastore
  • 6.
    Alluxio : Thetiered-storage layer ● Support for native filesystem and Hadoop filesystem ● Distributed and can be installed on every node ○ Provides data locality ● Mount S3, HDFS, etc. to Alluxio ○ Think symlink. No data movement. ● Use Hive metastore to partition data into hot/warm and cold region ○ Acts as a remote tiered-storage layer
  • 7.
    ZFS : Theacceleration layer ● Both a filesytem & a volume manager ○ Mirror write to 2 SSDs -> 2x read speed ● Works at the Linux kernel-space ○ Works with RAM to accelerate read/write ○ Auto promote/demote blocks from RAM to other storage ○ Used with local NVMe SSD if data is not in RAM ○ Acts as a local tiered-storage layer ● Extremely reliable ○ Automatic block checksum & repair
  • 8.
    ZFS + NVMe:Micro benchmark I3.4xlarge, up to 10Gbit network, 2 x 1.9 NVMe SSD ● Baseline w/ EBS ○ 135 MB/s write (dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync) ○ 157 MB/s read (dd if=/tmp/test1.img of=/dev/zero bs=8k) ● ZFS + 2 mirrored NVMe SSD ○ 820 MB/s write (dd if=/dev/zero of=/alluxio/fs/test1.img bs=1G count=1) ○ 1.7 GB/s read (dd if=/alluxio/fs/test1.img of=/dev/zero bs=1G count=1) ● 4x write, 10x read compared to EBS ● 10-15x compared to S3
  • 9.
  • 10.
    Hive Metastore Last 30 days Alluxio > 30daysS3 Hot & Warm Cold With Hive
  • 11.
  • 12.
  • 13.
  • 14.
    Hive Monitoring &Performance Scanning 200G of data in tiered storage, 500M rows, select * Scanning 5G of data in tiered storage, 350M rows, fewer projections
  • 15.
    Scanning 35G ofdata in S3, 1.6B rows, count distinct Metadata/split calculation ops 60s, majority of the time spent on scanning S3
  • 16.
    Result ● 5-10X readimprovement in Hive ○ Worker can short-circuit and read directly from ZFS instead of S3 ○ Move compute to the data ● Easy to debug, with feedback loop, collaborative ○ Data publishers + data analysts/scientists ● Good for iterating over the same data set multiple times ○ Machine learning ○ Exploratory analysis ● Give us control over S3 ○ More recent data should be faster to access
  • 17.