Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

118 views

Published on

Alluxio Bay Area Meetup March 14th
Join the Alluxio Meetup group: https://www.meetup.com/Alluxio
Alluxio Community slack: https://www.alluxio.org/slack

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

  1. 1. Alluxio 2.0 & Near Real-time Analytics with Spark & Alluxio @VipShop Alluxio Bay Area Meetup @alluxio alluxio.org/slack info@alluxio.com
  2. 2. Special thanks to AICamp and ODSC for co-hosting! Alluxio Bay Area Meetup @alluxio alluxio.org/slack info@alluxio.com
  3. 3. Alluxio 2.0.0-preview 03/14 Alluxio Meetup
  4. 4. ● Release Manager for Alluxio 2.0.0 ● Contributor since Tachyon 0.4 (2012) ● Founding Engineer @ Alluxio About Me Calvin Jia
  5. 5. Alluxio Overview • Open source, distributed storage system • Commonly used for data analytics such as OLAP on Hadoop • Deployed at Huya, Two Sigma, Tencent, and many others • Largest deployments of over 1000 nodes Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver
  6. 6. Agenda Alluxio 2.0 Motivation1 Architectural Innovations2 Release Roadmap3
  7. 7. Alluxio 2.0 Motivations
  8. 8. Why 2.0 • Alluxio 1.x target use cases are largely addressed • Three major types of feedback from users • Want to support POSIX-based workloads, especially ML • Want better options for data management • Want to scale to larger clusters
  9. 9. Use Cases Alluxio 1.x • Burst compute into cloud with data on-prem • Enable object stores for data analytics platforms • Accelerate OLAP on Hadoop Example • As a data scientist, I want to be able to spin up my own elastic compute cluster that can easily and efficiently access my data stores New in Alluxio 2.x • Enable ML/DL frameworks on object stores • Data lifecycle management and data migration Examples • As a data scientist, I want to run my existing simulations on larger datasets stored in S3. • As a data infrastructure engineer, I want to automatically tier data between Alluxio and the under store.
  10. 10. ML/DL Workloads • Alluxio 1.x focuses primarily on Hadoop based workloads, ie. OLAP on Hadoop • Alluxio 2.x will continue to excel for these workloads • New emphasis on ML frameworks such as Tensorflow • Primarily accesses the same data set which Alluxio already is serving • Challenges include new API and file characteristics, such as file access pattern and file sizes
  11. 11. Data Management • Finer grained control over Alluxio replication • Automated and scalable async persistence • Distributed data loading • Mechanism for cross-mount data operations
  12. 12. Scaling • Namespace scaling - scale to 1 billion files • Cluster scaling - scale to 3000 worker nodes • Client scaling - scale to 30,000 concurrent clients
  13. 13. Architectural Innovations
  14. 14. Architectural Innovations in 2.0 • Off heap metadata storage (namespace scaling) • gRPC transport layer (cluster and client scaling) • Improved POSIX API (new workloads) • Job Service (enable data management) • Embedded Journal and Internal Leader Election (better integration with object stores, fewer external dependencies)
  15. 15. Off Heap Metadata Storage • Uses an embedded RocksDB to store inode tree • Internal cache for frequently used inodes • Performance is comparable to previous on-heap option when working set can fit in cache
  16. 16. gRPC Transport Layer • Switch from Thrift (metadata) + Netty (data) transport to a consolidated gRPC based transport • Connection multiplexing to reduce the number of connections from # of application threads to # of applications • Threading model enables the master to serve concurrent requests without being limited by internal threadpool size or open file descriptors on the master
  17. 17. Improved POSIX API • Alluxio FUSE based POSIX API • Limitations such as no random write, file cannot be read until complete • Validated against Tensorflow’s image recognition and recommendation workloads • Taking suggestions for other POSIX-based workloads!
  18. 18. Job Service • New process which serves as a lightweight computation framework for Alluxio specific tasks • Enables replication factor control without user input • Enables faster loading/persisting of data in a distributed manner • Allows users to do cross-mount operations • Async through is handled automatically
  19. 19. Embedded Journal and Internal Leader Election • New journaling service reliant only on Alluxio master processes • No longer need an external distributed storage to store the journal • Greatly benefits environments without a distributed file system • Uses Raft as the consensus algorithm • Consensus is used for journal integrity • Consensus can also be used for leader election in high availability mode
  20. 20. Release Roadmap
  21. 21. Alluxio 2.0.0 Release • Alluxio 2.0.0-preview is available now • Any and all feedback is appreciated! • File bugs and feature requests on our Github issues • Alluxio 2.0.0 will be released in ~3 months
  22. 22. Questions? Alluxio Website - www.alluxio.org Alluxio Community Slack Channel - www.alluxio.org/slack Alluxio Office Hours & Webinars - https://www.alluxio.org/resources/events
  23. 23. Wanchun Wang Chief Data Architect Near Real-time ETL platform using Spark & Alluxio
  24. 24. About VipShop • A leading online discount retailer for brands in China • $12.3 Billion net revenue in 2018 • 20M+ visitors/day
  25. 25. q - 5 -5 5 q 5 2 2 & 2 : 5 q 5 5 : 2 : 5 q 5 5 5 2 5 5 - 5- 22: 5 q : : 5 q 22 2 -5
  26. 26. 26 Overview - Big Data systems q Separate Streaming and Batch platforms, single data pre- processing pipeline, no longer a pure Lambda architecture q Typically streaming data get sinked into hive tables every 5 minutes q More ETL jobs are moving toward Near Real Time Lo g Kafka Data Cleansin g Kafka Augmen -tation Kafka Hive Delta Hive Daily Streaming(Storm/Flink/Spark) Batch ETL (Hive/Spark)
  27. 27. 27 The process of identifying a set of user actions (“events”) across screens and touch points that contribute in some manner to a product sale, and then assigning value to each of these events. front today’s new man’s special Product A detail man’s special Product B detail add cartorder
  28. 28. Near Real-time sales attribution is a very complex process • Recompute full day’s data at each iteration: • ~ 30 minutes, worst case 2-3 hours • Many data sources involved: • page view, add cart, order_with_discount, order_cookie_map, sub_order, prepay_order_goods etc • Several large data sources each contain billions of records and take up 300GB ~ 800GB space on Disk • Sales Path assignment is very CPU intensive computation • Written by business analysts • Complex SQL scripts with UDF functions Business expectation: updated result every 5 - 15 minutes
  29. 29. 29 + + ++ + Running performance sensitive jobs on current batch platform not an option • Around 200K batch jobs executed daily in Hadoop & Spark clusters • Hdfs 1400+ nodes • SSD hdfs 50+ nodes • Spark Clusters( 300+ nodes) • Cluster usage is above 80% at normal days, resources are even more saturated during monthly promotion period • Many issues contribute to the Inconsistent data access time such as NN RPC too high, slow DataNode response etc • Scheduling overhead when running M/R jobs
  30. 30. 30 1. Adding more compute power • Too expensive - Not a real option 2. Improve ETL job to process updates incrementally 3. Create a new, relatively isolated environment • consistent computing resource allocation • intermediate data caching • faster read/write
  31. 31. • Recompute the click paths for the active users in current window • Merge active user paths with previous full path result • Less data in computation but one more read on history data 2.Improve ETL Job to process updates incrementally
  32. 32. 32 ,
  33. 33. 33 q A Satellite Spark + Alluxio 1.8.1 cluster with 27 nodes (48 cores, 256G Memory) q Alluxio colocated with Spark qVery consistent read/write I/O time over iterations q Alluxio Mem + HDD qDisable multiple copies to save space qLeave enough memory to OS, improve stability
  34. 34. 34 A. Remote HDFS cluster: 1-2 times slow than Alluxio, the biggest problem is there are lots of spikes B. Use local HDFS, 30%-100% slower than Alluxio ( Mem + HDD) C. On dedicated SSD cluster • on par with Alluxio in regular days, but overall read/write latency doubled during busy days D. On dedicated Alluxio cluster, still not as good as co-located setup ( more test to be done) E. Spark Cache • Our daily views, clicks and path result are too big to fit into JVM • Slow to create and we have lots of “only used twice” data • Multiple downstream spark apps need to share the data
  35. 35. 35 L q Move the downstream processes closer to the data, avoid duplicating large amount of data from Alluxio to remote HDFS q Manage NRT jobs q A single big Spark Streaming job? too many inputs and outputs at different stages q Split into multiple jobs? how to coordinate multiple stream jobs q NRT executed in much higher frequency, very sensitive to system hiccups q Current batch job scheduling q Process dependency, executed for every fixed interval q When there is a severe delay, multiple batch instances for different slot running at the same time
  36. 36. 36 q Report data readiness to Watermark Service, manage dependency between loosely coupled jobs q Ultimate goal is get the latest result fast q a delayed batch might consume the unprocessed input blocks span over multiple cycles. q Output for fix intervals is not guaranteed q not all inputs are mandatory, iteration get kicked off even when optional input sources are not update for that particular cycle
  37. 37. 37 • Easy to setup • Pluggable, just a simple switch from hdfs://xxxx to alluxio://xxxx • Together with Spark, either form a separated satellite cluster or on label machines in our big clusters • Within our Data Centers, it is easier to allocate computing resources but SSD machines are scare • Spark and Alluxio on K8S: Over 1k machines, we need shuffle those machines to run Streaming, Spark ETL,Presto Ad Hoc Query or ML at different days or different time of a day • Very stable in production • Over 2 and a half years without any major issue. A big thank to Alluxio Engineers!
  38. 38. 38 • Async persistent to remote HDFS • Avoid duplicated write in user code/SQL, • Put hadoop /tmp/ directory on Alluxio over SSD, reduce NN rpc and load on DN • Cache hot/warm data for Presto, Heavy traffic and ad hoc query is very sensitive to HDFS stability
  39. 39. Questions? Alluxio Bay Area Meetup @alluxio alluxio.org/slack info@alluxio.com WeChat

×