Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data processing at the speed of 100 Gbps@Apache Crail (Incubating)

354 views

Published on

Once the staple of HPC clusters, today high-performance network and storage devices are everywhere. For a fraction of the cost, one can rent 40/100 Gbps RDMA networks and high-end NVMe flash devices supporting 10s GB/s bandwidths, less than 100 microseconds of latencies, with millions of IOPS. How does one leverage this phenomenal performance for popular data processing frameworks such as Apache Spark, Flink, Hadoop that we all know and love?

In this talk, I will introduce the Apache Crail (Incubating), which is a fast, distributed data store that is designed specifically for high-performance network and storage devices. The goal of the project is to deliver the true hardware performance to Apache data processing frameworks in the most accessible way. With its modular design, Crail supports multiple storage back ends (DRAM, NVMe Flash, and 3D XPoint) and networking protocols (RDMA and TPC/sockets). Crail provides multiple flexible APIs (file system, KV, HDFS, streaming) for a better integration with the high-level data access operations in Apache compute frameworks. As a result, on a 100 Gbps network infrastructure, Crail delivers all-to-all shuffle operations at 80+ Gbps speed, broadcast operations at less than 10 usec latencies, and more than 8M lookups/namenode, etc. Moreover, Crail is a generic solution that integrates well with the Apache ecosystem including frameworks like Spark, Hadoop, Hive, etc.

I will present the case for Crail, its current status, and future plans. As Crail is a young Apache project, we are seeking to build a community and expand its application to other interesting domains.

Speaker
Animesh Trivedi, IBM Research, Research Staff Member (RSM)

Published in: Technology
  • Be the first to comment

Data processing at the speed of 100 Gbps@Apache Crail (Incubating)

  1. 1. Data Processing at the Speed of 100 Gbps@Apache Crail (Incubating) http://crail.incubator.apache.org/ Animesh Trivedi IBM Research, Zurich
  2. 2. The Performance Landscape Trend 1: The I/O performance is increasing dramatically
  3. 3. The Performance Landscape Trend 2: The I/O diversity is increasing dramatically
  4. 4. The Key Challenge Given the current hardware performance and diversity landscape, how can we orchestrate data movement at the speed of hardware in other words: " Can we feed modern data processing stacks at 100+ Gbps data speeds with 1-10 usec access latencies"
  5. 5. Apache Crail (Incubating)  A distributed data “store” platform designed from scratch to “share” data at the speed of hardware  “store”: in DRAM/Flash/3DXP, with multiple front-end APIs to storage  “share”: intermediate data from intra-job (shuffle, broadcast) & inter-job  Targets to speed-up workloads by accelerating data sharing  An effort to unify isolated and incompatible efforts to integrate high-performance device in big data frameworks  Written in Java8 with multiple client-language support  Started at IBM Research Lab, Zurich  Apache Incubator project since November, 2017
  6. 6. System Overview – Crail Store client Datanode Datanode Datanode  logically centralized  keeps track of metadata  access metadata from the namenode  access data from datanodes  donate storage and network resources to Crail  mostly dumb/stateless servers Control RPCs Data operations Namenode
  7. 7. Apache Crail: Datanode  DataNode is responsible for exposing (storage + networking) resources to the system  DRAM + RDMA [1], TCP/Sockets  Flash + NVMeF [2]  Local DRAM + memcpy  Local flash + SPDK  Accepts client connections and serves data  Periodic heartbeat pings with the namenode Datanode RDMA, TCP, DPDK, ...
  8. 8. Apache Crail: Namenode  Responsible for keeping track of storage resources in the cluster  Done by managing them in blocks (e.g., 1MB block size)  Maintains a hierarchical node tree  Directory, data files, stream files, multifiles, tables, KV  Clients connect to the namenode to create new nodes, lookup, read, and write to nodes  Allocation policy (different media and node types)  A node can contains some blocks from DRAM and flash
  9. 9. Apache Crail: Clients  Clients read and write data  Standalone or a part of a compute framework  Single writer, without holes files  Distributed clients often implicitly index and synchronize on the file/directory name  Multiple-client side storage abstractions/interfaces  Simple hierarchical file system  Key-Value (KV) store  HDFS  Streaming (WiP)
  10. 10. So where does the performance come from?
  11. 11. Performance Principles – I NIC Ethernet IP/TCP sockets OSkernel 100 Gbps 1 usec latency
  12. 12. Performance Principles – I NIC Ethernet IP/TCP sockets container OSkernel HDFS Alluxio Spark TeraSort 100 Gbps 1 usec latency JV M ?
  13. 13. Performance Principles – I NIC Ethernet IP/TCP sockets container OSkernel JVM HDFS Alluxio Spark TeraSort 100 Gbps 1 usec latency 1. Use high-performance user-level I/O RDMA[1], NVMeF[2], SPDK/DPDK in Java - one-sided RDMA read/write operations
  14. 14. Performance Principles – II network object
  15. 15. Performance Principles – II buffering copies allocation (de)serialization network object
  16. 16. Performance Principles – II buffering copies allocation (de)serialization network object 2. Careful software design for the μsec-Era software time budgeting, pre-allocation of buffers, buffer caching, and reusing, ownership, etc.
  17. 17. Performance Principles – III client Datanode Datanode Datanode Datanode
  18. 18. Performance Principles – III client Datanode Datanode Datanode Datanode client
  19. 19. Performance Principles – III client Datanode Datanode Datanode Datanode client 3. Careful data orchestration in the cluster randomization, async requests, compute-I/O overlap, smart buffering, data allocation policies, etc.
  20. 20. Performance Principle - Recap 1. Use high-performance user-level I/O 3. Careful data orchestration in the cluster 2. Careful software design for the μsec-Era
  21. 21. But what If I don’t have … RDMA 1. Use high-performance user-level I/O 3. Careful data orchestration in the cluster 2. Careful software design for the μsec-Era
  22. 22. The Full Apache Crail (Incubating) Stack Apache Crail Store NVMf DPDKRDMATCP Hierarchical Namespace API Streaming KV HDFS Fast Network (100Gb/s RoCE, etc.) Data Processing Framework (Spark, TensorFlow, λ Compute, …) 3DXP GPUNVMeDRAM … 10th of GB/s 10 µsec FS
  23. 23. Use in Apache Spark – Broadcast Crail Stor e spark driver /job_id/bc/1/ spark executor 0 spark executor 1 spark executor 2 RDMA/NVMeF write RDMA/NVMeF read
  24. 24. Use in Apache Spark - Shuffle Spark Shuffle plugin [4] that write and reads data in Crail Crail Stor e spark executor0 spark executor1 spark executor2 /job_id/shuffle/0/ /job_id/shuffle/1/ /job_id/shuffle/2/ spark executor0 spark executor1 spark executor2
  25. 25. Performance Numbers  Baseline performance for Apache Crail Store [6]:  Latency, bandwidth, and IOPS numbers in a distributed setting in the JVM  Crail + Spark integration, available at [4]  Micro-operations: Broadcast and GroupBy  Workloads: TeraSort and SQL JOIN  Mixed DRAM and Flash setting  A mix of x86 and POWER8 machines, 100 GbE network, 256GB DRAM DDR3, and NVMe flash
  26. 26. Crail Store – DRAM ”File Read Bandwidth” Crail delivers full hardware bandwidth from DRAM
  27. 27. Crail Store – DRAM ”File Read Latency” Crail delivers ultra-low file/data access latencies in JVM
  28. 28. Crail Store – NVMeF ”File Read Bandwidth” Crail delivers full hardware bandwidth from remote NVMe flash smart buffering and I/O scheduling
  29. 29. Crail Store – NVMeF ”File Read Latency” Crail delivers native NVMe access performance in JVM
  30. 30. Crail Store – Metadata IOPS ”getFile operation” Crail offers (almost) linear scalability with Namenode ops 8-9M IOPS
  31. 31. How does data processing-level performance look like? Apache Crail Store (Incubating) Apache Spark – TeraSort & SQL JOIN Apache Spark Shuffle Module Apache Spark Broadcast Module
  32. 32. Crail Workload - Spark - Broadcast Crail offers 10-100x performance gains for Spark Broadcasts
  33. 33. Crail Workload - Spark - GroupBy Crail offers 2-5x performance gains for Spark Shuffle 2x 5x
  34. 34. Crail Workload - Spark - TeraSort Crail offers 6x performance gains for Spark TeraSort [7]
  35. 35. Crail Workload - Spark - TeraSort
  36. 36. Crail Workload - Spark - EquiJoin Crail offers 2x performance gains for SQL JOIN
  37. 37. Crail Store – Flash Disaggregation Crail offers the possibility for storage disaggregation
  38. 38. Current Status and Future Plans  Apache Incubator project since November, 2017  First source release (tag: db64fd), a few weeks ago  Plenty of opportunities  Multiple languages support (C++ (WiP), Python, Rust, _your_fav_language_)  Multiple frameworks integration (Flink, Hadoop, λ compute, TensorFlow, SnapML[5], _your_fav_framework_here_)  Multiple datanode (network and storage) integration  Automated testing and packaging framework  Deploying in the cloud/containerization  JVM/runtime optimizations  And all the fun stuff that you know and love about Apache projects !
  39. 39. Thanks to … Patrick Stuedi, Jonas Pfefferle, Michael Kaufmann, Adrian Schuepbach, Bernard Metzler
  40. 40. Thanks you !  Website: http://crail.incubator.apache.org/  Blog: http://crail.incubator.apache.org/blog/  Mailings list: dev@crail.incubator.apache.org  JIRA: https://issues.apache.org/jira/browse/CRAIL  Slack: https://the-asf.slack.com/messages/C8VDLDWMV See you all at the mailing list :)
  41. 41. References [1] DiSNI: Direct Storage and Networking Interface, https://github.com/zrlio/disni [2] jNVMf: A NVMe over Fabrics library for Java, https://github.com/zrlio/jNVMf [3] “Crail: A High-Performance I/O Architecture for Distributed Data Processing”, in the IEEE Bulletin of the Technical Committee on Data Engineering, Special Issue on Distributed Data Management with RDMA, Volume 40, pages 40-52, March, 2017. http://sites.computer.org/debull/A17mar/p38.pdf [4] Crail I/O accleration plugins for Apache Spark, https://github.com/zrlio/crail-spark-io [5] Snap machine learning (SnapML), https://www.zurich.ibm.com/snapml/ [6] Apache Crail (Incubating) performance blogs  Part I: DRAM, http://crail.incubator.apache.org/blog/2017/08/crail-memory.html  Part II: NVMeF, http://crail.incubator.apache.org/blog/2017/08/crail-nvme-fabrics-v1.html  Part III: Metadata, http://crail.incubator.apache.org/blog/2017/11/crail-metadata.html [7] Sorting on a 100Gbit/s Cluster using Spark/Crail, http://crail.incubator.apache.org/blog/2017/01/sorting.html
  42. 42. Notice IBM is a trademark of International Business Machines Corporation, registered in many jurisdictions worldwide. Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java- based trade-marks and logos are trademarks or registered trademarks of Oracle and/or its affiliates Other products and service names might be trademarks of IBM or other companies.
  43. 43. Backup

×