Seastore: Next Generation Backing Store for CephScyllaDB
Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.
BGP (Border Gateway Routing Protocol) is a standardized exterior gateway protocol designed to
exchange routing and reachability information between autonomous systems (AS) on the Internet. The
Border Gateway Protocol makes routing decisions based on paths, network policies or rule-sets
configured by a network administrator, and are involved in making core routing decisions.
BGP is a very robust and scalable routing protocol, as evidenced by the fact that BGP is the routing
protocol employed on the Internet.
Lambda Architecture has been a common way to build data pipelines for a long time, despite difficulties in maintaining two complex systems. An alternative, Kappa Architecture, was proposed in 2014, but many companies are still reluctant to switch to Kappa. And there is a reason for that: even though Kappa generally provides a simpler design and similar or lower latency, there are a lot of practical challenges in areas like exactly-once delivery, late-arriving data, historical backfill and reprocessing.
In this talk, I want to show how you can solve those challenges by embracing Apache Kafka as a foundation of your data pipeline and leveraging modern stream-processing frameworks like Apache Flink.
Seastore: Next Generation Backing Store for CephScyllaDB
Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.
BGP (Border Gateway Routing Protocol) is a standardized exterior gateway protocol designed to
exchange routing and reachability information between autonomous systems (AS) on the Internet. The
Border Gateway Protocol makes routing decisions based on paths, network policies or rule-sets
configured by a network administrator, and are involved in making core routing decisions.
BGP is a very robust and scalable routing protocol, as evidenced by the fact that BGP is the routing
protocol employed on the Internet.
Lambda Architecture has been a common way to build data pipelines for a long time, despite difficulties in maintaining two complex systems. An alternative, Kappa Architecture, was proposed in 2014, but many companies are still reluctant to switch to Kappa. And there is a reason for that: even though Kappa generally provides a simpler design and similar or lower latency, there are a lot of practical challenges in areas like exactly-once delivery, late-arriving data, historical backfill and reprocessing.
In this talk, I want to show how you can solve those challenges by embracing Apache Kafka as a foundation of your data pipeline and leveraging modern stream-processing frameworks like Apache Flink.
Takaya Saeki, Yuichi Nishiwaki, Takahiro Shinagawa, Shinichi Honiden.
A Robust and Flexible Operating System Compatibility Architecture.
In Proceedings of the 16th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 2020), Mar 2020.
doi:10.1145/3381052.3381327
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Presentation delivered at LinuxCon China 2016
UEFI HTTP/HTTPS Boot is a new feature of UEFI 2.5+. In the meantime, this feature is not yet implemented in any Linux bootloader. This Birds of a Feather session will give an introduction to UEFI HTTP/HTTPS Boot, and share a proof-of-concept implementation based on grub2 that works on both the emulator (QEMU/OVMF) and HPE ProLiant Gen10 servers.
For HTTPS, the experience and comparison will be shared between the purely software-based and UEFI-based implementations in the aspects of ease of implementation, security strength, and limitation.
How to Performance-Tune Apache Spark Applications in Large ClustersDatabricks
Omkar Joshi offers an overview on how performance challenges were addressed at Uber while rolling out its newly built flagship ingestion system, Marmaray (open-sourced) for data ingestion from various sources like Kafka, MySQL, Cassandra, and Hadoop.
This document provides an overview of a new CPU capability called Intel® Speed Select
Technology – Base Frequency (Intel® SST-BF), which is available on select SKUs of 2nd
generation Intel® Xeon® Scalable processor (formerly codenamed Cascade Lake). The
document also includes benchmarking data and instructions on how to enable the
capability.
Value propositions of this capability include:
• Select SKUs of 2nd generation Intel® Xeon® Scalable processor (5218N, 6230N, and
6252N) offer a new capability called Intel® SST-BF.
• Intel® SST-BF allows the CPU to be deployed with an asymmetric core frequency
configuration.
• The placement of key workloads on higher frequency Intel® SST-BF enabled cores
can result in an overall system workload increase and potential overall energy
savings when compared to deploying the CPU with symmetric core frequencies
Many network operators still struggle with which type of data-plane encoding they should use for segment routing. The world is hyper-connected and we can’t afford to be late to deliver 5G. Using IPv4, IPv6 and MPLS data-plane encoding keeps us moving forward.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
This lesson describes the concept of VPN and introduces some VPN terminology.
Importance
This lesson is the foundation lesson for the MPLS VPN Curriculum.
Objectives
Upon completion of this lesson, the learner will be able to perform the following
tasks:
■ Describe the concept of VPN
■ Explain VPN terminology as defined by MPLS VPN architecture
These are the slides from a tutorial I presented at LOPSA-East in 2013. It covers spinning media and and solid state drives in detail.
A video of the presentation can be found on YouTube: http://www.youtube.com/watch?v=G3wf1HMr6b0
Takaya Saeki, Yuichi Nishiwaki, Takahiro Shinagawa, Shinichi Honiden.
A Robust and Flexible Operating System Compatibility Architecture.
In Proceedings of the 16th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 2020), Mar 2020.
doi:10.1145/3381052.3381327
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Presentation delivered at LinuxCon China 2016
UEFI HTTP/HTTPS Boot is a new feature of UEFI 2.5+. In the meantime, this feature is not yet implemented in any Linux bootloader. This Birds of a Feather session will give an introduction to UEFI HTTP/HTTPS Boot, and share a proof-of-concept implementation based on grub2 that works on both the emulator (QEMU/OVMF) and HPE ProLiant Gen10 servers.
For HTTPS, the experience and comparison will be shared between the purely software-based and UEFI-based implementations in the aspects of ease of implementation, security strength, and limitation.
How to Performance-Tune Apache Spark Applications in Large ClustersDatabricks
Omkar Joshi offers an overview on how performance challenges were addressed at Uber while rolling out its newly built flagship ingestion system, Marmaray (open-sourced) for data ingestion from various sources like Kafka, MySQL, Cassandra, and Hadoop.
This document provides an overview of a new CPU capability called Intel® Speed Select
Technology – Base Frequency (Intel® SST-BF), which is available on select SKUs of 2nd
generation Intel® Xeon® Scalable processor (formerly codenamed Cascade Lake). The
document also includes benchmarking data and instructions on how to enable the
capability.
Value propositions of this capability include:
• Select SKUs of 2nd generation Intel® Xeon® Scalable processor (5218N, 6230N, and
6252N) offer a new capability called Intel® SST-BF.
• Intel® SST-BF allows the CPU to be deployed with an asymmetric core frequency
configuration.
• The placement of key workloads on higher frequency Intel® SST-BF enabled cores
can result in an overall system workload increase and potential overall energy
savings when compared to deploying the CPU with symmetric core frequencies
Many network operators still struggle with which type of data-plane encoding they should use for segment routing. The world is hyper-connected and we can’t afford to be late to deliver 5G. Using IPv4, IPv6 and MPLS data-plane encoding keeps us moving forward.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
This lesson describes the concept of VPN and introduces some VPN terminology.
Importance
This lesson is the foundation lesson for the MPLS VPN Curriculum.
Objectives
Upon completion of this lesson, the learner will be able to perform the following
tasks:
■ Describe the concept of VPN
■ Explain VPN terminology as defined by MPLS VPN architecture
These are the slides from a tutorial I presented at LOPSA-East in 2013. It covers spinning media and and solid state drives in detail.
A video of the presentation can be found on YouTube: http://www.youtube.com/watch?v=G3wf1HMr6b0
Solid State Drives - Seminar Report for Semester 6 Computer Engineering - VIT...ravipbhat
This report is intended as a guide to emerging solid state storage technology, in particular, to the introduction of solid state drives.
Adding a solid-state drive (SSD) to your computer is simply the best upgrade at your disposal, capable of speeding up your computer in ways you hadn't thought possible. But as with any new technology, there's plenty to learn.
The consumer is no longer limited to just accepting pre-configured systems and, even when purchasing a system, should have an avenue to understand what purpose the storage device within serves as well as how it does what it does.
A solid-state drive (SSD) is a data storage device for your computer.
In everyday use, it provides the same functionality as a traditional hard disk drive (HDD)—the standard for computer storage for many years.
Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora is disruptive technology in the database space, bringing a new architectural model and distributed systems techniques to provide far higher performance, availability and durability than previously available using conventional monolithic database techniques. In this session, we will do a deep-dive into some of the key innovations behind Amazon Aurora, discuss best practices and configurations, and share early customer experience from the field.
Accelerating hbase with nvme and bucket cacheDavid Grier
This set of slides describes some initial experiments which we have designed for discovering improvements for performance in Hadoop technologies using NVMe technology
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
on-Volatile-Memory express (NVMe) standard promises and order of magnitude faster storage than regular SSDs, while at the same time being more economical than regular RAM on TB/$. This talk evaluates the use cases and benefits of NVMe drives for its use in Big Data clusters with HBase and Hadoop HDFS.
First, we benchmark the different drives using system level tools (FIO) to get maximum expected values for each different device type and set expectations. Second, we explore the different options and use cases of HBase storage and benchmark the different setups. And finally, we evaluate the speedups obtained by the NVMe technology for the different Big Data use cases from the YCSB benchmark.
In summary, while the NVMe drives show up to 8x speedup in best case scenarios, testing the cost-efficiency of new device technologies is not straightforward in Big Data, where we need to overcome system level caching to measure the maximum benefits.
2. Design Tradeoffs for SSD Performance Ted Wobber Principal Researcher Microsoft Research, Silicon Valley
3. Rotating Disks vs. SSDs We have a good model ofhow rotating disks work… what about SSDs?
4. Rotating Disks vs. SSDsMain take-aways Forget everything you knew about rotating disks. SSDs are different SSDs are complex software systems One size doesn’t fit all
6. Will SSDs Fix All Our Storage Problems? Excellent read latency; sequential bandwidth Lower $/IOPS/GB Improved power consumption No moving parts Form factor, noise, … Performance surprises?
7. Performance/Surprises Latency/bandwidth “How fast can I read or write?” Surprise: Random writes can be slow Persistence “How soon must I replace this device?” Surprise: Flash blocks wear out
8. What’s in This Talk Introduction Background on NAND flash, SSDs Points of comparison with rotating disks Write-in-place vs. write-logging Moving parts vs. parallelism Failure modes Conclusion
9. What’s *NOT* in This Talk Windows Analysis of specific SSDs Cost Power savings
10. Full Disclosure “Black box” study based on the properties of NAND flash A trace-based simulation of an “idealized” SSD Workloads TPC-C Exchange Postmark IOzone
11. BackgroundNAND flash blocks A flash block is a grid of cells 1 1 0 1 0 0 1 1 1 1 1 1 Erase: Quantum release for all cells Program: Quantuminjection for some cells Read: NAND operationwith a page selected 4096 + 128 bit-lines 64 pagelines Can’t reset bits to 1 except with erase
12. Background4GB flash package (SLC) Serial out Register Reg Reg Reg Reg Reg Reg Plane Plane 3 Plane 3 Plane 0 Plane 1 Plane 2 Plane 0 Plane 1 Plane 2 Reg Reg Block ’09? 20μs Die 1 Die 0 MLC (multiple bits in cell): slower, less durable
15. Write-in-Place vs. Logging Rotating disks Constant map fromLBA to on-disk location SSDs Writes always to new locations Superseded blocks cleaned later
16. Log-based WritesMap granularity = 1 block Flash Block LBA to Block Map P P P0 P1 Write order Block(P) Pages are moved – read-modify-write,(in foreground): Write Amplification
17. Log-based WritesMap granularity = 1 page LBA to Block Map P Q P P0 Q0 P1 Page(P) Page(Q) Blocks must be cleaned(in background): Write Amplification
18. Log-based WritesSimple simulation result Map granularity = flash block (256KB) TPC average I/O latency = 20 ms Map granularity = flash page (4KB) TPC-C average I/O latency = 0.2 ms
19. Log-based WritesBlock cleaning LBA to Page Map P Q R Q P R R0 P0 Q0 P0 R0 Q0 Page(P) Page(Q) Page(R) Move valid pages so block can be erased Cleaning efficiency: Choose blocks to minimize page movement
20. Over-provisioningPutting off the work Keep extra (unadvertised) blocks Reduces “pressure” for cleaning Improves foreground latency Reduces write-amplification due to cleaning
21. Delete NotificationAvoiding the work SSD doesn’t know what LBAs are in use Logical disk is always full! If SSD can know what pages are unused, these can treated as “superseded” Better cleaning efficiency De-facto over-provisioning “Trim” API: An important step forward
23. LBA Map Tradeoffs Large granularity Simple; small map size Low overhead for sequential write workload Foreground write amplification (R-M-W) Fine granularity Complex; large map size Can tolerate random write workload Background write amplification (cleaning)
24. Write-in-place vs. LoggingSummary Rotating disks Constant map fromLBA to on-disk location SSDs Dynamic LBA map Various possible strategies Best strategy deeply workload-dependent
26. Moving Parts vs. Parallelism Rotating disks Minimize seek time andimpact of rotational delay SSDs Maximize number ofoperations in flight Keep chip interconnect manageable
27. Improving IOPSStrategies Request-queue sort by sector address Defragmentation Application-level block ordering Defragmentation for cleaning efficiencyis unproven: next write might re-fragment One request at a time per disk head Null seek time
28. Flash Chip Bandwidth Serial interface is performance bottleneck Reads constrained by serial bus 25ns/byte = 40 MB/s (not so great) Reg Reg Reg Reg Reg Reg 8-bit serial bus Reg Reg Die 1 Die 0
31. Operations in Parallel SSDs are akin to RAID controllers Multiple onboard parallel elements Multiple request streams are needed to achieve maximal bandwidth Cleaning on inactive flash elements Non-trivial scheduling issues Much like “Log-Structured File System”, but at a lower level of the storage stack
32. Interleaving Concurrent ops on a package or die E.g., register-to-flash “program” on die 0 concurrent with serial line transfer on die 1 25% extra throughput on reads, 100% on writes Erase is slow, can be concurrent with other ops Reg Reg Reg Reg Reg Reg Reg Reg Die 1 Die 0
33. InterleavingSimulation TPC-C and Exchange No queuing, no benefit IOzone and Postmark Sequential I/O component results in queuing Increased throughput
34. Intra-plane Copy-back Block-to-block transfer internal to chip But only within the same plane! Cleaning on-chip! Optimizing for this can hurt load balance Conflicts with striping But data needn’t crossserial I/O pins Reg Reg Reg Reg
35. Cleaning with Copy-backSimulation Copy-back operation for intra-plane transfer TPC-C shows 40% improvement in cleaning costs No benefit for IOzone and Postmark Perfect cleaning efficiency
36. Ganging Optimally, all flash chips are independent In practice, too many wires! Flash packages can share a control bus with or/without separate data channels Operations in lock-step or coordinated Shared-control gang Shared-bus gang
38. Parallelism Tradeoffs No one scheme optimal for all workloads With faster serial connect, intra-chip ops are less important
39. Moving Parts vs. ParallelismSummary Rotating disks Seek, rotational optimization Built-in assumptions everywhere SSDs Operations in parallel are key Lots of opportunities forparallelism, but with tradeoffs
41. Failure ModesRotating disks Media imperfections, loose particles, vibration Latent sector errors [Bairavasundaram 07] E.g., with uncorrectable ECC Frequency of affected disks increases linearly with time Most affected disks (80%) have < 50 errors Temporal and spatial locality Correlation with recovered errors Disk scrubbing helps
42. Failure ModesSSDs Types of NAND flash errors (mostly when erases > wear limit) Write errors: Probability varies with # of erasures Read disturb: Increases with # of reads Data retention errors: Charge leaks over time Little spatial or temporal locality(within equally worn blocks) Better ECC can help Errors increase with wear: Need wear-leveling
46. Wear-levelingModified "greedy" algorithm Expiry Meter for block A Cold content Block B Block A Q R P Q R Q0 R0 P0 Q0 R0 If Remaining(A) < Throttle-Threshold, reduce probability of cleaning A If Remaining(A) < Migrate-Threshold, clean A, but migrate cold data into A If Remaining(A) >= Migrate-Threshold, clean A
47. Wear-leveling Results Fewer blocks reach expiry with rate-limiting Smaller standard deviation of remaining lifetimes with cold-content migration Cost to migrating cold pages (~5% avg. latency) Block wear in IOzone
48. Failure ModesSummary Rotating disks Reduce media tolerances Scrubbing to deal with latentsector errors SSDs Better ECC Wear-leveling is critical Greater density more errors?
49. Rotating Disks vs. SSDs ≠ Don’t think of an SSD as just a faster rotating disk Complex firmware/hardware system with substantial tradeoffs
51. Call To Action Users need help in rationalizing workload-sensitive SSD performance Operation latency Bandwidth Persistence One size doesn’t fit all… manufacturers should help users determine the right fit Open the “black box” a bit Need software-visible metrics
53. Additional Resources USENIX paper:http://research.microsoft.com/users/vijayanp/papers/ssd-usenix08.pdf SSD Simulator download:http://research.microsoft.com/downloads Related Sessions ENT-C628: Solid State Storage in Server and Data Center Environments (2pm, 11/5)
54. Please Complete A Session Evaluation FormYour input is important! Visit the WinHECCommNet and complete a Session Evaluation for this session and be entered to win one of 150 Maxtor®BlackArmor™ 160GB External Hard Drives50 drives will be given away daily! http://www.winhec2008.com BlackArmorHard Drives provided by: