- Sharding is a client-side translator that splits files into equally sized chunks or shards to improve performance and utilization of storage resources. It sits above the distributed hash table (DHT) in Gluster.
- Sharding benefits virtual machine image storage by allowing data healing and replication at the shard level for better scalability. It also distributes load more evenly across bricks.
- For general purpose use, sharding aims to maximize parallelism during writes while maintaining consistency through atomic operations and locking frameworks. Key challenges include updating file metadata without locking and handling operations like truncates and appends correctly across shards.
There are two key choices when scaling a NoSQL data store:
choosing between a hash or a range based sharding and choosing the right sharding key. Any choice is a trade-off between scalability of read, append, and update workloads.
In this talk I will present the standard scaling techniques,
some non-universal sharding tricks, less obvious reasons for
hotspots, as well as techniques to avoid them.
There are two key choices when scaling a NoSQL data store:
choosing between a hash or a range based sharding and choosing the right sharding key. Any choice is a trade-off between scalability of read, append, and update workloads.
In this talk I will present the standard scaling techniques,
some non-universal sharding tricks, less obvious reasons for
hotspots, as well as techniques to avoid them.
In this session, we'll discuss new volume types in Red Hat Gluster Storage. We will talk about erasure codes and storage tiers, and how they can work together. Future directions will also be touched on, including rule based classifiers and data transformations.
You will learn about:
How erasure codes lower the cost of storage.
How to configure and manage an erasure coded volume.
How to tune Gluster and Linux to optimize erasure code performance.
Using erasure codes for archival workloads.
How to utilize an SSD inexpensively as a storage tier.
Gluster's erasure code and storage tiering design.
P99CONF — What We Need to Unlearn About Persistent StorageScyllaDB
System software engineers have long been taught that disks are slow and sequential I/O is key to performance. With SSD drives I/O really got much faster but not simpler. In this brave new world of rocket-speed throughputs an engineer has to distinguish sustained workload from bursts, (still) take care about I/O buffer sizes, account for disks’ internal parallelism and study mixed I/O characteristics in advance. In this talk we will share some key performance measurements of the modern hardware we’re taking at ScyllaDB and our opinion about the implications for the database and system software design.
"Data classification" is an umbrella term covering things: locality-aware data placement, SSD/disk or normal/deduplicated/erasure-coded data tiering, HSM, etc. They share most of the same infrastructure, and so are proposed (for now) as a single feature.
Unikraft: Fast, Specialized Unikernels the Easy WayScyllaDB
P99 CONF
Unikernels are famous for providing excellent performance in terms of boot times, throughput and memory consumption, to name a few metrics. However, they are infamous for making it hard and extremely time consuming to extract such performance, and for needing significant engineering effort in order to port applications to them. We introduce Unikraft, a novel micro-library OS that (1) fully modularizes OS primitives so that it is easy to customize the unikernel and include only relevant components and (2) exposes a set of composable, performance-oriented APIs in order to make it easy for developers to obtain high performance.
Our evaluation using off-the-shelf applications such as nginx, SQLite, and Redis shows that running them on Unikraft results in a 1.7x-2.7x performance improvement compared to Linux guests. In addition, Unikraft images for these apps are around 1MB, require less than 10MB of RAM to run, and boot in around 1ms on top of the VMM time (total boot time 3ms-40ms). Unikraft is a Linux Foundation open source project and can be found at www.unikraft.org.
Data engineering Stl Big Data IDEA user groupAdam Doyle
Modern day Data Engineering requires creating reliable data pipelines, architecting distributed systems, designing data stores, and preparing data for other teams.
We’ll describe a year in the life of a Data Engineer who is tasked with creating a streaming data pipeline and touch on the skills necessary to set one up using Apache Spark.
Slides from the April 2019 meeting of the St. Louis Big Data IDEA meetup.
In this session, we'll discuss new volume types in Red Hat Gluster Storage. We will talk about erasure codes and storage tiers, and how they can work together. Future directions will also be touched on, including rule based classifiers and data transformations.
You will learn about:
How erasure codes lower the cost of storage.
How to configure and manage an erasure coded volume.
How to tune Gluster and Linux to optimize erasure code performance.
Using erasure codes for archival workloads.
How to utilize an SSD inexpensively as a storage tier.
Gluster's erasure code and storage tiering design.
P99CONF — What We Need to Unlearn About Persistent StorageScyllaDB
System software engineers have long been taught that disks are slow and sequential I/O is key to performance. With SSD drives I/O really got much faster but not simpler. In this brave new world of rocket-speed throughputs an engineer has to distinguish sustained workload from bursts, (still) take care about I/O buffer sizes, account for disks’ internal parallelism and study mixed I/O characteristics in advance. In this talk we will share some key performance measurements of the modern hardware we’re taking at ScyllaDB and our opinion about the implications for the database and system software design.
"Data classification" is an umbrella term covering things: locality-aware data placement, SSD/disk or normal/deduplicated/erasure-coded data tiering, HSM, etc. They share most of the same infrastructure, and so are proposed (for now) as a single feature.
Unikraft: Fast, Specialized Unikernels the Easy WayScyllaDB
P99 CONF
Unikernels are famous for providing excellent performance in terms of boot times, throughput and memory consumption, to name a few metrics. However, they are infamous for making it hard and extremely time consuming to extract such performance, and for needing significant engineering effort in order to port applications to them. We introduce Unikraft, a novel micro-library OS that (1) fully modularizes OS primitives so that it is easy to customize the unikernel and include only relevant components and (2) exposes a set of composable, performance-oriented APIs in order to make it easy for developers to obtain high performance.
Our evaluation using off-the-shelf applications such as nginx, SQLite, and Redis shows that running them on Unikraft results in a 1.7x-2.7x performance improvement compared to Linux guests. In addition, Unikraft images for these apps are around 1MB, require less than 10MB of RAM to run, and boot in around 1ms on top of the VMM time (total boot time 3ms-40ms). Unikraft is a Linux Foundation open source project and can be found at www.unikraft.org.
Data engineering Stl Big Data IDEA user groupAdam Doyle
Modern day Data Engineering requires creating reliable data pipelines, architecting distributed systems, designing data stores, and preparing data for other teams.
We’ll describe a year in the life of a Data Engineer who is tasked with creating a streaming data pipeline and touch on the skills necessary to set one up using Apache Spark.
Slides from the April 2019 meeting of the St. Louis Big Data IDEA meetup.
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
Enforcing format, changing schema, introducing privacy filters have always been a challenge with the classical Kafka-API. In this talk we'll cover how to extend existing applications with webassembly, allowing developers to change the shape of data at runtime, per application without creating additional topics. By leveraging WebAssembly, we can extend the capabilities of the Kafka-API beyond what it was initially imagined. Come and learn about the future of the Kafka-API
Seastore: Next Generation Backing Store for CephScyllaDB
Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.
Seastore: Next Generation Backing Store for CephScyllaDB
Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.
Slides presented at Percona Live Europe Open Source Database Conference 2019, Amsterdam, 2019-10-01.
Imagine a world where all Wikipedia articles disappear due to a human error or software bug. Sounds unreal? According to some estimations, it would take an excess of hundreds of million person-hours to be written again. To prevent that scenario from ever happening, our SRE team at Wikimedia recently refactored the relational database recovery system.
In this session, we will discuss how we backup 550TB of MariaDB data without impacting the 15 billion page views per month we get. We will cover what were our initial plans to replace the old infrastructure, how we achieved recovering 2TB databases in less than 30 minutes while maintaining per-table granularity, as well as the different types of backups we implemented. Lastly, we will talk about lessons learned, what went well, how our original plans changed and future work.
Webinar: Understanding Storage for Performance and Data SafetyMongoDB
In this deep dive, we'll look under the hood at how the MongoDB storage engine works to give you greater insight into both performance and data safety. You'll learn about storage layout, indexes, memory mapping, journaling, and fragmentation. This is a session intended for those who already have a basic understanding of MongoDB.
This session will cover performance-related developments in Red Hat Gluster Storage 3 and share best practices for testing, sizing, configuration, and tuning.
Join us to learn about:
Current features in Red Hat Gluster Storage, including 3-way replication, JBOD support, and thin-provisioning.
Features that are in development, including network file system (NFS) support with Ganesha, erasure coding, and cache tiering.
New performance enhancements related to the area of remote directory memory access (RDMA), small-file performance, FUSE caching, and solid state disks (SSD) readiness.
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
Speakers: Dominic Dwyer & Wei Shan Ang
This talk was presented in Percona Live Europe 2017. However, we did not have enough time to test against more scenario. We will be giving an updated talk with a more comprehensive tests and numbers. We hope to run it against citusDB and MongoRocks as well to provide a comprehensive comparison.
https://www.percona.com/live/e17/sessions/high-performance-json-postgresql-vs-mongodb
Similar to Sharding: Past, Present and Future with Krutika Dhananjay (20)
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
3. What is striping?
● Client-side translator - sits below DHT
● There will be <stripe-count> copies of every striped file
● Each file split into <block-size> chunks.
● Consecutive chunks are spread across multiple piece files
in a round-robin fashion.
5. Stripe translator - shortcomings
● Cost - you can add servers only in multiples of
‘stripe-count * replica-count’.
● File splitting not granular enough
○ Self-heal of a striped file should still heal
‘total_file_size/stripe_count’ bytes of data.
○ Geo-replication of a striped file should still sync
‘total_file_size/stripe_count’ bytes of data
6. Stripe shortcomings contd ...
● Suboptimal utilization of disks
○ An ‘x’ TB sized file would still require at least
‘x/stripe-count’ amount of space available in any
subvolume of DHT
○ … in turn implies suboptimal distribution of IOPs
across bricks for a given file.
8. What is sharding?
● Client-side xlator – sits above DHT
● Splits file into equal -sized chunks as it grows in size
● Shards beyond first block kept in a hidden /.shard
directory and the first block under its parent dir
● Translators above shard only see the user files
● Translators below shard see shards as normal files
● Shard naming is <gfid>.<num>
● Shard size configurable at volume level - 4MB to 4TB
11. How sharding benefits the use case
● Granularity of data heal is at shard level
● Minimal resource utilization by background processes
(self-heal, geo-rep, etc)
● VM image size no longer limited by the capacity of
individual brick(s)
● Better distribution of IOPs across bricks
● Geo-rep can now operate at shard level
● Add new bricks only after existing bricks’ space is fully
utilized
15. Where is the file metadata stored?
● File permissions, ownership, aggregated file size, block-count
and user-set extended attributes only maintained on the
base file. Shards under “.shard” owned by root.
● Being that sharding is only used in single-writer use case,
mtime is maintained on a best-effort basis in memory and
kept up-to-date as individual shards witness writes.
Moral of the story - lookup, stat, {get,set,remove}xattr are directly
served from the base file => 1 network call.
16. How does writing to a sharded file work?
● Create ‘.shard’ if it doesn’t exist.
● Identify participant shards, given write offset and length.
● Create shards if non-existent, in parallel.
● Send writes on participant shards at appropriate shard
offsets in parallel.
● Once all write responses are received, update size and
block count through an xattrop operation.
● Update in-memory cache containing the file size and
block-count and unwind the call.
17. How are renames and hard-links handled?
● Both fops operate only on the base file.
● File’s gfid remains constant even after a rename =>shards
under ‘.shard’ don’t need to be renamed.
● In other words, renaming and hard-linking a sharded file
completes in one single (atomic) network call.
18. Interoperability with existing Gluster
features?
● Verified that it works fine with geo-replication, hence
supported
● It should “theoretically” work fine in its current state with
features such as bit-rot detection, tiering etc because of its
position in the stack
● Features that won’t readily work with sharding (at least
not without additional code changes) - quota, snapshots,
etc.
20. ● Classic trade-off between consistency and performance
● For performance
○ Maximise parallelism across non-overlapping regions
of the large file
● For consistency
○ Keep writes atomic
○ Keep file size and block-count updates atomic and
accurate
○ mtime should reflect highest value
○ Handle truncates and appending writes correctly
Main challenges
21. The idea so far ...
● Do not try to solve fault tolerance and
recovery.
○ Use replication!
● Avoid locking as far as possible.
○ Do away with locks for writes do not span
across more than one shard
○ Use locking only for writes that modify multiple
shards to prevent interleaving of multiple
parallel writes
○ Introduce common locking framework to
minimize impact of locking by multiple
translators, on performance.
22. ● Introduce a server-side translator to
manage size update
○ Eliminate the need to take locks over the
network for size update.
● Store ctime/mtime in the form of an xattr
on the base file
○ This is a generic problem that needs to be
solved across multiple translators
● Possibly leverage compound fops
● Bitmaps for counting blocks?
The idea so far ...