This presentation was held by Prashanth Menon at ICDE '14 on April 3, 2014 in Chicago, IL, USA.
The full paper and additional information is available at:
http://msrg.org/papers/Menon2013
Abstract:
With the ever growing size and complexity of enterprise systems there is a pressing need for more detailed application performance management. Due to the high data rates, traditional database technology cannot sustain the required performance. Alternatives are the more lightweight and, thus, more performant key-value stores. However, these systems tend to sacrifice read performance in order to obtain the desired write throughput by avoiding random disk access in favor of fast sequential accesses.
With the advent of SSDs, built upon the philosophy of no moving parts, the boundary between sequential vs. random access is now becoming blurred. This provides a unique opportunity to extend the storage memory hierarchy using SSDs in key-value stores. In this paper, we extensively evaluate the benefits of using SSDs in commercialized key-value stores. In particular, we
investigate the performance of hybrid SSD-HDD systems and demonstrate the benefits of our SSD caching and our novel dynamic schema model.
Cassandra Day SV 2014: Designing Commodity Storage in Apache CassandraDataStax Academy
As we move into the world of Big Data and the Internet of Things, the systems architectures and data models we've relied on for decades are becoming a hindrance. At the core of the problem is the read-modify-write cycle. In this session, Al will talk about how to build systems that don't rely on RMW, with a focus on Cassandra. Finally, for those times when RMW is unavoidable, he will cover how and when to use Cassandra's lightweight transactions and collections.
San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...DataStax Academy
What You'll Learn at this Meetup
Tips and Tricks to achieve high performance when running Cassandra on AWS
• Configuration tuning for Cassandra
• Tools to benchmark raw filesystem IO
• AWS available AMIs to boost performance
• Stress testing on AWS i2 HVM instances
• Configuring AWS EC2 instances with SSDs and EBS storage with PIOPS
Cassandra Day SV 2014: Designing Commodity Storage in Apache CassandraDataStax Academy
As we move into the world of Big Data and the Internet of Things, the systems architectures and data models we've relied on for decades are becoming a hindrance. At the core of the problem is the read-modify-write cycle. In this session, Al will talk about how to build systems that don't rely on RMW, with a focus on Cassandra. Finally, for those times when RMW is unavoidable, he will cover how and when to use Cassandra's lightweight transactions and collections.
San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...DataStax Academy
What You'll Learn at this Meetup
Tips and Tricks to achieve high performance when running Cassandra on AWS
• Configuration tuning for Cassandra
• Tools to benchmark raw filesystem IO
• AWS available AMIs to boost performance
• Stress testing on AWS i2 HVM instances
• Configuring AWS EC2 instances with SSDs and EBS storage with PIOPS
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMsMarco Obinu
Come dimensionare una VM per SQL Server in Azure IaaS, alla luce delle ultime novità della piattaforma.Sessione erogata il 24 Aprile 2020, nell'ambito del Global Azure Virtual 2020.
Video sessione: https://youtu.be/7o80CJUtnh4
Demo: https://github.com/OmegaMadLab/SqlIaasVmPlayground
ARM Template ottimizzato per SQL Server: https://github.com/OmegaMadLab/OptimizedSqlVm-v2
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1DataStax Academy
Presenter: Aaron Morton, Apache Cassandra Committer & Co-Founder of The Last Pickle
Apache Cassandra 2.0 and 2.1 include a wealth of new and updated features. Some are well known, others are known to only a few. But any of them could help you reduce latency, improve throughput, or make operations easier. This talk will take a deep dive into features that improve: Compaction, Write Performance, Memory Management, CQL 3, TTL and Tombstones, & Repair. Existing and new users will benefit from this wide ranging view of the features Apache Cassandra offers.
Presentation from 2016 Austin OpenStack Summit.
The Ceph upstream community is declaring CephFS stable for the first time in the recent Jewel release, but that declaration comes with caveats: while we have filesystem repair tools and a horizontally scalable POSIX filesystem, we have default-disabled exciting features like horizontally-scalable metadata servers and snapshots. This talk will present exactly what features you can expect to see, what's blocking the inclusion of other features, and what you as a user can expect and can contribute by deploying or testing CephFS.
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...DataStax Academy
Speaker(s): Kathryn Erickson, Engineering at DataStax
During this session we will discuss varying recommended hardware configurations for DSE. We’ll get right to the point and provide quick and solid recommendations up front. After we get the main points down take a brief tour of the history of database storage and then focus on designing a storage subsystem that won't let you down.
Introducing MagnetoDB, a key-value storage sevice for OpenStackMirantis
Introducing MagnetoDB, NoSQL database as a service for OpenStack. MagnetoDB acts as a key-value store, is tightly integrated with OpenStack, and yet is compatible with the Amazon DynamoDB API, and can be used as a drop-in replacement.
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMsMarco Obinu
Come dimensionare una VM per SQL Server in Azure IaaS, alla luce delle ultime novità della piattaforma.Sessione erogata il 24 Aprile 2020, nell'ambito del Global Azure Virtual 2020.
Video sessione: https://youtu.be/7o80CJUtnh4
Demo: https://github.com/OmegaMadLab/SqlIaasVmPlayground
ARM Template ottimizzato per SQL Server: https://github.com/OmegaMadLab/OptimizedSqlVm-v2
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1DataStax Academy
Presenter: Aaron Morton, Apache Cassandra Committer & Co-Founder of The Last Pickle
Apache Cassandra 2.0 and 2.1 include a wealth of new and updated features. Some are well known, others are known to only a few. But any of them could help you reduce latency, improve throughput, or make operations easier. This talk will take a deep dive into features that improve: Compaction, Write Performance, Memory Management, CQL 3, TTL and Tombstones, & Repair. Existing and new users will benefit from this wide ranging view of the features Apache Cassandra offers.
Presentation from 2016 Austin OpenStack Summit.
The Ceph upstream community is declaring CephFS stable for the first time in the recent Jewel release, but that declaration comes with caveats: while we have filesystem repair tools and a horizontally scalable POSIX filesystem, we have default-disabled exciting features like horizontally-scalable metadata servers and snapshots. This talk will present exactly what features you can expect to see, what's blocking the inclusion of other features, and what you as a user can expect and can contribute by deploying or testing CephFS.
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...DataStax Academy
Speaker(s): Kathryn Erickson, Engineering at DataStax
During this session we will discuss varying recommended hardware configurations for DSE. We’ll get right to the point and provide quick and solid recommendations up front. After we get the main points down take a brief tour of the history of database storage and then focus on designing a storage subsystem that won't let you down.
Introducing MagnetoDB, a key-value storage sevice for OpenStackMirantis
Introducing MagnetoDB, NoSQL database as a service for OpenStack. MagnetoDB acts as a key-value store, is tightly integrated with OpenStack, and yet is compatible with the Amazon DynamoDB API, and can be used as a drop-in replacement.
Yesterday's thinking may still believe NVMe (NVM Express) is in transition to a production ready solution. In this session, we will discuss how the evolution of NVMe is ready for production, the history and evolution of NVMe and the Linux stack to address where NVMe has progressed today to become the low latency, highly reliable database key value store mechanism that will drive the future of cloud expansion. Examples of protocol efficiencies and types of storage engines that are optimizing for NVMe will be discussed. Please join us for an exciting session where in-memory computing and persistence have evolved.
In this talk we report on our experience with Redis-on-Flash (RoF)—a recently introduced product that uses SSDs as a RAM extension to dramatically increase the effective dataset capacity that can be stored on a single server. This talk provides the first in-depth RoF system performance characterization: we consider different use cases (varying both RAM-to-disk access ratio and object size), and compare SATA-based RoF, NVMe-based RoF, and all-RAM Redis deployments. We show that the superior performance of NVMe drives in terms of both latency and peak bandwidth makes them a particularly good fit for RoF use cases. Specifically, we show that backing RoF with NVMe drives can deliver more than 2 million operations per second with sub-millisecond latency on a single server.
AWS June Webinar Series - Getting Started: Amazon RedshiftAmazon Web Services
Amazon Redshift is a fast, fully-managed petabyte-scale data warehouse service, for less than $1,000 per TB per year. In this presentation, you'll get an overview of Amazon Redshift, including how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. Learn how, with just a few clicks in the AWS Management Console, you can set up with a fully functional data warehouse, ready to accept data without learning any new languages and easily plugging in with the existing business intelligence tools and applications you use today. This webinar is ideal for anyone looking to gain deeper insight into their data, without the usual challenges of time, cost and effort. In this webinar, you will learn: • Understand what Amazon Redshift is and how it works • Create a data warehouse interactively through the AWS Management Console • Load some data into your new Amazon Redshift data warehouse from S3 Who Should Attend • IT professionals, developers, line-of-business managers
With AWS you can choose the right database technology and software for the job. Given the myriad of choices, from relational databases to non-relational stores, this session provides details and examples of some of the choices available to you. This session also provides details about real-world deployments from customers using Amazon RDS, Amazon ElastiCache, Amazon DynamoDB, and Amazon Redshift.
SQL Server Reporting Services Disaster Recovery webinarDenny Lee
This is the PASS DW|BI virtual chapter webinar on SQL Server Reporting Services Disaster Recovery with Ayad Shammout and myself - hosted by Julie Koesmarno (@mssqlgirl)
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
Big Data is everywhere these days. But what is it and how can you use it to fuel your business? Data is as important to organizations as labour and capital, and if organizations can effectively capture, analyze, visualize and apply big data insights to their business goals, they can differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line.
Join this session to understand the different AWS Big Data and Analytics services such as Amazon Elastic MapReduce (Hadoop), Amazon Redshift (Data Warehouse) and Amazon Kinesis (Streaming), when to use them and how they work together.
Reasons to attend:
- Learn how AWS can help you process and make better use of your data with meaningful insights.
- Learn about Amazon Elastic MapReduce and Amazon Redshift, fully managed petabyte-scale data warehouse solutions.
- Learn about real time data processing with Amazon Kinesis.
Learn how Amazon Redshift, our fully managed, petabyte-scale data warehouse, can help you quickly and cost-effectively analyze all of your data using your existing business intelligence tools. Get an introduction to how Amazon Redshift uses massively parallel processing, scale-out architecture, and columnar direct-attached storage to minimize I/O time and maximize performance. Learn how you can gain deeper business insights and save money and time by migrating to Amazon Redshift. Take away strategies for migrating from on-premises data warehousing solutions, tuning schema and queries, and utilizing third party solutions.
Getting Started with Managed Database Services on AWS - September 2016 Webina...Amazon Web Services
On AWS you can choose from a variety of managed database services that save effort, save time, and unlock new capabilities and economies. In this session, we make it easy to understand how they differ, what they have in common, and how to choose one or more. We'll explain the fundamentals of Amazon RDS, a managed relational database service in the cloud; Amazon DynamoDB, a fully managed NoSQL database service; Amazon ElastiCache, a fast, in-memory caching service in the cloud; and Amazon Redshift, a fully managed, petabyte-scale data-warehouse solution that can be surprisingly economical. We will cover how each service might help support your application, how much each service costs, and how to get started.
Learning Objectives:
• Overview of managed database services available on AWS
• How to combine them for high-performance cost effective architectures
• Learn how to choose between the AWS database services based on the use case
Who Should Attend:
• IT Managers, DBAs, Enterprise and Solution Architects, IT Managers, DBAs, Enterprise and Solution Architects, Devops Engineers and Developers
SQL Server Reporting Services Disaster Recovery WebinarDenny Lee
This is the PASS DW/BI Webinar for SQL Server Reporting Services (SSRS) Disaster Recovery webinar. You can find the video at: http://www.youtube.com/watch?v=gfT9ETyLRlA
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
In this webinar, we discuss how the secret sauce to your business analytics strategy remains rooted on your approached, methodologies and the amount of data incorporated into this critical exercise. We also address best practices to supercharge your cloud analytics initiatives, and tips and tricks on designing the right information architecture, data models and other tactical optimizations.
To learn more, visit: http://www.snaplogic.com/redshift-trial
New to MongoDB? We'll provide an overview of installation, high availability through replication, scale out through sharding, and options for monitoring and backup. No prior knowledge of MongoDB is assumed. This session will jumpstart your knowledge of MongoDB operations, providing you with context for the rest of the day's content.
Selecting the Right AWS Database Solution - AWS 2017 Online Tech TalksAmazon Web Services
• Get an overview of managed database services available on AWS
• Learn how to combine them for high-performance cost effective architectures
• Learn how to choose between the AWS database services based on your use case
On AWS you can choose from a variety of managed database services that save effort, save time, and unlock new capabilities and economies. In this session, we make it easy to understand how they differ, what they have in common, and how to choose one or more. We'll explain the fundamentals of Amazon RDS, a managed relational database service in the cloud; Amazon DynamoDB, a fully managed NoSQL database service; Amazon ElastiCache, a fast, in-memory caching service in the cloud; and Amazon Redshift, a fully managed, petabyte-scale data-warehouse solution that can be economical. We will cover how each service might help support your application and how to get started.
Amazon Aurora services are MySQL and PostgreSQL -compatible relational database engines with the speed, reliability, and availability of high-end commercial databases at one-tenth the cost. This session introduces you to Amazon Aurora, explores the capabilities and features of Aurora, explains common use cases, and helps you get started with Aurora.
Storage Systems for High Scalable Systems Presentationandyman3000
Presentation from http://www.hfadeel.com/Blog/?p=151 what kind of storage systems players like Facebook or Google use for their extreme scalability requirements
1. If it’s not SQL, it’s not a database.
2. It takes 5+ years to build a database.
3. Listen to your users.
4. Too much magic is a bad thing.
5. It’s the cloud, stupid.
Similar to CaSSanDra: An SSD Boosted Key-Value Store (20)
TPC-DI - The First Industry Benchmark for Data IntegrationTilmann Rabl
This presentation was held by Meikel Poess on September 3, 2014 at VLDB 2014 in Hangzhou, China.
Full paper and additional information available at:
http://msrg.org/papers/VLDB2014TPCDI
Abstract:
Historically, the process of synchronizing a decision support system with data from operational systems has been referred to as Extract, Transform, Load (ETL) and the tools supporting such process have been referred to as ETL tools. Recently, ETL was replaced by the more comprehensive acronym, data integration (DI). DI describes the process of extracting and combining data from a variety of data source formats, transforming that data into a unified data model representation and loading it into a data store. This is done in the context of a variety of scenarios, such as data acquisition for business intelligence, analytics and data warehousing, but also synchronization of data between operational applications, data migrations and conversions, master data management, enterprise data sharing and delivery of data services in a service-oriented architecture context, amongst others. With these scenarios relying on up-to-date information it is critical to implement a highly performing, scalable and easy to maintain data integration system. This is especially important as the complexity, variety and volume of data is constantly increasing and performance of data integration systems is becoming very critical. Despite the significance of having a highly performing DI system, there has been no industry standard for measuring and comparing their performance. The TPC, acknowledging this void, has released TPC-DI, an innovative benchmark for data integration. This paper motivates the reasons behind its development, describes its main characteristics including workload, run rules, metric, and explains key decisions.
This presentation was held at ISC 2014 on June 26, 2014 in Leipzig, Germany.
More information available at:
http://msrg.org/papers/ISC2014-Rabl
Abstract:
The Workshops for Big Data Benchmarking (http://clds.sdsc.edu/bdbc/workshops), which have been underway since May 2012, have identified a set of characteristics of big data applications that apply to industry as well as scientific application scenarios involving pipelines of processing with steps that include aggregation, cleaning, and annotation of large volumes of data; filtering, integration, fusion, subsetting, and compaction of data; and, subsequent analysis, including visualization, data mining, predictive analytics and, eventually, decision making. One of the outcomes of the WBDB workshops has been the formation of a Transaction Processing Council subcommittee on Big Data, which is initially defining a Hadoop systems benchmark, TPCx-HS, based on Terasort. TPCx-HS would be a simple, functional benchmark that would assist in determining basic resiliency and scalability features of large-scale systems. Other proposals are also actively under development including BigBench, which extends the TPC-DS benchmark for big data scenarios; Big Decision Benchmark from HP; HiBench from Intel; and the Deep Analytics Pipeline (DAP), which defines a sequence of end-to-end processing steps consisting of some of the operations mentioned above. Pipeline benchmarks reveal the need for different processing modalities and system characteristics for different steps in the pipeline. For example, early processing steps may process very large volumes of data and may benefit from a Hadoop and MapReduce-style of computing, while later steps may operate on more structured data and may require, say, SMP-style architectures or very large memory systems. This talk will provide an overview of these benchmark activities and discuss opportunities for collaboration and future work with industry partners.
This tutorial was held at IEEE BigData '14 on October 29, 2014 in Bethesda, ML, USA.
Presenters: Chaitan Baru and Tilmann Rabl
More information available at:
http://msrg.org/papers/BigData14-Rabl
Summary:
This tutorial will introduce the audience to the broad set of issues involved in defining big data benchmarks, for creating auditable industry-standard benchmarks that consider performance as well as price/performance. Big data benchmarks must capture the essential characteristics of big data applications and systems, including heterogeneous data, e.g. structured, semi- structured, unstructured, graphs, and streams; large-scale and evolving system configurations; varying system loads; processing pipelines that progressively transform data; workloads that include queries as well as data mining and machine learning operations and algorithms. Different benchmarking approaches will be introduced, from micro-benchmarks to application- level benchmarking.
Since May 2012, five workshops have been held on Big Data Benchmarking including participation from industry and academia. One of the outcomes of these meetings has been the creation of industry’s first big data benchmark, viz., TPCx-HS, the Transaction Processing Performance Council’s benchmark for Hadoop Systems. During these workshops, a number of other proposals have been put forward for more comprehensive big data benchmarking. The tutorial will present and discuss salient points and essential features of such benchmarks that have been identified in these meetings, by experts in big data as well as benchmarking. Two key approaches are now being pursued—one, called BigBench, is based on extending the TPC- Decision Support (TPC-DS) benchmark with big data applications characteristics. The other called Deep Analytics Pipeline, is based on modeling processing that is routinely encountered in real-life big data applications. Both will be discussed.
We conclude with a discussion of a number of future directions for big data benchmarking
A BigBench Implementation in the Hadoop EcosystemTilmann Rabl
This presentation was held at WBDB.us 2013 on October 10, 2013 in San Jose, CA, USA
Full paper and additional information at:
http://msrg.org/papers/WBDB2013BigBench
Abstract:
BigBench is the first proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DS. The first proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test other other systems, the queries have to be translated. In this paper, an alternative implementation of BigBench for the Hadoop ecosystem is presented. All 30 queries of BigBench were realized using Apache Hive, Apache Hadoop, Apache Mahout, and NLTK. We will present the dierent design choices we took and show a proof of concept evaluation.
MADES - A Multi-Layered, Adaptive, Distributed Event StoreTilmann Rabl
This demo was presented at DEBS'13 on July 1, 2013 in Arlington, Texas, USA.
The full paper and more information is available at:
http://msrg.org/papers/DEBS13Mades
Abstract:
Application performance monitoring (APM) is shifting towards capturing and analyzing every event that arises in an enterprise infrastructure. Current APM systems, for example, make it possible to monitor enterprise applications at the granularity of tracing each method invocation (i.e., an event). Naturally, there is great interest in monitoring these events in real-time to react to system and application failures and in storing the captured information for an extended period of time to enable detailed system analysis, data analytics, and future auditing of trends in the historic data. However, the high insertion-rates (up to millions of events per second) and the purposely limited resource, a small fraction of all enterprise resources (i.e., 1-2% of the overall system resources), dedicated to APM are the key challenges for applying current data management solutions in this context. Emerging distributed key-value stores, often positioned to operate at this scale, induce additional storage overhead when dealing with relatively small data points (e.g., method invocation events) inserted at a rate of millions per second. Thus, they are not a promising solution for such an important class of workloads given APM's highly constrained resource budget. In this paper, to address these shortcomings, we present Multilayered, Adaptive, Distributed Event Store (MADES): a massively distributed store for collecting, querying, and storing event data at a rate of millions of events per second.
Rapid Development of Data Generators Using Meta Generators in PDGFTilmann Rabl
This is a presentation that was held at the Sixth International Workshop on Testing Database Systems, collocated with ACM SIGMOD 2013, June 24, New York, USA.
Full paper and additional information available at:
http://msrg.org/papers/dbtest13-rabl
Abstract:
Generating data sets for the performance testing of database systems on a particular hardware configuration and application domain is a very time consuming and tedious process. It is time consuming, because of the large amount of data that needs to be generated and tedious, because new data generators might need to be developed or existing once adjusted. The difficulty in generating this data is amplified by constant advances in hardware and software that allow the testing of ever larger and more complicated systems. In this paper, we present an approach for rapidly developing customized data generators. Our approach, which is based on the Parallel Data Generator Framework (PDGF), deploys a new concept of so called meta generators. Meta generators extend the concept of column-based generators in PDGF. Deploying meta generators in PDGF significantly reduces the development effort of customized data generators, it facilitates their debugging and eases their maintenance.
Solving Big Data Challenges for Enterprise Application Performance ManagementTilmann Rabl
This is a presentation that was held at the 38th Conference on Very Large Databases (VLDB), 2012.
Full paper and additional information available at:
http://msrg.org/papers/vldb12-bigdata
Abstract:
As the complexity of enterprise systems increases, the need for monitoring and analyzing such systems also grows. A number of companies have built sophisticated monitoring tools that go far beyond simple resource utilization reports. For example, based on instrumentation and specialized APIs, it is now possible to monitor single method invocations and trace individual transactions across geographically distributed systems. This high-level of detail enables more precise forms of analysis and prediction but comes at the price of high data rates (i.e., big data). To maximize the benefit of data monitoring, the data has to be stored for an extended period of time for ulterior analysis. This new wave of big data analytics imposes new challenges especially for the application performance monitoring systems. The monitoring data has to be stored in a system that can sustain the high data rates and at the same time enable an up-to-date view of the underlying infrastructure. With the advent of modern key-value stores, a variety of data storage systems have emerged that are built with a focus on scalability and high data rates as predominant in this monitoring use case.
In this work, we present our experience and a comprehensive performance evaluation of six modern (open-source) data stores in the context of application performance monitoring as part of CA Technologies initiative. We evaluated these systems with data and workloads that can be found in application performance monitoring, as well as, on-line advertisement, power monitoring, and many other use cases. We present our insights not only as performance results but also as lessons learned and our experience relating to the setup and configuration complexity of these data stores in an industry setting.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
GridMate - End to end testing is a critical piece to ensure quality and avoid...
CaSSanDra: An SSD Boosted Key-Value Store
1. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
CaSSanDra:
An
SSD
Boosted
Key-‐Value
Store
Prashanth
Menon,
Tilmann
Rabl,
Mohammad
Sadoghi
(*),
Hans-‐Arno
Jacobsen
!1
*
2. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Outline
• ApplicaHon
Performance
Management
• Cassandra
and
SSDs
• Extending
Cassandra’s
Row
Cache
• ImplemenHng
a
Dynamic
Schema
Catalogue
• Conclusions
!2
3. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Modern
Enterprise
Architecture
• Many
different
soPware
systems
• Complex
interacHons
• Stateful
systems
oPen
distributed/parHHoned/replicated
• Stateless
systems
certainly
duplicated
!3
4. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
ApplicaHon
Performance
Management
• Lightweight
agent
aSached
to
each
soPware
system
instance
• Monitors
system
health
• Traces
transacHons
• Determines
root
causes
• Raw
APM
metric:
!4
Agent
Agent
Agent
Agent
Agent Agent
AgentAgent
Agent
Agent
Agent
Agent
5. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
ApplicaHon
Performance
Management
• Problem:
Agents
have
short
memory
and
only
have
a
local
view
• What
was
the
average
response
Hme
for
requests
served
by
servlet
X
between
December
18-‐31
2011?
• What
was
the
average
Hme
spent
in
each
service/database
to
respond
to
client
requests?
!5
Agent
Agent
Agent
Agent
Agent Agent
AgentAgent
Agent
Agent
Agent
Agent
6. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
APM
Metrics
Datastore
• All
agents
store
metric
data
in
high
write-‐throughput
datastore
• Metric
data
is
at
a
fine
granularity
(per-‐acHon,
millisecond
etc)
• User
now
has
global
view
of
metrics
• What
is
the
best
database
to
store
APM
metrics?
!6
Agent
Agent
Agent
Agent
Agent Agent
AgentAgent
Agent
Agent
Agent
Agent
?
7. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra
Wins
APM
• APM
experiments
performed
by
Rabl
et
al.
[1]
show
Cassandra
performs
best
for
APM
use
case
• In
memory
workloads
including
95%,
50%
and
5%
read
• Workloads
requiring
disk
access
with
95%,
50%
and
5%
reads
!7
Read: 95%
0
50000
100000
150000
200000
250000
2 4 6 8 10 12
Throughput(Ops/sec)
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Figure 6: Throughput for Workload RW
0.1
1
10
100
1000
2 4 6 8 10 12
Latency(ms)-Logarithmic
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Read: 50%
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
2 4 6 8 10 12
Throughput(Operations/sec)
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Figure 3: Throughput for Workload R
million records per node, thus, scaling the problem size with the
cluster size. For each run, we used a freshly installed system and
loaded the data. We ran the workload for 10 minutes with max-
imum throughput. Figure 3 shows the maximum throughput for
workload R for all six systems.
In the experiment with only one node, Redis has the highest
throughput (more than 50K ops/sec) followed by VoltDB. There
are no significant differences between the throughput of Cassan-
dra and MySQL, which is about half that of Redis (25K ops/sec).
Voldemort is 2 times slower than Cassandra (with 12K ops/sec).
The slowest system in this test on a single node is HBase with 2.5K
operation per second. However, it is interesting to observe that the
0.1
1
10
100
2 4 6 8 10 12
Latency(ms)-Logarithmic
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Figure 4: Read latency for Workload R
0.01
0.1
1
10
100
2 4 6 8 10 12
Latency(ms)-Logarithmic
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Figure 5: Write latency for Workload R
[1] http://msrg.org/publications/pdf_files/2012/vldb12-bigdata-Solving_Big_Data_Challenges_fo.pdf
8. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra
• Built
at
Facebook
by
previous
Dynamo
engineers
• Open
sourced
to
Apache
in
2009
• DHT
with
consistent
hashing
• MD5
hash
of
key
• MulHple
nodes
handle
segments
of
ring
for
load
balancing
• Dynamo
distribuHon
and
replicaHon
model
+
BigTable
storage
model
!8
Commit&&
Log&
Memtable&
SS&Tables&
9. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra
and
SSDs
• Improve
performance
by
either
adding
nodes
or
improving
per-‐
node
performance
• Node
performance
is
directly
dependent
on
the
disk
I/O
performance
of
the
system
• Cassandra
stores
two
enHHes
on
disk:
• Commit
Log
• SSTables
• Should
SSDs
be
used
to
store
both?
• We
evaluated
each
possible
configura<on
!9
10. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Experiment
Setup
• Server
specificaHon:
• 2x
Intel
8-‐core
X5450,
16GB
RAM,
2x
2TB
RAID0
HDD,
2x
250GB
Intel
x520
SSD
• Apache
Cassandra
1.10
• Used
YCSB
benchmark
• 100M
rows,
50GB
total
raw
data,
‘latest’
distribuHon
• 95%
read,
5%
write
• Minimum
three
runs
per
workload,
fresh
data
on
each
run
• Broken
into
phases:
• Data
load
• FragmentaHon
• Cache
warm-‐up
• Workload
(>
12h
process)
!10
11. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
SSD
vs.
HDD
• LocaHon
of
log
is
irrelevant
• LocaHon
of
data
is
important
• DramaHc
performance
improvement
of
SSD
over
HDD
• SSD
benefits
from
high
parallelism
!11
Configura<on #
of
clients #
of
threads/client Loca<on
of
Data Loca<on
of
Commit
Log
C1 1 2 RAID
(HDD) RAID
(HDD)
C2 1 2 RAID
(HDD) SSD
C3 1 2 SSD RAID
(HDD)
C4 1 2 SSD SSD
C5 4 16 RAID
(HDD) RAID
(HDD)
C6 4 16 SSD SSD
0
1000
2000
3000
4000
5000
6000
7000
8000
C1 C2 C3 C4 C5 C6
Throughput(ops/sec)
Configuration
(a) HDD vs SSD Throughput
0
1
2
3
4
5
6
7
8
C1 C2 C3 C4 C5 C6
Latency(ms)
Configuration
(b) HDD vs SDD Latency
0
1000
2000
3000
4000
5000
6000
7000
8000
HDD
Throughput(ops/sec)
Data
Empty Disk
Full Disk
(c) 99% Fill HDD v
Fig. 4. Throughput/Latency Results for HDD vs SSD and D
on HDD for the bulk of data that is infrequently accessed.
Another reason to do this is the fact that SSD performance
degrades with higher fill ratios. As seen in Figure 4(c), the
performance of a highly filled SSD degrades much worse than
This is becau
the SSD; in f
twice the amo
alone, achiev
12. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
SSD
vs.
HDD
(II)
• SSD
offers
more
than
7x
improvement
to
throughput
on
empty
disk
• SSD
performance
degrades
by
half
as
storage
device
fills
up
• Filling
the
SSD
or
running
it
near
capacity
is
not
advisable
!12
3 C4 C5 C6
iguration
SDD Latency
0
1000
2000
3000
4000
5000
6000
7000
8000
HDD SSD
Throughput(ops/sec)
Data Location
Empty Disk
Full Disk
(c) 99% Fill HDD vs SDD Throughput
0
50
100
150
200
250
HDD SSD
Latency(ms)
Data Location
Empty Disk
Full Disk
(d) 99% Fill HDD vs SDD Latency
t/Latency Results for HDD vs SSD and Disk Full vs Disk Empty
quently accessed.
SSD performance
Figure 4(c), the
much worse than
s to be noted that
, for write heavy
experienced.
This is because a larger portion of the hot data is cached on
the SSD; in fact, our configuration enabled storing more than
twice the amount of data than when using an in-memory cache
alone, achieving a cache-hit ratio of more than 85%. When
a read operation reaches the server for a row that does not
reside in the off-heap memory cache, only a single SSD seek
is required to fulfill the request. In addition, cached data is
13. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
SSD
vs.
HDD:
Summary
• Cassandra
benefits
most
when
storing
data
on
SSD
(not
the
log)
• LocaHon
of
commit
log
not
important
• SSD
performance
inversely
proporHonal
to
fill
raHo
• Storing
all
data
on
SSD
is
uneconomical
• Replacing
3TB
HDD
with
3x
1TB
SSD
is
10x
more
costly
• SSDs
have
limited
lifeHme
(10-‐50K
write-‐erase
cycles),
replacement
more
frequently
• Rabl
et
al.
[1]
show
adding
node
is
100%
costlier,
with
100%
throughput
improvement
• Build
hybrid
system
to
get
comparable
performance
for
marginal
cost
!13
14. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra:
Read
+
Write
Path
• Write
path
is
fast:
1. Write
update
into
commit
log
2. Write
update
into
Memtable
• Memtables
flush
to
SSTables
asynchronously
when
full
• Never
blocks
writes
• Read
path
can
be
slow:
1. Read
key-‐value
from
Memtable
2. Read
key-‐value
from
each
SSTable
on
disk
3. Construct
merged
view
of
row
from
each
input
source
!14
ReadUpdate
Memtable
SSTableSSTableSSTable
SSTableSSTableSSTable
Memory
• Each
read
needs
to
do
O(#
of
SSTables)
I/O
Disk
Log
15. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra:
SSTables
• Cassandra
allows
blind-‐writes
• Row
data
can
be
fragmented
over
mulHple
SSTables
over
Hme
!
!
!
!
• Bloom
filters
and
indexes
can
potenHally
help
• Ul<mately,
mul<ple
fragments
need
to
be
read
from
disk
!15
Employee(ID( First(Name( Last(Name( Age( Department(ID(
99231234& Prashanth& Menon& 25& MSRG&
{SSTables
16. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra:
Row
Cache
• Row
cache
buffers
full
merged
row
in
memory
• Cache
miss
follows
regular
read
path,
constructs
merged
row,
brings
into
cache
• Makes
read
path
faster
for
frequently
accessed
data
• Problem:
Row
cache
occupies
memory
• Takes
away
precious
memory
from
rest
of
system
!16
• Extend
the
row
cache
efficiently
onto
SSD
ReadUpdate
Memtable
SSTableSSTableSSTable
SSTableSSTableSSTable
Memory
Disk
Log
Row Cache
17. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Extended
Row
Cache
• Extend
the
row
cache
onto
SSD
• Chained
with
in-‐memory
row
cache
• LRU
in-‐memory,
overflow
onto
LRU
SSD
row
cache
• Implemented
as
append-‐only
cache
files
• Efficient
sequenHal
writes
• Fast
random
reads
• Zero
I/O
for
hit
in
first
level
row
cache
• One
random
I/O
on
SSD
for
second
level
row
cache
!17
Log SSTableSSTableSSTable
SSTableSSTableSSTable
Memory
Memtable
1rst Level Row
Cache
2nd Level Cache
Index
Disk
2nd Level Row Cache
SSD
ReadUpdate
18. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
EvaluaHon:
SSD
Row
Cache
• Setup:
• 100M
rows,
50GB
total
data,
6GB
row
cache
• Results:
• 75%
improvement
in
throughput
• 75%
improvement
in
latency
• RAM-‐only
cache
has
too
liSle
hit
raHo
!18
0
200
400
600
800
1000
95% 85% 75%
Throughput(ops/sec)
Read Percentage
Disabled
RAM
RAM+SSD
(a) Row Cache (Throughput)
0
1
2
3
4
5
6
7
8
95% 85% 75%
Latency(ms)
Read Percentage
Disabled
RAM
RAM+SSD
(b) Row Cache (Latency)
0
1000
2000
3000
4000
5000
6000
7000
95%
Throughput(ops/sec)
Re
Regular
Dynamic
(c) Dynamic Sc
Fig. 5. Throughput/Latency Results for Row Cache Exten
and we find this to be much more compelling. In normal
operation, data sizes averaged 6.8GB compressed after the
initial load of 40 million keys. With a modified Cassandra,
data sizes averaged at 6.01GB of data, a savings of roughly
10%. This value will grow as the number of columns in the
table grow and as column names grow in length.
Another potential benefit for dynamic schema model (omit-
we identify
key-value s
In this p
SSDs in k
figurations
and implem
19. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Dynamic
Schema
• Key-‐value
stores
covet
schema-‐less
data
model
• Very
flexible,
good
for
highly
varying
data
• Schemas
oPen
change,
defining
up
front
can
be
detrimental
!
!
!
!
!
!
• ObservaHon:
many
big
data
applicaHons
have
relaHvely
stable
schemas
• e.g.,
Click
stream,
APM,
sensor
data
etc.
• Redundant
schemas
have
significant
overhead
in
I/O
and
space
usage
!19
Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1'
Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988848' Value' 5' Max' 7' Min' 1'
Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1'
OnHDisk'Format'
Metric'Name' Timestamp' Value' Max' Min'
HostA/AgentX/AVGResponse' 1332988833' 4' 6' 1'
ApplicaKon'Format'
20. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Dynamic
Schema
(III)
• Don’t
serialize
redundant
schema
with
rows
• Extract
schema
from
data,
store
on
SSD,
serialize
schema
ID
with
data
• Allows
for
large
number
of
schemas
!20
Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1'
Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988848' Value' 5' Max' 7' Min' 1'
Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1'
S1'
S2'
Metric'Name'Timestamp' Value' Max' Min'
Metric'Name'Timestamp' All' Warn' Error'
HostA/AgentX/AVGResponse'1332988833'S1' 4' 6' 1'
HostA/AgentX/AVGResponse'1332988848'
HostA/AgentX/Failures' 1332988849'
S1'
S2'
5' 7' 1'
4' 3' 1'
New'Disk'Format'Schema'Catalogue'
Old'Disk'Format'
SSD
21. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
EvaluaHon:
Dynamic
Schema
• Setup:
• 40M
rows,
variable
columns
5-‐10
(638
schemas),
6GB
row
cache
• Results:
• 10%
reducHon
in
disk
usage
(6.8GB
vs
6GB)
• Slightly
improved
throughput,
stable
latency
• EffecHve
SSD
usage
(only
random
reads)
&
reduce
I/O
and
space
usage
!21
85% 75%
Percentage
he (Latency)
0
1000
2000
3000
4000
5000
6000
7000
95% 50% 5%
Throughput(ops/sec)
Read Percentage
Regular
Dynamic
(c) Dynamic Schema (Throughput)
0
20
40
60
80
100
120
140
95% 50% 5%
Latency(ms)
Read Percentage
Regular
Dynamic
(d) Dynamic Schema (Latency)
atency Results for Row Cache Extension and Dynamic Schema
ing. In normal
essed after the
fied Cassandra,
ngs of roughly
columns in the
th.
ma model (omit-
we identify new avenues for exploiting the use of SSDs within
key-value stores, namely, our dynamic cataloguing technique.
VIII. CONCLUSION
In this paper, we investigated the performance benefits of
SSDs in key-value stores. We benchmarked different con-
figurations of SSD and HDD combinations. We proposed
and implemented two specific optimizations for SSD-HDD
22. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Conclusions
• Storing
Cassandra
commit
logs
on
SSD
doesn’t
help
• Managing
SSDs
at
capacity
degrades
its
performance
• Using
SSDs
as
a
secondary
row-‐cache
dramaHcally
improves
performance
• ExtracHng
redundant
schemas
onto
and
SSD
reduces
disk
space
usage
and
required
I/O
!22
23. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Thanks!
!
• QuesHons?
!
• Contact:
• Prashanth
Menon
(prashanth.menon@utoronto.ca)
!23
24. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Future
Work
• What
types
of
tables
benefit
most
from
a
dynamic
schema?
• Impact
of
compacHon
on
read-‐heavy
workloads
• How
can
SSDs
be
used
to
improve
the
performance
of
compacHon?
• How
is
performance
when
storing
only
SSTable
indexes
on
SSD?
!24