Castle is a storage system that uses a doubling array (DA) to transform random I/O into sequential I/O for distributed, shared-nothing databases handling big data workloads. The DA improves on B-Trees by allowing for faster small random inserts and range queries after inserts compared to traditional storage systems. Castle also uses snapshots and clones to address problems with supporting new big data workloads, such as reducing space blowup from copy-on-write operations. Based on benchmarks inserting 3 billion rows and performing subsequent small random queries, Castle provides over an order of magnitude better performance than standard Cassandra for these workloads.
Cassandra & the Acunu Data Platform
Tom Wilkie discusses how the Acunu data platform provides significant performance improvements for Cassandra. Acunu uses a doubling array technique for inserts and range queries that is over 100x faster than standard Cassandra for small random operations. It bridges the performance gap between traditional and modern distributed databases through its shared memory interface and kernel optimizations.
Acunu is developing an enterprise Cassandra appliance called Castle that aims to simplify Cassandra deployment and management. Castle includes a storage engine optimized for large disks and workloads, and allows for high density on commodity hardware. It also features fast disk rebuilds through its shared memory architecture. Acunu provides a web UI called the Control Center to configure, monitor, and troubleshoot Castle without deep Cassandra expertise. Acunu performs extensive automated testing of Castle to ensure reliability.
This document discusses Cassandra performance improvements using Acunu technology. It describes how Acunu powers Cassandra with features like doubling arrays to improve insert and range query performance by over 100x and 3.5x respectively compared to standard Cassandra. It also discusses Acunu's monitoring, operations and open source aspects.
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVMMediaEval2012
This document describes a spoken web search system that uses dynamic time warping (DTW) and an unsupervised support vector machine (SVM). It consists of 3 sections:
1) System architecture - outlines the segmentation, feature extraction, SVM method, and searching algorithm components of the system.
2) Experimental results - provides results from testing the system but no details.
3) Conclusion - the concluding remarks for the system but no specifics are given.
CCNxCon2012: Session 5: Distributed Cooperative Caching Scheme in CCNPARC, a Xerox company
Distributed Cooperative Caching Scheme in CCN
Dariusz Bursztynowski, Mateusz Dzida, Tomasz Janaszka (Telekomunikacja Polska, Orange Labs, Poland), Adam Dubiel (Warsaw University of Technology, Poland), Michal Rowicki (Warsaw University of Technology and Telekomunikacja Polska, Poland)
The document discusses the topics that will be covered in a .NET summer training program, including introductions to .NET framework classes, data types, OOP concepts, inheritance, multithreading, exception handling, file I/O, ADO.NET, web forms, and HTML controls. The training will cover syntax, architecture, and implementations related to these .NET and web development technologies.
Castle is a storage system that uses a doubling array (DA) to transform random I/O into sequential I/O for distributed, shared-nothing databases handling big data workloads. The DA improves on B-Trees by allowing for faster small random inserts and range queries after inserts compared to traditional storage systems. Castle also uses snapshots and clones to address problems with supporting new big data workloads, such as reducing space blowup from copy-on-write operations. Based on benchmarks inserting 3 billion rows and performing subsequent small random queries, Castle provides over an order of magnitude better performance than standard Cassandra for these workloads.
Cassandra & the Acunu Data Platform
Tom Wilkie discusses how the Acunu data platform provides significant performance improvements for Cassandra. Acunu uses a doubling array technique for inserts and range queries that is over 100x faster than standard Cassandra for small random operations. It bridges the performance gap between traditional and modern distributed databases through its shared memory interface and kernel optimizations.
Acunu is developing an enterprise Cassandra appliance called Castle that aims to simplify Cassandra deployment and management. Castle includes a storage engine optimized for large disks and workloads, and allows for high density on commodity hardware. It also features fast disk rebuilds through its shared memory architecture. Acunu provides a web UI called the Control Center to configure, monitor, and troubleshoot Castle without deep Cassandra expertise. Acunu performs extensive automated testing of Castle to ensure reliability.
This document discusses Cassandra performance improvements using Acunu technology. It describes how Acunu powers Cassandra with features like doubling arrays to improve insert and range query performance by over 100x and 3.5x respectively compared to standard Cassandra. It also discusses Acunu's monitoring, operations and open source aspects.
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVMMediaEval2012
This document describes a spoken web search system that uses dynamic time warping (DTW) and an unsupervised support vector machine (SVM). It consists of 3 sections:
1) System architecture - outlines the segmentation, feature extraction, SVM method, and searching algorithm components of the system.
2) Experimental results - provides results from testing the system but no details.
3) Conclusion - the concluding remarks for the system but no specifics are given.
CCNxCon2012: Session 5: Distributed Cooperative Caching Scheme in CCNPARC, a Xerox company
Distributed Cooperative Caching Scheme in CCN
Dariusz Bursztynowski, Mateusz Dzida, Tomasz Janaszka (Telekomunikacja Polska, Orange Labs, Poland), Adam Dubiel (Warsaw University of Technology, Poland), Michal Rowicki (Warsaw University of Technology and Telekomunikacja Polska, Poland)
The document discusses the topics that will be covered in a .NET summer training program, including introductions to .NET framework classes, data types, OOP concepts, inheritance, multithreading, exception handling, file I/O, ADO.NET, web forms, and HTML controls. The training will cover syntax, architecture, and implementations related to these .NET and web development technologies.
More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.
This document provides an overview of MXF and AAF file formats. It discusses:
1. Why these formats were developed, which was to allow for content-centric workflows with metadata handling, random access to material, and open standardized compression-independent formats.
2. What the formats are, with MXF being a wrapper format for interchange of finished audiovisual material and metadata, and AAF being a more complex wrapper of metadata and essence for post-production interchange.
3. Some key concepts around the formats, including the source reference chain that allows tracking material origins and derivations, and operational patterns that control complexity.
This document provides an overview of the FESTO line monitoring system. It describes the system components including the web application, DPWS function block, rule engine, ActiveMQ, and how messages flow between devices, through the rule engine and ActiveMQ, and are visualized. The document also describes the types of S1000 operating messages that are transmitted and processed including status, event, and property messages about operators, workstations, and workpieces.
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
The document discusses parallel computing using GPUs and CUDA. It introduces CUDA as a parallel programming model that allows writing parallel code in a C/C++-like language that can execute efficiently on NVIDIA GPUs. It describes key CUDA abstractions like a hierarchy of threads organized into blocks, different memory spaces, and synchronization methods. It provides an example of implementing parallel reduction and discusses strategies for mapping algorithms to GPU architectures. The overall message is that CUDA makes massively parallel computing accessible using a familiar programming approach.
The document discusses the differences between single-threaded and multithreaded programming. In single-threaded programming, each process has a single thread of control running in its address space. Multithreaded programming allows multiple threads of control to run concurrently within the same address space, similar to separate processes but sharing the same memory. This allows improved performance by overlapping I/O with computation and improving processor utilization with parallelism.
Cloumon is a management and monitoring tool for cloud computing platforms like ZooKeeper, Cassandra, and Hadoop. It collects metrics from these platforms and stores them in a database. It also provides notification and action management for alarms. Cloumon monitors nodes, clusters, and hosts across ZooKeeper, Cassandra, and Hadoop installations.
This document provides a summary of key concepts in Windows Communication Foundation (WCF) including configuration, contracts, bindings, behaviors, and more. It explains that WCF provides an abstraction layer over transports that allows developers to focus on message types rather than transport details. Contracts define message structure, bindings describe how messages are sent, and addresses specify where messages are sent. The document provides overviews of common WCF configuration sections and the purpose of various contracts, bindings, and behaviors.
The document discusses network intrusion detection and anomaly detection from a research perspective. It describes using network processors to develop a device that can perform high-speed packet capturing, timestamping, and processing. The device is used to build a traffic measurements system that can analyze traffic at wire speed and online to accurately characterize network traffic.
This document summarizes a presentation on the UNICORE Server Components. It discusses how UNICORE provides a web services framework for job submission and management across different computing resources. Key points include:
- UNICORE uses a gateway, service containers, and atomic services to expose target systems through standardized web service interfaces.
- Atomic services include job management, storage management, and file transfer services that provide abstract access to computing jobs and files on remote systems.
- Security is handled through XUUDB authentication, XACML authorization policies, and message signing. Configurable security handlers provide flexibility.
Building Applications Using NoSQL Architectures on top of SQL Azure: How MSN ...DATAVERSITY
Building highly-available and highly-scalable applications are one of the main reasons for using NoSQL database systems and processing frameworks over traditional relational database systems. Relational database systems have taken notice and are increasingly moving forward to provide solutions for these class of applications.
In this presentation we will showcase how the Windows Gaming Experience is using SQL Server Azure to build a highly-available and highly-scalable application that is used to create new experiences for millions of casual gamers in the next version of the Bing search engine and integrate Microsoft games with social-networking sites. They employ several of the NoSQL architectural patterns such as sharding. We will be presenting the architecture, lessons learned and also provide an insight into how the SQL Server Azure service is evolving to support NoSQL application development patterns such as sharding and open schema support to make SQL Server Azure a Not Only SQL database engine.
by James Broberg - Presentation given at the 2nd International Workshop on Web APIs and Mashups (at ICSOC2008) on December 1st, 2008 in Sydney, Australia. http://www.icsoc-mashups.org/
The Oracle Server Architecture document outlines the core components that make up an Oracle database instance, including background processes, memory structures like the system global area (SGA) and program global area (PGA), online redo logs, control files, and more. It shows how client connections are handled by the database and how resources are shared between users. Key processes keep the database functioning and recoverable, while memory areas cache data and SQL for fast access.
The document discusses steps for deploying a successful virtual network, including designing the network, building and configuring hardware, and configuring the virtual machine manager. It covers providing isolation through techniques like VLANs and software defined networking. Topics include logical network addressing, host configuration options, and creating logical switches. Tenant configuration using network virtualization is described for isolation.
This document discusses Altera's FPGA strategy for reconfigurable hardware in industry applications. It defines reconfigurable hardware as an architecture that does not require on-the-fly timing analysis because product qualification is extensively done through temperature and cycle testing without hardware architecture changes. It then shows how programmable solutions have evolved from single CPU and DSP cores to multi-core processors and coarse-grained arrays with FPGAs moving to fine-grained, massively parallel arrays with embedded hard IP blocks. Future trends include challenges of scaling CPUs due to physical limits and the benefits of parallelism through hardware reconfiguration.
This document provides an overview of the ZFS file system. It discusses ZFS's design goals of simplifying storage and replacing outdated assumptions. It also covers key aspects of ZFS like its layered architecture, use of copy-on-write, lack of need for filesystem checking, virtual devices (vdevs) including mirroring and striping of storage, and dynamic block allocation.
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...DataStax
Cassandra is a distributed database with features included but not limited to Secundary Indexes, UDF, Materialized Views, etc. and not so strict hardware requirements.
It is important to use those features and select hardware correctly to make sure the use of Cassandra in your business can be as painless as possible.
I will address how these features are used in the wrong way, how hardware should be selected, and how to make Cassandra work in the best possible way.
Learning Objective #1:
Learn that Cassandra hardware requirements exist (and why) and the shortcomings in some of features(Secundary Indexes, Compaction Strategies, etc).
Learning Objective #2:
The most misused features and common hardware errors. How they might seem harmeless at first (either small cluster or even single node).
Learning Objective #3:
How to correctly use Cassandra and it's features and go for perfect operation.
About the Speaker
Carlos Rolo Cassandra Consultant, Pythian
Carlos Rolo is a Cassandra MVP, and has deep expertise with distributed architecture technologies. Carlos is driven by challenge, and enjoys the opportunities to discover new things.. He has become known and trusted by customers and colleagues for his ability to understand complex problems, and to work well under pressure. When Carlos isn't working he can be found playing water polo or enjoying the his local community.
This document summarizes a presentation about data grids versus databases. It discusses how data grids provide extremely fast in-memory access to distributed data and can be used for caching to improve database performance. While data grids offer benefits like scalability, their use requires a different programming model than databases. They may replace databases for some use cases like analytics but databases will remain important for their maturity and existing implementations. Data grids are best viewed as complementing rather than replacing databases.
Scylla Summit 2016: ScyllaDB, Present and FutureScyllaDB
Where is Scylla now and where is it going? ScyllaDB's CTO Avi Kivity outlines the 3 ScyllaDB Commitments, and gives an overview of the ScyllaDB road map.
More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.
This document provides an overview of MXF and AAF file formats. It discusses:
1. Why these formats were developed, which was to allow for content-centric workflows with metadata handling, random access to material, and open standardized compression-independent formats.
2. What the formats are, with MXF being a wrapper format for interchange of finished audiovisual material and metadata, and AAF being a more complex wrapper of metadata and essence for post-production interchange.
3. Some key concepts around the formats, including the source reference chain that allows tracking material origins and derivations, and operational patterns that control complexity.
This document provides an overview of the FESTO line monitoring system. It describes the system components including the web application, DPWS function block, rule engine, ActiveMQ, and how messages flow between devices, through the rule engine and ActiveMQ, and are visualized. The document also describes the types of S1000 operating messages that are transmitted and processed including status, event, and property messages about operators, workstations, and workpieces.
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
The document discusses parallel computing using GPUs and CUDA. It introduces CUDA as a parallel programming model that allows writing parallel code in a C/C++-like language that can execute efficiently on NVIDIA GPUs. It describes key CUDA abstractions like a hierarchy of threads organized into blocks, different memory spaces, and synchronization methods. It provides an example of implementing parallel reduction and discusses strategies for mapping algorithms to GPU architectures. The overall message is that CUDA makes massively parallel computing accessible using a familiar programming approach.
The document discusses the differences between single-threaded and multithreaded programming. In single-threaded programming, each process has a single thread of control running in its address space. Multithreaded programming allows multiple threads of control to run concurrently within the same address space, similar to separate processes but sharing the same memory. This allows improved performance by overlapping I/O with computation and improving processor utilization with parallelism.
Cloumon is a management and monitoring tool for cloud computing platforms like ZooKeeper, Cassandra, and Hadoop. It collects metrics from these platforms and stores them in a database. It also provides notification and action management for alarms. Cloumon monitors nodes, clusters, and hosts across ZooKeeper, Cassandra, and Hadoop installations.
This document provides a summary of key concepts in Windows Communication Foundation (WCF) including configuration, contracts, bindings, behaviors, and more. It explains that WCF provides an abstraction layer over transports that allows developers to focus on message types rather than transport details. Contracts define message structure, bindings describe how messages are sent, and addresses specify where messages are sent. The document provides overviews of common WCF configuration sections and the purpose of various contracts, bindings, and behaviors.
The document discusses network intrusion detection and anomaly detection from a research perspective. It describes using network processors to develop a device that can perform high-speed packet capturing, timestamping, and processing. The device is used to build a traffic measurements system that can analyze traffic at wire speed and online to accurately characterize network traffic.
This document summarizes a presentation on the UNICORE Server Components. It discusses how UNICORE provides a web services framework for job submission and management across different computing resources. Key points include:
- UNICORE uses a gateway, service containers, and atomic services to expose target systems through standardized web service interfaces.
- Atomic services include job management, storage management, and file transfer services that provide abstract access to computing jobs and files on remote systems.
- Security is handled through XUUDB authentication, XACML authorization policies, and message signing. Configurable security handlers provide flexibility.
Building Applications Using NoSQL Architectures on top of SQL Azure: How MSN ...DATAVERSITY
Building highly-available and highly-scalable applications are one of the main reasons for using NoSQL database systems and processing frameworks over traditional relational database systems. Relational database systems have taken notice and are increasingly moving forward to provide solutions for these class of applications.
In this presentation we will showcase how the Windows Gaming Experience is using SQL Server Azure to build a highly-available and highly-scalable application that is used to create new experiences for millions of casual gamers in the next version of the Bing search engine and integrate Microsoft games with social-networking sites. They employ several of the NoSQL architectural patterns such as sharding. We will be presenting the architecture, lessons learned and also provide an insight into how the SQL Server Azure service is evolving to support NoSQL application development patterns such as sharding and open schema support to make SQL Server Azure a Not Only SQL database engine.
by James Broberg - Presentation given at the 2nd International Workshop on Web APIs and Mashups (at ICSOC2008) on December 1st, 2008 in Sydney, Australia. http://www.icsoc-mashups.org/
The Oracle Server Architecture document outlines the core components that make up an Oracle database instance, including background processes, memory structures like the system global area (SGA) and program global area (PGA), online redo logs, control files, and more. It shows how client connections are handled by the database and how resources are shared between users. Key processes keep the database functioning and recoverable, while memory areas cache data and SQL for fast access.
The document discusses steps for deploying a successful virtual network, including designing the network, building and configuring hardware, and configuring the virtual machine manager. It covers providing isolation through techniques like VLANs and software defined networking. Topics include logical network addressing, host configuration options, and creating logical switches. Tenant configuration using network virtualization is described for isolation.
This document discusses Altera's FPGA strategy for reconfigurable hardware in industry applications. It defines reconfigurable hardware as an architecture that does not require on-the-fly timing analysis because product qualification is extensively done through temperature and cycle testing without hardware architecture changes. It then shows how programmable solutions have evolved from single CPU and DSP cores to multi-core processors and coarse-grained arrays with FPGAs moving to fine-grained, massively parallel arrays with embedded hard IP blocks. Future trends include challenges of scaling CPUs due to physical limits and the benefits of parallelism through hardware reconfiguration.
This document provides an overview of the ZFS file system. It discusses ZFS's design goals of simplifying storage and replacing outdated assumptions. It also covers key aspects of ZFS like its layered architecture, use of copy-on-write, lack of need for filesystem checking, virtual devices (vdevs) including mirroring and striping of storage, and dynamic block allocation.
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...DataStax
Cassandra is a distributed database with features included but not limited to Secundary Indexes, UDF, Materialized Views, etc. and not so strict hardware requirements.
It is important to use those features and select hardware correctly to make sure the use of Cassandra in your business can be as painless as possible.
I will address how these features are used in the wrong way, how hardware should be selected, and how to make Cassandra work in the best possible way.
Learning Objective #1:
Learn that Cassandra hardware requirements exist (and why) and the shortcomings in some of features(Secundary Indexes, Compaction Strategies, etc).
Learning Objective #2:
The most misused features and common hardware errors. How they might seem harmeless at first (either small cluster or even single node).
Learning Objective #3:
How to correctly use Cassandra and it's features and go for perfect operation.
About the Speaker
Carlos Rolo Cassandra Consultant, Pythian
Carlos Rolo is a Cassandra MVP, and has deep expertise with distributed architecture technologies. Carlos is driven by challenge, and enjoys the opportunities to discover new things.. He has become known and trusted by customers and colleagues for his ability to understand complex problems, and to work well under pressure. When Carlos isn't working he can be found playing water polo or enjoying the his local community.
This document summarizes a presentation about data grids versus databases. It discusses how data grids provide extremely fast in-memory access to distributed data and can be used for caching to improve database performance. While data grids offer benefits like scalability, their use requires a different programming model than databases. They may replace databases for some use cases like analytics but databases will remain important for their maturity and existing implementations. Data grids are best viewed as complementing rather than replacing databases.
Scylla Summit 2016: ScyllaDB, Present and FutureScyllaDB
Where is Scylla now and where is it going? ScyllaDB's CTO Avi Kivity outlines the 3 ScyllaDB Commitments, and gives an overview of the ScyllaDB road map.
Galder Zamarreño from Red Hat presented on Infinispan, an open source in-memory data grid platform. Infinispan can be used as a local cache, clustered cache, or as a data grid. As a data grid, it provides a highly available, distributed, and elastic data store. Infinispan also enables users to build their own data-as-a-service solutions in private clouds by virtualizing data and making it accessible in an elastic and scalable manner. Major companies use Infinispan both as a cache (e.g. for Hibernate) and as a data grid for applications requiring real-time access to distributed data.
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...DataStax
This document discusses using Apache Ignite to enable in-memory SQL on Apache Cassandra. It provides an overview of GridGain's enterprise and open source strategies, with Ignite being based on the open source version. It then discusses EPAM's engineering capabilities. The remainder discusses Ignite's capabilities for scalable SQL queries with ACID transactions on Cassandra and provides a demo comparing performance of OLTP and OLAP queries between Cassandra and Ignite. Contact information and URLs for more information on Ignite and using it with Cassandra are also provided.
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...Novell
High availability provides a safety net for single points of hardware failure. This session will identify the software and hardware requirements for implementing Novell Cluster Services with Novell Open Enterprise Server. We'll cover concepts related to design, installation and monitoring. We'll also show you real-world clustering examples for Novell GroupWise, Novell Teaming and Novell iFolder.
A quick intro to DevCloud the CloudStack sandbox, and how to use CloudMonkey to manage your cloud.
DevCloud is a virtualbox image that contains the CloudStack source code and that is setup to run the storage infrastructure needed by CloudStack plus the networking setup to build the guest network of the VMs. Tiny Linux instances can be started within the Devcloud VM making use of nested virtualization.
This is a perfect setup to discover cloudstack, give demos and test new codes. It is used to test new releases and verify basic functionality. You can run DevCloud on your laptop and then use the command line interface CloudMonkey to make API calls to your DevCloud instance.
This is the perfect complement to the talk on CloudMonkey and shows the basic functionality of a cloud. Instance creation, snapshots, networking, network offering and AWS EC2 compatibility.
My talk from BACD http://buildacloud.org workshop in Ghent, Belgium
All videos can be viewed at: http://www.youtube.com/playlist?list=PLb899uhkHRoZZefRW5XmCb8QBcRO7o74E
This is an introductory talk for the workshop, it introduces CloudStack and the community at the Apache Software Foundation, it presents the basic layers of the Cloud IaaS, PaaS, and SaaS and shows how the CloudStack ecosystem addresses all layers. It presents the basic features of cloudstack, networking with a focus on SDN (Software Defined Networking) , storage with a focus on large scale object store (Ceph), a use case with Spotify, a PaaS with Karafe and fuse Fabric, the API using deltacloud which provides the CIMI standard interface and an application integration using the CloudStack API with Activeeon.
This is the perfect complement to the videos on youtube and serves as a introduction to CloudStack.
The document discusses the semantic web and metadata management. It defines the semantic web as a universal medium for exchanging information electronically that can be processed and still have meaning. It discusses challenges like overcoming prior integration issues and determining return on investment. It also discusses the importance of the semantic web for business needs like re-purposing data instead of re-creating it. Finally, it discusses getting past relevancy overload on the web by making concept URIs more precise to improve search results.
Paris NoSQL User Group - In Memory Data Grids in Action (without transactions...Cyrille Le Clerc
In Memory Data Grids in Action with Oracle Coherence presented to No SQL users.
The "transactions" chapter is missing as it has been rescheduled to another session.
In Memory Data Grids in Action with Oracle Coherence presented to No SQL users.
The "transactions" chapter is missing as it has been rescheduled to another session.
This document summarizes Tom Wilkie's presentation on Acunu & OCaml. It discusses how Acunu has evolved from small databases in 1990 to distributed, shared-nothing databases today. It presents the architecture and components of Acunu's storage core, which is built using OCaml. Performance results are shown for Acunu's prototypes using different data structures like doubling arrays, demonstrating high insertion rates.
This document summarizes SQL Azure, Microsoft's cloud-based relational database service. It describes the multi-tenant architecture with accounts, servers, and databases. It explains concepts like replication for high availability and reconfiguration to handle failures. The document also discusses the hardware and software deployment methods used to provide a reliable cloud database platform at scale.
Acunu and Hailo: a realtime analytics case study on CassandraAcunu
Hailo is a taxi app that receives a hail every 4 seconds across 15 cities. It launched on AWS using MySQL but adopted Cassandra and Acunu for greater resilience during international expansion. Cassandra provided high availability and global replication. Acunu provided analytics capabilities on Cassandra data. Hailo uses Cassandra for entity storage and Acunu for analytics, seeing benefits like simplified data modeling, rich queries, and infrastructure monitoring. Choosing these platforms allowed for high availability, multi-data center operation, and scaling to support growth.
- Cassandra nodes are clustered in a ring, with each node assigned a random token range to own.
- Adding or removing nodes traditionally required manually rebalancing the token ranges, which was complex, impacted many nodes, and took the cluster offline.
- Virtual nodes assign each physical node multiple random token ranges of varying sizes, allowing incremental changes where new nodes "steal" ranges from others, distributing the load evenly without manual work or downtime.
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu
Hailo, the taxi app, has served more than 5 million passengers in 15 cities and has taken fares of $100 million this year. I'm going to talk about how that rapid growth has been powered by a platform based on Cassandra and operational analytics and insights powered by Acunu Analytics. I'll cover some challenges and lessons learned from scaling fast!
Understanding Cassandra internals to solve real-world problemsAcunu
The document summarizes Nicolas Favre-Felix's presentation on Cassandra internals at a Cassandra London meetup. It discusses four common problems encountered with Cassandra - high read latency, high CPU usage with little activity, long nodetool repair times, and optimizing write throughput. For each problem, it describes symptoms, analysis using tools like nodetool, and solutions like adjusting the data model, increasing thread pool sizes, and adding hardware resources. The key takeaways are that monitoring Cassandra is important, using the right data model impacts performance, and understanding how Cassandra stores and arranges data on disk is essential to optimization.
Talk for the Cassandra Seattle Meetup April 2013: http://www.meetup.com/cassandra-seattle/events/114988872/
Cassandra's got some properties which make it an ideal fit for building real-time analytics applications -- but getting from atomic increments to live dashboards and streaming queries is quite a stretch. In this talk, Tim Moreton, CTO at Acunu, talks about how and why they built Acunu Analytics, which adds rich SQL-like queries and a RESTful API on top of Cassandra, and looks at how it keeps Cassandra's spirit of denormalization under the hood.
The document describes how Apache Cassandra can be used for real-time analytics on streaming data. It provides an example of counting Twitter mentions of a term per day in real-time by incrementing counters in Cassandra as tweets are processed. This allows queries to be answered by reading the counters. More complex queries can be supported by storing aggregated data in a denormalized format across rows and columns in Cassandra.
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Acunu
The document discusses implementing real-time analytics on Twitter data using Cassandra. It describes incrementing counters for each tweet to track token frequencies over time. This allows querying token mentions within a date range by reading the relevant counter columns. However, Cassandra's random partitioner prevents efficient range queries on rows. Instead, the solution denormalizes the data into wide rows with time buckets as columns to allow fast counting of token mentions within each time period through a single disk read. The document provides code examples and encourages experimenting with an open source implementation.
This document discusses real-time analytics with Cassandra. It includes sections on motivation/alternatives, what real-time analytics with Cassandra is, how it works, approximate analytics, and what problems it can help solve. The document contains log data as an example of the type of data that can be analyzed with this technique.
- The document discusses Acunu Analytics, a real-time big data analytics platform.
- It addresses the motivation for developing Acunu Analytics compared to alternatives. It also briefly describes what Acunu Analytics is, how it works, and what problems it can help solve.
- The main topics covered are the product itself, its capabilities for real-time analytics of big data, and potential use cases.
Realtime Analytics on the Twitter Firehose with CassandraAcunu
This document discusses using Cassandra for real-time analytics of Twitter data. It describes incrementing counters in Cassandra as tweets are processed to track metrics like mentions over time. This allows queries to retrieve trends by reading counters with a single I/O, rather than scanning large amounts of data. The document demonstrates preparing tweet data by tokenizing and incrementing counters in time buckets. It also covers implementing a range query to retrieve mentions between dates from a wide row with time buckets as columns.
This document discusses a distributed database called Acunu that is tunably consistent, highly available, and partition tolerant. It can scale out on commodity servers and provides high performance. The database uses a multi-master architecture without single points of failure and supports data replication across multiple data centers. It also provides a simple but powerful data model and is well-suited for applications involving high-velocity data.
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
The document discusses NoSQL, NewSQL, and other database technologies that are emerging to address limitations of relational databases in scaling to meet demands for performance, availability, and flexibility. It provides an overview of different categories of NoSQL databases and NewSQL solutions, and analyzes drivers like scalability, performance, relaxed consistency, agility, and complexity of data that are contributing to adoption of these new database approaches.
Cassandra EU 2012 - Putting the X Factor into CassandraAcunu
Malcolm Box discusses Tellybug's experience using Cassandra to power voting applications for reality TV shows like Britain's Got Talent and The X Factor. They started with Cassandra to handle high write loads from millions of votes but found counting to be more challenging than expected. They implemented sharded counters in Memcached with Cassandra as the source of truth. While Cassandra scaled well for writes, reads had performance issues. Backup and data integrity also presented operational challenges as their usage of Cassandra evolved.
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Acunu
The document discusses the history and development of Cassandra Query Language (CQL), which provides an SQL-like interface for querying Apache Cassandra databases. It describes CQL evolving from versions 1.0 through 3.0 to become more standardized and user-friendly. Key points include CQL initially being introduced in Cassandra 0.8 to replace the low-level Thrift API, its goals of being simple, intuitive, and high performing, and ongoing work to improve its interface stability and driver support across languages.
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixAcunu
The document discusses Cassandra's storage internals. It describes how Cassandra writes data to memtables and commit logs in memory before flushing to immutable SSTables on disk. It also explains how compaction merges SSTables to reclaim space and improve performance. For reads, Cassandra uses memtables, bloom filters on SSTables, key caches, and row caches to minimize disk I/O. Counters are implemented by coordinating writes across replicas.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. Outline
• Why Castle?
• A [quick] tour of Castle
• Cassandra on Castle
• An aside into Memcache
• Cross-cluster snapshots and clones
Saturday, 24 September 2011
3. Before the Flood
1990
Small databases
BTree indexes
BTree File systems
RAID
Old hardware
Saturday, 24 September 2011
4. Two Revolutions
2010
Distributed, shared-nothing databases
Write-optimised indexes Write-optimised indexes
BTree file systems BTree file systems
RAID ... RAID
New hardware New hardware
Saturday, 24 September 2011
5. Bridging the Gap
2011
Distributed, shared-nothing databases
Castle Castle
...
New hardware New hardware
Saturday, 24 September 2011
6. Saturday, 24 September 2011
Shared memory interface
keys
Userspace
Acunu Kernel
values
In-kernel
async, shared
memory ring workloads
interface
shared buffers
userspace
Streaming interface
range key buffered key buffered
queries insert value insert get value get
interface
kernelspace
Doubling Arrays
insert Bloom filters
queues key
get x
arrays
range arrays
queries management
mapping layer
key
doubling array
insert merges
Arrays
key Version tree
insert btree
key
get
btree
range
modlist btree
mapping layer
queries value arrays
Cache
"Extent" layer
extent block
extent cache
freespace
allocator
prefetcher
manager
& mapper
cacheing layer
flusher
block mapping &
page cache
Linux Kernel
Block layer Memory manager
MM layers
linux's block &
7. Shared memory interface
Castle
keys
Userspace
Acunu Kernel
userspace
interface
values
In-kernel
async, shared
memory ring workloads
shared buffers
kernelspace
• Like ZFS+BDB for Big Data
Streaming interface
interface
range key buffered key buffered
queries insert value insert get value get
• Opensource (GPLv2, MIT
Doubling Arrays
doubling array
mapping layer
insert Bloom filters
queues key
get
arrays x
for user libraries)
range arrays
queries management
key
insert merges
Arrays
• http://bitbucket.org/acunu
mapping layer
modlist btree
key Version tree
insert btree
• Loadable Kernel Module,
key
get
btree
range
queries value arrays
Cache
targeting CentOS’s 2.6.18
block mapping &
• http://www.acunu.com/
cacheing layer
"Extent" layer
prefetcher
extent block
extent cache
freespace
allocator
manager
flusher
& mapper
page cache
blogs/andy-twigg/why-
acunu-kernel/
linux's block &
Linux Kernel
MM layers
Block layer Memory manager
Saturday, 24 September 2011
8. The Interface
Shared memory interface
keys
Userspace
Acunu Kernel
userspace
interface
values
In-kernel
async, shared
memory ring workloads
shared buffers
kernelspace
Streaming interface
interface
range key buffered key buffered
queries insert value insert get value get
Doubling Arrays
doubling array
mapping layer
insert Bloom filters
queues key
get
arrays x
range
queries
castle_{back,objects}.c
arrays
management
Saturday, 24 September 2011 key
9. The Interface
Tree of versions
Attachment
• Create, snapshot, clone
• Attach/detach
• Keys: any dimensional
• Values: any size
v0
• Simple get, put, delete
v1 v3
• Iterator, slice interfaces
v12 v13 v15
• Streaming interface
v16 v24
Saturday, 24 September 2011
10. The Interface
Shared memory interface
keys
Userspace
Acunu Kernel
userspace
interface
values
In-kernel
async, shared
memory ring workloads
shared buffers
kernelspace
Streaming interface
interface
range key buffered key buffered
queries insert value insert get value get
Doubling Arrays
doubling array
mapping layer
insert Bloom filters
queues key
get
arrays x
range
queries
castle_{back,objects}.c
arrays
management
Saturday, 24 September 2011 key
11. interface
userspac
values
In-kernel
async, shared
memory ring workloads
shared buffers
kernelspace
interface
Doubling Array Streaming interface
range key buffered key buffered
queries insert value insert get value get
Doubling Arrays
doubling array
mapping layer
insert Bloom filters
queues key
get
arrays x
range arrays
queries management
key
insert merges
Arrays
mapping layer
modlist btree
key Version tree
insert btree
key
get
btree
range
queries value arrays
castle_{da,bloom}.c
Saturday, 24 September 2011
12. Doubling Array
Inserts
2 2 9
9
Buffer arrays in memory
until we have > B of them
Saturday, 24 September 2011
15. interface
userspac
values
In-kernel
async, shared
memory ring workloads
shared buffers
kernelspace
interface
Doubling Array Streaming interface
range key buffered key buffered
queries insert value insert get value get
Doubling Arrays
doubling array
mapping layer
insert Bloom filters
queues key
get
arrays x
range arrays
queries management
key
insert merges
Arrays
mapping layer
modlist btree
key Version tree
insert btree
key
get
btree
range
queries value arrays
castle_{da,bloom}.c
Saturday, 24 September 2011
16. Doubling Arrays
doubling array
mapping layer
“Mod-list” B-Tree
insert Bloom filters
queues key
get
arrays x
range arrays
queries management
key
insert merges
Arrays
mapping layer
modlist btree
key Version tree
insert btree
key
get
btree
range
queries value arrays
Cache
block mapping &
cacheing layer
"Extent" layer
prefetcher
extent block
extent cache
So how to do snapshots and clones?
freespace
manager
allocator
flusher
& mapper
page cache
castle_{btree,versions}.c
k&
Linux Kernel
s
Saturday, 24 September 2011
17. Copy-on-Write BTree
Idea:
• Apply path-copying [DSST] to
the B-tree
Problems:
• Space blowup: Each update may
rewrite an entire path
• Slow updates: as above
A log file system makes updates sequential, but relies on
random access and garbage collection (achilles heel!)
Saturday, 24 September 2011
18. Range
Update Space
Query
CoW B- O(logB Nv) O(Z/B)
O(N B logB Nv)
Tree random IOs random IOs
“BigTable” O((log N)/B) O(Z/B)
O(VN)
LevelDB
style DA sequential IOs sequential IOs
“Mod-list” O((log N)/B) O(Z/B)
Castle
in a DA sequential IOs sequential IOs
O(N)
Nv = #keys live (accessible) at version v
Saturday, 24 September 2011
19. Stratified B-Trees
• Retires Copy-On-Write B-Trees, the bedrock of
modern storage (Sun ZFS, NetApp WAFL, ...)
• Patent-pending, next-generation data structure
• Theoretically optimal, yet highly practical
Copy-on-write B-tree finally beaten.
Andy Twigg∗ , Andrew Byde∗ , Grzegorz Miło´∗ , Tim Moreton∗ , John Wilkes†∗ and Tom Wilkie∗
∗
s
Acunu, † Google http://goo.gl/INTb1
firstname@acunu.com
Abstract This paper presents some recent results on new con-
structions for B-trees that go beyond copy-on-write, that
A classic versioned data structure in storage and com- we call ‘stratified B-trees’. They solve two open prob-
puter science is the copy-on-write (CoW) B-tree – it un- lems: Firstly. they offer a fully-versioned B-tree with
derlies many of today’s file systems and databases, in- optimal space and the same lookup time as the CoW B-
cluding WAFL, ZFS, Btrfs and more. Unfortunately, it tree. Secondly, they are the first to offer other points on
doesn’t inherit the B-tree’s optimality properties; it has the Pareto optimal query/update tradeoff curve, and in
poor space utilization, cannot offer fast updates, and re- particular, our structures offer fully-versioned updates in
http://goo.gl/gzihe
lies on random IO to scale. Yet, nothing better has o(1) IOs, while using linear space. Experimental results
been developed since. We describe the ‘stratified B-tree’, indicate 100,000s updates/s on a large SATA disk, two
which beats the CoW B-tree in every way. In particu- orders of magnitude faster than a CoW B-tree.
lar, it is the first versioned dictionary to achieve optimal Since stratified B-trees subsume CoW B-trees (and in-
tradeoffs between space, query and update performance. deed all other known versioned external-memory dictio-
Therefore, we believe there is no longer a good reason to naries), we believe there is no longer a good reason to
use CoW B-trees for versioned data stores. use them for versioned data stores. Acunu is develop-
ing a commercial in-kernel implementation of stratified
B-tress, which we hope to release soon.
1 Introduction
Saturday, 24 September 2011
The B-tree was presented in 1972 [1], and it survives
20. Doubling Arrays
doubling array
mapping layer
“Mod-list” B-Tree
insert Bloom filters
queues key
get
arrays x
range arrays
queries management
key
insert merges
Arrays
mapping layer
modlist btree
key Version tree
insert btree
key
get
btree
range
queries value arrays
Cache
block mapping &
cacheing layer
"Extent" layer
prefetcher
extent block
extent cache
freespace
allocator
manager
flusher
& mapper
page cache
castle_{btree,versions}.c
k&
Linux Kernel
s
Saturday, 24 September 2011
21. Arrays
mapping layer
modlist btree
key Version tree
insert btree
Disk Layout: RDA
key
get
btree
range
queries value arrays
Cache
block mapping &
cacheing layer
"Extent" layer
prefetcher
extent block
extent cache
freespace
allocator
manager
flusher
& mapper
page cache
linux's block &
Linux Kernel
MM layers
Block layer Memory manager
castle_{cache,extent,freespace,rebuild}.c
Saturday, 24 September 2011
23. SSD tiering [taster]
• Why? Key to >cache random reads
• v1: SSD for metadata structures
• Redundancy provided by disk
• SSD for selected collection data (CFs)
• 10x write rate on SSDs than regular FSs
Saturday, 24 September 2011
24. Saturday, 24 September 2011
Shared memory interface
keys
Userspace
Acunu Kernel
values
In-kernel
async, shared
memory ring workloads
interface
shared buffers
userspace
Streaming interface
range key buffered key buffered
queries insert value insert get value get
interface
kernelspace
Doubling Arrays
insert Bloom filters
queues key
get x
arrays
range arrays
queries management
mapping layer
key
doubling array
insert merges
Arrays
key Version tree
insert btree
key
get
btree
range
modlist btree
mapping layer
queries value arrays
Cache
"Extent" layer
extent block
extent cache
freespace
allocator
prefetcher
manager
& mapper
cacheing layer
flusher
block mapping &
page cache
Linux Kernel
Block layer Memory manager
MM layers
linux's block &
25. Cassandra on Castle
• Eliminate all ‘storage heavy lifting’
• Extend ColumnFamilyStore
• Efficient JNI bindings to libcastle C library
• row, col, value, t: (row, col) -> (t,value)
• row, a|b|c|d, value, t:
(row, a, b, c, d, col) -> (t,value)
Saturday, 24 September 2011
26. Small random inserts
Inserting 3 billion rows
Acunu powered Cassandra -
‘standard’ Cassandra -
Saturday, 24 September 2011
27. Insert latency
While inserting 3 billion rows
Acunu powered Cassandra x
‘standard’ Cassandra +
Saturday, 24 September 2011
28. Small random range queries
Performed immediately after inserts
Acunu powered Cassandra -
‘standard’ Cassandra -
Saturday, 24 September 2011
29. Memcache + Cassandra
get/insert Cass client get/put memcached
Same data! 100k random
Replication logic inserts/sec! Replication logic
Text
Cassandra memcache Cassandra memcache
Castle Castle
...
H/W H/W
Saturday, 24 September 2011
30. v2: Cross-cluster versions
• Eventually consistent
• Spans data centers
• Tolerates node failure,
network partition
• High performance,
no space overhead
• Dev/Test/Staging on Prod
clusters
Saturday, 24 September 2011
31. So...
• Castle = ZFS + BDB for Big Data
• Cassandra on Castle runs apps unmodified
• Up to 100x throughput under load
• No GC pauses: very predictable latencies
• v2: Cross-cluster snapshot and clone
• SSD optimisation
Saturday, 24 September 2011
33. Questions?
Tim Moreton // @timmoreton
http://goo.gl/INTb1 http://goo.gl/gzihe
Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and
elephant logos are trademarks of the Apache Software Foundation.
Saturday, 24 September 2011