This document summarizes a conference paper about implementing CCIndex, an indexing scheme, in Apache Cassandra to support multi-dimensional range queries. CCIndex reorganizes data into complementary tables indexed on non-primary key columns to enable efficient range scans. The paper discusses differences between Cassandra and HBase, where CCIndex was previously implemented. It proposes a new approach to estimate query result sizes based on data distribution. Experimental results show CCIndex improves Cassandra's query performance by 2.4-3.7 times for various selectivities. The paper also discusses CCIndex implementation details in Cassandra and its recovery mechanism.
An Optimal Cooperative Provable Data Possession Scheme for Distributed Cloud ...IJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
Dimensionality Reduction Techniques for Document Clustering- A SurveyIJTET Journal
Abstract— Dimensionality reduction technique is applied to get rid of the inessential terms like redundant and noisy terms in documents. In this paper a systematic study is conducted for seven dimensionality reduction methods such as Latent Semantic Indexing (LSI), Random Projection (RP), Principle Component Analysis (PCA) and CUR decomposition, Latent Dirichlet Allocation(LDA), Singular value decomposition (SVD). Linear Discriminant Analysis(LDA)
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMIJCI JOURNAL
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
An Optimal Cooperative Provable Data Possession Scheme for Distributed Cloud ...IJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
Dimensionality Reduction Techniques for Document Clustering- A SurveyIJTET Journal
Abstract— Dimensionality reduction technique is applied to get rid of the inessential terms like redundant and noisy terms in documents. In this paper a systematic study is conducted for seven dimensionality reduction methods such as Latent Semantic Indexing (LSI), Random Projection (RP), Principle Component Analysis (PCA) and CUR decomposition, Latent Dirichlet Allocation(LDA), Singular value decomposition (SVD). Linear Discriminant Analysis(LDA)
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMIJCI JOURNAL
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
Cooperative Demonstrable Data Retention for Integrity Verification in Multi-C...Editor IJCATR
Demonstrable data retention (DDR) is a technique which certain the integrity of data in storage outsourcing. In this paper we propose an efficient DDR protocol that prevent attacker in gaining information from multiple cloud storage node. Our technique is for distributed cloud storage and support the scalability of services and data migration. This technique Cooperative store and maintain the client‟s data on multi cloud storage. To insure the security of our technique we use zero-knowledge proof system, which satisfies zero-knowledge properties, knowledge soundness and completeness. We present a Cooperative DDR (CDDR) protocol based on hash index hierarchy and homomorphic verification response. In order to optimize the performance of our technique we use a novel technique for selecting optimal parameter values to reduce the storage overhead and computation costs of client for service providers.
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Samsung Business USA
Which storage technology, HDDs or SSDs, excels in big data architecture? SSDs clearly win on speed, offering higher sequential read/write speeds and higher IOPS. However, deploying SSDs in hundreds or thousands of nodes could add up to a very expensive proposition. A better approach identifies critical locations where SSDs enable immediate cost-per-performance wins. This whitepaper will look at the basics of big data tools, review two performance wins with SSDs in a well-known framework, as well as present some examples of emerging opportunities on the leading edge of big data technology.
The NEC DX1000 MicroServer Chassis is an ultra-dense multi-server platform that offers scale-up and scale-out capabilities, but with a power profile comparable to modern single 2U servers. With enterprise-class features such as redundant switching, high bandwidth, and low-latency connections for up to 46 multi-core server nodes – each with 32 GB of RAM and onboard SSD storage, the NEC DX1000 can be easily leveraged to meet your private cloud provisioning needs.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
EFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEMNexgen Technology
TO GET THIS PROJECT COMPLETE SOURCE ON SUPPORT WITH EXECUTION PLEASE CALL BELOW CONTACT DETAILS
MOBILE: 9791938249, 0413-2211159, WEB: WWW.NEXGENPROJECT.COM,WWW.FINALYEAR-IEEEPROJECTS.COM, EMAIL:Praveen@nexgenproject.com
NEXGEN TECHNOLOGY provides total software solutions to its customers. Apsys works closely with the customers to identify their business processes for computerization and help them implement state-of-the-art solutions. By identifying and enhancing their processes through information technology solutions. NEXGEN TECHNOLOGY help it customers optimally use their resources.
The role of materialized views is becoming vital in today’s distributed Data warehouses. Materialization is
where parts of the data cube are pre-computed. Some of the real time distributed architectures are
maintaining materialization transparencies in the sense the users are not known with the materialization at
a node. Usually what all followed by them is a cache maintenance mechanism where the query results are
cached. When a query requesting materialization arrives at a distributed node it checks in its cache and if
the materialization is available answers the query. What if materialization is not available- the node
communicates the query in the network until a node answering the requested materialization is available.
This type of network communication increases the number of query forwarding’s between nodes. The aim
of this paper is to reduce the multiple redirects. In this paper we propose a new CB-pattern tree indexing to
identify the exact distributed node where the needed materialization is available.
Big Data analytics is common in many business domains, for example in financial sector for savings portfolio analysis, in government agencies, scientific research and insurance providers, to name a few. The uses of Big Data range from generating simple reports to executing complex analytical workloads. The increases in the amount of data being stored and processed in these domains expose many challenges with respect to scalable processing of analytical queries. Massively Parallel Processing (MPP) databases address these challenges by distributing storage and query processing across multiple compute nodes and distributed processes in parallel, usually in a shared-nothing architecture. Today, new technologies are shaping up the way platforms for the Internet of services are designed and managed. This technology is the containers, like Docker[1,2] and LXC[3,4,5]. The use of container as base technology for large-scale distributed systems opens many challenges in the area of resource management at run-time, for example: auto-scaling, optimal deployment and monitoring.
Join Principal Strategy Architect Ankit Patel to discuss the digital modernization journey many enterprises have taken from relational to NoSQL databases. In this webinar we will discuss the following:
• Why there is a need for digital modernization?
• What are the characteristics of the innovative data platform?
• What is NoSQL Apache Cassandra?
• How does DataStax innovate the NoSQL data platform?
• What are some of the challenges associated with digital modernization and migration?
Cooperative Demonstrable Data Retention for Integrity Verification in Multi-C...Editor IJCATR
Demonstrable data retention (DDR) is a technique which certain the integrity of data in storage outsourcing. In this paper we propose an efficient DDR protocol that prevent attacker in gaining information from multiple cloud storage node. Our technique is for distributed cloud storage and support the scalability of services and data migration. This technique Cooperative store and maintain the client‟s data on multi cloud storage. To insure the security of our technique we use zero-knowledge proof system, which satisfies zero-knowledge properties, knowledge soundness and completeness. We present a Cooperative DDR (CDDR) protocol based on hash index hierarchy and homomorphic verification response. In order to optimize the performance of our technique we use a novel technique for selecting optimal parameter values to reduce the storage overhead and computation costs of client for service providers.
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Samsung Business USA
Which storage technology, HDDs or SSDs, excels in big data architecture? SSDs clearly win on speed, offering higher sequential read/write speeds and higher IOPS. However, deploying SSDs in hundreds or thousands of nodes could add up to a very expensive proposition. A better approach identifies critical locations where SSDs enable immediate cost-per-performance wins. This whitepaper will look at the basics of big data tools, review two performance wins with SSDs in a well-known framework, as well as present some examples of emerging opportunities on the leading edge of big data technology.
The NEC DX1000 MicroServer Chassis is an ultra-dense multi-server platform that offers scale-up and scale-out capabilities, but with a power profile comparable to modern single 2U servers. With enterprise-class features such as redundant switching, high bandwidth, and low-latency connections for up to 46 multi-core server nodes – each with 32 GB of RAM and onboard SSD storage, the NEC DX1000 can be easily leveraged to meet your private cloud provisioning needs.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
EFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEMNexgen Technology
TO GET THIS PROJECT COMPLETE SOURCE ON SUPPORT WITH EXECUTION PLEASE CALL BELOW CONTACT DETAILS
MOBILE: 9791938249, 0413-2211159, WEB: WWW.NEXGENPROJECT.COM,WWW.FINALYEAR-IEEEPROJECTS.COM, EMAIL:Praveen@nexgenproject.com
NEXGEN TECHNOLOGY provides total software solutions to its customers. Apsys works closely with the customers to identify their business processes for computerization and help them implement state-of-the-art solutions. By identifying and enhancing their processes through information technology solutions. NEXGEN TECHNOLOGY help it customers optimally use their resources.
The role of materialized views is becoming vital in today’s distributed Data warehouses. Materialization is
where parts of the data cube are pre-computed. Some of the real time distributed architectures are
maintaining materialization transparencies in the sense the users are not known with the materialization at
a node. Usually what all followed by them is a cache maintenance mechanism where the query results are
cached. When a query requesting materialization arrives at a distributed node it checks in its cache and if
the materialization is available answers the query. What if materialization is not available- the node
communicates the query in the network until a node answering the requested materialization is available.
This type of network communication increases the number of query forwarding’s between nodes. The aim
of this paper is to reduce the multiple redirects. In this paper we propose a new CB-pattern tree indexing to
identify the exact distributed node where the needed materialization is available.
Big Data analytics is common in many business domains, for example in financial sector for savings portfolio analysis, in government agencies, scientific research and insurance providers, to name a few. The uses of Big Data range from generating simple reports to executing complex analytical workloads. The increases in the amount of data being stored and processed in these domains expose many challenges with respect to scalable processing of analytical queries. Massively Parallel Processing (MPP) databases address these challenges by distributing storage and query processing across multiple compute nodes and distributed processes in parallel, usually in a shared-nothing architecture. Today, new technologies are shaping up the way platforms for the Internet of services are designed and managed. This technology is the containers, like Docker[1,2] and LXC[3,4,5]. The use of container as base technology for large-scale distributed systems opens many challenges in the area of resource management at run-time, for example: auto-scaling, optimal deployment and monitoring.
Join Principal Strategy Architect Ankit Patel to discuss the digital modernization journey many enterprises have taken from relational to NoSQL databases. In this webinar we will discuss the following:
• Why there is a need for digital modernization?
• What are the characteristics of the innovative data platform?
• What is NoSQL Apache Cassandra?
• How does DataStax innovate the NoSQL data platform?
• What are some of the challenges associated with digital modernization and migration?
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
Apache Cassandra is a distributed storage system for managing very large amounts of structured data. Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large components fail continuously. Cassandra manages the persistent state in the face of the failures which drives the reliability and scalability of the software systems. Cassandra does not support a full relational data model because it resembles a database and shares many design and implementation strategies. In this paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and read efficiency.
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
http://tyfs.rocks
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
In this deck from the Swiss HPC Conference, Mark Wilkinson presents: 40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility.
"DiRAC is the integrated supercomputing facility for theoretical modeling and HPC-based research in particle physics, and astrophysics, cosmology, and nuclear physics, all areas in which the UK is world-leading. DiRAC provides a variety of compute resources, matching machine architecture to the algorithm design and requirements of the research problems to be solved. As a single federated Facility, DiRAC allows more effective and efficient use of computing resources, supporting the delivery of the science programs across the STFC research communities. It provides a common training and consultation framework and, crucially, provides critical mass and a coordinating structure for both small- and large-scale cross-discipline science projects, the technical support needed to run and develop a distributed HPC service, and a pool of expertise to support knowledge transfer and industrial partnership projects. The on-going development and sharing of best-practice for the delivery of productive, national HPC services with DiRAC enables STFC researchers to produce world-leading science across the entire STFC science theory program."
Watch the video: https://wp.me/p3RLHQ-k94
Learn more: https://dirac.ac.uk/
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Similar to Cc index for cassandra a novel scheme for multidimensional range queries in cassandra (20)
We are a company that delivers value to our customers by lowering costs with digital marketing and increasing the efficiency of campaigns and their conversions. Using the most advanced artificial intelligence models in the neuro-marketing perspective, we have been able to predict the effectiveness of a marketing campaign before it is published. After its publication, we evaluated the campaign, segmenting the public according to the standard extracted from each market segment, delivering information for strategic and efficient management.
Aplicações de Alto Desempenho com JHipster Full StackJoão Gabriel Lima
Palestra apresentada no Meetup da comunidade Sou Java Campinas sobre o JHipster, desmistificando muitas premissas e validando aquilo que temos de melhor no mercado de tecnologias Java.
Palestra apresentada no FEMUG-PE de Setembro! Mostro o ARKit Framework e algumas aplicações muito interessantes do uso de realidade aumentada. Por fim, apresento o React-Native-ArKit, biblioteca para que você, desenvolvedor React Native, também utilize o ARkit em seus projetos de forma facilitada e muito prática.
Com a crescente onda de dados gerados, está cada vez mais claro que tecnologias de preparação e processamento de Big Data precisam se apoiar em Inteligência Artificial. Nesta palestra apresento o estado da arte em Big Data e IA, mostro claramente a relação entre esses tópicos, dando um direcionamento de como esses conceitos devem ser aplicados. Foi mostrado um estudo de caso da Operação Serenata de Amor, proposta por cientistas de dados e jornalistas para o combate à corrupção no Brasil.
O modelo de regressão é então usado para prever o resultado de uma variável dependente desconhecida, dados os valores das variáveis independentes.
Nesta aula, mostro um passo a passo com a bordage teórica e prática de como fazer regressão linear utilizando o WEKA
Nesta apresentação, foram discutidos os principais casos que ocorreram entre 2015 e 2016, detalhando como cada um foi executado, as técnicas utilizadas e principalmente, dicas de como proteger-se delas.
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...João Gabriel Lima
Nesta palestra, vamos trabalhar uma abordagem passo a passo de como construir um modelo de classificação, para identificar os padrões de clientes de uma empresa de telefonia que cancelaram o serviço, de modo que a operadora possa prever o risco de cancelamento e iniciar um trabalho para evitar que isso aconteça.
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoJoão Gabriel Lima
Nesta apresentação, apresento um passo a passo prático de como clusterizar e mais importante que isso, como interpretar os resultados aplicando isso para auxiliar a tomada de decisão.
No final temos um exercício de fixação muito interessante que nos dá a oportunidade de aplicar os conhecimentos adquiridos.
jgabriel.ufpa@gmail.com
Nessa apresentação apresento ambas arquiteturas e mostro que ao invés de escolher entre uma e outra, podemos tirar o que há de melhor em cada e utilizá-las de forma limpa, simples e objetiva.
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...João Gabriel Lima
Nesta apresentação mostro um estudo realizado pela universidade de Munique que visa prever a probabilidade de um personagem morrer na próxima temporada de acordo com 24 características pré-selecionadas
Apresentação sobre o aplicativo e-Trânsito cidadão: https://play.google.com/store/apps/details?id=com.huddle3.etranstitocidadaov2
Contendo notícias e provendo consulta sobre o IPVA
[Estácio - IESAM] Automatizando Tarefas com Gulp.jsJoão Gabriel Lima
Tutorial sobre Gulpjs
Especialização em Desenvolvimento Web - Instituto de Estudos Superiores da Amazônia
Neste tutorial apresento a facilidade proporcionada por automatizadores e abordo especificamente o [Gulp.js](gulpjs.com)
Palestra apresentada no JsDay Recife 2015, onde mostro uma visão geral sobre o cenário da Internet das Coisas com Javascript. Primeiramente destaco os conceitos gerais, em seguida justifico o uso de javascript, além disso, mostro as principais ferramentas, bibliotecas e API's. Cito os principais projetos na área e mostro um projeto na prática implementado em javascript, utilizando a tecnologia bluetooth para contrução de smarthomes, provendo a comunicação entre o dispositivo controlador e o smartphone do usuário.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
2. 1. This paper employs CCIndex to support multi- CCIndex creates all ComplementalTables and CCTs when
dimensional range queries overcoming the limitations of the OriginalTable is created. CCIndex maintains the index by
Cassandra. The results show that CCIndex gains 2.4 times the procedures of inserting and deleting.
performance over Cassandra’s index scheme with 1%
selectivity, and about 3.7 times performance when the
selectivity is 50% for 2 million records.
2. This paper shows that CCIndex is a general approach for
DOTs, which could gain better performance for DOTs with
slow random read and fast sequential read. This paper shows
that CCIndex improves query performance by about 2 times
on DOTs with fast random read, and achieves an order of
magnitude times performance improvement for the DOTs
whose random read is significantly slower than sequential
read or scan, such as HBase. This paper implements the
CCIndex recovery mechanism indicates that the efficiency of
CCIndex recovery is 33% of that of sequential write for
Cassandra.
3. This paper reveals that Cassandra is optimized for hash
tables rather than ordered tables. Cassandra provides both
consistency hashing and order-preserving hashing, while the
read and scan operations are not optimized for order-
preserving hashing, such as considering pre-fetch for read, and
optimizing scan for range queries over ordered tables.
Cassandra’s strategy is good for hash tables, but inefficient for
ordered tables.
This paper is organized as follows. Section 2 gives the
background. Section 3 illustrates the design and Fig. 1 Data layout of CCIndex.
implementation for CCIndex in Cassandra. Section 4 shows The procedure of writing is shown as Fig. 2. When writing
the experimental results and the discussion on the results. a record into OriginalTable, CCIndex reads the OriginalTable
Section 5 concludes the whole work. by rowkey to get the old values, checks whether the index
values are going to be modified, and then deletes records form
II. BACKGROUND
corresponding CCITs and CCTs when updating index values.
A. CCIndex Analysis After that, CCIndex writes the records to all CCITs and CCTs.
CCIndex is proposed to support multi-dimensional range When deleting a record, CCIndex reads all index values from
queries over DOTs by reorganizing data. CCIndex introduces OriginalTable and deletes records from all CCITs and CCTs.
a ComplementalTable for each index column. A
ComplementalTable stores all columns except the rowkey and
the corresponding index column. The ComplementalTable
rowkey is a concatenation of the index column value, the
original rowkey, and the length of index column value. The
way of generating the rowkey of ComplementalTable ensures
that all the rowkeys are unique and sorted by index column
and the original rowkey. The OriginalTable and the
ComplementalTables are called Complemental Clustering
Index Table (CCIT). CCIT sets the replica factor to 1 to
decrease the storage overhead. CCIndex maintains the
reliability of a CCIT by other CCITs and introduces a
replicated CCT (Complemental Check Table) for each CCIT
Fig. 2 The procedure of writing.
to help data recovery.
In Fig. 1, there is an OriginalTable (CCIT0) with a primary The procedure of multi-dimensional range queries is shown
id and two index columns weight and height. CCIT-W and as Fig. 3. CCIndex estimates result size for each query
CCIT-H (ComplementalTable) are ordered by key1 and key2 condition and selects the condition with the smallest result
respectively. With these CCITs, range queries over id, weight, size to execute range query on corresponding CCIT. CCIndex
or height can be converted to range queries on CCIT0, CCIT- employs other conditions to filter the result got by range query
W or CCIT-H. and returns the ultimate results of multi-dimensional range
CCT stores the rowkey and all index columns of a CCIT. queries.
CCTs are replicated while the CCITs are not replicated.
131
3. ratio of CCIndex to IndexedTable is determined by
the speed ratio of range query to random read.
B. Cassandra Analysis
Cassandra organizes nodes as a ring overlay like Chord to
partition data. Each node manages a part of data in the ring,
with data id from previous node token to this node token.
Records use the same partitioner to map its key to the token
ring. Corresponding node writes records to commitlog and
then to its memtable.
Memtable is a memory structure contains sorted rows.
Memtable is flushed to an SSTable on disk when it is full.
SSTable is a sorted structure flushed one by one and cannot be
modified once flushed, so that records between multiple
SSTables are not sorted as in Fig. 5. Cassandra combines
several old SSTables into a new SSTable by compaction to
reduce the SSTable number. Each node contains more than
Fig. 3 The procedure of multi-dimensional range queries.
one SSTable in most cases.
CCIndex for HBase uses a simple way to estimate the result
size. In HBase, HMaster stores region-to-server mapping
information as in Fig. 4. The mapping information can be
described as a set of <startKey-regionServer>, ordered by
startKey. CCIndex finds the regions covered by each range
query and estimates the result size by the region number.
When HBase has more than 1 region and has max region size
Smax, each region size must be greater than Smax/2 and less than
Smax. Thus CCIndex considers the result size depends on the
region number covered.
Fig. 5 An example of memtable and SSTables in a node.
Like Dynamo, Cassandra keeps strong consistency if W +
R > N, where W and R indicates respectively the minimum
number of nodes that have executed write and read operation
successfully, and N is the number of replication factor.
Cassandra uses different ConsistencyLevels to keep the
balance between consistency and availability. In writing,
ConsistencyLevel.ONE and QUORUM ensure that the write
Fig. 4 The region-to-server mapping of HBase.
operation has been executed successfully on at least 1 and N /
In HBase, the speed of scan is 8.2 times of random read. 2 + 1 node(s). In reading, ONE returns the record responded
The speed of multi-dimensional range query on CCIndex is by the fastest node and QUORUM returns the record in
11.4 times of IndexedTable. majority of most recent records from at least N / 2 + 1 nodes.
The performance of CCIndex is affected by 2 issues: Comparing with ONE, QUORUM has higher latency while
The accuracy of result size estimation. The more maintaining the consistency.
accurate the estimation is, the less unnecessary Cassandra version 0.7+ provides APIs to execute multi-
records will be scanned. dimensional range queries. But there is a limitation that the
The speed ratio of range query to random read. To APIs require at least one equal operator on a configured index
execute a multi-dimensional range query, CCIndex column in the query expression. Cassandra also provides APIs
executes range query on a CCIT and then filters the to execute the range query over rowkey, but the speed of
result. IndexedTable executes range query on an index range query is only 1.3 times of random read.
table to get original rowkeys, and then gets the records In summary, there are three issues of mismatches between
by random read on those rowkeys. Thus the speed HBase and Cassandra, which impose challenges when
utilizing CCIndex for Cassandra.
132
4. 1) The smallest sorted unit is region in HBase while it’s CCIndex encapsulates APIs of HBase and Cassandra, and
node in Cassandra: In HBase, regions are sorted by the exposes the same CCIndex APIs for applications.
rowkey of records. In Cassandra, records are stored in
SSTables and sorted between nodes, while multiple SSTables D. Data recovery
in the same node are not sorted. The difference decreases the CCIndex introduces replicated CCT to help recover the
accuracy of estimating result size. damaged data. This paper implements the data recovery
2) The speed of range query: Cassandra executes range module with CCT in Cassandra.
query by logical scan, traversing all SSTables to find the To recover a record of OriginalTable, CCIndex first reads
‘next’ record, while HBase executes physical scans on regions. CCTs by rowkey to get all index columns. Then CCIndex
3) The differences between HBase and Cassandra on APIs: concatenates the original rowkey and the index column value
To implement CCIndex for Cassandra, the API issue must be to form the rowkey of a certain ComplementalTable. CCIndex
considered, namely how to utilize the different APIs given by tries to read the record by the concatenated rowkey and write
HBase and Cassandra and unify the APIs CCIndex providing the corresponding record into OriginalTable. If the recovery
to the application level. fails, CCIndex tries to recover data by another
ComplementalTable.
III. DESIGN AND IMPLEMENTATION To recover a record on ComplementalTable, CCIndex gets
CCIndex for Cassandra uses different methods to deal with the rowkey of OriginalTable by splitting the given rowkey.
the differences when utilizing CCIndex for Cassandra. Then CCIndex tries to read the record from OriginalTable. If
the reading operation fails, CCIndex uses other index column
A. The smallest sorted unit issue. values got from CCT to recover data by other
As record size between nodes might be unbalanced, the ComplementalTables.
way which CCIndex for HBase uses to estimate result size by To recover a certain range of table, CCIndex scans
covered region number cannot work on Cassandra. This paper corresponding CCT, and uses the methods above to recover
uses a different way to estimate result size, which lies on data records one by one. A range can be split into several parts for
distribution information of Cassandra. multi-thread recovery to increase efficiency.
1) Data distribution information gathering: CCIndex for E. Implementation
Cassandra first adds an API in CassandraClient to gather
SSTable information of a certain node, and then adds a CCIndex for Cassandra prototype uses Cassandra v0.7.2 as
daemon thread Listener in CassandraDaemon. Listener gets code bases and is written in Java.
token ring information from StorageService every other As replica factor of Cassandra associates with keyspace, it
minute. With token-IP mapping, Listener uses the API above is easy for CCIndex for Cassandra to replicate CCTs by
to get SSTable information from every node. Thus each node putting CCTs into a separate keyspace with replica factor 3.
saves the data distribution information of all nodes. Cassandra CCIndex sets keyspace replica factor to 1 for CCIT, and
kernel code is modified without performance degradation. creates one ComplementalTable for each index column.
2) The estimation of result size: CCIndex client uses a
thread Refiner to get data distribution information and token
ring information from Listener, then CCIndex estimates result
size for every query condition:
• Calculate the nodes covered by range. Count the node
number as N3,
• For every node covered, read the SSTable data file total
size S, and file number C,
• Summarize the total size of S, C for all nodes, get N1,
N2.
Each search condition has a tuple [N1, N2, N3]. N1 has
higher priority than N2, and N2 has higher priority than N3.
CCIndex for Cassandra executes range queries on
corresponding CCIT which has the smallest tuple.
B. The speed of range query
The speed of range query is determined by Cassandra
system. The aim of CCIndex for Cassandra is to implement
Fig. 6 The architecture of CCIndex for Cassandra.
CCIndex while making as few changes as possible. The low
speed of range query affects the speed of multi-dimensional CCIndex for Cassandra client connects with a server node
range queries but does not restrict the implementation. to perform operations like inserting, reading and range query.
As Fig. 6 shows, CCIndex for Cassandra uses a connection
C. The API issue
133
5. pool extends from Pelops [11]. The connection pool assigns a not have enough replicas for CCIT. When N changes from 2
random connection to each client to avoid hot spot issue. to 4 and Ls/L changes from 1/30 to 1/10, the overhead ratio
The client gets the token ring and data distribution changes from 10% to 116.7%.
information by sending a query to a certain node to estimate
the query result size. B. Experiment Setup
This paper introduces a benchmark to evaluate the basic
IV. EVALUATION operations throughput, including sequential read/write,
CCIndex for Cassandra is implemented and evaluated random read, and range query. The workload uses a table with
through analysis and experiments. columns rowkey, index1, index2, index3 and data. The length
of rowkey, index1, index2 and index3 are 10 bytes while the
A. Space Overhead Analysis data column is 1 KB. The throughput is defined as rows per
For the given metrics, the performance is easy to be second for all clients.
evaluated through experiments. As to the space overhead, CCIndex builds index for index1, index2, and index3,
theoretical analysis is more suitable. ConsistencyLevel for CCIT is ONE, and is QUORUM for
Here we denote the number of index columns by N, the CCT.
replica factor of Original Cassandra and CCT by R, the Original Cassandra and Cassandra Indexed set replica to 3
average length of the key and all index columns by Ls, and the and ConsistencyLevel to QUORUM. Original Cassandra does
total length of record by L. not build index. Cassandra Indexed builds index for index1,
In Original Cassandra, the space for every record is: index2, and index3.
SORG = L * R The experimental cluster has 5 nodes. Each node has two
(1) 1.8 GHz dual-cores AMD Opteron (tm) Processor270, with 4
In CCIndex, the space for each record is the CCITs plus GB memory. Each node in the cluster has 321 GB RAID5
CCTs. The space for CCITs is: SCSI disks. All nodes are connected by Gigabits Ethernet.
SCCIT = L *( N + 1) (2) Each node uses Red Hat CentOS release 5.3 (kernel 2.6.18),
The space for CCT is: ext3 file system, Sun JDK1.6.0_14. The test runs on another
SCCT = Ls *( N + 1)* R client machine, which has two 2.0 GHz Intel(R) Core(TM)
(3) Duo T5750 Processor , with 3 GB memory, Broadcom
The total space for CCIndex is:
Netlink(TM) fast Ethernet 100M bps. The client uses Ubuntu
SCC = SCCIT + SCCT = ( N + 1)( L + Ls * R) (4) 10.04LTS, ext3 file system, Sun JDK 1.6.0_14.
The space overhead ratio of CCIndex to Original Cassandra The workload in the experiments has 2 million rows; the
is: token of each node is initialized manually to keep load
SCC / SORG − 1 = ( N + 1) / R + ( N + 1)* Ls / L − 1 balance. Each test runs three times to report the average value.
(5) The client uses 25 concurrent threads for sequential write,
In Cassandra, the replica number R is often set to 3. The sequential read, random read and range query, and uses 1
radio is: thread for multi-dimensional range queries.
( N + 1) / 3 + ( N + 1)* Ls / L − 1 (6)
C. Experiment Result
Equation (6) can be plotted as Fig. 7.
The result in Fig. 8 shows that ConsistencyLevel has great
effect on every test, which can be confirmed by the great
differences between the throughput of Cassandra(1) and
Cassandra(3) or Cassandra Indexed(1) and Cassandra
Indexed(3).
The throughput of sequential write for CCIndex is
significantly lower than the Cassandra Indexed and much
lower than the Original Cassandra, because maintaining index
needs extra random read to get row data from OriginalTable,
and if there are old index column values, further delete
operations are needed to update index.
The performance of Original Cassandra(3) and Cassandra
Indexed(3) on range query, random read, and sequential read
Fig. 7 The space overhead ratio of CCIndex to Original Cassandra. Using are nearly identical due to the same implementation. They are
L/Ls values as the horizontal axis. lower than that of CCIndex because of ConsistencyLevel,
From Fig. 7, the overhead ratio drop significantly as the which can be confirmed by the fact that Original Cassandra(1)
Ls/L decreases and the N decreases, which indicates that to and Cassandra Indexed(1) have nearly the same throughput
avoid huge space overhead, there should be less index with CCIndex.
columns in CCIndex and the data length of index columns
should be shorter. When N is smaller than 2, CCIndex would
134
6. CCIndex increases to 3.7 times that of Cassandra Indexed(3).
In the experiment, CCIndex is about 1.8 to 2.7 times as fast as
Cassandra Indexed(1).
In another test on Cassandra Indexed, when MAXVALUE
is 100 and the query expression is 0 < index1 < 10000, 0 <
index2 < 10000 and index3 = 0, exception happens every time
in all 10 attempts while CCIndex performs well. We consider
it happens when many records are discarded by the non-equal
columns ranges.
The throughput of recovery is 1819 records/s in average in
Fig. 10. To recover one record, CCIndex first executes range
query on CCT, writes on CCIT, and random reads on CCIT.
The CCT range query speed is 6013 records/s, while the write
speed on CCIT is 4778 records/s and the random read speed
on CCIT is 4797 records/s. The recovery speed is 1964.7
Fig. 8 Basic Operations for Original Cassandra, Cassandra Indexed and records/s in theory. Comparing with 1819 records/s in practice,
CCIndex. Cassandra(1) is Cassandra with 1 replica and ConsistencyLevel is the recover speed matches the theoretical analysis.
ONE. Cassandra(3) is Cassandra with 3 replica and ConsistencyLevel is
QUORUM. Cassandra Indexed builds index for index columns.
In this experiment, N is 4,Ls/L is 1/30, CCIndex uses 46%
more space than Original Cassandra(3) in theory. The result
shows that Original Cassandra(3) uses 1.39 GB per node
while CCIndex uses 2.12 GB per node, which has 52.6%
space overhead. Because there are memtables not flushed in
memory, we consider the storage overhead confirms the
theoretical analysis.
The tests of multi-dimensional range query writes records
with index1 and index2 whose value is randomly generated
from 0 to 2 million and index3 is randomly generated from 0
to MAXVALUE. In this way, the test could use expression 0
< index1 < 2000000 and 0 < index2 < 2000000 and index3 = Fig. 10 CCIndex recovery speed.
0 to match the requirement of Cassandra API. The
MAXVALUE of index3 is set from 100 to 1 to change the D. Discussion
selectivity from 1% to 100%. The results provide many insights on CCIndex and
The results of multi-dimensional range query test on Cassandra.
different conditions are shown as Fig. 9. When the selectivity 1) Overall, the results show that CCIndex is a general
is under 10%, Cassandra Indexed performs well, but when the approach for DOTs, successfully in improving both
selectivity raises from 20% to 100%, the latency increases performance and query expressiveness.
significantly. 2) The results show that in Cassandra, the sequential read
and random read are the same in throughput and the range
query throughput is only 1.3 times as fast as random read. But
if a client sets Cassandra’s partitioner to OrderedPartitioner, it
suggests that the client is probably willing to use some special
operations on ordered table such as sequential read and range
query. Cassandra could do some optimization like prefetching
and caching on adjacent records.
3) CCIndex is suitable for tables with 2 to 4 index columns.
CCIndex cannot guarantee the reliability with fewer than 2
index columns because the CCITs are not replicated. If there
are more than 4 index columns, the space overhead is more
than 2 times of the Original Cassandra. When a table has more
than 4 columns with query requirements, a solution is to build
index for 2 to 4 most frequently used columns, and to filter the
Fig. 9 Throughput of multi-dimensional range queries by CCIndex , result by non-indexed conditions in applications.
Cassandra Indexed(1) and Cassandra Indexed(3) 4) The throughput of CCIndex is determined by the ratio of
The throughput ratio of CCIndex to Cassandra Indexed(3) range query to random read. This explains why the throughput
is at least 2.4. When the selectivity grows, the throughput of of CCIndex for Cassandra is 2.4 to 3.7 times to Cassandra
135
7. Indexed(3), while the throughput of CCIndex for HBase is 1% to 50% selectivity for 2 million records. This paper shows
11.4 times to that of IndexedTable. CCIndex converts random that CCIndex is a general approach for DOTs, and could gain
read on OriginalTable to range query on CCIT, so its better performance on multi-dimensional range queries for
performance is associated with the speed improvement from DOTs with slow random read and fast sequential read. This
random read to range query. paper implements the CCIndex recovery mechanism and show
During the procedure of multi-dimensional range query, that CCIndex recovery performance is 33% of that for
IndexedTable executes range query and random read for every sequential write in Cassandra. This paper reveals that
record before filtering while CCIndex only needs to execute Cassandra is optimized for hash tables rather than ordered
range query for one time. tables in read and range queries. Cassandra could do some
We denote the speed of range query by Ss, and the speed of optimizing like prefetching and caching on adjacent records.
random read by Sr.
The speed for CCIndex to get records is: ACKNOWLEDGMENT
Scc = S s (7)
This work is supported in part by the Hi-Tech Research and
Development (863) Program of China (Grant No.
The speed for IndexedTable is:
2006AA01A106), and the major national science and
Si = 1/ (1/ S s + 1/ S r ) = S s * Sr / ( S s + Sr ) (8) technology special projects (2010ZX03004-003-03).
The ratio of CCIndex to IndexedTable is:
Scc / Si = ( S s + Sr ) / Sr = 1 + S s / Sr REFERENCES
(9) [1] Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava, Erik Vee,
So the ratio of CCIndex to IndexedTable is decided by the Ramana Yerneni, and Raghu Ramakrishnan, “Efficient bulk insertion
value of Ss / Sr. For HBase, Ss / Sr is equal to 8.2 and Scc / Si is into a distributed ordered table,” in Proceedings of the 2008 ACM
equal to 9.2. As there’s no optimization on query, SIGMOD International conference on Management of Data, 2008.
[2] Ymir Vigfusson, Adam Silberstein, Brian F. Cooper, Rodrigo Fonseca,
IndexedTable filters more records as candidate results. So the “Adaptively parallelizing distributed range queries,” in Proc. VLDB
final ratio of CCIndex to IndexedTable on multi-dimensional Endow., vol. 2, pp. 682–693. VLDB Endowment (2009)
range queries, 11.4, meets the analysis. [3] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh,
From Fig.9, the throughput of CCIndex is 1.9 and 2.4 times Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes,
and Robert E. Gruber, “Bigtable: a distributed storage system for
to Cassandra Indexed(1) and Cassandra Indexed(3) structured data,” in 7th USENIX Symposium on Operating Systems
respectively. CCIndex performs the same with Cassandra Design and Implementation, 2006.
Indexed(1) in random read and scan. [4] Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
From Fig.8 Ss / Sr is equal to 1.2 on Cassandra Indexed(1), Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel
Weaver, and Ramana Yerneni, “PNUTS: Yahoo!'s hosted data serving
and CCIndex takes more time to filter the result, the final ratio platform,” in Proc. VLDB Endow. vol. 1, pp. 1277--1288. 2008
1.9 is close to the predicted value 2.2. [5] Apache HBase project. [Online]. Available: http://hbase.apache.org/.
[6] Hai Zhuge, "Probabilistic Resource Space Model for Managing
V. CONCLUSIONS Resources in Cyber-Physical Society," IEEE Transactions on Services
Computing, vol. 99, no. PrePrints, 2011
Cassandra is a Distributed Ordered Table supporting multi- [7] Yongqiang Zou, Jia Liu, Shicai Wang, Li Zha, and Zhiwei Xu,
dimensional range queries. However, current design and “CCIndex: a Complemental Clustering Index on Distributed Ordered
implementation of Cassandra have two problems: (1) Tables for Multi-dimensional Range Queries,” in 7th IFIP
International Conference on Network and Parallel Computing, 2010.
Cassandra’s query expression is limited in that there must be
[8] Avinash Lakshman, Prashant Malik, “Cassandra: a decentralized
one dimension with an equal operator in the query expression; structured storage system,” SIGOPS Operating Systems Review, vol.
(2) The performance is poor. With the success of CCIndex 44 issue 2. pp. 35-40. Apr. 2010
scheme in Apache HBase, this paper tries to study the [9] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan
Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan
feasibility of employing CCIndex to improve multi-
Sivasubramanian, Peter Vosshall, and Werner Vogels, “Dynamo:
dimensional range queries in DOTs like Cassandra. amazon's highly available key-value store,” in Proceedings of 21st
There are three mismatches between HBase and Cassandra ACM SIGOPS symposium on Operating systems principles, 2007.
when utilizing CCIndex for Cassandra, which imposes [10] Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari
Balakrishnan, “Chord: A scalable peer-to-peer lookup service for
challenges: (1) The smallest sorted unit is region in HBase
internet applications,” in Proceedings of the 2001 conference on
while it’s node in Cassandra, so the estimation method in Applications, Technologies, Architectures, and Protocols for Computer
HBase is not suitable for Cassandra; (2) The speed of range Communications, 2001.
query of Cassandra is not fast enough to accelerate the [11] Pelops project. [Online]. Available. https://github.com/s7/scale7-pelops
CCIndex performance; (3) The APIs of HBase and Cassandra
are different.
This paper proposes a new approach to estimate result size
and exposes the same CCIndex APIs for application to tackle
the first and the third mismatch. The speed of range query is
determined by Cassandra system, Cassandra could do some
optimization like prefetching and caching on adjacent records.
The experimental results show that CCIndex gains 2.4 to
3.7 times performance over Cassandra’s index scheme with
136