This document discusses how to implement operations like selection, joining, grouping, and sorting in Cassandra without SQL. It explains that Cassandra uses a nested data model to efficiently store and retrieve related data. Operations like selection can be performed by creating additional column families that index data by fields like birthdate and allow fast retrieval of records by those fields. Joining can be implemented by nesting related entity data within the same column family. Grouping and sorting are also achieved through additional indexing column families. While this requires duplicating data for different queries, it takes advantage of Cassandra's strengths in scalable updates.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
This presentation is about our modification to the Cloudera Impala.
Our version can efficiently work with S3 as well as other remote DFS compatible storage
This presentation discusses the following features of Hadoop:
Open source
Fault Tolerance
Distributed Processing
Scalability
Reliability
High Availability
Economic
Flexibility
Easy to use
Data locality
Conclusion
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
Get a look under the hood: Understand how to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. You’ll also hear about how the University of Technology Sydney (UTS) are using Redshift. The University of Technology Sydney will describe how utilizing Amazon Redshift enabled agility in dealing with Data Quality, a capacity to scale when required, and optimizing development processes through rapid provisioning of Data Warehouse environments.
Speaker: Ganesh Raja, Solutions Architect, Amazon Web Services with Susan Gibson, Manager, Data and Business Intelligence, UTS
Level: 300
10,000 Foot View of Big Data. Presentation slides by Peter Smith, PhD, ACL at the Canadian Executive Cloud & DevSecOps Summit. Toronto May 4, 2018 and Vancouver, May 11, 2018 hosted by TriNimbus.
This presentation discusses the following topics:
Hadoop Distributed File System (HDFS)
How does HDFS work?
HDFS Architecture
Features of HDFS
Benefits of using HDFS
Examples: Target Marketing
HDFS data replication
This presentation explains integration between ImpalaToGo query engine and Tachyon memory-centric storage.
Here are configuration details: https://github.com/ImpalaToGo/ImpalaToGo/wiki/To-run-ImpalaToGo-over-Tachyon
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
This presentation is about our modification to the Cloudera Impala.
Our version can efficiently work with S3 as well as other remote DFS compatible storage
This presentation discusses the following features of Hadoop:
Open source
Fault Tolerance
Distributed Processing
Scalability
Reliability
High Availability
Economic
Flexibility
Easy to use
Data locality
Conclusion
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
Get a look under the hood: Understand how to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. You’ll also hear about how the University of Technology Sydney (UTS) are using Redshift. The University of Technology Sydney will describe how utilizing Amazon Redshift enabled agility in dealing with Data Quality, a capacity to scale when required, and optimizing development processes through rapid provisioning of Data Warehouse environments.
Speaker: Ganesh Raja, Solutions Architect, Amazon Web Services with Susan Gibson, Manager, Data and Business Intelligence, UTS
Level: 300
10,000 Foot View of Big Data. Presentation slides by Peter Smith, PhD, ACL at the Canadian Executive Cloud & DevSecOps Summit. Toronto May 4, 2018 and Vancouver, May 11, 2018 hosted by TriNimbus.
This presentation discusses the following topics:
Hadoop Distributed File System (HDFS)
How does HDFS work?
HDFS Architecture
Features of HDFS
Benefits of using HDFS
Examples: Target Marketing
HDFS data replication
This presentation explains integration between ImpalaToGo query engine and Tachyon memory-centric storage.
Here are configuration details: https://github.com/ImpalaToGo/ImpalaToGo/wiki/To-run-ImpalaToGo-over-Tachyon
Vskills certification for Apache Cassandra Professional assesses the candidate for Apache Cassandra database. The certification tests the candidates on various areas in Apache Cassandra which includes knowledge of installing, administering and developing applications utilizing the Apache Cassandra.
http://www.vskills.in/certification/Certified-Apache-Cassandra-Professional
Vskills certified html5 developer Notes covers the following topics.
HTML5
Introduction
History
HTML Versions
HTML5 Enhancements
Elements, Tags and Attributes
Head and body tags
HTML Editor
Create a web page
Viewing the Source
White Space and Flow
HTML Comments
HTML Meta Tags
HTML Attributes
XHTML First Line
DTD (Document Type Declaration)
HTML5 new Doctype and Charset
Special Characters
Capitalization
Quotations
Nesting
Spacing and Breaks
HTML5 Global attributes
http://www.vskills.in/certification/Web-Development/Certified-HTML5-Developer
DataStax: Rigorous Cassandra Data Modeling for the Relational Data ArchitectDataStax Academy
With the explosive adoption of Cassandra for online transaction processing by hundreds of Web-scale companies, there is a growing need for a rigorous and practical data modeling approach that ensures sound and efficient schema design. In this talk, we present a query-driven data modeling methodology for Apache Cassandra and provide a side-by-side comparison of traditional Relational database design and Cassandra data modeling. Based on a selected use case, we demonstrate main techniques for designing conceptual, logical, and physical data models in the context of both Relational and Cassandra databases. Finally, we discuss strategies for successful migration from Relational to Cassandra.
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
Spark Summit East Keynote by Ion Stoica
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
Migrating minimal databases with minimal downtime to AWS RDS, Amazon Redshift and Amazon Aurora
Migration of databases to same and different engines and from on premise to cloud
Schema conversion from Oracle and SQL Server to MySQL and Aurora
Big data analytics: Technology's bleeding edgeBhavya Gulati
There can be data without information , but there can not be information without data.
Companies without Big Data Analytics are deaf and dumb , mere wanderers on web.
Data Ware House System in Cloud EnvironmentIJERA Editor
To reduce Cost of data ware house deployment , virtualization is very Important. virtualization can reduce Cost
and as well as tremendous Pressure of managing devices, Storages Servers, application models & main Power.
In current time, data were house is more effective and important Concepts that can make much impact in
decision support system in Organization. Data ware house system takes large amount of time, cost and efforts
then data base system to Deploy and develop in house system for an Organization . Due to this reason that,
people now think about cloud computing as a solution of the problem instead of implementing their own data
were house system . In this paper, how cloud environment can be established as an alternative of data ware
house system. It will given the some knowledge about better environment choice for the organizational need.
Organizational Data were house and EC2 (elastic cloud computing ) are discussed with different parameter like
ROI, Security, scalability, robustness of data, maintained of system etc
Discover how database sharding https://bityl.co/Q6F3 can transform your application's performance by distributing data across multiple servers in our latest blog. With insights into key sharding techniques, you'll further learn how to implement sharding effectively and avoid common pitfalls. As you move forward, this blog will help you dive into real-life use cases to understand how sharding can optimize data management. Lastly, you'll get the most important factors to consider before sharding your database and learning to navigate the complexities of database management.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
A Review of Data Access Optimization Techniques in a Distributed Database Man...Editor IJCATR
In today's computing world, accessing and managing data has become one of the most significant elements. Applications as
varied as weather satellite feedback to military operation details employ huge databases that store graphics images, texts and other
forms of data. The main concern in maintaining this information is to access them in an efficient manner. Database optimization
techniques have been derived to address this issue that may otherwise limit the performance of a database to an extent of vulnerability.
We therefore discuss the aspects of performance optimization related to data access in distributed databases. We further looked at the
effect of these optimization techniques
A Review of Data Access Optimization Techniques in a Distributed Database Man...Editor IJCATR
In today's computing world, accessing and managing data has become one of the most significant elements. Applications as
varied as weather satellite feedback to military operation details employ huge databases that store graphics images, texts and other
forms of data. The main concern in maintaining this information is to access them in an efficient manner. Database optimization
techniques have been derived to address this issue that may otherwise limit the performance of a database to an extent of vulnerability.
We therefore discuss the aspects of performance optimization related to data access in distributed databases. We further looked at the
effect of these optimization techniques.
The design and implementation of modern column oriented databasesTilak Patidar
An attempt to break down the paper on the design of column-oriented databases into simpler terms.
https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
Organisation Structure For digital Transformation Team
Cassandra data modelling best practices
1. NoSQL database systems are designed for scalability. The down side of that is a primitive key-
value data model and, as the name suggest, no support for SQL. It might sound like a serious
limitation – how can I “select”, “join”, “group” and “sort” the data? This post explains how all
these operations can be quite naturally and efficiently implemented in one of the most famous
NoSQL system – Cassandra.
To understand this post you need to know the Cassandra data model. You can find a quick
introduction in my previous post. The power of the Cassandra data model is that it extends a
basic key-value store with efficient data nesting (via columns and super columns). It means that
you can read/update a column (or a super column) without retrieving the whole record. Below I
describe how we can exploit data nesting to support various query operations.
Let’s consider a basic example: departments and employees with one-to-many relationships
respectively. So we have two column families: Emps and Deps. In Emps employee IDs are used
as keys and there are Name,Birthdate, and City columns. In Deps keys are department IDs
and the single column is Name.
1) Select
For example: select * from Emps where Birthdate = '25/04/1975'
To support this query we need to add one more column family named Birthdate_Emps in
which key is a date and column names are IDs of those employees that were born on the date.
The values are not used here and can be an empty byte array (denoted “-”). Every time when a
new employee is inserted/deleted into/from Empswe need to update Birthdate_Emps. To
execute the query we just need to retrieve all the columns for the
key'25/04/1975' from Birthdate_Emps.
2. Notice that Birthdate_Emps is essentially an index that allows us to execute the query very
efficiently. And this index is scalable as it is distributed across Cassandra nodes. You can go
even further to speed up the query by redundantly storing information about employees (i.e.
employee’s columns from Emps) in Birthdate_Emps. In this case employee IDs becomes names
of super columns that contain corresponding employee columns.
2) Join
For example: select * from Emps e, Deps d where e.dep_id = d.dep_id
What does join essentially mean? It constructs records that represent relationship between
entities. Such relationships can be easily (and even more naturally) represented via nesting. To
do that add column familyDep_Emps in which key is a department ID and column names are IDs
of the corresponding employees.
3) Group By
For example: select count(*) from Emps group by City
From implementation viewpoint Group By is very similar to select/indexing described above.
You just need to add a column family City_Emps with cities as keys and employee IDs as
column names. In this case you will count the number of employees on retrieval. Or you can
have a single column named count which value is the pre-calculated number of employees in
the city.
3. 4) Order By
To keep data sorted in Cassandra you can use two mechanisms: (a) records can be sorted by
keys using OrderPreservingPartitioner with range queries (more on this in Cassandra:
RandomPartitioner vs OrderPreservingPartitioner). To keep nested data sorted you can use
automatically supported ordering for column names.
To support all these operations we store redundant data optimized for each particular query. It
has two implications:
1) You must know queries in advance (i.e. no support for ad-hoc queries). However, typically in
Web applications and enterprise OLTP applications queries are well known in advance, few in
number, and do not change often. Read Mike Stonebraker convincingly talking about that. BTW,
Constraint Tree Schema, described in the latter paper, also exploits nesting to organize data for
predefined queries.
2) We shift the burden from querying to updating because what we essentially do is supporting
materialized views (i.e. pre-computed results of queries). But it makes a lot of sense in case of
using Cassandra as Cassandra is very much optimized for updates (thanks to eventual
consistency and “log-structured” storage borrowed from Google BigTable). So we can use fast
updates to speed up query execution. Moreover, use-cases typical for social applications are
proven to be only scalable with push-on-change model (i.e. preliminary data propagation via
updates with simple queries – the approach taken in this post) in comparison with pull-on-
demand model (i.e. data are stored normalized and combined by queries on demand – classical
relational approach). On push-on-change versus pull-on-demand read WHY ARE FACEBOOK,
DIGG, AND TWITTER SO HARD TO SCALE?
Consideration for NoSql
Do you need a more flexible data model to manage data that goes beyond a rigid
RDBMS table/row data structure and instead includes a combination of structured,
semi-structured, and unstructured data?
• Do you need continuous availability with redundancy in both data and function
across one or more locations versus simple failover for the database?
• Do you need a database that runs over multiple data centers / cloud availability
zones?
• Do you need to handle high velocity data coming in via sensors, mobile devices,
and the like, and have extreme write speed and low latency query speed?
• Do you need to go beyond single machine limits for scale-up and instead go to a
scale-out architecture to support the easy addition of more processing power and
storage capacity?
• Do you need to run different workloads (e.g. online, analytics, search) on the same
4. data without needing to manually ETL the data to separate systems/machines?
• Do you need to manage a widely distributed system with minimal staff?
MIGRATING DATA
Moving data from an RDBMS or other database to Cassandra is generally quite easy.
The
following options exist for migrating data to Cassandra:
• COPY command - CQL provides a copy command (very similar to Postgres) that
is able to load data from an operating system file into a Cassandra table. Note that this
is not recommended for very large files.
• Bulk loader - this utility is designed for more quickly loading a Cassandra table
with a file that is delimited in some way (e.g. comma, tab, etc.)
• Sqoop - Sqoop is a utility used in Hadoop to load data from RDBMSs into a
Hadoop cluster. DataStax supports pipelining data directly from an RDBMS table
into a Cassandra table.
• ETL tools - there are a variety of ETL tools (e.g. Informatica) that support
Cassandra as both a source and target data platform. Many of these tools not only
extract and load data but also provide transformation routines that can manipulate the
incoming data in many ways. A number of these tools are also free to use (e.g.
Pentaho, Jaspersoft, Talend).
Advanced Command Line Performance Monitoring Tools
The Performance Service maintains the following levels of performance information:
• System level - supplies general memory, network, and thread pool statistics.
• Cluster level - provides metrics at the cluster, data center, and node level.
• Database level - provides drill down metrics at the keyspace, table, and table-pernode
level.
• Table histogram level - delivers histogram metrics for tables being accessed.
• Object I/O level - supplies metrics concerning 'hot objects'; data on what objects
are being accessed the most.
• User level - provides metrics concerning user activity, 'top users' (those consuming
the most resources on the cluster) and more.
• Statement level - captures queries that exceed a certain response time threshold
along with all their relevant metrics.
Once the service has been configured and is running, statistics are
populated in their associated tables and stored in a special keyspace (dse_perf). You
can then query the various performance tables to get statistics such as the I/O metrics
for certain objects:
5. Finding and Troubleshooting Problem Queries
DataStax Enterprise Performance Service to automatically capture
long-running queries (based on response time thresholds you specify) and then query a
performance table that holds those statements:.
6. The trace information is stored in the systems_traces keyspace that holds two tables:
sessions and events
Trace on individual query like explain plan:
7. Cassandra data modelling best practices:
1. Composite Type use through API client is not recommended.
2. Super column family use is not recommended as it de serialize all the columns on usage as
against deserialization of single column.
3. We can create wide rows (huge columns and several rows) and skinny rows (small col and huge
rows).
4. Valueless column; if Rowid={City+uid} we want to write/read only City then uid can be empty or
valueless column.
5. Can expire column based on TTL set in seconds.
8. 6. Counter columns maintain to store a number that incrementally counts the occurrences of a
particular event or process. For example, you might use a counter column to count the number
of times a page is viewed.
7. Keyspace: a cluster has one keyspace per application.
Top level container for Column Families.
Column Family: A container for Row Keys and Column Families
Row Key: The unique identifier for data stored within a Column Family
Column: Name-Value pair with an additional field: timestamp
Super Column: A Dictionary of Columns identified by Row Key.
8. Random Partitioner is the recommended partitioning scheme. It has the following advantages
over Ordered Partitioning as in BOP
Random partitioner: It uses hash on the Row Key to determine which node in the cluster will be
responsible for the data. The hash value is generated by doing MD5 on the Row Key. Each node
in the cluster in a data center is assigned sections of this range (token) and is responsible for
storing the data whose Row Key’s hash value falls within this range.
Token Range = (2^127) ÷ (# of nodes in the cluster)
If the cluster is spanned across multiple data centers, the tokens are created for individual data
centers. Which is better.
Byte Ordered Partitioner (BOP): It allows you to calculate your own tokens and assign to nodes
yourself as opposed to Random Partitioner automatically doing this for you.
9. Partitioning => Picking out one node to store first copy of data on
Replication => Picking out additional nodes to store more copies of data.
Storage commit log (durability) flush it to memtables(in-memory structures) SSTables
which compact data using compaction to remove stale data and tombstones(indicator that data
deleted).
10. Binary protocol is faster than thrift.
11. Why RP?
1. RP ensures that the data is evenly distributed across all nodes in the cluster and not create
data hotspot as in BOP.
2. When a new node is added to the cluster, RP can quickly assign it a new token range and
move minimum amount of data from other nodes to the new node which it is now responsible
for. With BOP, this will have to be done manually.
3. Multiple Column Families Issue: BOP can cause uneven distribution of data if you have
multiple column families.
4. The only benefit that BOP has over RP is that it allows you to do row slices. You can obtain a
cursor like in RDBMS and move over your rows.
9. 12. column family as a map of a map.
SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>
A map gives efficient key lookup, and the sorted nature gives efficient scans. In Cassandra, we
can use row keys and column keys to do efficient lookups and range scans.
13. The number of column keys is unbounded. In other words, you can have wide rows.
A key can itself hold a value. In other words, you can have a valueless column.
14. You need to pass the timestamp with each column value, for Cassandra to use internally for
conflict resolution. However, the timestamp can be safely ignored during modeling.
15. Start with query patterns and create ER model. Then start deformalizing and duplicating. helps
to identify the most frequent query patterns and isolate the less frequent.
Query pattern:
Get user by user id
Get item by item id
Get all the items that a particular user likes
Get all the users who like a particular item
10. Option 1: Exact replica of relational model.
Option 2: Normalized entities with custom indexes
Option 3: Normalized entities with de-normalization into custom indexes
Option 4: Partially de-normalized entities
11. Keyspaces: container for column families and a cluster has 1 keyspace per application.
CREATE KEYSPACE keyspace_name WITH
strategy_class = 'SimpleStrategy'
AND strategy_options:replication_factor='2';
Single device per row - Time Series Pattern 1
Partitioning to limit row size - Time Series Pattern 2
The solution is to use a pattern called row partitioning by adding data to the row key to limit the
amount of columns you get per device.
Reverse order timeseries with expiring columns -
Time Series Pattern 3
Data for a dashboard application and we only want to show the last 10 temperature readings. With
TTL time to live for data value it is possible.
CREATE TABLE latest_temperatures (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time),
) WITH CLUSTERING ORDER BY (event_time DESC);
INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES
('1234ABCD','2013-04-03 07:03:00','72F') USING TTL 20;
12. create table Inbound (
InboundID int not null primary key auto_increment,
ParticipantID int not null,
FromParticipantID int not null,
Occurred date not null,
Subject varchar(50) not null,
Story text not null,
foreign key (ParticipantID) references Participant(ParticipantID),
foreign key (FromParticipantID) references Participant(ParticipantID));
create table Inbound (
ParticipantID int,
Occurred timeuuid,
FromParticipantID int,
Subject text,
Story text,
primary key (ParticipantID, Occurred));
13. 1 Define the User Scenarios This ensures User participation and commitment.
2
Define the Steps in each
Scenario
Clarify the User Interaction.
3 Derive the Data Model.
Use a Modelling Tool, such as Data Architect or ERWin to
generate SQL.
4 Relate Data Entities to each Step. Create Cross-reference matrix to check results.
5
Identify Transactions for each
Entity
Confirm that each Entity has Transactions to load and read
Data
6 Prepare sample Data In collaboration with the Users.
7 Prepare Test Scripts Agree sign-off with the Users.
8 Define a Load Sequence
Reference Data, basics such as Products, any existing Users
or Customers,etc..
9 Run the Test Scripts Get User Sign-off to record progress.