The document provides an introduction to Cassandra presented by Duy Hai Doan. It discusses Cassandra's history, key features including linear scalability, availability, support for multiple data centers, operational simplicity, and analytics capabilities. It also covers Cassandra architecture including the cluster layer based on Dynamo and data-store layer based on BigTable, data distribution, replication, consistency levels, and the write path. The data model of using the last write to resolve conflicts is explained along with CQL basics and modeling one-to-many relationships with clustered tables.
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...NoSQLmatters
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns
In this session, you'll see how to leverage the best features of Cassandra to solve real world issues (Rate limiting/anti fraud system, account validation, security token …). We'll also highlight some common anti-patterns (queue,partition key miss,CQL3 null) and see how to solve them in the Cassandra way.
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2DataStax
Title: Introduction to Apache Cassandra 1.2
Details: Join Aaron Morton, DataStax MVP for Apache Cassandra and learn the basics of the massively scalable NoSQL database. This webinar is will examine C*’s architecture and its strengths for powering mission-critical applications. Aaron will introduce you to core concepts such as Cassandra’s data model, multi-datacenter replication, and tunable consistency. He’ll also cover new features in Cassandra version 1.2 including virtual nodes, CQL 3 language and query tracing.
Speaker: Aaron Morton, Apache Cassandra Committer
Aaron Morton is a Freelance Developer based in New Zealand, and a Committer on the Apache Cassandra project. In 2010, he gave up the RDBMS world for the scale and reliability of Cassandra. He now spends his time advancing the Cassandra project and helping others get the best out of it.
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...NoSQLmatters
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns
In this session, you'll see how to leverage the best features of Cassandra to solve real world issues (Rate limiting/anti fraud system, account validation, security token …). We'll also highlight some common anti-patterns (queue,partition key miss,CQL3 null) and see how to solve them in the Cassandra way.
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2DataStax
Title: Introduction to Apache Cassandra 1.2
Details: Join Aaron Morton, DataStax MVP for Apache Cassandra and learn the basics of the massively scalable NoSQL database. This webinar is will examine C*’s architecture and its strengths for powering mission-critical applications. Aaron will introduce you to core concepts such as Cassandra’s data model, multi-datacenter replication, and tunable consistency. He’ll also cover new features in Cassandra version 1.2 including virtual nodes, CQL 3 language and query tracing.
Speaker: Aaron Morton, Apache Cassandra Committer
Aaron Morton is a Freelance Developer based in New Zealand, and a Committer on the Apache Cassandra project. In 2010, he gave up the RDBMS world for the scale and reliability of Cassandra. He now spends his time advancing the Cassandra project and helping others get the best out of it.
C*ollege Credit: Data Modeling for Apache CassandraDataStax
Cassandra stores data differently than traditional RDBMS’s. It is these differences that allow for improvements in performance, availability and scalability. Aaron Morton, DataStax MVP for Apache Cassandra will present the basics of the data model and outline the differences clearly. This webinar is 101 level and is suitable for people who are coming from a relational background and just starting to get into Apache Cassandra.
Если нашлась одна ошибка — есть и другие. Один способ выявить «наследуемые» у...Positive Hack Days
Ведущий: Асука Накадзима (Asuka Nakajima)
Практика повторного использования исходного кода позволяет сократить расходы на разработку программного обеспечения. Тем не менее, если в оригинальном исходном коде кроется уязвимость, она будет перенесена и в новое приложение. Докладчик расскажет о необычном способе обнаружения «наследуемых» уязвимостей в бинарных файлах без необходимости обращаться к исходному коду или символьным файлам.
Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...DataWorks Summit
All the distributed file systems and NoSQL databases work very well during the normal operation. We can find the big differences if we investigate the behaviour in case of emergency. Data replication strategies and recovery algorithms are the key ingredients of a distributed data storage to save data.
Recent studies proved that the random replication is not the safest choice for storing data as it almost guarantees to lose data in the common scenario of simultaneous node failures. Copyset Replication method significantly reduces the frequency of data loss events with selecting the replica groups in a smart way.
In this talk we will introduce the key elements of a successful data replication and show how advanced data replication strategies could help to survive outages. We will show how Apache Hadoop Ozone solves the problem with advanced techniques and present the challenges of using Copyset algorithm with advanced cluster topology support.
Bootstrapping Meta-Languages of Language WorkbenchesGabriël Konat
It is common practice to bootstrap compilers of programming languages. By using the compiled language to implement the compiler, compiler developers can code in their own high-level language and gain a large-scale test case. In this paper, we investigate bootstrapping of compiler-compilers as they occur in language workbenches. Language workbenches support the development of compilers through the application of multiple collaborating domain-specific meta-languages for defining a language's syntax, analysis, code generation, and editor support. We analyze the bootstrapping problem of language workbenches in detail, propose a method for sound bootstrapping based on fixpoint compilation, and develop recipes for conducting breaking meta-language changes in a bootstrapped language workbench. We have applied sound bootstrapping to the Spoofax language workbench and report on our experience.
What is the best full text search engine for Python?Andrii Soldatenko
Nowadays we can see lot’s of benchmarks and performance tests of different web frameworks and Python tools. Regarding to search engines, it’s difficult to find useful information especially benchmarks or comparing between different search engines. It’s difficult to manage what search engine you should select for instance, ElasticSearch, Postgres Full Text Search or may be Sphinx or Whoosh. You face a difficult choice, that’s why I am pleased to share with you my acquired experience and benchmarks and focus on how to compare full text search engines for Python.
C*ollege Credit: Data Modeling for Apache CassandraDataStax
Cassandra stores data differently than traditional RDBMS’s. It is these differences that allow for improvements in performance, availability and scalability. Aaron Morton, DataStax MVP for Apache Cassandra will present the basics of the data model and outline the differences clearly. This webinar is 101 level and is suitable for people who are coming from a relational background and just starting to get into Apache Cassandra.
Если нашлась одна ошибка — есть и другие. Один способ выявить «наследуемые» у...Positive Hack Days
Ведущий: Асука Накадзима (Asuka Nakajima)
Практика повторного использования исходного кода позволяет сократить расходы на разработку программного обеспечения. Тем не менее, если в оригинальном исходном коде кроется уязвимость, она будет перенесена и в новое приложение. Докладчик расскажет о необычном способе обнаружения «наследуемых» уязвимостей в бинарных файлах без необходимости обращаться к исходному коду или символьным файлам.
Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...DataWorks Summit
All the distributed file systems and NoSQL databases work very well during the normal operation. We can find the big differences if we investigate the behaviour in case of emergency. Data replication strategies and recovery algorithms are the key ingredients of a distributed data storage to save data.
Recent studies proved that the random replication is not the safest choice for storing data as it almost guarantees to lose data in the common scenario of simultaneous node failures. Copyset Replication method significantly reduces the frequency of data loss events with selecting the replica groups in a smart way.
In this talk we will introduce the key elements of a successful data replication and show how advanced data replication strategies could help to survive outages. We will show how Apache Hadoop Ozone solves the problem with advanced techniques and present the challenges of using Copyset algorithm with advanced cluster topology support.
Bootstrapping Meta-Languages of Language WorkbenchesGabriël Konat
It is common practice to bootstrap compilers of programming languages. By using the compiled language to implement the compiler, compiler developers can code in their own high-level language and gain a large-scale test case. In this paper, we investigate bootstrapping of compiler-compilers as they occur in language workbenches. Language workbenches support the development of compilers through the application of multiple collaborating domain-specific meta-languages for defining a language's syntax, analysis, code generation, and editor support. We analyze the bootstrapping problem of language workbenches in detail, propose a method for sound bootstrapping based on fixpoint compilation, and develop recipes for conducting breaking meta-language changes in a bootstrapped language workbench. We have applied sound bootstrapping to the Spoofax language workbench and report on our experience.
What is the best full text search engine for Python?Andrii Soldatenko
Nowadays we can see lot’s of benchmarks and performance tests of different web frameworks and Python tools. Regarding to search engines, it’s difficult to find useful information especially benchmarks or comparing between different search engines. It’s difficult to manage what search engine you should select for instance, ElasticSearch, Postgres Full Text Search or may be Sphinx or Whoosh. You face a difficult choice, that’s why I am pleased to share with you my acquired experience and benchmarks and focus on how to compare full text search engines for Python.
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...NoSQLmatters
Apache Spark is a general data processing framework which allows you perform map-reduce tasks (but not only) in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting alternative to the Hadoop eco-system for both real-time and batch processing. During this talk we will highlight the tight integration between Spark & Cassandra and demonstrate some usages with live code demo.
A lot has changed since I gave one of these talks and man, has it been good. 2.0 brought us a lot of new CQL features and now with 2.1 we get even more! Let me show you some real life data models and those new features taking developer productivity to an all new high. User Defined Types, New Counters, Paging, Static Columns. Exciting new ways of making your app truly killer!
Cassandra Summit 2014: Real Data Models of Silicon ValleyDataStax Academy
A lot has changed since I gave one of these talks and man, has it been good. 2.0 brought us a lot of new CQL features and now with 2.1 we get even more! Let me show you some real life data models and those new features taking developer productivity to an all new high. User Defined Types, New Counters, Paging, Static Columns. Exciting new ways of making your app truly killer!
Cassandra Community Webinar | Data Model on FireDataStax
Functional data models are great, but how can you squeeze out more performance and make them awesome? Let's talk through some example Cassandra 2.0 models, go through the tuning steps and understand the tradeoffs. Many time's just a simple understanding of the underlying Cassandra 2.0 internals can make all the difference. I've helped some of the biggest companies in the world do this and I can help you. Do you feel the need for Cassandra 2.0 speed?
PHP performance 101: so you need to use a databaseLeon Fayer
Being involved in performance audits on systems of every size, from start-up sites hacked together overnight, to a ginormous applications built by world-recognized brand companies, I’ve seen a lot of interesting (and sometimes very unique) performance issues in every level of the stack: code, architecture, databases (sometimes all of the above). But there are a few particular, very “Performance 101″, issues that (unfortunately) appear in a lot of code bases. In this talk I present the most common database-related performance bottlenecks that can happen in most PHP applications.
This PPT is intended to provide a thorough coverage of verilog HDL concepts based on fundamental principles of digital design. This is the basic fundamental concept for the programming of the digital electronics.
Introduction to HDLs: Overview of Digital Design with Verilog HDL, Basic Concepts, Data types, System tasks and Compiler Directives. Hierarchical modeling, concepts of modules and ports Gate level Modeling, Dataflow modeling-Continuous Assignments, Timing and Delays. Programming Language Interface
Design of Arithmetic Circuits using Gate level/ Data flow modeling –Adders, Subtractors, 4- bit Binary and BCD adders and 8-bit Comparators.
Verification: Functional verification, simulation types, Design of stimulus block.
Analytics: The Final Data Frontier (or, Why Users Need Your Data and How Pino...HostedbyConfluent
"Traditionally, analytics have served internal decision-makers—often an exclusive group of people in high-status positions in the organization. Recently, initiatives like Data Mesh have recognized that pushing analytics products down to people at all levels of the org chart can make for a more responsive and competitive organization. But what about people outside the organization? It used to be that you wanted to see their data, but now they need to see yours. Users are the final frontier of analytics, and going forward, you may have less and less of a choice whether to expose analytics products to your customers themselves.
This requires a completely re-engineered approach to data infrastructure. Apache Pinot is a database built from the ground up to ingest streaming data from Apache Kafka and serve filtered, grouped, and aggregated results in tens of milliseconds rather than tens of seconds. Built at LinkedIn to expose the social network's data to users as game-changing application features, Pinot is now powering user-facing analytics in real-time, event-driven systems in many different businesses all over the world. Come to this talk to understand the forces that have given rise to this class of database, learn about Pinot's internals, and see some examples of it in action."
Similar to Cassandra introduction @ NantesJUG (20)
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
2. @doanduyhai
Who Am I ?!
Duy Hai DOAN
Cassandra technical advocate
• talks, meetups, confs
• open-source devs (Achilles, …)
• OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2
3. @doanduyhai
Datastax!
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 200+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
3
11. @doanduyhai
Multi-DC usages!
Prod data copy for testing/benchmarking
n2
n3
n4
n5
n6
n7
n8
n1
n2
n3n1
Use
LOCAL
consistency
My tiny test
cluster
Data copy
NEVER WRITE HERE !!!
11
34. @doanduyhai
Consistency summary!
ONERead + ONEWrite
☞ available for read/write even (N-1) replicas down
QUORUMRead + QUORUMWrite
☞ available for read/write even 1+ replica down
34
42. @doanduyhai
Last Write Win (LWW)!
jdoe
age
name
33 John DOE
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
#partition
42
43. @doanduyhai
Last Write Win (LWW)!
jdoe
age (t1) name (t1)
33 John DOE
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
auto-generated timestamp
.
43
44. @doanduyhai
Last Write Win (LWW)!
UPDATE users SET age = 34 WHERE login = ‘jdoe’;
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
SSTable1 SSTable2
44
45. @doanduyhai
Last Write Win (LWW)!
DELETE age FROM users WHERE login = ‘jdoe’;
jdoe
age (t3)
ý
tombstone
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
SSTable1 SSTable2 SSTable3
45
46. @doanduyhai
Last Write Win (LWW)!
SELECT age FROM users WHERE login = ‘jdoe’;
???
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
46
47. @doanduyhai
Last Write Win (LWW)!
SELECT age FROM users WHERE login = ‘jdoe’;
✓✕✕
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
47
49. @doanduyhai
Historical data!
history
id
date1(t1) date2(t2) … date9(t9)
… … … …
SSTable1 SSTable2
You want to keep data history ?
• do not use internal generated timestamp !!!
• ☞ time-series data modeling
id
date10(t10)date11(t11) …
…
… … … …
49
50. @doanduyhai
CRUD operations!
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
UPDATE users SET age = 34 WHERE login = ‘jdoe’;
DELETE age FROM users WHERE login = ‘jdoe’;
SELECT age FROM users WHERE login = ‘jdoe’;
50
52. @doanduyhai
What about joins ?!
How can I join data between tables ?
How can I model 1 – N relationships ?
How to model a mailbox ?
EmailsUser
1 n
52
55. @doanduyhai
Queries!
Get message by user and message_id (date)
SELECT * FROM mailbox WHERE login = jdoe
and message_id = ‘2014-09-25 16:00:00’;
Get message by user and date interval
SELECT * FROM mailbox WHERE login = jdoe
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
55
56. @doanduyhai
Queries!
Get message by message_id only ?
SELECT * FROM mailbox WHERE message_id = ‘2014-09-25 16:00:00’;
Get message by date interval only ?
SELECT * FROM mailbox WHERE
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
❓
❓
56
57. @doanduyhai
Queries!
Get message by message_id only (#partition not provided)
SELECT * FROM mailbox WHERE message_id = ‘2014-09-25 16:00:00’;
Get message by date interval only (#partition not provided)
SELECT * FROM mailbox WHERE
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
57
61. @doanduyhai
Queries!
SELECT * FROM mailbox WHERE login >= ‘hsue’ and login <= ‘jdoe’;
Get message by user range (range query on #partition)
SELECT * FROM mailbox WHERE login like ‘%doe%‘;
Get message by user pattern (non exact match on #partition)
61
62. @doanduyhai
WHERE clause restrictions!
All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition
Only exact match (=) on #partition, range queries (<, ≤, >, ≥) not allowed
• ☞ full cluster scan
On clustering columns, only range queries (<, ≤, >, ≥) and exact match
WHERE clause only possible
• on columns defined in PRIMARY KEY
• on indexed columns ( )
62
64. @doanduyhai
WHERE clause restrictions!
What if I want to perform « arbitrary » WHERE clause ?
• search form scenario, dynamic search fields
DO NOT RE-INVENT THE WHEEL !
☞ Apache Solr (Lucene) integration (Datastax Enterprise)
☞ Same JVM, 1-cluster-2-products (Solr & Cassandra)
64
65. @doanduyhai
WHERE clause restrictions!
What if I want to perform « arbitrary » WHERE clause ?
• search form scenario, dynamic search fields
DO NOT RE-INVENT THE WHEEL !
☞ Apache Solr (Lucene) integration (Datastax Enterprise)
☞ Same JVM, 1-cluster-2-products (Solr & Cassandra)
SELECT * FROM users WHERE solr_query = ‘age:[33 TO *] AND gender:male’;
SELECT * FROM users WHERE solr_query = ‘lastname:*schwei?er’;
65
66. @doanduyhai
Collections & maps!
CREATE TABLE users (
login text,
name text,
age int,
friends set<text>,
hobbies list<text>,
languages map<int, text>,
…
PRIMARY KEY(login));
66
Keep the cardinality low ≈ 1000
67. @doanduyhai
User Defined Type (UDT)!
CREATE TABLE users (
login text,
…
street_number int,
street_name text,
postcode int,
country text,
…
PRIMARY KEY(login));
Instead of
67
68. @doanduyhai
User Defined Type (UDT)!
CREATE TYPE address (
street_number int,
street_name text,
postcode int,
country text);
CREATE TABLE users (
login text,
…
location frozen <address>,
…
PRIMARY KEY(login));
68
70. @doanduyhai
UDT update!
UPDATE users set location =
{
‘street_number’: 125,
‘street_name’: ‘Congress Avenue’,
‘postcode’: 95054,
‘country’: ‘USA’
}
WHERE login = jdoe;
Can be nested ☞ store documents
• but no dynamic fields (or use map<text, blob>)
70
71. @doanduyhai
From SQL to CQL!
Normalized
Comment
User
1
n
CREATE TABLE comments (
article_id uuid,
comment_id timeuuid,
author_login text, // typical join id
content text,
PRIMARY KEY((article_id), comment_id));
71
72. @doanduyhai
From SQL to CQL
1 SELECT
- 10 last comments
- 10 author_login
What to do with 10 author_login ???
Comment
User
1
n
72
73. @doanduyhai
From SQL to CQL
1 SELECT
- 10 last comments
- 10 author_login
What to do with 10 author_login ???
10 extra SELECT → N+1 SELECT problem !
Comment
User
1
n
73
74. @doanduyhai
From SQL to CQL!
De-normalized
Comment
User
1
n
CREATE TABLE comments (
article_id uuid,
comment_id timeuuid,
author frozen<person>, // person is UDT
content text,
PRIMARY KEY((article_id), comment_id));
74
75. @doanduyhai
Data modeling best practices!
Start by queries
• identify core functional read paths
• 1 read scenario ≈ 1 SELECT
75
76. @doanduyhai
Data modeling best practices!
Start by queries
• identify core functional read paths
• 1 read scenario ≈ 1 SELECT
Denormalize
• wisely, only duplicate necessary & immutable data
• functional/technical trade-off
76
78. @doanduyhai
Data modeling best practices!
John DOE, male
birthdate: 21/02/1981
subscribed since 03/06/2011
☉ San Mateo, CA
’’Impossible is not John DOE’’
Full detail read from
User table on click
78
80. @doanduyhai
Data modeling trade-off
2 strategies
• either accept to normalize some data (extra SELECT required)
• or de-normalize and update everywhere upon data mutation
80
81. @doanduyhai
Data modeling trade-off
2 strategies
• either accept to normalize some data (extra SELECT required)
• or de-normalize and update everywhere upon data mutation
But always keep those scenarios rare (5%-10% max), focus on the 90%
81
82. @doanduyhai
Data modeling trade-off
2 strategies
• either accept to normalize some data (extra SELECT required)
• or de-normalize and update everywhere upon data mutation
But always keep those scenarios rare (5%-10% max), focus on the 90%
Example: Twitter tweet deletion
82
84. @doanduyhai
Lightweight Transaction (LWT)!
What ? ☞ make operations linearizable
Why ? ☞ solve a class of race conditions in Cassandra that
would require installing an external lock manager
84
85. @doanduyhai
Lightweight Transaction (LWT)!
INSERT INTO account (id, email)
VALUES (‘jdoe’,
‘john_doe@fiction.com’);
SELECT * FROM account
WHERE id= ‘jdoe’;
(0 rows)
SELECT * FROM account
WHERE id= ‘jdoe’;
(0 rows)
INSERT INTO account (id, email)
VALUES (‘jdoe’,
‘jdoe@fiction.com’);
winner
85
86. @doanduyhai
Lightweight Transaction (LWT)!
How ? ☞ implementing Paxos protocol on Cassandra
Syntax ?
INSERT INTO account (id, email) VALUES (‘jdoe’, ‘john_doe@fiction.com’)
IF NOT EXISTS;
UPDATE account SET email = ‘jdoe@fiction.com’
IF email = ‘john_doe@fiction.com’ WHERE id=‘jdoe’;
86