This document provides an introduction to Cassandra including:
1) An overview of Cassandra's key architecture including its linear scalability, continuous availability across data centers, and operational simplicity.
2) A discussion of Cassandra's data model including its use of Last Write Wins for conflict resolution and examples of modeling one-to-many relationships using clustered tables.
3) Details on Cassandra's consistency levels and how they impact availability and durability of writes and reads.
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaDuyhai Doan
This document summarizes a presentation on Cassandra use cases and anti-patterns. It discusses several anti-patterns to avoid such as queue-like designs, intensive updates on the same column, and designing around a dynamic schema. It also provides examples of good use cases such as rate limiting, anti-fraud detection, and account validation. The document contains an agenda, descriptions of each anti-pattern and their level of failure, as well as explanations and demonstrations of the example use cases.
There are a few options for performing more complex queries in Cassandra beyond the restrictions of the WHERE clause:
1. Denormalize/duplicate data across tables to allow querying on different columns. For example, have one table keyed on user ID and another keyed on message date to allow filtering by date.
2. Offload complex queries to an external search index like Solr or Elasticsearch that can handle full-text and complex queries, and keep Cassandra as the system of record.
3. Use Spark/Hive on Cassandra to run queries across the cluster and leverage their more powerful query engines.
4. Consider a different database if your queries require joins, complex where clauses, or don't map well to
Cassandra data structures and algorithmsDuyhai Doan
This document discusses Cassandra data structures and algorithms. It begins with an introduction and agenda, then covers Cassandra's use of CRDTs, bloom filters, and Merkle trees for its data model. It explains how Cassandra columns can be modeled as a CRDT join semilattice and proves their eventual convergence. The document also covers Cassandra's write path, read path optimized with bloom filters, and the math behind bloom filter probabilities.
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
The document discusses Apache Cassandra's SASI (SSTable Attached Secondary Index). It provides a 5 minute introduction to Cassandra, introduces SASI and how it follows the SSTable lifecycle, describes how SASI works at the cluster level for distributed queries and indexing, and details the local read/write process including data structures and query planning. Some benchmarks are shown for full table scans on a large dataset using SASI with Spark. The key advantages and use cases for SASI are discussed along with its limitations compared to dedicated search engines.
This document provides an overview of big data concepts for a new project in 2017. It discusses distributed systems theories like time ordering, latency, failure and consensus. It also covers data sharding, replication, and the CAP theorem. Key points include how latency is impacted by network delays, different failure modes, and that the CAP theorem states that a distributed system can only guarantee two of consistency, availability, and partition tolerance at once.
Distributed algorithms for big data @ GeeConDuyhai Doan
This document discusses distributed algorithms for big data. It begins with an overview of HyperLogLog for estimating cardinality and counting distinct elements in a large data set. It then explains how HyperLogLog works by using a hash function to distribute the data across buckets and applying the LogLog algorithm to each bucket before taking the harmonic mean. The document also covers Paxos for distributed consensus, explaining the phases of prepare, promise, accept and learn to reach agreement in the presence of failures.
Fast track to getting started with DSE Max @ INGDuyhai Doan
This document provides an overview of Apache Spark and Apache Cassandra and how they can be used together. It begins with introductions to Spark, describing its core concepts like RDDs and transformations. It then introduces Cassandra and covers concepts like data distribution and token ranges. The remainder discusses the Spark Cassandra connector, covering how it allows reading and writing Cassandra data from Spark and maintaining data locality. It also discusses use cases, failure handling, and cross-datacenter/cluster operations.
MongoDB and Indexes - MUG Denver - 20160329Douglas Duncan
Indexes are data structures that store a subset of data to allow for efficient retrieval. MongoDB stores indexes using a b-tree format. There are several types of indexes including simple, compound, multikey, full-text, and geospatial indexes. Indexes improve performance by enabling efficient retrieval, sorting, and filtering of documents. Indexes are created using the createIndex command and their usage can be checked using explain plans.
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaDuyhai Doan
This document summarizes a presentation on Cassandra use cases and anti-patterns. It discusses several anti-patterns to avoid such as queue-like designs, intensive updates on the same column, and designing around a dynamic schema. It also provides examples of good use cases such as rate limiting, anti-fraud detection, and account validation. The document contains an agenda, descriptions of each anti-pattern and their level of failure, as well as explanations and demonstrations of the example use cases.
There are a few options for performing more complex queries in Cassandra beyond the restrictions of the WHERE clause:
1. Denormalize/duplicate data across tables to allow querying on different columns. For example, have one table keyed on user ID and another keyed on message date to allow filtering by date.
2. Offload complex queries to an external search index like Solr or Elasticsearch that can handle full-text and complex queries, and keep Cassandra as the system of record.
3. Use Spark/Hive on Cassandra to run queries across the cluster and leverage their more powerful query engines.
4. Consider a different database if your queries require joins, complex where clauses, or don't map well to
Cassandra data structures and algorithmsDuyhai Doan
This document discusses Cassandra data structures and algorithms. It begins with an introduction and agenda, then covers Cassandra's use of CRDTs, bloom filters, and Merkle trees for its data model. It explains how Cassandra columns can be modeled as a CRDT join semilattice and proves their eventual convergence. The document also covers Cassandra's write path, read path optimized with bloom filters, and the math behind bloom filter probabilities.
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
The document discusses Apache Cassandra's SASI (SSTable Attached Secondary Index). It provides a 5 minute introduction to Cassandra, introduces SASI and how it follows the SSTable lifecycle, describes how SASI works at the cluster level for distributed queries and indexing, and details the local read/write process including data structures and query planning. Some benchmarks are shown for full table scans on a large dataset using SASI with Spark. The key advantages and use cases for SASI are discussed along with its limitations compared to dedicated search engines.
This document provides an overview of big data concepts for a new project in 2017. It discusses distributed systems theories like time ordering, latency, failure and consensus. It also covers data sharding, replication, and the CAP theorem. Key points include how latency is impacted by network delays, different failure modes, and that the CAP theorem states that a distributed system can only guarantee two of consistency, availability, and partition tolerance at once.
Distributed algorithms for big data @ GeeConDuyhai Doan
This document discusses distributed algorithms for big data. It begins with an overview of HyperLogLog for estimating cardinality and counting distinct elements in a large data set. It then explains how HyperLogLog works by using a hash function to distribute the data across buckets and applying the LogLog algorithm to each bucket before taking the harmonic mean. The document also covers Paxos for distributed consensus, explaining the phases of prepare, promise, accept and learn to reach agreement in the presence of failures.
Fast track to getting started with DSE Max @ INGDuyhai Doan
This document provides an overview of Apache Spark and Apache Cassandra and how they can be used together. It begins with introductions to Spark, describing its core concepts like RDDs and transformations. It then introduces Cassandra and covers concepts like data distribution and token ranges. The remainder discusses the Spark Cassandra connector, covering how it allows reading and writing Cassandra data from Spark and maintaining data locality. It also discusses use cases, failure handling, and cross-datacenter/cluster operations.
MongoDB and Indexes - MUG Denver - 20160329Douglas Duncan
Indexes are data structures that store a subset of data to allow for efficient retrieval. MongoDB stores indexes using a b-tree format. There are several types of indexes including simple, compound, multikey, full-text, and geospatial indexes. Indexes improve performance by enabling efficient retrieval, sorting, and filtering of documents. Indexes are created using the createIndex command and their usage can be checked using explain plans.
Indexes are references to documents that are efficiently ordered by key and maintained in a tree structure for fast lookup. They improve the speed of document retrieval, range scanning, ordering, and other operations by enabling the use of the index instead of a collection scan. While indexes improve query performance, they can slow down document inserts and updates since the indexes also need to be maintained. The query optimizer aims to select the best index for each query but can sometimes be overridden.
Datastax day 2016 : Cassandra data modeling basicsDuyhai Doan
This document discusses data modeling with Apache Cassandra. It covers:
1. The objectives of data modeling like reducing query latency and avoiding disasters
2. Choosing the right partition key which is the main entry point for queries and helps distribute data
3. Using clustering columns to simulate one-to-many relationships and enable sorting and range queries
4. Other critical details like avoiding huge partitions, sub-partitioning techniques, and how deletes create tombstones
This document discusses using Spark with Apache Cassandra for various use cases including loading data from various sources, performing analytics, and sanitizing, validating, and transforming data. It provides examples of using Spark jobs to import data, clean data, perform schema migrations, and run analytics queries. It also covers aspects of the connector architecture like data locality, failure handling, and cross data center operations. The document concludes with discussing a benchmark that used Spark and Cassandra to perform parallel data ingestion and top-K queries on 3.2 billion rows of data with SASI indices.
This document discusses MySQL 5.7's JSON datatype. It introduces JSON and why it is useful for integrating relational and schemaless data. It covers creating JSON columns, inserting and selecting JSON data using functions like JSON_EXTRACT. It discusses indexing JSON columns using generated columns. Performance is addressed, showing JSON tables can be 40% larger with slower inserts and selects compared to equivalent relational tables without indexes. Options for stored vs virtual generated columns are presented.
The first half of this presentation is an introduction to Apache Cassandra's architecture, highlighting its main features: distributed (masterless), replicated, multi data center.
The second half is focused on data modeling with Apache Cassandra, the differences with the relational way of doing data modeling and a few real examples, highlighting potential issues and providing alternatives.
This document discusses Apache Cassandra and its features and use cases. It provides an overview of Cassandra's key characteristics like massive scalability, extreme availability, and rich data modeling. Example use cases mentioned include messaging, collections/playlists, fraud detection, recommendations, and IoT sensor data. New features introduced in Cassandra in 2016 are also summarized, such as delete by range, materialized views, atomic UDT updates, a new SASI index, and support for GROUP BY queries.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008
- Key features of Cassandra including linear scalability, continuous availability, support for multiple data centers, operational simplicity, and analytics capabilities
- Details on Cassandra's architecture including its cluster layer based on Amazon Dynamo and data store layer based on Google BigTable
- Explanations of Cassandra's data distribution, token ranges, replication, coordinator nodes, tunable consistency levels, and write path
- Descriptions of Cassandra's data model including last write win and examples of CRUD operations and table schemas
The document provides instructions on various MongoDB commands for working with databases, collections, and documents. It demonstrates how to start the MongoDB CLI, create and drop databases and collections, insert, update, find, and remove documents, and add indexes. It also discusses sharding, backups using mongodump, and restores with mongorestore.
MongoDB World 2016: Deciphering .explain() OutputMongoDB
The document discusses different explain modes for MongoDB queries and aggregations. It begins with an overview of explain() and query plans, then covers the default "queryPlanner" mode which shows the winning and rejected plans. It also mentions the "executionStats" and "allPlansExecution" modes which provide more runtime statistics. The document aims to help understand how queries and aggregations are executed and troubleshoot performance issues.
Cassandra 3.0 - JSON at scale - StampedeCon 2015StampedeCon
This session will explore the new features in Cassandra 3.0, starting with JSON support. Cassandra now allows storing JSON directly to Cassandra rows and vice versa, making it trivial to deploy Cassandra as a component in modern service-oriented architectures.
Cassandra 3.0 also delivers other enhancements to developer productivity: user defined functions let developers deploy custom application logic server side with any language conforming to the Java scripting API, including Javascript. Global indexes allow scaling indexed queries linearly with the size of the cluster, a first for open-source NoSQL databases.
Finally, we will cover the performance improvements in Cassandra 3.0 as well.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
- The document discusses enhancements and new features in Cassandra 2.1 including user defined types, collection indexing, improved counters, data directory changes, bloom filter improvements, and more efficient repair. It also outlines the new query cache and row cache features in Cassandra.
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial IndexesMongoDB
This is the fourth webinar of a Back to Basics series that will introduce you to the MongoDB database. This webinar will introduce you to the aggregation framework.
Presented by Tom Schreiber, Senior Consulting Engineer, MongoDB
MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application? In this talk we’ll cover how indexing works, the various indexing options, and cover use cases where each might be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale. We'll show you the tools and techniques for diagnosing and tuning the performance of your MongoDB deployment. Whether you're running into problems or just want to optimize your performance, these skills will be useful.
MySQL 5.7. Tutorial - Dutch PHP Conference 2015Dave Stokes
MySQL 5.7 is the latest version of the MySQL database. It includes new features such as support for JSON as a native data type with functions for manipulating JSON documents. Security has also been improved with secure defaults, password rotation/expiration controls, and SSL encryption enabled by default for the C client library. The release candidate for 5.7 was released in April 2015 and includes patches, contributions, and enhancements over previous versions.
MySQL 5.7 Tutorial Dutch PHP Conference 2015Dave Stokes
MySQL 5.7 is the latest version of the MySQL database. It includes new features such as support for JSON as a native data type with functions for manipulating JSON documents. Security has also been improved with secure defaults, password rotation/expiration controls, and SSL encryption enabled by default for the C client library. The release candidate for 5.7 was released in April 2015 and includes patches, contributions, and enhancements over previous versions.
The document discusses various web technologies including Yahoo Query Language (YQL), mashups, screen scraping, web services, RSS, JSON, orchestration, SQL, OAuth, PHP, Python, and provides links to resources about these topics. It also thanks the organizer of an event and provides picture credits.
Data Wars: The Bloody Enterprise strikes backVictor_Cr
I would like to describe such cases when we create problems for "future us" just by an accident. I will show how different Java data types can ease or increase the pain in supporting the application later. Most common pitfals and tricky corner cases you probably have never thought about.
This document summarizes a presentation on Cassandra Query Language version 3 (CQL3). It outlines the motivations for CQL3, provides examples of defining schemas and querying data with CQL3, and notes new features like collection support. The document also reviews changes from earlier versions like improved definition of static and dynamic column families using composite keys.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008.
- Key features of Cassandra including linear scalability, continuous availability, ability to span multiple data centers, and operational simplicity.
- A high-level overview of Cassandra's architecture including its use of Dynamo and BigTable papers for the cluster and data storage layers.
- Concepts related to Cassandra's data model including data distribution, token ranges, replication, write path, and "last write wins" consistency.
This document summarizes a presentation about the KillrChat messaging application. KillrChat is a scalable messaging app built using AngularJS, Spring, and Cassandra. It demonstrates de-normalization and provides an exercise for attendees to work with user and chat room management, as well as chat messages. The document outlines the architecture, data models, and solutions for handling concurrent requests to avoid inconsistencies through the use of lightweight transactions in Cassandra.
Indexes are references to documents that are efficiently ordered by key and maintained in a tree structure for fast lookup. They improve the speed of document retrieval, range scanning, ordering, and other operations by enabling the use of the index instead of a collection scan. While indexes improve query performance, they can slow down document inserts and updates since the indexes also need to be maintained. The query optimizer aims to select the best index for each query but can sometimes be overridden.
Datastax day 2016 : Cassandra data modeling basicsDuyhai Doan
This document discusses data modeling with Apache Cassandra. It covers:
1. The objectives of data modeling like reducing query latency and avoiding disasters
2. Choosing the right partition key which is the main entry point for queries and helps distribute data
3. Using clustering columns to simulate one-to-many relationships and enable sorting and range queries
4. Other critical details like avoiding huge partitions, sub-partitioning techniques, and how deletes create tombstones
This document discusses using Spark with Apache Cassandra for various use cases including loading data from various sources, performing analytics, and sanitizing, validating, and transforming data. It provides examples of using Spark jobs to import data, clean data, perform schema migrations, and run analytics queries. It also covers aspects of the connector architecture like data locality, failure handling, and cross data center operations. The document concludes with discussing a benchmark that used Spark and Cassandra to perform parallel data ingestion and top-K queries on 3.2 billion rows of data with SASI indices.
This document discusses MySQL 5.7's JSON datatype. It introduces JSON and why it is useful for integrating relational and schemaless data. It covers creating JSON columns, inserting and selecting JSON data using functions like JSON_EXTRACT. It discusses indexing JSON columns using generated columns. Performance is addressed, showing JSON tables can be 40% larger with slower inserts and selects compared to equivalent relational tables without indexes. Options for stored vs virtual generated columns are presented.
The first half of this presentation is an introduction to Apache Cassandra's architecture, highlighting its main features: distributed (masterless), replicated, multi data center.
The second half is focused on data modeling with Apache Cassandra, the differences with the relational way of doing data modeling and a few real examples, highlighting potential issues and providing alternatives.
This document discusses Apache Cassandra and its features and use cases. It provides an overview of Cassandra's key characteristics like massive scalability, extreme availability, and rich data modeling. Example use cases mentioned include messaging, collections/playlists, fraud detection, recommendations, and IoT sensor data. New features introduced in Cassandra in 2016 are also summarized, such as delete by range, materialized views, atomic UDT updates, a new SASI index, and support for GROUP BY queries.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008
- Key features of Cassandra including linear scalability, continuous availability, support for multiple data centers, operational simplicity, and analytics capabilities
- Details on Cassandra's architecture including its cluster layer based on Amazon Dynamo and data store layer based on Google BigTable
- Explanations of Cassandra's data distribution, token ranges, replication, coordinator nodes, tunable consistency levels, and write path
- Descriptions of Cassandra's data model including last write win and examples of CRUD operations and table schemas
The document provides instructions on various MongoDB commands for working with databases, collections, and documents. It demonstrates how to start the MongoDB CLI, create and drop databases and collections, insert, update, find, and remove documents, and add indexes. It also discusses sharding, backups using mongodump, and restores with mongorestore.
MongoDB World 2016: Deciphering .explain() OutputMongoDB
The document discusses different explain modes for MongoDB queries and aggregations. It begins with an overview of explain() and query plans, then covers the default "queryPlanner" mode which shows the winning and rejected plans. It also mentions the "executionStats" and "allPlansExecution" modes which provide more runtime statistics. The document aims to help understand how queries and aggregations are executed and troubleshoot performance issues.
Cassandra 3.0 - JSON at scale - StampedeCon 2015StampedeCon
This session will explore the new features in Cassandra 3.0, starting with JSON support. Cassandra now allows storing JSON directly to Cassandra rows and vice versa, making it trivial to deploy Cassandra as a component in modern service-oriented architectures.
Cassandra 3.0 also delivers other enhancements to developer productivity: user defined functions let developers deploy custom application logic server side with any language conforming to the Java scripting API, including Javascript. Global indexes allow scaling indexed queries linearly with the size of the cluster, a first for open-source NoSQL databases.
Finally, we will cover the performance improvements in Cassandra 3.0 as well.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
- The document discusses enhancements and new features in Cassandra 2.1 including user defined types, collection indexing, improved counters, data directory changes, bloom filter improvements, and more efficient repair. It also outlines the new query cache and row cache features in Cassandra.
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial IndexesMongoDB
This is the fourth webinar of a Back to Basics series that will introduce you to the MongoDB database. This webinar will introduce you to the aggregation framework.
Presented by Tom Schreiber, Senior Consulting Engineer, MongoDB
MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application? In this talk we’ll cover how indexing works, the various indexing options, and cover use cases where each might be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale. We'll show you the tools and techniques for diagnosing and tuning the performance of your MongoDB deployment. Whether you're running into problems or just want to optimize your performance, these skills will be useful.
MySQL 5.7. Tutorial - Dutch PHP Conference 2015Dave Stokes
MySQL 5.7 is the latest version of the MySQL database. It includes new features such as support for JSON as a native data type with functions for manipulating JSON documents. Security has also been improved with secure defaults, password rotation/expiration controls, and SSL encryption enabled by default for the C client library. The release candidate for 5.7 was released in April 2015 and includes patches, contributions, and enhancements over previous versions.
MySQL 5.7 Tutorial Dutch PHP Conference 2015Dave Stokes
MySQL 5.7 is the latest version of the MySQL database. It includes new features such as support for JSON as a native data type with functions for manipulating JSON documents. Security has also been improved with secure defaults, password rotation/expiration controls, and SSL encryption enabled by default for the C client library. The release candidate for 5.7 was released in April 2015 and includes patches, contributions, and enhancements over previous versions.
The document discusses various web technologies including Yahoo Query Language (YQL), mashups, screen scraping, web services, RSS, JSON, orchestration, SQL, OAuth, PHP, Python, and provides links to resources about these topics. It also thanks the organizer of an event and provides picture credits.
Data Wars: The Bloody Enterprise strikes backVictor_Cr
I would like to describe such cases when we create problems for "future us" just by an accident. I will show how different Java data types can ease or increase the pain in supporting the application later. Most common pitfals and tricky corner cases you probably have never thought about.
This document summarizes a presentation on Cassandra Query Language version 3 (CQL3). It outlines the motivations for CQL3, provides examples of defining schemas and querying data with CQL3, and notes new features like collection support. The document also reviews changes from earlier versions like improved definition of static and dynamic column families using composite keys.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008.
- Key features of Cassandra including linear scalability, continuous availability, ability to span multiple data centers, and operational simplicity.
- A high-level overview of Cassandra's architecture including its use of Dynamo and BigTable papers for the cluster and data storage layers.
- Concepts related to Cassandra's data model including data distribution, token ranges, replication, write path, and "last write wins" consistency.
This document summarizes a presentation about the KillrChat messaging application. KillrChat is a scalable messaging app built using AngularJS, Spring, and Cassandra. It demonstrates de-normalization and provides an exercise for attendees to work with user and chat room management, as well as chat messages. The document outlines the architecture, data models, and solutions for handling concurrent requests to avoid inconsistencies through the use of lightweight transactions in Cassandra.
The presentation introduces KillrChat, a scalable messaging app built using Cassandra to demonstrate denormalization. It discusses the technology stack including Cassandra, Spring Boot, and AngularJS. It then covers the data models and solutions for various entities like users, chat rooms, and messages to handle concurrent modifications using lightweight transactions. Real-time features are implemented with WebSockets. The presentation provides a hands-on exercise for attendees and highlights how to build a real application with the Cassandra ecosystem.
The document describes the KillrChat application, which is a scalable chat application built with AngularJS, Cassandra, and Spring Boot. It discusses the application architecture including using Cassandra for distributed data storage and scaling out via a message broker. It also summarizes the key components of the application including controllers, services, REST resources, directives, and how data is distributed in Cassandra.
This document summarizes Cassandra drivers and tools. It discusses the Java driver architecture including connection pooling, load balancing policies, and automatic paging. It also demonstrates Cassandra Unit for testing, the Java driver object mapper module, and Achilles object mapper with features like dirty checking. Live coding examples are provided for these tools.
The document provides an introduction to Cassandra presented by Duy Hai Doan. It discusses Cassandra's history, key features including linear scalability, availability, support for multiple data centers, operational simplicity, and analytics capabilities. It also covers Cassandra architecture including the cluster layer based on Dynamo and data-store layer based on BigTable, data distribution, replication, consistency levels, and the write path. The data model of using the last write to resolve conflicts is explained along with CQL basics and modeling one-to-many relationships with clustered tables.
Datastax day 2016 introduction to apache cassandraDuyhai Doan
This document provides an overview of Apache Cassandra and discusses its key features. It describes how Cassandra distributes and replicates data across multiple nodes for continuous availability and linear scalability. It also covers Cassandra's consistency model and how consistency levels can be tuned to balance availability and durability. The document lists Cassandra's features like collections, user-defined types, materialized views, and JSON support for flexible data modeling.
This document summarizes a presentation about using Spark with Apache Cassandra. It discusses using Spark jobs to load and transform data in Cassandra for purposes such as data import, cleaning, schema migration and analytics. It also covers aspects of the connector architecture like data locality, failure handling and cross-cluster operations. Examples are given of using Spark and Cassandra together for parallel data ingestion and top-K queries on a large dataset.
Spark cassandra integration, theory and practiceDuyhai Doan
This document discusses Spark and Cassandra integration. It begins with an introduction to Spark, describing it as a general data processing framework that is faster than Hadoop. It then discusses the Cassandra database and its data distribution using token ranges. The document provides examples of using the Spark/Cassandra connector for reading and writing data between Spark and Cassandra, including techniques for ensuring data locality. It discusses best practices for cluster deployment and handling failures while maintaining data locality. Finally, it presents some use cases for using Spark/Cassandra including data cleaning, schema migration, and analytics.
The document provides an introduction to Cassandra presented by Duy Hai Doan. It discusses Cassandra's history, key features including linear scalability, availability, support for multiple data centers, operational simplicity, and analytics capabilities. It also covers Cassandra architecture including the cluster layer based on Dynamo and data-store layer based on BigTable, data distribution, replication, consistency levels, and the write path. The data model of using the last write to resolve conflicts is explained along with CQL basics and modeling one-to-many relationships with clustered tables.
This document discusses Libon's migration of contact data from an SQL database to Cassandra. It began with billions of contact records stored relationally in Oracle. Performance became unpredictable at scale. Tuning Oracle helped but new challenges like high availability and multi-datacenter support remained. The migration strategy involved writing to both databases, migrating old data, and switching fully to Cassandra while preserving no downtime and safe rollback. Business code refactoring maintained existing tests by modifying services and repositories to work with the new Cassandra data model.
This document discusses Cassandra and the Datastax Academy. It provides examples of companies using Cassandra as infrastructure including ING, Netflix, Sony, and Microsoft. It also discusses the increasing SQL support in Cassandra, such as user defined functions, materialized views, and secondary indexes. The document notes that skills in Cassandra are in high demand but difficult to find. It promotes the Datastax Academy as a free solution to this problem, offering self-paced courses, instructor-led training, and O'Reilly certification to boost careers.
Cassandra 3 new features @ Geecon Krakow 2016Duyhai Doan
Duyhai Doan gave a presentation on new features in Cassandra 3.0, including materialized views, user defined functions, user defined aggregates, and the new SASI full text search index. Materialized views allow pre-computing common queries to improve performance. User defined functions and aggregates enable pushing computation to the server. The new SASI index provides improved full text search capabilities in Cassandra.
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
This document provides an overview of Spark and its integration with Cassandra for real-time data processing. It begins with introductions of the speaker and Datastax. It then discusses what Spark and Cassandra are, including their architectures and key characteristics like Spark being fast, easy to use, and supporting multiple languages. The document demonstrates basic Spark code and how RDDs work. It covers the Spark and Cassandra connectors and how they provide locality-aware joins. It also discusses use cases and deployment options. Finally, it considers future improvements like leveraging Solr for local filtering to improve data locality during joins.
Apache zeppelin the missing component for the big data ecosystemDuyhai Doan
Duy Hai Doan presented Apache Zeppelin, an open-source web-based notebook that allows users to interact with data. Zeppelin provides a front-end GUI and display system for data analysis tools and uses interpreters to connect to back-end systems like Spark, Cassandra, and Flink. Doan demonstrated Zeppelin's notebook interface, display options, and how users can write their own interpreters to connect new systems to Zeppelin. Future plans for Zeppelin include improving usability, adding authentication and authorization, and developing more interpreters and visualizations.
This document provides an overview of DataStax Enterprise, a database platform for cloud applications. It discusses key features of DataStax Enterprise including that it is certified for production, offers automatic management services for configuration and administration through OpsCenter, and provides 24/7 expert support. The document also summarizes various DataStax Enterprise technologies and capabilities like advanced replication, tiered storage, security features, and integration with search, analytics, and graph databases.
The document discusses Cassandra architecture and operations. It provides an overview of key Cassandra concepts like data distribution across nodes, replication, consistency levels, and the write and read paths. It also covers topics like compaction strategies, best practices for configuration, and operational recommendations.
Sasi, cassandra on full text search rideDuyhai Doan
This document discusses SASI (SSTable Attached Secondary Index), a new secondary index for Apache Cassandra that follows the SSTable lifecycle. It describes how SASI works, including its in-memory and on-disk structures. It also covers SASI's query planning optimizations and provides some benchmark results showing SASI's performance improvements over full scans. While SASI is not as full-featured as search engines, it can cover many search use cases within Cassandra.
This document provides an introduction to Cassandra including:
- Datastax is a company that contributes to Apache Cassandra and sells Datastax Enterprise.
- Cassandra was created at Facebook and is now open source software with the current version being 3.2.
- Cassandra's key features include linear scalability, continuous availability, multi-datacenter support, operational simplicity, and Spark integration.
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...NoSQLmatters
This document discusses Cassandra use cases and anti-patterns. It describes queue-like designs, intensive updates on the same column, and designing around a dynamic schema as anti-patterns that can lead to failures. Rate limiting, fraud prevention, and account validation are provided as examples of good use cases. Key-value modeling, clustering, compaction strategies, and time-to-live features are also overviewed.
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...hamidsamadi
This document provides an overview of Spark and its integration with Cassandra for real-time data processing. It introduces Spark and its characteristics like being fast, easy to use, and having a rich API. It then discusses Cassandra's data distribution using token ranges and how Spark partitions data to maximize data locality when reading from and writing to Cassandra. The document demonstrates the Spark-Cassandra connector architecture and how it exposes Cassandra tables as RDDs and DataFrames while pushing predicates down for filtering. It also provides examples of using the connector API to read and write data and ensuring data locality.
Cassandra introduction apache con 2014 budapestDuyhai Doan
This document provides an introduction and summary of Cassandra presented by Duy Hai Doan. It discusses Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008. The key architecture of Cassandra including its data distribution across nodes, replication for failure tolerance, and consistency models for reads and writes is summarized.
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...NoSQLmatters
Apache Spark is a general data processing framework which allows you perform map-reduce tasks (but not only) in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting alternative to the Hadoop eco-system for both real-time and batch processing. During this talk we will highlight the tight integration between Spark & Cassandra and demonstrate some usages with live code demo.
Cassandra nice use cases and worst anti patternsDuyhai Doan
This document discusses Cassandra use cases and anti-patterns. Some good use cases include rate limiting, fraud prevention, account validation, and storing sensor time series data. Poor designs include using Cassandra like a queue, storing null values, intensive updates to the same column, and dynamically changing the schema. The document provides examples and explanations of how to properly implement these scenarios in Cassandra.
Daniel Krasner - High Performance Text Processing with Rosetta PyData
This talk covers rapid prototyping of a high performance scalable text processing pipeline development in Python. We demonstrate how Python modules, in particular from the Rosetta library, can be used to analyze, clean, extract features, and finally perform machine learning tasks such as classification or topic modeling on millions of documents. Our style is to build small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing library.
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas
GNU poke is a new interactive editor for binary data. Not limited to editing basic ntities such as bits and bytes, it provides a full-fledged procedural, interactive programming language designed to describe data structures and to operate on them. Once a user has defined a structure for binary data (usually matching some file format) she can search, inspect, create, shuffle and modify abstract entities such as ELF relocations, MP3 tags, DWARF expressions, partition table entries, and so on, with primitives resembling simple editing of bits and bytes. The program comes with a library of already written descriptions (or “pickles” in poke parlance) for many binary formats.
GNU poke is useful in many domains. It is very well suited to aid in the development of programs that operate on binary files, such as assemblers and linkers. This was in fact the primary inspiration that brought me to write it: easily injecting flaws into ELF files in order to reproduce toolchain bugs. Also, due to its flexibility, poke is also very useful for reverse engineering, where the real structure of the data being edited is discovered by experiment, interactively. It is also good for the fast development of prototypes for programs like linkers, compressors or filters, and it provides a convenient foundation to write other utilities such as diff and patch tools for binary files.
This talk (unlike Gaul) is divided into four parts. First I will introduce the program and show what it does: from simple bits/bytes editing to user-defined structures. Then I will show some of the internals, and how poke is implemented. The third block will cover the way of using Poke to describe user data, which is to say the art of writing “pickles”. The presentation ends with a status of the project, a call for hackers, and a hint at future works.
Jose E. Marchesi
The document provides information about the Go programming language. It discusses the history and creators of Go, key features of the language such as concurrency and garbage collection, basic Go code examples, and common data types like slices and maps. It also covers Go tools, environments, benchmarks showing Go's performance, and examples of companies using Go in production.
This document discusses query optimization in database systems. It explains that a query optimizer is needed because there are many possible ways to execute a query with different tables and joins. The optimizer uses statistics, cost modeling, and explores the search space of options to pick the most efficient plan. It also shows how database internals knowledge like indexes, joins, and parallelism can help the optimizer generate better execution plans.
This document summarizes the Cassandra Java driver and tools. It discusses the driver's architecture including connection pooling, request pipelining, load balancing policies, and automatic failover. It also covers using statements, asynchronous reads, the query builder, and the object mapper. Lastly, it discusses new automatic paging functionality in the driver.
Coding Like the Wind - Tips and Tricks for the Microsoft Visual Studio 2012 C...Rainer Stropek
Microsoft Visual Studio 2012 contains a bunch of productivity features for C# developers. Rainer Stropek, MVP for Windows Azure, summarizes his top tips for the new VS2012 C# IDE in this presentation
The document provides information about an introduction to Python programming presented by Kiattisak Anoochitarom. It begins with welcoming messages and details about the presenter. It then discusses various Python topics like data types, operators, control flow statements, functions, built-in functions, and string and list methods. Examples are provided throughout to demonstrate different Python concepts and syntax. The goal is to teach the basics of the Python language.
A lot has changed since I gave one of these talks and man, has it been good. 2.0 brought us a lot of new CQL features and now with 2.1 we get even more! Let me show you some real life data models and those new features taking developer productivity to an all new high. User Defined Types, New Counters, Paging, Static Columns. Exciting new ways of making your app truly killer!
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIOnadine39280
Understanding conversion funnel and rates is essential for deciphering e-commerce shopping behavior. In this live event, Albert Wong from StarRocks will provide an anonymized, real-world customer dataset featuring 87 million events and 4 million unique products spanning 10,000 product categories. He'll showcase how to deploy a modern data lakehouse with hashtag#ApacheHudi, and MinIO, then conduct complex analytics, including JOIN operations, to analyze purchasing patterns and product conversion rates with hashtag#StarRocks as the analytical engine.
You can catch the live event:
https://youtu.be/-Wp7itPDtgo
This document provides an overview of Spark SQL and its architecture. Spark SQL allows users to run SQL queries over SchemaRDDs, which are RDDs with a schema and column names. It introduces a SQL-like query abstraction over RDDs and allows querying data in a declarative manner. The Spark SQL component consists of Catalyst, a logical query optimizer, and execution engines for different data sources. It can integrate with data sources like Parquet, JSON, and Cassandra.
Analytics: The Final Data Frontier (or, Why Users Need Your Data and How Pino...HostedbyConfluent
Pinot is an open source distributed real-time data store. It ingests and indexes data from offline batch loads and real-time streams, and supports low latency queries. Key components include tables, segments, servers, brokers, and indexes like inverted indexes and star-tree indexes. Data can be ingested through batch or real-time modes, with batch loading segmented data and real-time continuously consuming streams.
SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...Codemotion
Apache Cassandra is a scalable database with high availability features. But they come with severe limitations in term of querying capabilities. Since the introduction of SASI in Cassandra 3.4, the limitations belong to the pass. Now you can create performant indices on your columns as well as benefit from full text search capabilities with the introduction of the new LIKE %term% syntax. To illustrate how SASI works, we'll use a database of 100 000 albums and artists.
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through its distributed file system and scalable processing through its MapReduce programming model. Yahoo! uses Hadoop extensively for applications like log analysis, content optimization, and computational advertising, processing over 6 petabytes of data across 40,000 machines daily.
The document provides an overview of different index types in Postgres including B-Tree, GIN, GiST, and BRIN indexes. It discusses what each index type is best suited for, how to create each type of index, and their internal data structures. Specifically, it covers that B-Tree indexes are good for equality comparisons, GIN indexes store unique values efficiently for arrays/JSON and are useful for containment operators, GiST indexes allow overlapping ranges and are useful for nearest neighbor searches, and BRIN indexes provide scalable indexing for large tables.
Big data 101 for beginners riga dev daysDuyhai Doan
This document provides an overview and introduction to big data concepts for a new project in 2017. It discusses distributed systems theories like time ordering, latency, failure modes, and consensus protocols. It also covers data sharding and replication techniques. The document explains the CAP theorem and how it relates to consistency and availability. Finally, it discusses different distributed systems architectures like master/slave versus masterless designs.
Spark zeppelin-cassandra at synchrotronDuyhai Doan
This document discusses using Spark, Cassandra, and Zeppelin for storing and aggregating metrics data from a particle accelerator project called HDB++. It provides an overview of the HDB++ project, how it previously used MySQL but now stores data in Cassandra. It describes the Spark jobs that are run to load metrics data from Cassandra and generate statistics that are written back to Cassandra. It also demonstrates visualizing the data using Zeppelin and discusses some tricks and traps to be aware of when using this stack.
This document discusses user-defined functions and materialized views in Cassandra. It provides information on how to create user-defined functions and user-defined aggregates, including the syntax and best practices. It also covers how user-defined functions and aggregates are executed. The document then discusses materialized views, including why they are useful and how they work at a high level. It provides the syntax for creating materialized views and describes how updates are handled.
Apache zeppelin, the missing component for the big data ecosystemDuyhai Doan
Apache Zeppelin is a web-based notebook that allows users to interact with data via interpreters like Spark, SQL, and Cassandra. It provides a GUI for data scientists to write code and visualizations in notebooks. Zeppelin has a modular architecture that allows new interpreters to be easily added. It also includes features like scheduling, sharing, and exporting of notebooks.
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
- The document discusses Spark/Cassandra connector API, best practices, and use cases.
- It describes the connector architecture including support for Spark Core, SQL, and Streaming APIs. Data is read from and written to Cassandra tables mapped as RDDs.
- Best practices around data locality, failure handling, and cross-region/cluster operations are covered. Locality is important for performance.
- Use cases include data cleaning, schema migration, and analytics like joins and aggregation. The connector allows processing and analytics on Cassandra data with Spark.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
2. Copyright @ 2014 ParisJug. Licence CC - Creative Commons 2.0 France – Paternité
- Pas d'Utilisation Commerciale - Partage des Conditions Initiales à l'Identique
3. @doanduyhai
Who Am I ?!
Duy Hai DOAN
Cassandra technical advocate
• talks, meetups, confs
• open-source devs (Achilles, …)
• OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
3
4. @doanduyhai
Datastax!
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 200+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
4
12. @doanduyhai
Multi-DC usages!
Prod data copy for testing/benchmarking
n2
n3
n4
n5
n6
n7
n8
n1
n2
n3n1
Use
LOCAL
consistency
My tiny test
cluster
Data copy
NEVER WRITE HERE !!!
12
35. @doanduyhai
Consistency summary!
ONERead + ONEWrite
☞ available for read/write even (N-1) replicas down
QUORUMRead + QUORUMWrite
☞ available for read/write even 1+ replica down
35
43. @doanduyhai
Last Write Win (LWW)!
jdoe
age
name
33 John DOE
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
#partition
43
44. @doanduyhai
Last Write Win (LWW)!
jdoe
age (t1) name (t1)
33 John DOE
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
auto-generated timestamp
.
44
45. @doanduyhai
Last Write Win (LWW)!
UPDATE users SET age = 34 WHERE login = ‘jdoe’;
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
SSTable1 SSTable2
45
46. @doanduyhai
Last Write Win (LWW)!
DELETE age FROM users WHERE login = ‘jdoe’;
jdoe
age (t3)
ý
tombstone
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
SSTable1 SSTable2 SSTable3
46
47. @doanduyhai
Last Write Win (LWW)!
SELECT age FROM users WHERE login = ‘jdoe’;
???
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
47
48. @doanduyhai
Last Write Win (LWW)!
SELECT age FROM users WHERE login = ‘jdoe’;
✓✕✕
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
48
50. @doanduyhai
Historical data!
history
id
date1(t1) date2(t2) … date9(t9)
… … … …
SSTable1 SSTable2
You want to keep data history ?
• do not use internal generated timestamp !!!
• ☞ time-series data modeling
id
date10(t10)date11(t11) …
…
… … … …
50
51. @doanduyhai
CRUD operations!
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
UPDATE users SET age = 34 WHERE login = ‘jdoe’;
DELETE age FROM users WHERE login = ‘jdoe’;
SELECT age FROM users WHERE login = ‘jdoe’;
51
53. @doanduyhai
What about joins ?!
How can I join data between tables ?
How can I model 1 – N relationships ?
How to model a mailbox ?
EmailsUser
1 n
53
56. @doanduyhai
Queries!
Get message by user and message_id (date)
SELECT * FROM mailbox WHERE login = jdoe
and message_id = ‘2014-09-25 16:00:00’;
Get message by user and date interval
SELECT * FROM mailbox WHERE login = jdoe
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
56
57. @doanduyhai
Queries!
Get message by message_id only ?
SELECT * FROM mailbox WHERE message_id = ‘2014-09-25 16:00:00’;
Get message by date interval only ?
SELECT * FROM mailbox WHERE
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
❓
❓
57
58. @doanduyhai
Queries!
Get message by message_id only (#partition not provided)
SELECT * FROM mailbox WHERE message_id = ‘2014-09-25 16:00:00’;
Get message by date interval only (#partition not provided)
SELECT * FROM mailbox WHERE
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
58
62. @doanduyhai
Queries!
SELECT * FROM mailbox WHERE login >= ‘hsue’ and login <= ‘jdoe’;
Get message by user range (range query on #partition)
SELECT * FROM mailbox WHERE login like ‘%doe%‘;
Get message by user pattern (non exact match on #partition)
62
63. @doanduyhai
WHERE clause restrictions!
All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition
Only exact match (=) on #partition, range queries (<, ≤, >, ≥) not allowed
• ☞ full cluster scan
On clustering columns, only range queries (<, ≤, >, ≥) and exact match
WHERE clause only possible
• on columns defined in PRIMARY KEY
• on indexed columns ( )
63
65. @doanduyhai
WHERE clause restrictions!
What if I want to perform « arbitrary » WHERE clause ?
• search form scenario, dynamic search fields
DO NOT RE-INVENT THE WHEEL !
☞ Apache Solr (Lucene) integration (Datastax Enterprise)
☞ Same JVM, 1-cluster-2-products (Solr & Cassandra)
65
66. @doanduyhai
WHERE clause restrictions!
What if I want to perform « arbitrary » WHERE clause ?
• search form scenario, dynamic search fields
DO NOT RE-INVENT THE WHEEL !
☞ Apache Solr (Lucene) integration (Datastax Enterprise)
☞ Same JVM, 1-cluster-2-products (Solr & Cassandra)
SELECT * FROM users WHERE solr_query = ‘age:[33 TO *] AND gender:male’;
SELECT * FROM users WHERE solr_query = ‘lastname:*schwei?er’;
66
67. @doanduyhai
Collections & maps!
CREATE TABLE users (
login text,
name text,
age int,
friends set<text>,
hobbies list<text>,
languages map<int, text>,
…
PRIMARY KEY(login));
67
Keep the cardinality low ≈ 1000
68. @doanduyhai
User Defined Type (UDT)!
CREATE TABLE users (
login text,
…
street_number int,
street_name text,
postcode int,
country text,
…
PRIMARY KEY(login));
Instead of
68
69. @doanduyhai
User Defined Type (UDT)!
CREATE TYPE address (
street_number int,
street_name text,
postcode int,
country text);
CREATE TABLE users (
login text,
…
location frozen <address>,
…
PRIMARY KEY(login));
69
71. @doanduyhai
UDT update!
UPDATE users set location =
{
‘street_number’: 125,
‘street_name’: ‘Congress Avenue’,
‘postcode’: 95054,
‘country’: ‘USA’
}
WHERE login = jdoe;
Can be nested ☞ store documents
• but no dynamic fields (or use map<text, blob>)
71
72. @doanduyhai
From SQL to CQL!
Normalized
Comment
User
1
n
CREATE TABLE comments (
article_id uuid,
comment_id timeuuid,
author_login text, // typical join id
content text,
PRIMARY KEY((article_id), comment_id));
72
73. @doanduyhai
From SQL to CQL
1 SELECT
- 10 last comments
- 10 author_login
What to do with 10 author_login ???
Comment
User
1
n
73
74. @doanduyhai
From SQL to CQL
1 SELECT
- 10 last comments
- 10 author_login
What to do with 10 author_login ???
10 extra SELECT → N+1 SELECT problem !
Comment
User
1
n
74
75. @doanduyhai
From SQL to CQL!
De-normalized
Comment
User
1
n
CREATE TABLE comments (
article_id uuid,
comment_id timeuuid,
author frozen<person>, // person is UDT
content text,
PRIMARY KEY((article_id), comment_id));
75
76. @doanduyhai
Data modeling best practices!
Start by queries
• identify core functional read paths
• 1 read scenario ≈ 1 SELECT
76
77. @doanduyhai
Data modeling best practices!
Start by queries
• identify core functional read paths
• 1 read scenario ≈ 1 SELECT
Denormalize
• wisely, only duplicate necessary & immutable data
• functional/technical trade-off
77
79. @doanduyhai
Data modeling best practices!
John DOE, male
birthdate: 21/02/1981
subscribed since 03/06/2011
☉ San Mateo, CA
’’Impossible is not John DOE’’
Full detail read from
User table on click
79
81. @doanduyhai
Data modeling trade-off
2 strategies
• either accept to normalize some data (extra SELECT required)
• or de-normalize and update everywhere upon data mutation
81
82. @doanduyhai
Data modeling trade-off
2 strategies
• either accept to normalize some data (extra SELECT required)
• or de-normalize and update everywhere upon data mutation
But always keep those scenarios rare (5%-10% max), focus on the 90%
82
83. @doanduyhai
Data modeling trade-off
2 strategies
• either accept to normalize some data (extra SELECT required)
• or de-normalize and update everywhere upon data mutation
But always keep those scenarios rare (5%-10% max), focus on the 90%
Example: Twitter tweet deletion
83
85. @doanduyhai
Lightweight Transaction (LWT)!
What ? ☞ make operations linearizable
Why ? ☞ solve a class of race conditions in Cassandra that
would require installing an external lock manager
85
86. @doanduyhai
Lightweight Transaction (LWT)!
INSERT INTO account (id, email)
VALUES (‘jdoe’,
‘john_doe@fiction.com’);
SELECT * FROM account
WHERE id= ‘jdoe’;
(0 rows)
SELECT * FROM account
WHERE id= ‘jdoe’;
(0 rows)
INSERT INTO account (id, email)
VALUES (‘jdoe’,
‘jdoe@fiction.com’);
winner
86
87. @doanduyhai
Lightweight Transaction (LWT)!
How ? ☞ implementing Paxos protocol on Cassandra
Syntax ?
INSERT INTO account (id, email) VALUES (‘jdoe’, ‘john_doe@fiction.com’)
IF NOT EXISTS;
UPDATE account SET email = ‘jdoe@fiction.com’
IF email = ‘john_doe@fiction.com’ WHERE id=‘jdoe’;
87