This document introduces the concepts of NoSQL databases. It discusses that NoSQL aims to provide scalability, flexibility, naturalness and distribution over ACID properties. It covers the CAP theorem, BASE properties and the four main types of NoSQL databases - columnar families, bigtable systems, document databases and graph databases. It also provides examples for each type.
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
When processing TB and PB of data, running your Big Data queries at scale and having them perform at peak is essential. In this session, we show you some state-of-the art tools on how to analyze U-SQL job performances and we discuss in-depth best practices on designing your data layout both for files and tables and writing performing and scalable queries using U-SQL. You will learn how to analyze performance and scale bottlenecks and will learn several tips on how to make your big data processing scripts both faster and scale better.
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Jason L Brugger
U-SQL is the query language for big data analytics on the Azure Data Lake platform. This session will explore the unification of SQL and C# in this new query language, examples of combining data from external sources such as Azure SQL Database and Blob storage with Azure Data Lake store, creating and referencing assemblies, job submission and tools. The ADL platform will also be compared and contrasted to the HDInsight/Hadoop platform.
Data Warehousing in the Era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
by Tony Gibbs, Data Warehouse Specialist SA, AWS
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
When processing TB and PB of data, running your Big Data queries at scale and having them perform at peak is essential. In this session, we show you some state-of-the art tools on how to analyze U-SQL job performances and we discuss in-depth best practices on designing your data layout both for files and tables and writing performing and scalable queries using U-SQL. You will learn how to analyze performance and scale bottlenecks and will learn several tips on how to make your big data processing scripts both faster and scale better.
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Jason L Brugger
U-SQL is the query language for big data analytics on the Azure Data Lake platform. This session will explore the unification of SQL and C# in this new query language, examples of combining data from external sources such as Azure SQL Database and Blob storage with Azure Data Lake store, creating and referencing assemblies, job submission and tools. The ADL platform will also be compared and contrasted to the HDInsight/Hadoop platform.
Data Warehousing in the Era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
by Tony Gibbs, Data Warehouse Specialist SA, AWS
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks
This session will cover a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Here’s an example outline of some of the topics that will be covered in the talk: Problems that are perfectly solved with Apache Spark: 1) Analyzing a large set of data files. 2) Doing ETL of a large amount of data. 3) Applying Machine Learning & Data Science to a large dataset. 4) Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally.
By Vida Ha at Spark Summit East 2016.
This 1-day course provides hands-on skills in ingesting, analyzing, transforming and visualizing data using AWS Athena and getting the best performance when using it at scale.
Audience:
This class is intended for data engineers, analysts and data scientists responsible for: analyzing and visualizing big data, implementing cloud-based big data solutions, deploying or migrating big data applications to the public cloud, implementing and maintaining large-scale data storage environments, and transforming/processing big data.
This presentation focuses on Apache Spark’s MLlib library for distributed ML, focusing on how we simplified elements of production-grade ML by building MLlib on top of Spark’s distributed DataFrame API.
NoSql DBs are really popular in the BigData landscape, but SQL semantic is taking revenge. Instead of learning many DSL, developers prefer to use the well know and universal SQL query, so roughly all big data solutions are forced to support SQL semantic over their data models.
From Document to Graph DBs, from search to streaming platforms, all the ways to query Big data through SQL.
Introduction to Azure Data Lake and U-SQL presented at Seattle Scalability Meetup, January 2016. Demo code available at https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Please signup for the preview at http://www.azure.com/datalake. Install Visual Studio Community Edition and the Azure Datalake Tools (http://aka.ms/adltoolvs) to use U-SQL locally for free.
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys
More and more customers who are looking to modernize analytics needs are exploring the data lake approach in Azure. Typically, they are most challenged by a bewildering array of poorly integrated technologies and a variety of data formats, data types not all of which are conveniently handled by existing ETL technologies. In this session, we’ll explore the basic shape of a modern ETL pipeline through the lens of Azure Data Lake. We will explore how this pipeline can scale from one to thousands of nodes at a moment’s notice to respond to business needs, how its extensibility model allows pipelines to simultaneously integrate procedural code written in .NET languages or even Python and R, how that same extensibility model allows pipelines to deal with a variety of formats such as CSV, XML, JSON, Images, or any enterprise-specific document format, and finally explore how the next generation of ETL scenarios are enabled though the integration of Intelligence in the data layer in the form of built-in Cognitive capabilities.
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
The Delta Architecture pattern has made the lives of data engineers much simpler, but what about improving query performance for data analysts? What are some common places to look at for tuning query performance? In this session we will cover some common techniques to apply to our delta tables to make them perform better for data analysts queries. We will look at a few examples of how you can analyze a query, and determine what to focus on to deliver better performance results.
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks
This session will cover a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Here’s an example outline of some of the topics that will be covered in the talk: Problems that are perfectly solved with Apache Spark: 1) Analyzing a large set of data files. 2) Doing ETL of a large amount of data. 3) Applying Machine Learning & Data Science to a large dataset. 4) Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally.
By Vida Ha at Spark Summit East 2016.
This 1-day course provides hands-on skills in ingesting, analyzing, transforming and visualizing data using AWS Athena and getting the best performance when using it at scale.
Audience:
This class is intended for data engineers, analysts and data scientists responsible for: analyzing and visualizing big data, implementing cloud-based big data solutions, deploying or migrating big data applications to the public cloud, implementing and maintaining large-scale data storage environments, and transforming/processing big data.
This presentation focuses on Apache Spark’s MLlib library for distributed ML, focusing on how we simplified elements of production-grade ML by building MLlib on top of Spark’s distributed DataFrame API.
NoSql DBs are really popular in the BigData landscape, but SQL semantic is taking revenge. Instead of learning many DSL, developers prefer to use the well know and universal SQL query, so roughly all big data solutions are forced to support SQL semantic over their data models.
From Document to Graph DBs, from search to streaming platforms, all the ways to query Big data through SQL.
Introduction to Azure Data Lake and U-SQL presented at Seattle Scalability Meetup, January 2016. Demo code available at https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Please signup for the preview at http://www.azure.com/datalake. Install Visual Studio Community Edition and the Azure Datalake Tools (http://aka.ms/adltoolvs) to use U-SQL locally for free.
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys
More and more customers who are looking to modernize analytics needs are exploring the data lake approach in Azure. Typically, they are most challenged by a bewildering array of poorly integrated technologies and a variety of data formats, data types not all of which are conveniently handled by existing ETL technologies. In this session, we’ll explore the basic shape of a modern ETL pipeline through the lens of Azure Data Lake. We will explore how this pipeline can scale from one to thousands of nodes at a moment’s notice to respond to business needs, how its extensibility model allows pipelines to simultaneously integrate procedural code written in .NET languages or even Python and R, how that same extensibility model allows pipelines to deal with a variety of formats such as CSV, XML, JSON, Images, or any enterprise-specific document format, and finally explore how the next generation of ETL scenarios are enabled though the integration of Intelligence in the data layer in the form of built-in Cognitive capabilities.
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
The Delta Architecture pattern has made the lives of data engineers much simpler, but what about improving query performance for data analysts? What are some common places to look at for tuning query performance? In this session we will cover some common techniques to apply to our delta tables to make them perform better for data analysts queries. We will look at a few examples of how you can analyze a query, and determine what to focus on to deliver better performance results.
This presentation contains the introduction to NOSQL databases, it's types with examples, differentiation with 40 year old relational database management system, it's usage, why and we should use it.
Presentation on NoSQL Database related RDBMSabdurrobsoyon
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
The document talks about the overview behind the need and drive for NoSQL databases. It also mentions about some of the most popular NoSQL databases in the market.
Conquering "big data": An introduction to shard queryJustin Swanhart
This talk introduces Shard-Query, an MPP distributed parallel processing middleware solution for MySQL.
Shard-Query is a federation engine which provides a virutal "grid computing" layer on top of MySQL. This can be used to access data spread over many machines (sharded) and also data partitioned in MySQL tables using the MySQL partitioning option. This is similar to using partitions for parallelism with Oracle Parallel Query.
This talk focuses on why Shard-Query is needed, how it works (not detailed) and the best schema to use with it. Shard-Query is designed to scan massive amounts of data in parallel.
A NoSQL (often interpreted as Not Only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
GIDS 2016 Understanding and Building No SQLstechmaddy
Storage becomes the key part of any Big Data system. There are few non-functional parameters that are expected from the Big Data storage systems like reliability, horizontal scalability, high availability, fault tolerance, etc. To support these properties and the change of data storage and access patterns in Big Data systems lead to a class of storage - NoSQLs. If there’s one rule in design -- there will always be trade-offs. CAP theorem defines the choices that we can make with the trade-offs. And ACID rules change to BASE in NoSQLs.
This talk focuses on understanding NoSQLs, the design decisions for designing NoSQL databases, an complete design example of key-value database, and patterns of replication and sharding.
Is emergence of NoSQL killed RDBMS and SQL? This slide discusses what is NoSQL and it's history. This also discusses briefly about polyglot persistence.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
3. NoSQL
NoSQL is not “no-SQL”
It is not only SQL
It does not aim to provide the ACID properties
Originated as no-SQL though
Later changed since RDBMS is too powerful to always ignore
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 2 / 12
4. NoSQL
NoSQL is not “no-SQL”
It is not only SQL
It does not aim to provide the ACID properties
Originated as no-SQL though
Later changed since RDBMS is too powerful to always ignore
Aims to provide
Scalability
Flexibility
Naturalness
Distribution
Performance
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 2 / 12
5. CAP theorem
All of C, A, P cannot be satisfied simultaneously
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 3 / 12
6. CAP theorem
All of C, A, P cannot be satisfied simultaneously
CA: single-site; partitioning is not allowed
CP: what is available is consistent
AP: everything is available but may not be consistent
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 3 / 12
7. CAP theorem
All of C, A, P cannot be satisfied simultaneously
CA: single-site; partitioning is not allowed
CP: what is available is consistent
AP: everything is available but may not be consistent
Not a theorem – just a hypothesis
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 3 / 12
8. BASE properties
Basically Available: System guarantees availability
Soft state: State of system is soft, i.e., it may change without input to
maintain consistency
Eventual consistency: Data will be eventually consistent without any
interim perturbation
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 4 / 12
9. BASE properties
Basically Available: System guarantees availability
Soft state: State of system is soft, i.e., it may change without input to
maintain consistency
Eventual consistency: Data will be eventually consistent without any
interim perturbation
Sacrifices consistency
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 4 / 12
10. BASE properties
Basically Available: System guarantees availability
Soft state: State of system is soft, i.e., it may change without input to
maintain consistency
Eventual consistency: Data will be eventually consistent without any
interim perturbation
Sacrifices consistency
To counter ACID
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 4 / 12
11. Types
Four main types of NoSQL data stores:
1 Columnar families
2 Bigtable systems
3 Document databases
4 Graph databases
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 5 / 12
12. Columnar storage
Instead of rows being stored togther, columns are stored
consecutively
A single disk block (or a set of consecutive blocks) stores a single
column family
A column family may consist of one or multiple columns
This set of columns is called a super column
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 6 / 12
13. Columnar storage
Instead of rows being stored togther, columns are stored
consecutively
A single disk block (or a set of consecutive blocks) stores a single
column family
A column family may consist of one or multiple columns
This set of columns is called a super column
Two main types
Columnar relational models
Key-value stores and/or big tables
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 6 / 12
14. Columnar relational models
Not NoSQL and is actually RDBMS
Column-wise storage on the disk
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 7 / 12
15. Columnar relational models
Not NoSQL and is actually RDBMS
Column-wise storage on the disk
Allows faster querying when only few columns are touched on the
entire data
Allows compression of columns
Provides better memory caching
Joins are faster since they are mostly on similar columns from two
tables
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 7 / 12
16. Columnar relational models
Not NoSQL and is actually RDBMS
Column-wise storage on the disk
Allows faster querying when only few columns are touched on the
entire data
Allows compression of columns
Provides better memory caching
Joins are faster since they are mostly on similar columns from two
tables
Is not good for updates
Is not good when many columns of a few tuples are accessed
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 7 / 12
17. Columnar relational models
Not NoSQL and is actually RDBMS
Column-wise storage on the disk
Allows faster querying when only few columns are touched on the
entire data
Allows compression of columns
Provides better memory caching
Joins are faster since they are mostly on similar columns from two
tables
Is not good for updates
Is not good when many columns of a few tuples are accessed
Good for OLAP (online analytical processing)
Not good for OLTP (online transaction processing)
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 7 / 12
18. Columnar relational models
Not NoSQL and is actually RDBMS
Column-wise storage on the disk
Allows faster querying when only few columns are touched on the
entire data
Allows compression of columns
Provides better memory caching
Joins are faster since they are mostly on similar columns from two
tables
Is not good for updates
Is not good when many columns of a few tuples are accessed
Good for OLAP (online analytical processing)
Not good for OLTP (online transaction processing)
Example: MonetDB
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 7 / 12
19. Key-value stores
Two columns: a key and a value
Key is mostly text
Value can be anything and is simply an object
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 8 / 12
20. Key-value stores
Two columns: a key and a value
Key is mostly text
Value can be anything and is simply an object
Essentially, actual data becomes “value” and an unique id is
generated which becomes “key”
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 8 / 12
21. Key-value stores
Two columns: a key and a value
Key is mostly text
Value can be anything and is simply an object
Essentially, actual data becomes “value” and an unique id is
generated which becomes “key”
Whole database is then just one big table with these two columns
Becomes schema-less
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 8 / 12
22. Key-value stores
Two columns: a key and a value
Key is mostly text
Value can be anything and is simply an object
Essentially, actual data becomes “value” and an unique id is
generated which becomes “key”
Whole database is then just one big table with these two columns
Becomes schema-less
Can be distributed and is, thus, highly scalable
So, in essence a big distributed hash table
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 8 / 12
23. Key-value stores
Two columns: a key and a value
Key is mostly text
Value can be anything and is simply an object
Essentially, actual data becomes “value” and an unique id is
generated which becomes “key”
Whole database is then just one big table with these two columns
Becomes schema-less
Can be distributed and is, thus, highly scalable
So, in essence a big distributed hash table
All queries are on keys
Keys are necessarily indexed
Uses memcache where most queried keys are persisted in memory
for faster access
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 8 / 12
24. Key-value stores
Two columns: a key and a value
Key is mostly text
Value can be anything and is simply an object
Essentially, actual data becomes “value” and an unique id is
generated which becomes “key”
Whole database is then just one big table with these two columns
Becomes schema-less
Can be distributed and is, thus, highly scalable
So, in essence a big distributed hash table
All queries are on keys
Keys are necessarily indexed
Uses memcache where most queried keys are persisted in memory
for faster access
Example: Cassandra, CouchDB, Tokyo Cabinet, Redis
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 8 / 12
25. Bigtable systems
Started from Google’s BigTable implementation
Uses a key-value store
Data can be replicated for better availability
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 9 / 12
26. Bigtable systems
Started from Google’s BigTable implementation
Uses a key-value store
Data can be replicated for better availability
Uses a timestamp
Timestamp is used to
Expire data
Delete stale data
Resolve read-write conflicts
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 9 / 12
27. Bigtable systems
Started from Google’s BigTable implementation
Uses a key-value store
Data can be replicated for better availability
Uses a timestamp
Timestamp is used to
Expire data
Delete stale data
Resolve read-write conflicts
Same value can be indexed using multiple keys
Map-reduce framework to compute
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 9 / 12
28. Bigtable systems
Started from Google’s BigTable implementation
Uses a key-value store
Data can be replicated for better availability
Uses a timestamp
Timestamp is used to
Expire data
Delete stale data
Resolve read-write conflicts
Same value can be indexed using multiple keys
Map-reduce framework to compute
Example: BigTable, HBase, Cassandra, HyperTable, SimpleDB
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 9 / 12
29. Document databases
Uses documents as the main storage format of data
Popular document formats are XML, JSON, BSON, YAML
Document itself is the key while the content is the value
Document can be indexed by id or simply its location (e.g., URI)
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 10 / 12
30. Document databases
Uses documents as the main storage format of data
Popular document formats are XML, JSON, BSON, YAML
Document itself is the key while the content is the value
Document can be indexed by id or simply its location (e.g., URI)
Content needs to be parsed to make sense
Content can be organised further
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 10 / 12
31. Document databases
Uses documents as the main storage format of data
Popular document formats are XML, JSON, BSON, YAML
Document itself is the key while the content is the value
Document can be indexed by id or simply its location (e.g., URI)
Content needs to be parsed to make sense
Content can be organised further
Extremely useful for insert-once read-many scenarios
Can use map-reduce framework to compute
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 10 / 12
32. Document databases
Uses documents as the main storage format of data
Popular document formats are XML, JSON, BSON, YAML
Document itself is the key while the content is the value
Document can be indexed by id or simply its location (e.g., URI)
Content needs to be parsed to make sense
Content can be organised further
Extremely useful for insert-once read-many scenarios
Can use map-reduce framework to compute
Example: MongoDB, CouchDB
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 10 / 12
33. Graph databases
Nodes represent entities
Edges encode relationships between nodes
Can be directed
Can have hyper-edges as well
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 11 / 12
34. Graph databases
Nodes represent entities
Edges encode relationships between nodes
Can be directed
Can have hyper-edges as well
Easier to find distances and neighbors
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 11 / 12
35. Graph databases
Nodes represent entities
Edges encode relationships between nodes
Can be directed
Can have hyper-edges as well
Easier to find distances and neighbors
Example: Neo4J, HyperGraph, Infinite Graph, FlockDB
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 11 / 12
36. Discussion
NoSQL, although started as anti-SQL, is no more so
More a realisation that for some cases
RDBMS does not scale or distribute, and in some other cases
ACIDity is an overkill
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 12 / 12
37. Discussion
NoSQL, although started as anti-SQL, is no more so
More a realisation that for some cases
RDBMS does not scale or distribute, and in some other cases
ACIDity is an overkill
NoSQL is not good for every scenario
Not every good feature reported has been validated
Most legacy systems still use RDBMS
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 12 / 12
38. Discussion
NoSQL, although started as anti-SQL, is no more so
More a realisation that for some cases
RDBMS does not scale or distribute, and in some other cases
ACIDity is an overkill
NoSQL is not good for every scenario
Not every good feature reported has been validated
Most legacy systems still use RDBMS
NoSQL horizon is shifting rapidly with almost no control or sense
However, trend is for NoSQL as cloud computing and big data relies
on it
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: NoSQL 2013-14 12 / 12