NoSQL databases

•Download as ZIP, PDF•

103 likes•22,199 views

Harri Kauhanen

NoSQL database paradigms, how to query NoSQL, about relationships, comparison with RDBMS

Technology

Harri Kauhanen
2010-03-09

NoSQL databases

Database paradigms

• Relational (RDBMS)
• NoSQL
• Key-value stores

• Document databases

• Wide column stores (BigTable and clones)

• Graph databases

• Others

Relational databases

• ACID (Atomicity, Consistency, Isolation and
Durability)
• SQL
• MySQL, PostreSQL, Oracle, ...

dog bark
id: integer id: integer
name: varchar dog_id: integer
mood: varchar bark: text
birthdate: date
color: enum
comment
id: integer
bark_id: integer
dog_id: integer
comment: text

id name mood birth_date color
12 Stella Happy 2007-04-01 NULL
13 Wimma Hungry NULL black
9 Ninja NULL NULL NULL

Key-value stores
• “One key, one value, no duplicates, and
crazy fast.”
• It is a Hash!
• The value is binary object aka. “blob” – the
DB does not understand it and does not
want to understand it.
• Amazon Dynamo, MemcacheDB, ...

“value”
“key”
...
name_€#_Stella^^^
mood_€#_Happy^^^
dog_12 birthdate%///
135465645)
...

Document databases

• Key-value store, but the value is (usually)
structured and “understood” by the DB.
• Querying data is possible (by other means
than just a key).
• Amazon SimpleDB, CouchDB, MongoDB,
Riak, ...

$“document” “key” { type: “Dog”, name: “Stella”, dog_12 mood: “Happy”, birthdate: 2007-04-01 } vs. id name mood birth_date color 12 Stella Happy 2007-04-01 NULL 13 Wimma Hungry NULL black 9 Ninja NULL NULL NULL$

${ type: “Dog”, name: “Stella”, mood: “Happy”, birthdate: 2007-04-01, barks: [ { bark: “I had to wear stupid..” comments: [ { dog_12 dog_id: “dog_4”, comment: “You look so cute!” }, { dog_id: “dog_14”, comment: “I hate it, too!” } ] } ] }$

Wide column stores

• Often referred as “BigTable clones”
• "a sparse, distributed multi-dimensional
sorted map"
• Google BigTable, Cassandra (Facebook),
HBase, ...

“column”

“row-id” “column family” “title” “time” “value”

dog_12 dog birthdate 15 2007-04-01
dog mood 11 Angry
dog mood 45 Happy
dog name 25 Stella
dog color 34 Black

bark text 11 I had to wear...

Graph databases

• “Relation database is a collection loosely
connected tables” whereas “Graph
database is a multi-relational graph”.
• Neo4j, InfoGrid, ...

Dog

Stella
type

name

mood dog_12 barks bark_59 I had to wear stupid...
Happy

birth_date
comment_to

2007-04-01
comment_83 You look so Cute

comments

dog_4

• Relationships in RDBMS are “weak”.
• You may “deﬁne” one by using constraints,
documenting a relationship, writing code, using
naming conventions etc.

• Relationships in graph databases are ﬁrst
class citizens.
• There are no relationships in key-value
stores, document databases and wide
column stores.
• You may “deﬁne” one by using validations,
documenting a relationship, writing code, using
naming conventions etc.

• Relational databases have almost limitless
indexing, and a very strong language for
dynamic, cross-table, queries (SQL)
• That’s why they handle all kinds of relationships
well and dynamically.

• NoSQL databases...
• ...might have limited support for dynamic queries
and indexing

• ...don’t support JOIN like operations of SQL

• ...but you can store some relationships into
document itself

How to query NoSQL?
• Key-Value
• Row-id/column-family:title[/time]
• “stella_12”/”dogs”:”name” → Stella

• Graph traversal
• API
• Query-language
• Integration to indexing and search engines
• Map-Reduce

Map-Reduce
• “MapReduce is a programming model and
an associated implementation for
processing and generating large data sets.”
• Often JavaScript (NoSQL implementations)

Map-function
function map(doc) {
if (doc['type'] == 'Dog') {
emit(doc['mood'], doc['birthdate']);
}
}

• Generates “indexed view” of data/
documents
• This view is just another hash, but both key
and value can be “anything”

Reduce-function
• Aggregate results for a “view” (after the
Map-function)

function reduce(mood, listOfBirthdates) {
return averageBirthDate(listOfBirthdates);
}

• Map-phase is easy to distribute, but you is also
easy to write poor Reduce-functions

Theorems
• CAP
• Consistency,
Availability,
Partition tolerance
• “Pick two”
• N/R/W (adjusting CAP)

Why NoSQL?
• Schema-free
• Massive data stores
• Scalability
• Some services simpler to implement than
using RDBMS
• Great ﬁt for many “Web 2.0” services

Why NOT NoSQL

• DRBMS databases and tools are mature
• NoSQL implementations often “alpha”
• Data consistency, transactions
• “Don’t scale until you need it”

RDBMS vs. NoSQL

• Strong consistency vs. Eventual consistency
• Big dataset vs. HUGE datasets
• Scaling is possible vs. Scaling is easy
• SQL vs. Map-Reduce
• Good availability vs.Very high availability

What's hot

NoSQL databases

Marin Dimitrov

Non relational databases-no sql

Ram kumar

9. Document Oriented Databases

Fabio Fumarola

A Seminar on NoSQL Databases.

Navdeep Charan

With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!

Big data architectures and the data lake

James Serra

Introduction to NOSQL databases

Ashwani Kumar

Apache Spark Architecture

Alexey Grishchenko

NOSQL Databases types and Uses

Suvradeep Rudra

Introduction to Apache Spark

Apache spark

Schemaless Databases

NOSQL vs SQL

In this lecture we analyze key-values databases. At first we introduce key-value characteristics, advantages and disadvantages. Then we analyze the major Key-Value data stores and finally we discuss about Dynamo DB. In particular we consider how Dynamo DB: How is implemented 1. Motivation Background 2. Partitioning: Consistent Hashing 3. High Availability for writes: Vector Clocks 4. Handling temporary failures: Sloppy Quorum 5. Recovering from failures: Merkle Trees 6. Membership and failure detection: Gossip Protocol

7. Key-Value Databases: In Depth

Fabio Fumarola

Nosql data models

Viet-Trung TRAN

Mongodb basics and architecture

Bishal Khanal

SQL & NoSQL

Ahmad Awsaf-uz-zaman

Introduction to MongoDB

Mike Dirolf

Introduction to Cassandra

Gokhan Atil

NoSQL

Radu Potop

Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.

Azure Synapse Analytics Overview (r1)

James Serra

What's hot (20)

NoSQL databases

Non relational databases-no sql

9. Document Oriented Databases

A Seminar on NoSQL Databases.

Big data architectures and the data lake

Introduction to NOSQL databases

Apache Spark Architecture

NOSQL Databases types and Uses

Introduction to Apache Spark

Apache spark

Schemaless Databases

NOSQL vs SQL

7. Key-Value Databases: In Depth

Nosql data models

Mongodb basics and architecture

SQL & NoSQL

Introduction to MongoDB

Introduction to Cassandra

NoSQL

Azure Synapse Analytics Overview (r1)

Recently uploaded

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

In the ever-evolving landscape of data management, Zero-ETL is an approach that is reshaping how businesses handle and integrate their data. This webinar explores Zero-ETL, a paradigm shift from the traditional Extract, Transform, Load (ETL) process, offering a more streamlined, efficient, and real-time data integration method. We will begin with an introduction to the concept of Zero-ETL, including how it allows direct access to data in its native environment and real-time data transformation, providing up-to-date information with significantly reduced data redundancy. Next, we'll take you through several demonstrations showing how Zero-ETL can deliver real-time data and enable the free movement of data between systems. We will also discuss the various tools that support all aspects of Zero-ETL, providing attendees with an understanding of how they can adopt this innovative approach in their organizations. Lastly, the session will conclude with an interactive Q&A segment, allowing participants to gain deeper insights into how Zero-ETL can be tailored to their specific business needs and how they can get started today. Join us to discover how Zero-ETL can elevate your organization's data strategy.

The Zero-ETL Approach: Enhancing Data Agility and Insight

Safe Software

Following the popularity of “Cloud Revolution: Exploring the New Wave of Serverless Spatial Data,” we’re thrilled to announce this much-anticipated encore webinar. In this sequel, we’ll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you’re building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

In this session, we will discuss the journey of API governance from its initial, ungoverned state to the development of sophisticated models that tackle contemporary challenges. We'll explore how APIs have become essential in the intersection of business and technology, adapting to advancements and evolving needs. We'll focus on how organizations have moved from launching to monetizing APIs, using models like pay-per-use and subscriptions, and finding the right balance between technical implementation and business strategy. We'll also highlight the impact of governance on monetization strategies, especially how data security, compliance, and service quality influence pricing. Real-world examples will demonstrate the effective integration of governance with monetization, including AI's role in dynamic pricing. Looking ahead, we'll share insights into future trends in API governance and monetization, emphasizing the importance of adaptability, continuous learning, and innovation.

API Governance and Monetization - The evolution of API governance

WSO2

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Dubai, known for its towering skyscrapers, luxurious lifestyle, and relentless pursuit of innovation, often finds itself in the global spotlight. However, amidst the glitz and glamour, the emirate faces its own set of challenges, including the occasional threat of flooding. In recent years, Dubai has experienced sporadic but significant floods, disrupting normalcy and posing unique challenges to its infrastructure. Among the critical nodes in this bustling metropolis is the Dubai International Airport, a vital hub connecting the world. This article delves into the intersection of Dubai flood events and the resilience demonstrated by the Dubai International Airport in the face of such challenges.

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Orbitshub

💥 You’re lucky! We’ve found two different (lead) developers that are willing to share their valuable lessons learned about using UiPath Document Understanding! Based on recent implementations in appealing use cases at Partou and SPIE. Don’t expect fancy videos or slide decks, but real and practical experiences that will help you with your own implementations. 📕 Topics that will be addressed: • Training the ML-model by humans: do or don't? • Rule-based versus AI extractors • Tips for finding use cases • How to start 👨‍🏫👨‍💻 Speakers: o Dion Morskieft, RPA Product Owner @Partou o Jack Klein-Schiphorst, Automation Developer @Tacstone Technology

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

UiPathCommunity

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

JohnPollard-hybrid-app-RailsConf2024.pptx

JohnPollard37

In the dynamic field of DevOps, the quest for efficiency and productivity is endless. This talk introduces a revolutionary toolkit: Large Language Models (LLMs), including ChatGPT, Gemini, and Claude, extending far beyond traditional coding assistance. We'll explore how LLMs can automate not just code generation, but also transform day-to-day operations such as crafting compelling cover letters for TPS reports, streamlining client communications, and architecting innovative DevOps solutions. Attendees will learn effective prompting strategies and examine real-life use cases, demonstrating LLMs' potential to redefine productivity in the DevOps landscape. Join us to discover how to harness the power of LLMs for a comprehensive productivity boost across your DevOps activities.

ChatGPT and Beyond - Elevating DevOps Productivity

VictorSzoltysek

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

Six Myths about Ontologies: The Basics of Formal Ontology

johnbeverley2021

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the deployment of external web forms using Jotform for Bonterra Impact Management. This solution can be customized to your organization’s needs and deployed to support the common use cases below: - Intake and consent - Assessments - Surveys - Applications - Program registration Interested in deploying web form automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Jeffrey Haguewood

Dubai, often portrayed as a shimmering oasis in the desert, faces its own set of challenges, including the occasional threat of flooding. Despite its reputation for opulence and modernity, the emirate is not immune to the forces of nature. In recent years, Dubai has experienced sporadic but significant floods, testing the resilience of its infrastructure and communities. Among the critical lifelines in this bustling metropolis is the Dubai International Airport, a bustling hub that connects the city to the world. This article explores the intersection of Dubai flood events and the resilience demonstrated by the Dubai International Airport in the face of such challenges.

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Orbitshub

Simplifying Mobile A11y Presentation.pptx

MarkSteadman7

CNIC Information System with Pakdata Cf In Pakistan

danishmna97

WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...

WSO2

Key topics covered: - Real-world examples of Choreo's comprehensive coverage from application design and deployment, security, scaling, and monitoring - Running different types of workloads, such as web applications, APIs, microservices, integrations, and tasks at scale, and wire them together to deliver seamless omnichannel digital experiences - How Choreo improves the developer experience by eliminating repetition, silos, and redundancy through enhanced discoverability and self-serviceability

Choreo: Empowering the Future of Enterprise Software Engineering

WSO2

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform

The Zero-ETL Approach: Enhancing Data Agility and Insight

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

API Governance and Monetization - The evolution of API governance

Why Teams call analytics are critical to your entire business

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

JohnPollard-hybrid-app-RailsConf2024.pptx

ChatGPT and Beyond - Elevating DevOps Productivity

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Corporate and higher education May webinar.pptx

Six Myths about Ontologies: The Basics of Formal Ontology

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Simplifying Mobile A11y Presentation.pptx

CNIC Information System with Pakdata Cf In Pakistan

WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...

Choreo: Empowering the Future of Enterprise Software Engineering

NoSQL databases

1. Harri Kauhanen 2010-03-09 NoSQL databases

2. Database paradigms • Relational (RDBMS) • NoSQL • Key-value stores • Document databases • Wide column stores (BigTable and clones) • Graph databases • Others

3. Relational databases • ACID (Atomicity, Consistency, Isolation and Durability) • SQL • MySQL, PostreSQL, Oracle, ...

4. dog bark id: integer id: integer name: varchar dog_id: integer mood: varchar bark: text birthdate: date color: enum comment id: integer bark_id: integer dog_id: integer comment: text id name mood birth_date color 12 Stella Happy 2007-04-01 NULL 13 Wimma Hungry NULL black 9 Ninja NULL NULL NULL

5. Key-value stores • “One key, one value, no duplicates, and crazy fast.” • It is a Hash! • The value is binary object aka. “blob” – the DB does not understand it and does not want to understand it. • Amazon Dynamo, MemcacheDB, ...

6. “value” “key” ... name_€#_Stella^^^ mood_€#_Happy^^^ dog_12 birthdate%/// 135465645) ...

7. Document databases • Key-value store, but the value is (usually) structured and “understood” by the DB. • Querying data is possible (by other means than just a key). • Amazon SimpleDB, CouchDB, MongoDB, Riak, ...

8. “document” “key” { type: “Dog”, name: “Stella”, dog_12 mood: “Happy”, birthdate: 2007-04-01 } vs. id name mood birth_date color 12 Stella Happy 2007-04-01 NULL 13 Wimma Hungry NULL black 9 Ninja NULL NULL NULL

9. { type: “Dog”, name: “Stella”, mood: “Happy”, birthdate: 2007-04-01, barks: [ { bark: “I had to wear stupid..” comments: [ { dog_12 dog_id: “dog_4”, comment: “You look so cute!” }, { dog_id: “dog_14”, comment: “I hate it, too!” } ] } ] }

10. Wide column stores • Often referred as “BigTable clones” • "a sparse, distributed multi-dimensional sorted map" • Google BigTable, Cassandra (Facebook), HBase, ...

11. “column” “row-id” “column family” “title” “time” “value” dog_12 dog birthdate 15 2007-04-01 dog mood 11 Angry dog mood 45 Happy dog name 25 Stella dog color 34 Black bark text 11 I had to wear...

12. Graph databases • “Relation database is a collection loosely connected tables” whereas “Graph database is a multi-relational graph”. • Neo4j, InfoGrid, ...

13. Dog Stella type name mood dog_12 barks bark_59 I had to wear stupid... Happy birth_date comment_to 2007-04-01 comment_83 You look so Cute comments dog_4

14. • Relationships in RDBMS are “weak”. • You may “define” one by using constraints, documenting a relationship, writing code, using naming conventions etc. • Relationships in graph databases are first class citizens. • There are no relationships in key-value stores, document databases and wide column stores. • You may “define” one by using validations, documenting a relationship, writing code, using naming conventions etc.

15. • Relational databases have almost limitless indexing, and a very strong language for dynamic, cross-table, queries (SQL) • That’s why they handle all kinds of relationships well and dynamically. • NoSQL databases... • ...might have limited support for dynamic queries and indexing • ...don’t support JOIN like operations of SQL • ...but you can store some relationships into document itself

16. How to query NoSQL? • Key-Value • Row-id/column-family:title[/time] • “stella_12”/”dogs”:”name” → Stella • Graph traversal • API • Query-language • Integration to indexing and search engines • Map-Reduce

17. Map-Reduce • “MapReduce is a programming model and an associated implementation for processing and generating large data sets.” • Often JavaScript (NoSQL implementations)

18. Map-function function map(doc) { if (doc['type'] == 'Dog') { emit(doc['mood'], doc['birthdate']); } } • Generates “indexed view” of data/ documents • This view is just another hash, but both key and value can be “anything”

19. Reduce-function • Aggregate results for a “view” (after the Map-function) function reduce(mood, listOfBirthdates) { return averageBirthDate(listOfBirthdates); } • Map-phase is easy to distribute, but you is also easy to write poor Reduce-functions

20. Theorems • CAP • Consistency, Availability, Partition tolerance • “Pick two” • N/R/W (adjusting CAP)

21. No consistency?

22. Eventual consistency

23. Why NoSQL? • Schema-free • Massive data stores • Scalability • Some services simpler to implement than using RDBMS • Great ﬁt for many “Web 2.0” services

24. Why NOT NoSQL • DRBMS databases and tools are mature • NoSQL implementations often “alpha” • Data consistency, transactions • “Don’t scale until you need it”

25. RDBMS vs. NoSQL • Strong consistency vs. Eventual consistency • Big dataset vs. HUGE datasets • Scaling is possible vs. Scaling is easy • SQL vs. Map-Reduce • Good availability vs.Very high availability

26. Questions?

Editor's Notes

Hello... Today, I&#x2019;m going to talk about one of the latest buzz words, so called &#x201C;NoSQL&#x201D; databases. A little disclaimer: I personally have real-life experience with only one NoSQL database called CouchDB. I use CouchDB at my Haukut.fi service. So, why did I want to talk about the subject: I wanted to learn this stuff myself. What options there are to CouchDB and should I perhaps pick another option for my next generation Haukut.fi. Also, I strongly feel that relational databases are going to be, if not replaced, they will at least get strong competition from these NoSQL solutions. This will not happen within all domains and services, but for simple consumer web services this will eventually happen.
Ok, lets quickly review the database paradigms we have to choose from. We have relational databases -- the ones you use here at Futurice every day in most of our projects. Then there is NoSQL. NoSQL a relatively new term now getting hot and popular on blogsphere and Twitter. Like many buzzwords out there, NoSQL does not have a definite definition. It could refer to all those databases where you don&#x2019;t have to use SQL. Some more friendly and wise people would put it more softly &#x201C;not only SQL&#x201D;. Well, I am not even going to try to give a definition, but instead I will list what kind of data stores are usually categorized under this umbrella term. They are: ... These I am going to talk about today, but then there are also others such as Object Databases, and they are out of the scope of this presentation.
Let&#x2019;s start with something familiar. The great promise of relational databases is they are ACID. They all support a very strong query language called SQL. And these are some familiar examples of relation database implementations.
This could be a visual representation of a relational database and the relationships between the tables. We have dogs. A dog must have a name and it may have mood, birthdate and a color. We all know that dogs can really speak, and they speak by barking. Dogs may also comment barks made by the others. Quite simple. The content of table &#x2018;Dog&#x2019; could be something like this. We can see that Stella is a happy dog and has birth on April fool&#x2019;s day. Nothing special here, either. You know the stuff.
Ok, that was quick. So, let&#x2019;s get into the business :) The simplest form of NoSQL are the Key-value stores. They are really simple, you could say &#x201C;One key...&#x201D; All in all, a key value-store is just a persistent Hash. The value-part can be anything. The key-value store does not care a bit about the content of it. Example implementations. Amazon Dynamo is very important one, because those Amazon guys published some theoretical material, and other NoSQL databases are partly based on these writings.
If you want to visualize it, you can see that we have a string based key, and that the value could be a serialized Java-object, or anything at all.
Document databases. The basic idea is still quite simple. They are key-value stores, but the difference is that the value... Because of this, querying of data is somehow possible. There might by a query language like SQL, but that is not actually very common. The examples of document databases are...
What is the difference between this picture and the one two slides earlier? The &#x201C;value&#x201D; is now called &#x201C;document&#x201D;. That&#x2019;s it!? Well, the document here has a structure. The structure could be JSON, XML or anything. Having a structure means that we might do something with the data. Not just return a binary blob. But having a structure does not mean it should have a predefined structure. With relational databases you have to define a schema before you can store data. If you compare this document here with our relational example, you can see that in the document we do not have a &#x201C;color&#x201D;. It could be there, but it does not need to be there. On the other hand, if we want to add a new attribute, we just add it. With relational database we need to adjust the structure of the table, and that&#x2019;s not always so pleasant. What about relationships? Well, if the relationship is strong enough, you could add it as a part of the document itself. You could do...
...this. Here, barks and comments are just a part of the dog-document. I would not say that this would be a smart move, because the document might become huge. In Haukut.fi -service &#x201C;dog&#x201D; is one document, and &#x201C;bark&#x201D; is another. But comments made to a bark ARE part of a bark document.
The next category does not have widely accepted name. Some would call them &#x201C;wide column stores&#x201D; and others might say they are &#x201C;BigTable clones&#x201D;. Whatever you call them, they ideologically reside somewhere between key-value stores and relational databases. There is no schema, but the data is still semi-structured. Like in relational databases, you could think that there are rows, but they can have any number of columns, and there is no need to store NULL values. Again, if you think of it in terms of relational databases, and talk about &#x201C;rows&#x201D; and &#x201C;tables&#x201D;, you might just get more confused. I still am a bit confused myself :-) This is one definition... It might not get you any wiser. Hopefully an example will help, but before that let&#x2019;s see the example implementations. Cassandra is probably the most interesting one, because it is getting a lot of attention. And the reference as the store under the hoods of Facebook is quite good reference, or what do you think :-)
Ok, here&#x2019;s the example I promised. I could not figure out the best way to depict this, but let&#x2019;s hope this works. Like in the previous examples, we have a dog called Stella. Stella&#x2019;s ID is dog_12 and Stella has a number of attributes such as birthdate, mood and name. Internally, we could store this information into a table structure, much like in relational database. Now, if we want to add an attribute &#x201C;color&#x201D; to Stella, we can do it easily. Ok, let&#x2019;s look at terms on the top of the picture. &#x201C;Row-id&#x201D; is the &#x201C;key&#x201D; to the stored item. &#x201D;Column family&#x201D; and &#x201C;title&#x201D; together form a &#x201C;column&#x201D;. The difference between these is that the &#x201C;title&#x201D; is dynamic and you can define new titles on the fly. But the column family is more or less fixed. It is a bit like &#x201C;table&#x201D; in relational database and it is costly to update it&#x2019;s name, for instance. The &#x201C;time&#x201D; is simply the version of an attribute. If we look at the attribute &#x201C;mood&#x201D; here, you will see that Stella has been angry from time 11 on, and she became happy at time 45. If you query an attribute without time, you would simply get the latest value. A record is not tied to a single column family. Like in the document database example, I could say &#x201C;barks are attributes of a dog&#x201D;. Like this. Perhaps this explains why they are called &#x201C;wide column stores&#x201D; as a record can easily consist of a large number of attributes.
Then there are Graph databases. Someone would perhaps leave them out from the &#x201C;NoSQL&#x201D; family, but the authors themselves shout that they should belong to the hype. I don&#x2019;t really know too much about them. The Neo4j seems to be most popular one.
Here&#x2019;s an example how a graph database could look like. Again, there is our dog_12 having a name, mood, birthdate and so on. The relationships within graph databases are strong. What I mean to say is...
Relationships in... There is no REAL relationship between the tables, you may... In graph databases, however, the relationships are... It means that you can do very efficient calculations you might need in some social applications. I&#x2019;m talking about friends and friends-of-friends here. What about the other NoSQL databases. You could say that... BUT just like in relational databases, you may... The only difference between these sentences is that here I talk about &#x201C;constraints&#x201D; and here &#x201C;validations&#x201D;... and again, they are, in a sense, the same thing. Why, then, relational databases are so good handling relationships? It is because they have...
...SQL. And great support for indexing! NoSQL databases, on the other hand, might... And they usually don&#x2019;t... But, on positive side, you... However, compared to the power SQL, isn&#x2019;t this quite disappointing?
Well, how do you query NoSQL then? Of course, you can access a document with a key you know. ...and with a BigTable also query per attribute like this. With graph databases you do &#x201C;graph traversals&#x201D;. Usually there is an API which might support queries such as &#x201C;give me all the dogs&#x201D;. Some NoSQL solutions also provide an SQL-like query language. Then you can integrate to various search engines and some NoSQL solutions may provide integration out-of-the box. ...but the most common way seems to be to use Map-Reduce functions.
When Google popularized the term Map-Reduce, the first sentence of the publication was: &#x201C;...&#x201D; The big idea is that without understanding about parallel and distributed programming, users may write quite simple Map and Reduce functions to solve any kinds of problems. These functions can then be run in a big cluster of processes and computers. These Map and Reduce functions could be written with any language, but quite often NoSQL databases provide a JavaScript based solution. The main point is not the language. Rather you just need a means to process a record, and then a way to return processed data back to the pipeline.
A very simple example of a Map function could be this. The parameter passed to the Map-function is always a single document. The example here checks that given document is a dog, and if it is, it will return dog&#x2019;s mood as a key and birthdate as value. Now, think that we have 10 million dogs in our database. It will be very easy to distribute this function because given a single document, you should always get the same key/value pair back. It is a little bit like rendering an animated 3D movie, where you distribute the rendering of each individual frame to number of workers, and the result will be frame number as a key and the bitmap as value. From NoSQL databases point of view, you may use Map-functions to generate views for your data. Internally the database caches the results so that the data will be very fast to access. With this map function defined as a view, I could easily find the birth dates of happy dogs.
Reduce function exists to aggregate results from a dataset. A common scenario would be to calculate sum of values. Here&#x2019;s an example of quite stupid reduce function. You always pass two parameters to reduce phase: a key and values for the given key. This reduce function would simply calculate the average ages of happy dogs, angry dogs, bored dogs and so on. And as a footnote, you know it is easy to write poorly performing SQL, but you can easily write poor reduce functions as well.
If you want to get deeper into the subject, you should probably google this keyword: CAP Shortly, CAP theorem says that: if you have a distributed data system, you can only get two out of these three features. Consistency: meaning that once you write something into the database, all the readers of the system will get the same result. Availability: meaning the database will function even if a single component in the system fails. Partition tolerance is easiest to explain using an example. Imagine we have a database distributed that have nodes all over Europe and United states. Now, if the network connections suddenly disappear between the continents, we will have two partitions. If the system is still able to operate normally allowing both reads and writes, and is able to recover once the connections between Europe and USA works again, the system is called &#x201C;partition tolerant&#x201D;. If you think of a relational database there is no such concept as partition tolerance at all. However, relational databases are very consistent and can be quite highly available. Most NoSQL databases focus on providing very high availability. Most of the systems are partition tolerant, but there are also system providing consistency over partition tolerance. Then, there are some databases that let you decide the level of CAP you want. If you are interested, you could try googling N/R/W.
Ok, doesn&#x2019;t it sound quite bad if a database is not consistent? If you write your comment into Facebook, some of your friends will see it immediately and others only after couple of seconds? The database designers (and I) think that consistency is not always top priority. What matters is that the data will be..
..eventually consistent. Eventual consistency means that the consistent state may happen instantly, after a few seconds or after a network connections have been restored. But it will eventually happen. If you get better performance, better availability and better scalability, the ACID level consistency might not be that important after all. And of course, this depends on the domain and the application you are doing. Eventual consistency might not be a good idea when you are transacting money or doing other such critical task.
Why you should thing a NoSQL solution for your next project? Being schema-free is a HUGE help for many problems. And it almost always makes the life of developer much more pleasant. If you need to store massive amount of data, or know you need to scale easily, NoSQL might be for you. I could also claim that some services are perhaps easier to implement with a NoSQL solution that with a relational database. And this very true for many simple Web 2.0 services.
Then, why NoSQL is not always the best choice. First, relational databases have been around for many decades, and they are mature. The developer tools are mature. NoSQL solutions, on the other hand, are often new projects with great ideas and even greater promises, but only &#x201C;alpha&#x201D; quality. If you need strong data consistency, or if you need uncompromised support for transactions, use a relational database. And the last point. We often tend to think of scalability issues too early. It might not be a bad choice to write an application with the tools you know best, and when the time hits, and you need to scale, then think if a NoSQL solution could solve some of your scalability issues.
This is my last slide. If you want to compare relational databases with NoSQL databases, these could be the main points. We talked about consistency. In relational model, with ACID operation, it is very strong. Relational databases support rather big datasets, but some NoSQL give you almost unlimited scalability in terms of data size. Scaling can mean many things, but you could safely say that NoSQL solutions usually scale up much better than relational databases do. SQL is wonderful query language and NoSQL solutions often do not support any kind of query language. If they do, the languages are likely not as expressive as SQL is. On the other hand, Map-Reduce can solve some problems very efficiently. And in theory at least, high availability is also easier achieved with many NoSQL solutions. And remember, NoSQL solutions are very different from each other. And this slide perhaps simplifies things too much. But it is still a very good slide to finish this presentation. Thank you!

NoSQL databases

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Recently uploaded

Recently uploaded (20)

NoSQL databases

Editor's Notes