This document provides an overview and introduction to NoSQL databases. It discusses key-value stores like Dynamo and BigTable, which are distributed, scalable databases that sacrifice complex queries for availability and performance. It also explains column-oriented databases like Cassandra that scale to massive workloads. The document compares the CAP theorem and consistency models of these databases and provides examples of their architectures, data models, and operations.
This presentation contains the introduction to NOSQL databases, it's types with examples, differentiation with 40 year old relational database management system, it's usage, why and we should use it.
MongoDB is the most famous and loved NoSQL database. It has many features that are easy to handle when compared to conventional RDBMS. These slides contain the basics of MongoDB.
In this presentation, Raghavendra BM of Valuebound has discussed the basics of MongoDB - an open-source document database and leading NoSQL database.
----------------------------------------------------------
Get Socialistic
Our website: http://valuebound.com/
LinkedIn: http://bit.ly/2eKgdux
Facebook: https://www.facebook.com/valuebound/
Twitter: http://bit.ly/2gFPTi8
MongoDB is an open-source document database, and the leading NoSQL database. Written in C++.
MongoDB has official drivers for a variety of popular programming languages and development environments. There are also a large number of unofficial or community-supported drivers for other programming languages and frameworks.
“not only SQL.”
NoSQL databases are databases store data in a format other than relational tables.
NoSQL databases or non-relational databases don’t store relationship data well.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
NoSQL databases are currently used in several applications scenarios in contrast to Relations Databases. Several type of Databases there exist. In this presentation we compare Key Value, Column Oriented, Document Oriented and Graph Databases. Using a simple case study there are evaluated pros and cons of the NoSQL databases taken into account.
This presentation contains the introduction to NOSQL databases, it's types with examples, differentiation with 40 year old relational database management system, it's usage, why and we should use it.
MongoDB is the most famous and loved NoSQL database. It has many features that are easy to handle when compared to conventional RDBMS. These slides contain the basics of MongoDB.
In this presentation, Raghavendra BM of Valuebound has discussed the basics of MongoDB - an open-source document database and leading NoSQL database.
----------------------------------------------------------
Get Socialistic
Our website: http://valuebound.com/
LinkedIn: http://bit.ly/2eKgdux
Facebook: https://www.facebook.com/valuebound/
Twitter: http://bit.ly/2gFPTi8
MongoDB is an open-source document database, and the leading NoSQL database. Written in C++.
MongoDB has official drivers for a variety of popular programming languages and development environments. There are also a large number of unofficial or community-supported drivers for other programming languages and frameworks.
“not only SQL.”
NoSQL databases are databases store data in a format other than relational tables.
NoSQL databases or non-relational databases don’t store relationship data well.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
NoSQL databases are currently used in several applications scenarios in contrast to Relations Databases. Several type of Databases there exist. In this presentation we compare Key Value, Column Oriented, Document Oriented and Graph Databases. Using a simple case study there are evaluated pros and cons of the NoSQL databases taken into account.
Architecture web aujourd'hui, besoin de scalabilité des bases de données relationnelles, découverte des bases de données NoSQL et des différents types de celles-ci. La vidéo de présentation peut être consultée à l'adresse suivante : http://youtu.be/oIpjcqHyx2M
Big Data: Hadoop Map / Reduce sur Windows et Windows AzureMicrosoft
L'algorithme Map/Reduce et sa mise en oeuvre avec Apache Hadoop permettent de gérer de très grands volumes de données non structurées. Microsoft adopte Haddop sur Windows et Windows Azure. Venez voir comment.
Curso impartido en Curso de Verano Big Data & Data Science, Universidade de Santiago de Compostela, CITIUS (http://www.citius.usc.es/), 18 de Julio 2013
Conceptos básicos de NoSQL. Introducción a Cassandra, CouchDB, MongoDB y Neo4j.
NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).
Techday Arrow Group: Hadoop & le Big DataArrow Group
retrouvez notre techday sur Hadoop & le Big Data.
La Technologie Hadoop au coeur des
projets "Big Data".
Pour en savoir plus sur notre projet Square Predict:
http://www.square-solutions.com/accueil/square-predict-big-data-assurance/
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...Hatim CHAHDI
Ce cours introduit les bases de données orientées colonnes et leurs spécificités. Il détaille par la suite l'architecture d'HBase et explique les moyens nécessaires à sa mise en place et à son exploitation.
Researching an alternative to the MS SQL database - first of all in order to gain additional technological benefits, secondly moving towards an open source way of development.
The idea behind this presentation was to introduce PostgreSQL (ver. 9.4+) in a different manner than a conventional "Pros Vs. Cons" style, it is more likely to be a "Buzz Word" thesaurus (of course based on a deep research).
P.S. Since it's a presentation, there was no intention going over and covering all of the PostgreSQL features - most of the interesting parts.
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn. This was a presentation made at QCon 2009 and is embedded on LinkedIn's blog - http://blog.linkedin.com/
Cassandra from the trenches: migrating Netflix (update)Jason Brown
Update talk on Cassandra at Netflix, presented at the Silicon Valley NoSQL meetup on 9 Feb 2012. Includes an introduction to Astyanax, an open source cassandra client written in java.
[db tech showcase Tokyo 2016] E32: My Life as a Disruptor by Jim StarkeyInsight Technology, Inc.
I’ve championed or developed four distinct disruptive technologies in database management. I started working on databases for the ARPAnet - the precursor of the Internet which had 47 nodes and was the largest network on earth. I advocated relational technology when it was considered an academic curiosity and introduced a new concurrency control technology that made consistency practical. More recently I created a radically new architecture for distributed ACID SQL databases. Now, my project is a critical re-evaluation of where we are, how we got here, and where we should be going. It’s going to be a wild ride.
1) Apache Cassandra in term of CAP Theorem
2) What makes Apache Cassandra "Available"?
3) How Apache Cassandra ensures data consistency?
4) Cassandra advantages and disadvantages
5) Frameworks/libraries to access Apache Cassandra + performance comparison
Measuring the Productivity of Your Engineering Organisation - the Good, the B...Marin Dimitrov
High-performing engineering teams regularly dedicate time on measuring the performance & quality of the systems and applications they’re building or on measuring & improving the various aspects of the development lifecycle. High-performing product companies are also data-driven when it comes to measuring the impact of new features & products in terms of business KPIs and Northstar metrics.
Can a data-driven approach be applied to measuring the performance, maturity and continuous improvement of an engineering team or the whole engineering organisation? In this discussion we’ll cover various important topics related to quantifying the performance of an engineering organisation
The career development of our teammates is among the key responsibilities of a leader - and оur personal career development vision & plan plays a critical role for our long term growth and success. Despite their importance, our career vision is often not getting enough attention and level of detail, or is hampered by easily avoidable mistakes. In this discussion, we’ll address typical mistakes related to long-term career planning, some best practices, and practical steps for building our own long-term career development vision (or the ones of the teammates we are leading), so that career planning becomes a long term journey with clear why/how/what, rather than just a list of SMART goals
Uber began its open source journey in 2015 when three passionate engineers decided to contribute Uber’s work back to the community. In only four years, Uber’s open source program has fostered 350+ outstanding open source projects with 2,000+ contributors worldwide delivering over 70,000 commits. Since 2017, four of Uber’s open source projects have won InfoWorld’s Best of Open Source Software Awards. In this talk, Brian Hsieh & Marin Dimitrov will share more details on Uber’s open source journey, program and best practices, and how Uber enables open innovation by fostering a healthy and collaborative open source culture
Trust - the Key Success Factor for Teams & OrganisationsMarin Dimitrov
>>> Most leaders agree that trust is a key factor for the success o the team and the organisation and that they are actively working to build trust. And yet, various studies imply that almost half of the teams and organisations worldwide experience lower trust levels with their managers, teammates and the rest of the organisation, which leads to decreased engagement, productivity and success.
>>> In this talk we will discuss why trust is a key success factor for every team and every organisation, some good practices for building, sustaining and rebuilding trust, as well as the most common mistakes related to trust building
talk @ the Computer Science department of Sofia University - practical advice for career growth for students
DEV.BG event http://dev.bg/%D1%81%D1%8A%D0%B1%D0%B8%D1%82%D0%B8%D0%B5/fmi-club-%D0%BF%D1%80%D0%B0%D0%BA%D1%82%D0%B8%D1%87%D0%BD%D0%B8-%D1%81%D1%8A%D0%B2%D0%B5%D1%82%D0%B8-%D0%B7%D0%B0-%D0%BA%D0%B0%D1%80%D0%B8%D0%B5%D1%80%D0%BD%D0%BE-%D1%80%D0%B0%D0%B7%D0%B2%D0%B8%D1%82/
On-Demand RDF Graph Databases in the CloudMarin Dimitrov
slides from the S4 webinar "On-Demand RDF Graph Databases in the Cloud"
RDF database-as-a-service running on the Self-Service Semantic Suite (S4) platform: http://s4.ontotext.com
video recording of the talk is available at http://info.ontotext.com/on-demand-rdf-graph-database
slides from our talk "Low-Cost Open Data as-a-service" from the Semantic Web Developers workshop of ESWC'2015 (full paper: http://ceur-ws.org/Vol-1361/paper7.pdf)
Text Analytics & Linked Data Management As-a-ServiceMarin Dimitrov
slides from the talk on "Text Analytics & Linked Data Management As-a-Service with S4" from the ESWC'2015 workshop on Semantic Web Enterprise Adoption & Best Practices
full paper available at http://2015.wasabi-ws.org/papers/wasabi15_1.pdf
overview of the RDF graph database-as-a-service (GraphDB based) on the Self-Service Semantic Suite (S4)
http://s4.ontotext.com
presentation for the AKSW Group of the University of Leipzig
Dec'2013 webinar from the EUCLID project on managing large volumes of Linked Data
webinar recording at https://vimeo.com/84126769 and https://vimeo.com/84126770
more info on EUCLID: http://euclid-project.eu/
Enabling Low-cost Open Data Publishing and ReuseMarin Dimitrov
In the space of just a few years we’ve seen the transformational power of open data; both for transparency and accountability in public data, and efficiency and innovation with businesses in private data. In its first year, institutions and individuals throughout Europe have supported public sector bodies in releasing data and numerous start-ups, developers and SMEs in reusing this data for economic benefit.
However, we are still at the beginning of the open data movement, and there is still more that can be done to make open data simpler to use and to make it available to a wider audience.
The core goal of the DaPaaS project is to provide a Data- and Platform-as-a-Service environment, where 3rd parties (such as governmental organisations, SMEs, developers and larger companies) can publish and host both data sets and data-intensive applications, which can then be accessed by end-user applications in a cross-platform manner. You can find out more about DaPaaS on the detailed about page.
Essentially, DaPaaS aims to make publishing, consumption, and reuse of open data, as well as deploying open data applications, easier and cheaper for SMEs and small public bodies which otherwise may not have sufficient technical expertise, infrastructure and resources required to do so.
see also http://www.slideshare.net/eswcsummerschool/wed-roman-tutopendatapub-38742186
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
3. Contents
Part I
Introduction
NoSQL Databases Mar 2010 #3
4. Why NoSQL?
• Is NoSQL equal to “SQL “or to “Not Only SQL”?
• ACID doesn’t scale well
• Web apps have different needs (than the apps that RDBMS
were designed for)
– Low and predictable response time (latency)
– Scalability & elasticity (at low cost!)
– High availability
– Flexible schemas / semi-structured data
– Geographic distribution (multiple datacenters)
• Web apps can (usually) do without
– Transactions / strong consistency / integrity
– Complex queries
NoSQL Databases Mar 2010 #4
5. Some NoSQL use cases
1. Massive data volumes
– Massively distributed architecture required to store the data
– Google, Amazon, Yahoo, Facebook – 10-100K servers
2. Extreme query workload
– Impossible to efficiently do joins at that scale with an RDBMS
3. Schema evolution
– Schema flexibility (migration) is not trivial at large scale
– Schema changes can be gradually introduced with NoSQL
NoSQL Databases Mar 2010 #5
6. NoSQL pros/cons
• Advantages
– Massive scalability
– High availability
– Lower cost (than competitive solutions at that scale)
– (usually) predictable elasticity
– Schema flexibility, sparse & semi-structured data
• Disadvantages
– Limited query capabilities (so far)
– Eventual consistency is not intuitive to program for
• Makes client applications more complicated
– No standardizatrion
• Portability might be an issue
– Insufficient access control
NoSQL Databases Mar 2010 #6
7. CAP theorem
• Requirements to distributed systems
– Consistency – the system is in a consistent state after an operation
• All clients see the same data
• Strong consistency (ACID) vs. eventual consistency (BASE)
– Availability – the system is “always on”, no downtime
• Node failure tolerance – all clients can find some available replica
• software/hardware upgrade tolerance
– Partition tolerance – the system continues to function even when
split into disconnected subsets (by a network disruption)
• Not only for reads, but writes as well!
• CAP Theorem (E. Brewer, N. Lynch)
– You can satisfy at most 2 out of the 3 requirements
NoSQL Databases Mar 2010 #7
8. CAP (2)
• CA
– Single site clusters (easier to ensure all nodes are always in contact)
– e.g. 2PC
– When a partition occurs, the system blocks
• CP
– Some data may be inaccessible (availability sacrificed), but the rest is
still consistent/accurate
– e.g. sharded database
• AP
– System is still available under partitioning, but some of the data
returned my be inaccurate
– e.g. DNS, caches, Master/Slave replication
– Need some conflict resolution strategy
NoSQL Databases Mar 2010 #8
9. … and also CAE
• CAE trade-off (Amazon?)
– Cost-efficiency
– High Availability
– Elasticity
• Pick any two (Cost, Availability, Elasticity)
– Cost Efficiency
• By over-provisioning you get HA and E
– High Availability
• Just make the client wait when the system is overloaded (CE and E)
– Elasticity
• If you can predict your workload, you can provide HA + CE by booking
the resources in advance
• Everyone wants HA, so the challenge is providing E in a CE
way
NoSQL Databases Mar 2010 #9
10. NoSQL taxonomy
• Key-Value stores
– Simple K/V lookups (DHT)
• Column stores
– Each key is associated with many attributes (columns)
– NoSQL column stores are actually hybrid row/column stores
• Different from “pure” relational column stores!
• Document stores
– Store semi-structured documents (JSON)
– Map/Reduce based materialisation, sorting, aggregation, etc.
• Graph databases
– Not exactly NoSQL…
• can’t satisfy the requirements for High Availability and
Scalability/Elasticity very well
NoSQL Databases Mar 2010 #10
11. Contents
Part II
Key-Value stores
PNUTS, Dynamo, Voldemort
NoSQL Databases Mar 2010 #11
12. PNUTS
• Yahoo, ~2008, part of the Y! Sherpa platform
• Requirements
– Scale out / elasticity
– Geo-replication
– High availability / Fault tolerance
– Relaxed consistency
• Simplified relational data model
– Flexible schema
– No Referential Integrity
– No joins, aggregation, etc
– Updates/deletes only by Primary Key
– “Multiget” (retrieve many records by PKs)
NoSQL Databases Mar 2010 #12
13. PNUTS – consistency model
• Timeline consistency
– In-between serializability and eventual consistency
• All replicas apply updates for a record in the same order
• but not necessarily at the same time (eventually)
– One replica is always chosen as a Master (per record!)
• adaptively changed for load re-balancing (usage pattern tracking!)
• updates go only to the Master, then async propagation to other replicas
/ reads go to replicas or Master
– Read consistency levels
• Read any – may return a stale version of a record (any replica, fast)
• Read critical (V) – return a newer (or the same) record version than V
• Read latest – most recent version (goes to Master, slow)
– Write consistency levels
• Write
• Test-and-set (V) – write only if the record version is the same as V
(ensures serialization)
NoSQL Databases Mar 2010 #13
15. PNUTS - architecture
• Regions
– (usually) geographically distributed replicas (20+ datacenters)
– Pub/Sub asynchronous replication (Tribble)
• Tablets
– Horizontal partitioning of the tables (keyspace)
– One copy of a Tablet within each Region
• Components
– Routers
• Cached copy of the interval mapping for tablets
– Storage Units (SU)
• Filesystem or MySQL
– Tablet Controllers (TC)
• Maintains interval mapping, tablet split, load re-balancing, etc
NoSQL Databases Mar 2010 #15
17. PNUTS – architecture (3)
• Message Broker
– Pub/Sub replication
– Notification service (even for external apps)
– Reliable logging
– 1 broker per tablet, but subscriptions on the table level possible too
• “Mastership”
– Record master – updates, “read latest”
– Tablet master – inserts & deletes
• Query processing
– Replica – “read any”, “read critical”
– Master – updates, “read latest”
– Scatter/Gather engine – range queries, table scans, multi-record
updates
NoSQL Databases Mar 2010 #17
18. Dynamo
• Amazon, ~2007
• P2P key-value store
– Object versioning
– Consistent hashing
– Gossip – membership & failure detection
– Quorum reads
• Requirements & assumptions
– Simple query model (unique keys, blobs, no schema, no multi-access)
– Scale out (elasticity)
– Eventual consistency (improved availability)
– Decentralized & symmetric (P2P), heterogeneous (load distribution)
– Low latency / high throughput
– Internal service (no security model)
NoSQL Databases Mar 2010 #18
19. Dynamo - partitioning
• Consistent hashing
– “Ring” of nodes (random position assigned to nodes when joining)
– Node leaving/joining the ring affects only its 2 neighbours
– Virtual nodes (improves the load balancing) – node is assigned
several positions in the ring
• Replication
– N replicas (successor nodes on ring)
• Versioning
– All updates provide version context
– Eventual consistency
– Conflict reconciliation by clients (vector clocks)
– Vector clock size thresholds (truncation)
NoSQL Databases Mar 2010 #19
(C) Amazon
20. Dynamo – operations
• Read/Write requests
– Client sends a request to a random node, the node forwards it to the
proper node (1st replica responsible for that partition – coordinator)
– Coordinator will forward R/W requests to top N replicas
• Consistency
– Quorum protocol
• R, W – minimum number of nodes that must participate in a Read/ Write
operation (R+W > N)
– Writes
• Coordinator generates the version and the vector clock
• Coordinator sends requests to N replicas, if W replicas confirm then OK
– Reads
• Coordinator sends requests to N replicas, if R replicas respond then OK
• If different versions are returned then reconcile and write back the
reconciled version (read repair)
NoSQL Databases Mar 2010 #20
21. Dynamo – failure handling
• “Hinted handoff”
– Dynamo is “always writable”
– If a replica node NR is down, the coordinator sends the write data to a
new node NHH (with a hint that NR was the intended recipient)
– Hinted writes are not readable, i.e. read quorum may still fail!
– NHH will keep the data until NR is available again, then send the data
version to NR and delete the local copy
• “Anti-entropy” replica synchronisation
– What if a hinted replica NHH fails before it can sync with NR?
– Merkle trees
• Pros – minimum data transfer (only the root node hash compared when
in-sync), very fast location of out-of-sync keys (tree traversal)
• Cons – trees need to be re-calculated when nodes join/leave the ring
(keyspace partitioning changes)
NoSQL Databases Mar 2010 #21
22. Dynamo – architecture
• Bootstrapping & node removal
– New node gets a position in the ring
– Several re-balancing strategies
• Partition split - the node responsible for that keyspace partition splits it
and sends half the data to the new node
• Fixed size partitions – one or more partitions are sent to the new node
– Similar process when a node leaves the ring
• Node components
– Request handling & coordination
– Membership & failure detection
– Local persistence engine
• MySQL
• BerkleyDB
• In-memory buffer + persistence
NoSQL Databases Mar 2010 #22
23. Voldemort
• http://project-voldemort.com/
• Distributed key-value store
– Based on Dynamo
• Originally developed by LinkedIn, now open source
• Features
– Simple data model (no joins or complex queries, no RI, …)
– P2P
– Scale-out / elastic
• Consistent hashing of keyspace
• Fixed partitions (no splits, but owner may change when re-balancing)
– Eventual consistency / High Availability
– Replication
– Failure handling
NoSQL Databases Mar 2010 #23
24. Voldemort & Dynamo
• Similar to Dynamo
– Consistent hashing
– Versions & Vector clocks -> eventual consistency
– R/W quorum (R+W>N)
– Hinted handoff (failure handling)
– Pluggable persistence engines
• MySQL, BerkeleyDB, in-memory
NoSQL Databases Mar 2010 #24
25. Contents
Part III
Column stores
BigTable, HBase, Cassandra
NoSQL Databases Mar 2010 #25
26. BigTable
• Google, ~2006
• Sparse, distributed, persistent multidimensional sorted map
• (row, column, timestamp) dimensions, value is string
• Key features
– Hybrid row/column store
– Single master (stand-by replica)
– Versioning
– Compression
(C) Google
NoSQL Databases Mar 2010 #26
27. BigTable - data model
• Row
– Keys are arbitrary strings
– Data is sorted by row key
• Tablet
– Row range is dynamically partitioned into tablets (sequence of rows)
– Range scans are very efficient
– Row keys should be chosen to improve locality of data access
• Column, Column Family
– Column keys are arbitrary strings, unlimited number of columns
– Column keys can be grouped into families
– Data in a CF is stored and compressed together (Locality Groups)
– Access control on the CF level
NoSQL Databases Mar 2010 #27
28. BigTable - data model (2)
• Timestamps
– Each cell has multiple versions
– Can be manually assigned
• Versioning
– Automated garbage collection
• Retain last N versions
• Retain versions newer than TS
• Architecture
– Data stored on GFS
– Relies on Chubby (distributed lock service, Paxos)
– 1 Master server
– thousands of Tablet servers
NoSQL Databases Mar 2010 #28
29. BigTable - architecture
• Master server
– Assign tablets to Tablet Servers
– Balance TS load
– Garbage collection
– Schema management
– Client data does not move through the MS (directly through TS)
– Tablet location not handled by MS
• Tablet server (many)
– thousands of tablets per TS
– Manages Read / Write / Split of its tablets
NoSQL Databases Mar 2010 #29
30. BigData - workflow
• Tablet location
– B+ tree index
– root (Chubby file) - location of the Root Metadata tablet
– 1st level (Root Metadata tablet) – locations of all Metadata tablets
– 2nd level (Metadata tablet) – location of a set of tablets
• Each row is SSTable ID + end row ID
– Clients cache tablet locations
(C) Google
NoSQL Databases Mar 2010 #30
31. BigData – workflow (2)
• Tablet access
– SSTable – persistent, ordered immutable map from keys to values
• Contains a sequence of blocks + block index
• Compression at block level
• 2 phase compression, 10:1 ratio (standard gzip up to 3:1)
– Updates
• 1st added to the redo log
• Then to the in-memory buffer (memtable)
• If memtable is full, then it is flushed into a new SSTable and emptied
– Reads
• Executed over the merged memtable and all SSTables (Bloom filters)
• Merged view is very efficient (data is already sorted)
• Bloom filters – probabilistic maps on whether a specific SSTable contains
a specific cell (row, column), i.e. disk lookups are drastically reduced
NoSQL Databases Mar 2010 #31
33. BigData – workflow (4)
• SSTable compactions
– Minor compaction
• Memtable flushed on disk into a new SSTable and emptied
• When: memtable is full
– Merging compaction
• Merge the memtable and several SSTables into 1+ new SSTable(s)
• Then, empty the memtable and delete the input SSTables
• New SSTable(s) can still contain deletion info + deleted data (no purging)
• When: number of SSTables limit reached
– Major compaction
• A merging compaction with exactly 1 output SSTable and all deleted data
purged
• When: regularly (background)
NoSQL Databases Mar 2010 #33
34. BigTable - performance
• Benchmark with 1, 50, 250 and 500 Tablet Servers
(C) Google
NoSQL Databases Mar 2010 #34
35. BigTable, real world data volumes (ca. 2006)
(C) Google
NoSQL Databases Mar 2010 #35
36. HBase
• http://hadoop.apache.org/hbase
• Developed by Powerset, now Apache
• Based on BigTable
– HDFS (GFS), ZooKeeper (Chubby)
– Master Node (Master Server), Region Servers (Tablet Servers)
– HStore (tablet), memcache (memtable), MapFile (SSTable)
• Features
– Data is stored sorted (no real indexes)
– Automatic partitioning
– Automatic re-balancing / re-partitioning
– Fault tolerance (HDFS, 3 replicas)
NoSQL Databases Mar 2010 #36
38. HBase - operations
• Workflow
– Hstore/tablet location (== BigTable)
• clients talk to ZooKeeper
• root metadata, metadata tables, client side caching, etc
– Writes (== BigTable)
• First added to the commit log, then to the memcache (memtable),
eventually flushed to MapFile (SSTable)
– Reads (== BigTable)
• First read from memcache, if info is incomplete then read the MapFiles
– Compactions (== BigTable)
• APIs
– Thrift
– REST
• GET | PUT | POST | DELETE <table>/<row>/<column>:<qualifier>/<timestamp>
NoSQL Databases Mar 2010 #38
39. HBase – MapReduce jobs
• Can be a source/sink for M/R jobs in Hadoop
(C) ???
NoSQL Databases Mar 2010 #39
40. HBase vs. BigTable
• http://www.larsgeorge.com/2009/11/hbase-vs-bigtable-
comparison.html
Feature BigTable HBase
Access control + -
Master server single single/multiple*
Locality groups + -
Cell cache + -
Commit logs (WAL) primary + secondary single
Skip WAL for bulk loads ? yes
Data isolation + -
replication + In progress
Everything else pretty much the same!
NoSQL Databases Mar 2010 #40
41. Cassandra
• http://cassandra.apache.org/
• Developed by Facebook (inbox), now Apache
– Facebook now developing its own version again
• Based on Google BigTable (data model) and Amazon Dynamo
(partitioning & consistency)
• P2P
– Every node is aware of all other nodes in the cluster
• Design goals
– High availability
– Eventual consistency (improves HA)
– Incremental scalability / elasticity
– Optimistic replication
NoSQL Databases Mar 2010 #41
42. Cassandra – data model, partitioning
• Data model
– Same as BigTable
– Super Columns (nested Columns) and Super Column Families
– column order in a CF can be specified (name, time)
• Cluster membership
– Gossip – every nodes gossips to 1-3 other nodes about the state of
the cluster (merge incoming info with its own)
– Changes in the cluster (node in/out, failure) propagate quickly (LogN)
– Probabilistic failure detection (sliding window, Exp(α) or Nor(μ,σ2))
• Dynamic partitioning
– Consistent hashing
– Ring of nodes
– Nodes can be “moved” on the ring for load balancing
NoSQL Databases Mar 2010 #42
43. Cassandra - operations
• Writes (== BigTable… almost)
– Client sends request to a random node, the node determines the
actual node responsible for storing the data (cluster awareness)
– Data replicated to N nodes (config)
– Data first added to the commit log, then to the memtable, eventually
flushed to SSTable
• Reads (== BigTable… almost)
– Send request to a random node, which forwards it to all the N nodes
having the data
• Single read – return first response / Quorum read – value that the
majority of nodes agrees on
– First read from memtable, if info is incomplete then read the
SSTables
– Bloom filters for efficient SSTable lookups
• Compactions (== BigTable) Databases
NoSQL Mar 2010 #43
45. Contents
Part IV
Document stores
CouchDB
NoSQL Databases Mar 2010 #45
46. CouchDB
• http://couchdb.apache.org/ , ~2005
• Schema-free, document oriented database
– Documents stored in JSON format (XML in old versions)
– B-tree storage engine
– MVCC model, no locking
– no joins, no PK/FK (UUIDs are auto assigned)
– Implemented in Erlang
• 1st version in C++, 2nd in Erlang and 500 times more scalable (source:
“Erlang Programming” by Cesarini & Thompson)
– Replication (incremental)
{
• Documents “name”: “Ontotext”,
“url”: “www.ontotext.com”
– UUID, version “employees”: 40
– Old versions retained “products”: [“OWLIM”, “KIM”, “LifeSKIM”, “JOCI”]
}
NoSQL Databases Mar 2010 #46
47. CouchDB (2)
• REST API
CRUD HTTP params
Create PUT /db/docid
Read GET /db/docid
Update POST /db/docid
Delete DELETE /db/docid
• Views
– Filter, sort, “join”, aggregate, report
– Map/Reduce based, language independent (JavaScript by default)
– K/V pairs from Map/Reduce are also stored in the B-tree engine
– Built on demand
– Can be materialized & incrementally updated
NoSQL Databases Mar 2010 #47
48. CouchDB – views
(C) Chris Anderson / Apache
NoSQL Databases Mar 2010 #48
49. Contents
Part V
Benchmarks
NoSQL Databases Mar 2010 #49
50. NoSQL feature matrix (incomplete)
(C) R. Ramakrishnan / Yahoo
NoSQL Databases Mar 2010 #50
51. CAP positioning
Consistency
•BigTable
•HBase
•Dynamo
Partition
tolerance •PNUTS Availability
•Cassandra
•Voldemort
•CouchDB?
NoSQL Databases Mar 2010 #51
52. Yahoo Cloud Serving Benchmark (YCSB)
• Benchmark various NoSQL systems
– Performance
• Measure latency/throughput curve on a fixed hardware
– Scale out (elasticity)
• add hardware, increase data size / workload proportionally
• Measure latency (best case: constant)
• Or… add hardware but same data/workload (latency should drop)
– Later – replication, availability, …
• Testbed
– 6 boxes {2x4 core 2.5GHz, 8GB RAM, 6 HDD (SAS RAID1+0), GB eth}
– 120M records (1K) ~ 20GB data per server
– 100+ client threads
NoSQL Databases Mar 2010 #52
53. YCSB – performance
• Update heavy workload (50/50 reads/updates)
• Reminders
– Cassandra (Dynamo) is “always writable”
– HBase buffers updates in memory
– PNUTS uses MySQL as a persistence engine
NoSQL Databases (C) B. Cooper et al. / Yahoo #53
Mar 2010
54. YCSB – performance (2)
• Read heavy workload
• Reminders
– Cassandra (Dynamo) will read several replicas for read consistency
– HBase will access several files during a read (even with Bloom filters)
– PNUTS uses MySQL as a persistence engine
NoSQL Databases (C) B. Cooper et al. / Yahoo #54
Mar 2010
55. YCSB – scalability
• Can more servers handle proportionally bigger workload?
– Best case: latency should be constant
• Notes
– PNUTS and Cassandra scale very well
– HBase performance varies a lot
(C) B. Cooper et al. / Yahoo
NoSQL Databases Mar 2010 #55
56. YCSB – elastic speedup
• What is the impact of adding one server (same workload)?
– Best case: latency should drop
• Notes
– 3 servers, add 1 more at 5th min (different latency scales!)
– Cassandra – performance improvement by 10% (quickly!)
– Hbase – large latency spike initially, then 10% performance
improvement (within 20 min)
(C) B. Cooper et al. / Yahoo
NoSQL Databases Mar 2010 #56
57. Contents
Part VI
RDF & NoSQL?
NoSQL Databases Mar 2010 #57
58. RDF & NoSQL
• No easy match
– Partitioning by key not so suitable for RDF (S, P, O)
– Data locality insufficient for RDF materialisation
– Full featured query language (SPARQL) cannot be efficiently mapped
to NoSQL key/value lookups and range scans
– Eventual consistency is a problem (materialisation with stale data)
• So far RDF limited to the graph databases subset of NoSQL
– But graph databases don’t provide massive scalability
NoSQL Databases Mar 2010 #58
59. Useful links
• http://groups.google.com/group/nosql-discussion
• “Bigtable: A Distributed Storage System for Structured Data”
• “Cassandra - A Decentralized Structured Storage System”
• "Dynamo: Amazon’s Highly Available Key-value Store"
• “PNUTS: Yahoo!'s Hosted Data Serving Platform”
• “Benchmarking Cloud Serving Systems with YCSB”
• “HBase vs. BigTable Comparison”
NoSQL Databases Mar 2010 #59