This document discusses polyglot persistence and multi-cloud data management solutions. It begins by noting the huge amounts of data being generated and stored globally, such as the billions of pieces of content shared daily on social media platforms. It then discusses challenges in storing and accessing these massive datasets, which can range from the petabyte to exabyte scale. The document introduces the concept of polyglot persistence, where enterprises use a variety of data storage technologies suited to different types of data, rather than assuming a single relational database. It also discusses using NoSQL databases and deploying databases across multiple cloud platforms.
NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).
Enterprise Architect's view of Couchbase 4.0 with N1QLKeshav Murthy
Enterprise architects have to decide on the database platform that will meet various requirements: performance and scalability on one side, ease of data modeling, agile development on the other, elasticity and flexibility to handle change easily, and a database platform that integrates well with tools and within ecosystem. This presentation will highlight the challenges and approaches to solution using Couchbase with N1QL.
The way we store and manage data is changing. In the old days, there were only a handful of file formats and databases. Now there are countless databases and numerous file formats. The methods by which we access the data has also increased in number. As R users, we often access and analyze data in highly inefficient ways. Big Data tech has solved some of those problems.
This presentation will take attendees on a quick tour of the various relevant Big Data technologies. I’ll explain how these technologies fit together to form a stack for various data analysis uses cases. We’ll talk about what these technologies mean for the future of analyzing data with R.
Even if you work with “small data” this presentation will still be of interest because some Big Data tech has a small data use case.
The presentation begins with an overview of the growth of non-structured data and the benefits NoSQL products provide. It then provides an evaluation of the more popular NoSQL products on the market including MongoDB, Cassandra, Neo4J, and Redis. With NoSQL architectures becoming an increasingly appealing database management option for many organizations, this presentation will help you effectively evaluate the most popular NoSQL offerings and determine which one best meets your business needs.
NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).
Enterprise Architect's view of Couchbase 4.0 with N1QLKeshav Murthy
Enterprise architects have to decide on the database platform that will meet various requirements: performance and scalability on one side, ease of data modeling, agile development on the other, elasticity and flexibility to handle change easily, and a database platform that integrates well with tools and within ecosystem. This presentation will highlight the challenges and approaches to solution using Couchbase with N1QL.
The way we store and manage data is changing. In the old days, there were only a handful of file formats and databases. Now there are countless databases and numerous file formats. The methods by which we access the data has also increased in number. As R users, we often access and analyze data in highly inefficient ways. Big Data tech has solved some of those problems.
This presentation will take attendees on a quick tour of the various relevant Big Data technologies. I’ll explain how these technologies fit together to form a stack for various data analysis uses cases. We’ll talk about what these technologies mean for the future of analyzing data with R.
Even if you work with “small data” this presentation will still be of interest because some Big Data tech has a small data use case.
The presentation begins with an overview of the growth of non-structured data and the benefits NoSQL products provide. It then provides an evaluation of the more popular NoSQL products on the market including MongoDB, Cassandra, Neo4J, and Redis. With NoSQL architectures becoming an increasingly appealing database management option for many organizations, this presentation will help you effectively evaluate the most popular NoSQL offerings and determine which one best meets your business needs.
The NoSQL movement has introduced four new database architectural patterns that complement, but not replace, traditional relational and analytical databases. This presentation will introduce these four patterns and discuss their relative strengths and weaknesses for solving a variety of business problems. These problems include Big Data (scalability), search, high availability and agility. For each type of problem we look at how NoSQL databases take different approaches to solving these problems and how you can use this knowledge to find the right database architecture for your business challenges.
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web ApplicationsTodd Hoff
This is the slidedeck I used for a webinar (http://voltdb.com/choosing-sql-nosql-or-both-scalable-web-apps-webinar) I gave on helping people choose SQL or NoSQL for building scalabile web applications. Hint, the answer is: both.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
Big data doesn't mean big money. In fact, choosing a NoSQL solution will almost certainly save your business money, in terms of hardware, licensing, and total cost of ownership. What's more, choosing the correct technology for your use case will almost certainly increase your top line as well.
Big words, right? We'll back them up with customer case studies and lots of details.
This webinar will give you the basics for growing your business in a profitable way. What's the use of growing your top line but outspending any gains on cumbersome, ineffective, outdated IT? We'll take you through the specific use cases and business models that are the best fit for NoSQL solutions.
By the way, no prior knowledge is required. If you don't even know what RDBMS or NoSQL stand for, you are in the right place. Get your questions answered, and get your business on the right track to meeting your customers' needs in today's data environment.
In the past few years, the term "data lake" has leaked into our lexicon. But what exactly IS a data lake? Some IT managers confuse data lakes with data warehouses. Some people think data lakes replace data warehouses. Both of these conclusions are false. Their is room in your data architecture for both data lakes and data warehouses. They both have different use cases and those use cases can be complementary.
Todd Reichmuth, Solutions Engineer with Snowflake Computing, has spent the past 18 years in the world of Data Warehousing and Big Data. He spent that time at Netezza and then later at IBM Data. Earlier in 2018 making the jump to the cloud at Snowflake Computing.
Mike Myer, Sales Director with Snowflake Computing, has spent the past 6 years in the world of Security and looking to drive awareness to better Data Warehousing and Big Data solutions available! Was previously at local tech companies FireMon and Lockpath and decided to join Snowflake due to the disruptive technology that's truly helping folks in the Big Data world on a day to day basis.
في الفيديو ده بيتم شرح ما هي المشاكل التي انتجت ظهور هذا النوع من قواعد البيانات
انواع المشاريع التي يمكن استخدامها بها
نبذة عن تاريخها و مزاياها و عيوبها
https://youtu.be/I9zgrdCf0fY
The NoSQL movement has introduced four new database architectural patterns that complement, but not replace, traditional relational and analytical databases. This presentation will introduce these four patterns and discuss their relative strengths and weaknesses for solving a variety of business problems. These problems include Big Data (scalability), search, high availability and agility. For each type of problem we look at how NoSQL databases take different approaches to solving these problems and how you can use this knowledge to find the right database architecture for your business challenges.
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web ApplicationsTodd Hoff
This is the slidedeck I used for a webinar (http://voltdb.com/choosing-sql-nosql-or-both-scalable-web-apps-webinar) I gave on helping people choose SQL or NoSQL for building scalabile web applications. Hint, the answer is: both.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
Big data doesn't mean big money. In fact, choosing a NoSQL solution will almost certainly save your business money, in terms of hardware, licensing, and total cost of ownership. What's more, choosing the correct technology for your use case will almost certainly increase your top line as well.
Big words, right? We'll back them up with customer case studies and lots of details.
This webinar will give you the basics for growing your business in a profitable way. What's the use of growing your top line but outspending any gains on cumbersome, ineffective, outdated IT? We'll take you through the specific use cases and business models that are the best fit for NoSQL solutions.
By the way, no prior knowledge is required. If you don't even know what RDBMS or NoSQL stand for, you are in the right place. Get your questions answered, and get your business on the right track to meeting your customers' needs in today's data environment.
In the past few years, the term "data lake" has leaked into our lexicon. But what exactly IS a data lake? Some IT managers confuse data lakes with data warehouses. Some people think data lakes replace data warehouses. Both of these conclusions are false. Their is room in your data architecture for both data lakes and data warehouses. They both have different use cases and those use cases can be complementary.
Todd Reichmuth, Solutions Engineer with Snowflake Computing, has spent the past 18 years in the world of Data Warehousing and Big Data. He spent that time at Netezza and then later at IBM Data. Earlier in 2018 making the jump to the cloud at Snowflake Computing.
Mike Myer, Sales Director with Snowflake Computing, has spent the past 6 years in the world of Security and looking to drive awareness to better Data Warehousing and Big Data solutions available! Was previously at local tech companies FireMon and Lockpath and decided to join Snowflake due to the disruptive technology that's truly helping folks in the Big Data world on a day to day basis.
في الفيديو ده بيتم شرح ما هي المشاكل التي انتجت ظهور هذا النوع من قواعد البيانات
انواع المشاريع التي يمكن استخدامها بها
نبذة عن تاريخها و مزاياها و عيوبها
https://youtu.be/I9zgrdCf0fY
Building High Performance MySQL Query Systems and Analytic ApplicationsCalpont
This presentation describes how to build fast running MySQL applications that service read-based systems. It takes a special look at column databases and Calpont's InfiniDB
Building High Performance MySql Query Systems And Analytic Applicationsguest40cda0b
This presentation gives practical advice and tips on how to build high-performance read intensive databases, and discusses innovations such as column-oriented databases
Deep Dive on ElasticSearch Meetup event on 23rd May '15 at www.meetup.com/abctalks
Agenda:
1) Introduction to NOSQL
2) What is ElasticSearch and why is it required
3) ElasticSearch architecture
4) Installation of ElasticSearch
5) Hands on session on ElasticSearch
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
This session will detail best practices for architecting, building, operating and managing an Analytics Data Lake platform. Key topics will include:
1) Defining next-generation Data Lake architectures. The defacto standard has been commodity DAS servers with HDFS, but there are now multiple solutions aimed at separating compute and storage, virtualizing or containerizing Hadoop applications, and utilizing Hadoop compatible or embedded HDFS filesystems. This portion will explore the options available, and the pros and cons of each.
2) Data Ingest. There are many ways to load data into a Data Lake, including standardized Apache tools (Sqoop, Flume, Kafka, Storm, Spark, NiFi), standard file and object protocols (SFTP, NFS, Rest, WebHDFS), and proprietary tools (eg, Zaloni Bedrock, DataTorrent). This section will explore these options in the context of best fit to workflows; it will also look at key gaps and challenges, particularly in the areas of data formats and integration with metadata/cataloging tools.
3) Metadata & Cataloguing. One of the biggest inhibitors of successful Data Lake deployments is Data Governance, particularly in the areas of indexing, cataloguing and metadata management. It is nearly impossible to run analytics on top of a Data Lake and get meaningful & timely results without solving these problems. This portion will explore both emerging open standards (Apache Atlas, HCatalog) and proprietary tools (Cloudera Navigator, Zaloni Bedrock/Mica, Informatica Metadata Manager), and balance the pros, cons and gaps of each.
4) Security & Access Controls. Solving these challenges are key for adoption in regulatory driven industries like Healthcare & Financial Services. There are multiple Apache projects and proprietary tools to address this, but the challenge is making security and access controls consistent across the entire application and infrastructure stack, and over the data lifecycle, and being able to audit this in the face of legal challenges. This portion will explore available options and best practices.
5) Provisioning & Workflow Management. The real promise of the Data Lake is integrating Analytics workflows and tools on converged infrastructure-with shared data-and build “As A Service” oriented architectures that are oriented towards self-service data exploration and Analytics for end users. This is an emerging and immature area, but this session will explore some potential concepts, tools and options to achieve this.
This will be a moderately technical session, with the above topics being illustrated by real world examples. Attendees should have basic familiarity with Hadoop and the associated Apache projects.
The document talks about the overview behind the need and drive for NoSQL databases. It also mentions about some of the most popular NoSQL databases in the market.
With tremendous growth in big data, low latency and high throughput is the key ask for many big data application. The in-memory technology market is growing rapidly. We see that traditional database vendors are extending their platform to support in-memory capability and others are offering in-memory data grid and NoSQL solutions for high performance and scalability. In this talk, we will share our point of view on In-Memory Data Grid and NoSQL technology. It is all about how to build architecture that meets low latency and high throughput requirements. We will share our thoughts and experiences in implementing the use cases that demands low latency & high throughput with inherent scale-out features.
You will learn about how in-memory data grid and NoSQL is used to meet the low latency and high throughput needs and choosing in-memory technology that is good fit for your use case.
This deck talks about the basic overview of NoSQL technologies, implementation vendors/products, case studies, and some of the core implementation algorithms. The presentation also describes a quick overview of "Polyglot Persistency", "NewSQL" like emerging trends.
The deck is targeted to beginners who wants to get an overview of NoSQL databases.
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...IJCERT JOURNAL
NOSQL is a database provides a mechanism for storage and retrieval of data that is modeled for huge amount of data which is used in big data and Cloud Computing . NOSQL systems are also called "Not only SQL" to emphasize that they may support SQL-like query languages. A basic classification of NOSQL is based on data model; they are like column, Document, Key-Value etc. The objective of this paper is to study and compare the implantation of various column oriented data stores like Bigtable, Cassandra.
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYGenoveva Vargas-Solar
Long studies and particularly PhD, imply being ready to perform an exciting journey that leads to unexpected places. It is a personal experience that imposes courage, patience and discipline but above all it is a human experience that calls for intellectual ambition, and humility. As any important personal adventure, the journey must be thoroughly prepared beforehand according to one’s own expectations, objectives and motivation. Of course, as any plan, it will be adjusted as events come up and as the individual discovers new opportunities and weaves them with a personal life.
This talk points out, as a travel guide, practical information about things that should be considered during PhD for preparing ones own personal strategy to have access to the « best » opportunities when planning and starting a professional career.
Moving forward data centric sciences weaving AI, Big Data & HPCGenoveva Vargas-Solar
This novel and multidisciplinary data centric and scientific movement, promises new and not yet imagined applications that rely on massive amounts of evolving data that need to be cleaned, integrated and analysed for modelling purposes. Yet, data management issues are not usually perceived as central. In this keynote I will explore the key challenges and opportunities for data management in this new scientific world, and discuss how a possible data centric artificial intelligence supported by high performance computing (HPC) can best contribute to these exciting domains. If the moto is not academic, huge numbers of dollars being devoted to related applications are moving industry and academia to analyse these directions.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The infamous Mallox is the digital Robin Hoods of our time, except they steal from everyone and give to themselves. Since mid-2021, they've been playing hide and seek with unsecured Microsoft SQL servers, encrypting data, and then graciously offering to give it back for a modest Bitcoin donation.
Mallox decided to go shopping for new malware toys, adding the Remcos RAT, BatCloak, and a sprinkle of Metasploit to their collection. They're now playing a game of "Catch me if you can" with antivirus software, using their FUD obfuscator packers to turn their ransomware into the digital equivalent of a ninja.
-------
This document provides a analysis of the Target Company ransomware group, also known as Smallpox, which has been rapidly evolving since its first identification in June 2021.
The analysis delves into various aspects of the group's operations, including its distinctive practice of appending targeted organizations' names to encrypted files, the evolution of its encryption algorithms, and its tactics for establishing persistence and evading defenses.
The insights gained from this analysis are crucial for informing defense strategies and enhancing preparedness against such evolving cyber threats.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
I'm excited to share my latest predictions on how AI, robotics, and other technological advancements will reshape industries in the coming years. The slides explore the exponential growth of computational power, the future of AI and robotics, and their profound impact on various sectors.
Why this matters:
The success of new products and investments hinges on precise timing and foresight into emerging categories. This deck equips founders, VCs, and industry leaders with insights to align future products with upcoming tech developments. These insights enhance the ability to forecast industry trends, improve market timing, and predict competitor actions.
Highlights:
▪ Exponential Growth in Compute: How $1000 will soon buy the computational power of a human brain
▪ Scaling of AI Models: The journey towards beyond human-scale models and intelligent edge computing
▪ Transformative Technologies: From advanced robotics and brain interfaces to automated healthcare and beyond
▪ Future of Work: How automation will redefine jobs and economic structures by 2040
With so many predictions presented here, some will inevitably be wrong or mistimed, especially with potential external disruptions. For instance, a conflict in Taiwan could severely impact global semiconductor production, affecting compute costs and related advancements. Nonetheless, these slides are intended to guide intuition on future technological trends.
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
Vargas polyglot-persistence-cloud-edbt
1. POLYGLOT PERSISTENCE AND MULTI-CLOUD DATA
MANAGEMENT SOLUTIONS
GENOVEVAVARGAS SOLAR
FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE
Genoveva.Vargas@imag.fr
http://www.vargas-‐solar.com
2. DATA ALL AROUND
2
30 billion
pieces of content
are shared on Facebook
every month
40 billion+
hours video
are watched onYouTube
each month
As of 2011 the global size
Data in healthcare was
estimated to be
150 Exabytes
(161 billion of Gigabytes)
40 millionTweets
are sent per day about 200
monthly active users
By 2014 it is anticipated
there will be
400 million
wearable wireless
health monitors
Web 2.0 sites where millions of users may both read and
write data, scalability for simple database operations has
become important
Data collections available through front ends managed
through public/private organization, available through the
Web, e.g., Sloan Sky Server
3. STORING AND ACCESSING HUGE AMOUNTS OF DATA
Peta
1015
Exa
1018
Zetta
1021
Yota
1024
RAID
Disk
Cloud
3
• Data formats
• Data collection sizes
• Data storage supports
• Data delivery mechanisms
5. THIS TALK IS NOT ABOUT
5
http://nosql-‐database.org
Debate on whether NoSQL stores and relational systems are better or worse …
that is not the point
Of course we can surf on these waves
at the end of the talk and during EDBT School!
6. THIS TALK IS ABOUT
6
alternative for managing multiform and multimedia data collections
according to different properties and requirements
7. AGENDA
¡ Dealing with multiform and multimedia data collections:
it’s all about management requirements
¡ Polyglot persistence general principle
¡ The NoSQL plethora
¡ Polyglot database
¡ Design
¡ Deployment
¡ Querying, update and maintenance
¡ Conclusions and outlook
7
9. POLYGLOT PERSISTENCE
¡ Polyglot Programming: applications should be written in a mix of languages to take
advantage of different languages are suitable for tackling different problems
¡ Polyglot persistence: any decent sized enterprise will have a variety of different data
storage technologies for different kinds of data
¡ a new strategic enterprise application should no longer be built assuming a relational
persistence support
¡ the relational option might be the right one - but you should seriously look at other
alternatives
9
M.
Fowler
and
P.
Sadalage.
NoSQL
Distilled:
A
Brief
Guide
to
the
Emerging
World
of
Polyglot
Persistence.
Pearson
Education,
Limited,
2012
10. (Katsov-2012)
Use the right tool for the right job…
How do I know which is the
right tool for the right job?
10
11. SCALING DATABASE SYSTEMS
¡ A system is scalable if increasing its resources
(CPU, memory, disk) results in increased
performance proportionally to the added
resources
¡ Improving performance means serving more units
of work for handling larger units of work like when
data sets grow
¡ Database systems have been scaled by buying
bigger faster and more expensive machines
11
¡ Vertically (SCALE UP)
¡ Add resources (CPU, memory) to a single node in a system
¡ Horizontally (SCALE OUT)
¡ Add more nodes to a system
12. NOSQL STORES CHARACTERISTICS
¡ Simple operations
¡ Key lookups reads and writes of one record or a small number of records
¡ No complex queries or joins
¡ Ability to dynamically add new attributes to data records
¡ Horizontal scalability
¡ Distribute data and operations over many servers
¡ Replicate and distribute data over many servers
¡ No shared memory or disk
¡ High performance
¡ Efficient use of distributed indexes and RAM for data storage
¡ Weak consistency model
¡ Limited transactions
12
Next generation databases mostly addressing some of the points: being non-relational, distributed,
open-source and horizontally scalable [http://nosql-‐database.org]
13. DEALING WITH HUGE AMOUNTS OF DATA
13
Peta
1015
Exa
1018
Zetta
1021
Yota
1024
RAID
Disk
Cloud
Concurrency
Consistency
Atomicity
Relational
Graph
Key
value
Columns
14. DATA MANAGEMENT SYSTEMS ARCHITECTURES
Physical
model
Logic
model
External
model
ANSI/SPARC
Storage
Manager
Schema
Manager
Query
Engine
Transaction
Manager
DBMS
Customisable
points
Custom
components
Glue
code
Data
services
Access
services
Storage
services
Additional
Extension
services
Other
services
Extension
services
Streaming, XML, procedures,
queries, replication
Physical
model
Logic
model
External
model
Physical
model
Logic
model
External
model
14
15. 15
Data
stores
designed
to
scale
simple
OLTP-‐style
applica7on
loads
• Data
model
• Consistency
• Storage
• Durability
• Availability
• Query
support
Read/Write
operations
by
thousands/millions
of
users
16. PROBLEM STATEMENT: HOW MUCH TO GIVE UP?
¡ CAP theorem1: a system can have two of the three properties
¡ NoSQL systems sacrifice consistency
16
Consistency
Availability
Fault-‐tolerant
partitioning
1
Eric
Brewer,
"Towards
robust
distributed
systems."
PODC.
2000
http://www.cs.berkeley.edu/~brewer/cs262b-‐2004/PODC-‐keynote.pdf
17. NOSQL STORES:AVAILABILITY AND PERFORMANCE
¡ Replication
¡ Copy data across multiple servers (each bit of data can be found in
multiple servers)
¡ Increase data availability
¡ Faster query evaluation
¡ Sharding
¡ Distribute different data across multiple servers
¡ Each server acts as the single source of a data subset
¡ Orthogonal techniques
17
18. REPLICATION: PROS & CONS
¡ Data is more available
¡ Failure of a site containing E does not result in
unavailability of E if replicas exist
¡ Performance
¡ Parallelism: queries processed in parallel on several
nodes
¡ Reduce data transfer for local data
¡ Increased updates cost
¡ Synchronisation: each replica must be updated
¡ Increased complexity of concurrency control
¡ Concurrent updates to distinct replicas may lead to
inconsistent data unless special concurrency control
mechanisms are implemented
18
19. SHARDING:WHY IS IT USEFUL?
¡ Scaling applications by reducing data sets in any single
databases
¡ Segregating data
¡ Sharing application data
¡ Securing sensitive data by isolating it
¡ Improve read and write performance
¡ Smaller amount of data in each user group implies faster querying
¡ Isolating data into smaller shards accessed data is more likely to
stay on cache
¡ More write bandwidth: writing can be done in parallel
¡ Smaller data sets are easier to backup, restore and manage
¡ Massively work done
¡ Parallel work: scale out across more nodes
¡ Parallel backend: handling higher user loads
¡ Share nothing: very few bottlenecks
¡ Decrease resilience improve availability
¡ If a box goes down others still operate
¡ But: Part of the data missing
19
Load%balancer%
Cache%1%
Cache%2%
Cache%3%
MySQL%
Master%
MySQL%
Master%
Web%1%
Web%2%
Web%3%
Site%database%
Resume%database%
20. SHARDING AND REPLICATION
¡ Sharding with no replication: unique copy, distributed data sets
¡ (+) Better concurrency levels (shards are accessed independently)
¡ (-) Cost of checking constraints, rebuilding aggregates
¡ Ensure that queries and updates are distributed across shards
¡ Replication of shards
¡ (+) Query performance (availability)
¡ (-) Cost of updating, of checking constraints, complexity of concurrency control
¡ Partial replication (most of the times)
¡ Only some shards are duplicated
20
21. NOSQL STORES: DATA MANAGEMENT PROPERTIES
¡ Indexing
¡ Distributed hashing like Memcached open source cache
¡ In-memory indexes are scalable when distributing and
replicating objects over multiple nodes
¡ Partitioned tables
¡ High availability and scalability: eventual consistency
¡ Data fetched are not guaranteed to be up-to-date
¡ Updates are guaranteed to be propagated to all nodes
eventually
¡ Shared nothing horizontal scaling
¡ Replicating and partitioning data over many servers
¡ Support large number of simple read/write operations
per second (OLTP)
¡ No ACID guarantees
¡ Updates eventually propagated but limited guarantees
on reads consistency
¡ BASE: basically available; soft state, eventually consistent
¡ Multi-version concurrency control
21
22. COMPARING NOSQL & NEWSQL SYSTEMS
SYSTEM CONCURRENCY
CONTROL
DATA
STORAGE
REPLICATION TRANSACTION
Redis Locks RAM Asynchronous No
Scalaris Locks RAM Synchronous Local
Tokyo Locks RAM/Disk Asynchronous Local
Voldemort MVCC RAM/BDB Asynchronous No
Riak MVCC Plug in Asynchronous No
Membrain Locks Flash+Disk Synchronous Local
Membase Locks Disk Synchronous Local
Dynamo MVCC Plug in Asynchronous No
SimpleDB Non S3 Asynchronous No
MongoDB Locks Disk Asynchronous No
CouchDB MVCC Disk Asynchronous No
22
SYSTEM CONCURRENCY
CONTROL
DATA
STORAGE
REPLICATION TRANSACTION
Terrastore Locks RAM+ Synchronous L
Hbase Locks HADOOP Asynchronous L
HyperTable Locks Files Synchronous L
Cassandra MVCC Disk Asynchronous L
BigTable Locs+stamps GFS Both L
PNuts MVCC Disk Asynchronous L
MySQL-C ACID Disk Synchronous Y
VoltDB ACID/no Lock RAM Synchronous Y
Clustrix ACID/no Lock Disk Synchronous Y
ScaleDB ACID Disk Synchronous Y
ScaleBase ACID Disk Asynchronous Y
NimbusDB ACID/no Lock Disk Synchronous Y
Key-‐Value
Document
Extended
records
Relational
Cattell,
Rick.
"Scalable
SQL
and
NoSQL
data
stores."
ACM
SIGMOD
Record
39.4
(2011):
12-‐27
23. AGENDA
ü Dealing with multiform and multimedia data collections:
ü it’s all about management requirements
ü Polyglot persistence general principle
ü The NoSQL plethora
¡ Polyglot database
¡ Design
¡ Deployment
¡ Querying, update and maintenance
¡ Conclusions and outlook
23
25. 25
OBJECTIVE
¡ Build a MyNet app based on a polyglot database for building an integrated directory
of my contacts including their status and posts from several social networks
MyNet
DB
MyNet
App
Social
network
26. 26
Analysis on contacts networks, overlapping
according to interests, posts topics
Top 10 most popular contacts
User sessions in different
Social networks
Integrating posts
from all networks
Contact graph traversal
For building groups out of
Common characteristics
Synchronizing posts to all
SNFriends network
User accounts activity
In different social networks
Directory synchronisation
Integrating contacts’ information
From all SN
27. NOSQL DESIGN AND CONSTRUCTION PROCESS
¡ Data reside in RAM (memcached) and is eventually replicated and stored
¡ Querying = designing a database according to the type of queries / map reduce model
¡ “On demand” data management: the database is virtually organized per view (external schema) on cache and some
view are made persistent
¡ An elastic easy to evolve and explicitly configurable architecture 27
Database
querying
Database
population
Database
organization
INDEX
Memcached
Replicated
Stored
28. GENERATING NOSQL PROGRAMS FROM HIGH LEVEL
ABSTRACTIONS
28
UML class diagram application classes
Spring Roo
Java
web
App
Spring Data
Graph database
Relational database
High-level abstractions
Low-level abstractions
http://code.google.com/p/model2roo/
30. DEPLOYING A DATABASE APPLICATION ON THE CLOUD
Have a first experience on
¡ creating a database on a relational DBMS deployed on the cloud,
¡ building a simple database application exported as a service,
¡ deploying the service on the cloud and implement an interface.
30
MyNet
DB
MyNet
App
Contact'
idContact:"Integer"
firstName:"String"
lastName:"String"
Service
App
31. SHARDING AND DEPLOYING SHARDS ON THE CLOUD
Have a first experience on
¡ sharding a relational DBMS
¡ dealing with shards synchronization
¡ deploying the service on the cloud and implement an interface
31
MyNet
DB
MyNet
Service
webSite:(URI(
socialNetworkID:(URI((
(
Basic(Info(
Contact'
idContact:(Integer(
lastName:(String(
givenName:(String(
society:(String( <<'hasBasicInfo'>>'
1' *'
postID:(Integer(
timeStamp:(Date(
geoStamp:(String(
Post'
contentID:(Integer(
text:(String(
Content'
<<'hasContent'>>'
<<publishesPost>>'
*'
1'
1' 1'
street:(String,((
number:(Integer,(
City:(Sting,((
Zipcode:(Integer(
Address'
*'
email:(String(
type:({pers,(prof}(
Email'
<<'hasAddress'>>'
*'
1'
<<'hasEmail'>>'
ID:(Integer,(
Lang:(String(
Language'
*'
<<'speaksLanguage'>>'
Français
Español
English
Others
35. EXAMPLE 1: SYNCHRONIZING REDIS+MYSQL
35
https://oracleus.activeevents.com/connect/sessionDetail.ww?SESSION_ID=4775
Updating REDIS #FAIL
begin
MySQL
transaction
update
MySQL
update
Redis
rollback
MySQL
transaction
begin
MySQL
transaction
update
MySQL
commit
MySQL
transaction
<<
system
crashes
>>
update
Redis
Redis has updated MySQL does not
MySQL has updated Redis does not
36. EXAMPLE 1: UPDATING REDIS RELIABLY
Step I Step 2
36
begin
MySQL
transaction
update
MySQL
queue
CRUD
event
in
MySQL
commit
transaction
Event
Id
Operation:
Create,
Update,
Delete
queue
CRUD
event
in
MySQL
New
entity
state,
e.g.
JSON
ACID
for
each
CRUD
event
in
MySQL
queue
get
next
CRUD
event
from
MySQL
queue
if
CRUD
event
is
not
duplicate
then
update
Redis
(incl.
eventID)
end
if
begin
MySQL
transaction
mark
CRUD
event
processed
commit
transaction
end
for
each
37. EXAMPLE 1: UPDATING REDIS RELIABLY
37
EntityCRUDEvent
Repository
EntityCRUDEvent
Processor
Redis
updater
ID
JSON
Processed?
INSERT
INTO
..
SELECT
…
FROM..
apply(event)
Timer
Step 1 Step 2
39. POLYGLOT DATABASE EVOLUTION
¡ Problem statement:
¡ Evolution of the application: modification of classes, new
classes, new relationships among classes
¡ Evolution of the “entities” managed in the polyglot
database
¡ Some change structure, change values, …
¡ The content of the stores start deriving from the
application data structures
¡ Which is the current structure of the entities stored?
¡ Are there elements that are not being accessed because
they do not longer correspond to the application data
structures?
39
Contact'
idContact:"Integer"
firstName:"String"
lastName:"String"
webSite:(URI(
socialNetworkID:(URI((
(
Basic(Info(
Contact'
idContact:(Integer(
lastName:(String(
givenName:(String(
society:(String( <<'hasBasicInfo'>>'
1' *'
postID:(Integer(
timeStamp:(Date(
geoStamp:(String(
Post'
contentID:(Integer(
text:(String(
Content'
<<'hasContent'>>'
<<publishesPost>>'
*'
1'
1' 1'
street:(String,((
number:(Integer,(
City:(Sting,((
Zipcode:(Integer(
Address'
*'
email:(String(
type:({pers,(prof}(
Email'
<<'hasAddress'>>'
*'
1'
<<'hasEmail'>>'
ID:(Integer,(
Lang:(String(
Language'
*'
<<'speaksLanguage'>>'
40. EXTRACTING SCHEMATA FROM NOSQL PROGRAMS CODE
http://code.google.com/p/exschema/
40
Metalayer(
Struct( Relationship(
Attribute(Set(
*( *(
*(
*(
*( *(
*(
*(*(
*(
*(
*(
*(
*(
Declarations
analyzer
Updates
analyzer
Repositories
analyzer
Schema1 Schema2
Schema3
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.UserInfo
Struct
Attribute
name : userId
Attribute
name : firstName
Attribute
name : lastName
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.Tweet
Struct
Attribute
name : text
Attribute
name : userId
Set
Attribute
implementation : Neo4j
Struct
Attribute
name : fr.imag.twitter.domain.User
Str
Attribute
name : nodeId
Attribute
name : userName
A
na
Relatio
start end
Schema1 Schema2
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.UserInfo
Struct
Attribute
name : userId
Attribute
name : firstName
Attribute
name : lastName
Set
Attribute
implementation : Spring Repository
Attribute
name : fr.imag.twitter.domain.Tweet
Attribute
name : text
Attri
name :
Schema1 Schema2
Schema3
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.UserInfo
Struct
Attribute
name : userId
Attribute
name : firstName
Attribute
name : lastName
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.Tweet
Struct
Attribute
name : text
Attribute
name : userId
Set
Attribute
implementation : Neo4j
Struct
Attribute
name : fr.imag.twitter.domain.User
Struct
Attribute
name : nodeId
Attribute
name : userName
Attribute
name : userId
Attribute
name : password
Relationship
start end
Attribute
name : followers
Spring Data
42. AGENDA
ü Dealing with multiform and multimedia data collections:
ü it’s all about management requirements
ü Polyglot persistence general principle
ü The NoSQL plethora
ü Polyglot database
ü Design
ü Deployment
ü Querying, update and maintenance
¡ Conclusions and outlook
42
43. CHALLENGE: POLYGLOT MEETS XPERANTO
Given a data collection coming from different social networks stored on NoSQL systems (Neo4J and Mongo) [possibly
according to a strategy combining sharding and replication techniques], extend the UnQL pivot query language considering
¡ Data processing operators adapted to query different data models (graph, document). Example query Neo + Mongo
and what about Join, Union …
¡ Assuming concurrent CRUD operations to the stores can you expect query results to be consistent ? How can you
tag your results or implement a sharding strategy in order to determine whether results are consistent?
¡ Querying data represented on different models: How can you exploit the structure of the different stores for
expressing queries ? Provide adapted operators? Give generic operators and then rewrite queries?
¡ Normally, Polyglot solutions tend to solve some data processing issues in the application code.This can be penalizing.
Discuss the challenges to address for ensuring that your queries will be able to scale as the collection grows.
43
44. CHALLENGE: EXPECTED RESULTS
¡ Give the principle of your proposal through a partial programming solution, of the operators of your
UnQL extension, detail the query evaluation process if U want your solution to scale
¡ We ask U to sketch the solution on the polyglot database that we provide consisting of mongo, Neo4J stores
¡ https://github.com/jccastrejon/edbt-unql
¡ Technical requirements:VMware player 5
44
45. WHEN IS POLYGLOT PERSISTENCE PERTINENT?
¡ Application essentially composing and serving web pages
¡ They only looked up page elements by ID, they had different needs or availability,
concurrency and no need to share all their data
¡ A problem like this is much better suited to a NoSQL store than the corporate relational
DBMS
¡ Scaling to lots of traffic gets harder and harder to do with vertical scaling
¡ Many NoSQL databases are designed to operate over clusters
¡ They can tackle larger volumes of traffic and data than is realistic with a single server
45
46. CONCLUSIONS
¡ Data are growing big and more heterogeneous and they need new adapted ways to be managed thus the
NoSQL movement is gaining momentum
¡ Data heterogeneity implies different management requirements this is where polyglot persistence comes up
¡ Consistency – Availability – Fault tolerance theorem: find the balance !
¡ Which data store according to its data model?
¡ A lot of programming implied …
46
Open opportunities if you’re interested in this topic!
48. Prof.
Dr.
Christine
Collet
Grenoble
INP
France
Juan
Carlos
Castrejón
University
of
Grenoble
France
Javier
Espinosa
University
of
Grenoble
France
Dr.
Genoveva
Vargas-‐Solar
CNRS,
LIG-‐LAFMIA
France
49. REFERENCES
¡ Eric
A.,
Brewer
"Towards
robust
distributed
systems."
PODC.
2000
¡ Rick,
Cattell
"Scalable
SQL
and
NoSQL
data
stores."
ACM
SIGMOD
Record
39.4
(2011):
12-‐27
¡ Juan
Castrejon,
Genoveva
Vargas-‐Solar,
Christine
Collet,
and
Rafael
Lozano,
ExSchema:
Discovering
and
Maintaining
Schemas
from
Polyglot
Persistence
Applications,
In
Proceedings
of
the
International
Conference
on
Software
Maintenance,
Demo
Paper,
IEEE,
2013
¡ M.
Fowler
and
P.
Sadalage.
NoSQL
Distilled:
A
Brief
Guide
to
the
Emerging
World
of
Polyglot
Persistence.
Pearson
Education,
Limited,
2012
¡ C.
Richardson,
Developing
polyglot
persistence
applications,
http://fr.slideshare.net/
chris.e.richardson/developing-‐polyglotpersistenceapplications-‐gluecon2013
49
51. DATA MODELS
¡ Tuple
¡ Row in a relational table, where attributes are pre-defined in a schema, and the values are scalar
¡ Document
¡ Allows values to be nested documents or lists, as well as scalar values.
¡ Attributes are not defined in a global schema
¡ Extensible record
¡ Hybrid between tuple and document, where families of attributes are defined in a schema, but new attributes can be added
on a per-record basis
51
52. DATA STORES
¡ Key-value
¡ Systems that store values and an index to find them, based on a key
¡ Document
¡ Systems that store documents, providing index and simple query mechanisms
¡ Extensible record
¡ Systems that store extensible records that can be partitioned vertically and horizontally across nodes
¡ Graph
¡ Systems that store model data as graphs where nodes can represent content modelled as document or key-value structures and arcs
represent a relation between the data modelled by the node
¡ Relational
¡ Systems that store, index and query tuples
52
53. KEY-VALUE STORES
¡ “Simplest data stores” use a data model similar to
the memcached distributed in-memory cache
¡ Single key-value index for all data
¡ Provide a persistence mechanism
¡ Replication, versioning, locking, transactions, sorting
¡ API: inserts, deletes, index lookups
¡ No secondary indices or keys
53
SYSTEM ADDRESS
Redis code.google.com/p/redis
Scalaris code.google.com/p/scalaris
Tokyo tokyocabinet.sourceforge.net
Voldemort project-‐voldemort.com
Riak riak.basho.com
Membrain schoonerinfotech.com/products
Membase membase.com
54. SELECT
name
FROM
group
WHERE
gid
IN
(
SELECT
gid
FROM
group_member
WHERE
uid
=
me()
)
54
SELECT
name,
pic,
profile_url
FROM
user
WHERE
uid
=
me()
SELECT
name,
pic
FROM
user
WHERE
online_presence
=
"active"
AND
uid
IN
(
SELECT
uid2
FROM
friend
WHERE
uid1
=
me()
)
SELECT
name
FROM
friendlist
WHERE
owner
=
me()
SELECT
message,
attachment
FROM
stream
WHERE
source_id
=
me()
AND
type
=
80
https://developers.facebook.com/docs/reference/fql/
56. DOCUMENT STORES
¡ Support more complex data: pointerless objects, i.e.,
documents
¡ Secondary indexes, multiple types of documents
(objects) per database, nested documents and lists, e.g.
B-trees
¡ Automatic sharding (scale writes), no explicit locks,
weaker concurrency (eventual for scaling reads) and
atomicity properties
¡ API: select,
delete,
getAttributes,
putAttributes
on documents
¡ Queries can be distributed in parallel over multiple
nodes using a map-reduce mechanism
56
SYSTEM ADDRESS
SimpleDB amazon.com/simpledb
Couch DB couchdb.apache.org
Mongo DB mongodb.org
Terrastore code.google.com/terrastore
58. EXTENSIBLE RECORD STORES
¡ Basic data model is rows and columns
¡ Basic scalability model is splitting rows and columns over
multiple nodes
¡ Rows split across nodes through sharding on the primary key
¡ Split by range rather than hash function
¡ Rows analogous to documents: variable number of attributes, attribute
names must be unique
¡ Grouped into collections (tables)
¡ Queries on ranges of values do not go to every node
¡ Columns are distributed over multiple nodes using “column
groups”
¡ Which columns are best stored together
¡ Column groups must be pre-defined with the extensible record
stores
58
SYSTEM ADDRESS
HBase hbase.apache.com
HyperTable hypertable.org
Cassandra incubator.apache.org/cassandra
59. SCALABLE RELATIONAL SYSTEMS
¡ SQL: rich declarative query language
¡ Databases reinforce referential integrity
¡ ACID semantics
¡ Well understood operations:
¡ Configuration, Care and feeding, Backups,Tuning, Failure and recovery,
Performance characteristics
¡ Use small-scope operations
¡ Challenge: joins that do not scale with sharding
¡ Use small-scope transactions
¡ ACID transactions inefficient with communication and 2PC overhead
¡ Shared nothing architecture for scalability
¡ Avoid cross-node operations
59
SYSTEM ADDRESS
MySQL C mysql.com/cluster
Volt DB voltdb.com
Clustrix clustrix.com
ScaleDB scaledb.com
Scale Base scalebase.com
Nimbus DB nimbusdb.com
61. REPLICATION MASTER - SLAVE
¡ Makes one node the authoritative copy/replica that handles
writes while replica synchronize with the master and may
handle reeds
¡ All replicas have the same weight
¡ Replicas can all accept writes
¡ The lose of one of them does not prevent access to the data store
¡ Helps with read scalability but does not help with write
scalability
¡ Read resilience: should the master fail, slaves can still handle
read requests
¡ Master failure eliminates the ability to handle writes until
either the master is restored or a new master is appointed
¡ Biggest complication is consistency
¡ Possible write – write conflict
¡ Attempt to update the same record at the same time from to
different places
¡ Master is a bottle-neck and a point of failure
61
Master'
Slaves'
all'updates'
made'to'the'master'
changes'propagate''
To'slaves'
reads'can'be'done'
from'master'or'slaves'
62. MASTER-SLAVE REPLICATION MANAGEMENT
¡ Masters can be appointed
¡ Manually when configuring the nodes cluster
¡ Automatically: when configuring a nodes cluster one of them elected as master.The master can appoint a new master when the master
fails reducing downtime
¡ Read resilience
¡ Read and write paths have to be managed separately to handle failure in the write path and still reads can occur
¡ Reads and writes are put in different database connections if the database library accepts it
¡ Replication comes inevitably with a dark side: inconsistency
¡ Different clients reading different slaves will see different values if changes have not been propagated to all slaves
¡ In the worst case a client cannot read a write it just made
¡ Even if master-slave is used for hot backups, if the master fails any updates on to the backup are lost
62
63. REPLICATION: PEER-TO-PEER
¡ Allows writes to any node; the nodes coordinate
to synchronize their copies
¡ The replicas have equal weight
¡ Deals with inconsistencies
¡ Replicas coordinate to avoid conflict
¡ Network traffic cost for coordinating writes
¡ Unnecessary to make all replicas agree to write, only
the majority
¡ Survival to the loss of the minority of replicas nodes
¡ Policy to merge inconsistent writes
¡ Full performance on writing to any replica
63
Master'
nodes'communicate'
their'writes'
all'nodes'read'
and'write'all'data'
64. SHARDING
¡ Ability to distribute both data and load of simple
operations over many servers, with no RAM or disk
shared among servers
¡ A way to horizontally scale writes
¡ Improve read performance
¡ Application/data store support
64
¡ Puts different data on separate nodes
¡ Each user only talks to one servicer so she gets rapid
responses
¡ The load should be balanced out nicely between
servers
¡ Ensure that
¡ data that is accessed together is clumped together on
the same node
¡ that clumps are arranged on the nodes to provide best
data access
Each%shard%reads%and%
writes%its%own%data%
65. SHARDING
Database laws
¡ Small databases are fast
¡ Big databases are slow
¡ Keep databases small
Principle
¡ Start with a big monolithic database
¡ Break into smaller databases
¡ Across many clusters
¡ Using a key value
65
Instead of having one million customers information
on a single big machine ….
100 000 customers on smaller and different machines
66. SHARDING CRITERIA
¡ Partitioning
¡ Relational: handled by the DBMS (homogeneous DBMS)
¡ NoSQL: based on ranging of the k-value
¡ Federation
¡ Relational
¡ Combine tables stored in different physical databases
¡ Easier with denormalized data
¡ NoSQL:
¡ Store together data that are accessed together
¡ Aggregates unit of distribution
66
67. SHARDING
Architecture
¡ Each application server (AS) is running DBS/client
¡ Each shard server is running
¡ a database server
¡ replication agents and query agents for supporting
parallel query functionality
Process
¡ Pick a dimension that helps sharding easily (customers,
countries, addresses)
¡ Pick strategies that will last a long time as repartition/
re-sharding of data is operationally difficult
¡ This is done according to two different principles
¡ Partitioning: a partition is a structure that divides a space
into tow parts
¡ Federation: a set of things that together compose a
centralized unit but each individually maintains some aspect
of autonomy
67
Customers data is partitioned by ID in shards using an
algorithm d to determine which shard a customer ID belongs to
68. REPLICATION:ASPECTS TO CONSIDER
¡ Conditioning
¡ Important elements to consider
¡ Data to duplicate
¡ Copies location
¡ Duplication model (master – slave / P2P)
¡ Consistency model (global – copies)
68
Fault
tolerance
Availability
Transparency
levels
Performance
à Find a compromise !
70. BACKGROUND: DISTRIBUTED RELATIONAL DATABASES
¡ External schemas (views) are often subsets of relations
(contacts in Europe and America)
¡ Access defined on subsets of relations: 80% of the
queries issued in a region have to do with contacts of
that region
¡ Relations partition
¡ Better concurrency level
¡ Fragments accessed independently
¡ Implications
¡ Check integrity constraints
¡ Rebuild relations
70
71. FRAGMENTATION
¡ Horizontal
¡ Groups of tuples of the same relation
¡ Budget < 300 000 or >= 150 000
¡ Not disjoint are more difficult to manage
¡ Vertical
¡ Groups attributes of the same relation
¡ Separate budget from loc and pname of the relation
project
¡ Hybrid
71
72. FRAGMENTATION: RULES
Vertical
¡ Clustering
¡ Grouping elementary fragments
¡ Budget and location information in two relations
¡ Splitting
¡ Decomposing a relation according to affinity
relationships among attributes
Horizontal
¡ Tuples of the same fragment must be statistically
homogeneous
¡ If t1 and t2 are tuples of the same fragment then t1 and t2 have
the same probability of being selected by a query
¡ Keep important conditions
¡ Complete
¡ Every tuple (attribute) belongs to a fragment (without information
loss)
¡ If tuples where budget >= 150 000 are more likely to be selected
then it is a good candidate
¡ Minimum
¡ If no application distinguishes between budget >= 150 000 and
budget < 150 000 then these conditions are unnecessary
72
73. SHARDING: HORIZONTAL PARTITIONING
¡ The entities of a database are split into two or
more sets (by row)
¡ In relational: same schema several physical bases/
servers
¡ Partition contacts in Europe and America shards where
they zip code indicates where the will be found
¡ Efficient if there exists some robust and implicit way to
identify in which partition to find a particular entity
¡ Last resort shard
¡ Needs to find a sharding function: modulo, round robin,
hash – partition, range - partition
73
Load%balancer%
Cache%1%
Cache%2%
Cache%3%
MySQL%
Master%
MySQL%
Master%
Web%1%
Web%2%
Web%3%
Even%IDs%
Odd%IDs%
MySQL%
Slave%1% MySQL%
Slave%2%
MySQL%
Slave%n%
MySQL%
Slave%1% MySQL%
Slave%2%
MySQL%
Slave%n%
74. FEDERATION
A FEDERATION IS A SET OF THINGS THAT TOGETHER COMPOSE A CENTRALIZED UNIT BUT EACH
INDIVIDUALLY MAINTAINS SOME ASPECT OF AUTONOMY
74
75. FEDERATION:VERTICAL SHARDING
¡ Principle
¡ Partition data according to their logical affiliation
¡ Put together data that are commonly accessed
¡ The search load for the large partitioned entity can
be split across multiple servers (logical and
physical) and not only according to multiple
indexes in the same logical server
¡ Different schemas, systems, and physical bases/
servers
¡ Shards the components of a site and not only data
75
Load%balancer%
Cache%1%
Cache%2%
Cache%3%
MySQL%
Master%
MySQL%
Master%
Web%1%
Web%2%
Web%3%
Site%database%
Resume%database%
MySQL%
Slave%1% MySQL%
Slave%2%
MySQL%
Slave%n%
MySQL%
Slave%1%
Internal%
user%
77. «MEMCACHED»
¡ «memcached» is a memory management protocol based on a cache:
¡ Uses the key-value notion
¡ Information is completly stored in RAM
¡ «memcached» protocol for:
¡ Creating, retrieving, updating, and deleting information from the database
¡ Applications with their own «memcached» manager (Google, Facebook,
YouTube, FarmVille,Twitter,Wikipedia)
77
78. STORAGE ON DISC (1)
¡ For efficiency reasons, information is stored using the RAM:
¡ Work information is in RAM in order to answer to low latency requests
¡ Yet, this is not always possible and desirable
Ø The process of moving data from RAM to disc is called "eviction”; this
process is configured automatically for every bucket
78
79. STORAGE ON DISC (2)
¡ NoSQL servers support the storage of key-value pairs on disc:
¡ Persistency–can be executed by loading data, closing and reinitializing it
without having to load data from another source
¡ Hot backups– loaded data are sotred on disc so that it can be
reinitialized in case of failures
¡ Storage on disc– the disc is used when the quantity of data is higher
thant the physical size of the RAM, frequently used information is
maintained in RAM and the rest es stored on disc
79
80. STORAGE ON DISC (3)
¡ Strategies for ensuring:
¡ Each node maintains in RAM information on the key-value pairs it stores. Keys:
¡ may not be found, or
¡ they can be stored in memory or on disc
¡ The process of moving information from RAM to disc is asynchronous:
¡ The server can continue processing new requests
¡ A queue manages requests to disc
Ø In periods with a lot of writing requests, clients can be notified that the server is
termporaly out of memory until information is evicted
80
82. MULTIVERSION CONCURRENCY CONTROL (MVCC)
¡ Objective: Provide concurrent access to the database and in programming languages to implement transactional
memory
¡ Problem: If someone is reading from a database at the same time as someone else is writing to it, the reader could
see a half-written or inconsistent piece of data.
¡ Lock: readers wait until the writer is done
¡ MVCC:
¡ Each user connected to the database sees a snapshot of the database at a particular instant in time
¡ Any changes made by a writer will not be seen by other users until the changes have been completed (until the transaction has been
committed
¡ When an MVCC database needs to update an item of data it marks the old data as obsolete and adds the newer version elsewhere à
multiple versions stored, but only one is the latest
¡ Writes can be isolated by virtue of the old versions being maintained
¡ Requires (generally) the system to periodically sweep through and delete the old, obsolete data objects
82