This document discusses architectural anti-patterns related to data distribution and handling failures. It describes issues like using tables for queues, logs, or caches instead of the proper tools. Alternatives presented include document databases, key-value stores, message queues, and avoiding over-normalization. The document advocates simplifying data models and thinking about architecture and data flow rather than only databases.
Architectural anti-patterns for data handlingGleicon Moraes
Now with three more anti patterns and a new required listening. This is the Discipline release, all hail to King Crimson and Fripp's care with details.
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
Growth of big datasets
Introduction to Apache Hadoop and Spark for developing applications
Components of Hadoop, HDFS, MapReduce and HBase
Capabilities of Spark and the differences from a typical MapReduce solution
Some Spark use cases for data analysis
Architectural anti-patterns for data handlingGleicon Moraes
Now with three more anti patterns and a new required listening. This is the Discipline release, all hail to King Crimson and Fripp's care with details.
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
Growth of big datasets
Introduction to Apache Hadoop and Spark for developing applications
Components of Hadoop, HDFS, MapReduce and HBase
Capabilities of Spark and the differences from a typical MapReduce solution
Some Spark use cases for data analysis
What is NoSQL? How does it come to the picture? What are the types of NoSQL? Some basics of different NoSQL types? Differences between RDBMS and NoSQL. Pros and Cons of NoSQL.
What is MongoDB? What are the features of MongoDB? Nexus architecture of MongoDB. Data model and query model of MongoDB? Various MongoDB data management techniques. Indexing in MongoDB. A working example using MongoDB Java driver on Mac OSX.
This deck talks about the basic overview of NoSQL technologies, implementation vendors/products, case studies, and some of the core implementation algorithms. The presentation also describes a quick overview of "Polyglot Persistency", "NewSQL" like emerging trends.
The deck is targeted to beginners who wants to get an overview of NoSQL databases.
DataStage Online Training, Job Oriented Data Stage Training Classes by Real Time Expert for India, USA, Canada, UK, Japan, Singapore , Hyderabad, Bangalore, Pune @ +91 7680813158
Domain Driven Design is a software development process that focuses on finding a common language for the involved parties. This language and the resulting models are taken from the domain rather than the technical details of the implementation. The goal is to improve the communication between customers, developers and all other involved groups. Even if Eric Evan's book about this topic was written almost ten years ago, this topic remains important because a lot of projects fail for communication reasons.
Relational databases have their own language and influence the design of software into a direction further away from the Domain: Entities have to be created for the sole purpose of adhering to best practices of relational database. Two kinds of NoSQL databases are changing that: Document stores and graph databases. In a document store you can model a "contains" relation in a more natural way and thereby express if this entity can exist outside of its surrounding entity. A graph database allows you to model relationships between entities in a straight forward way that can be expressed in the language of the domain.
In this talk I want to look at the way a multi model database that combines a document store and a graph database can help you to model your problems in a way that is understandable for all parties involved, and explain the benefits of this approach for the software development process.
A column-oriented DBMS is a database management system (DBMS) that stores its content by column rather than by row. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.
What is NoSQL? How does it come to the picture? What are the types of NoSQL? Some basics of different NoSQL types? Differences between RDBMS and NoSQL. Pros and Cons of NoSQL.
What is MongoDB? What are the features of MongoDB? Nexus architecture of MongoDB. Data model and query model of MongoDB? Various MongoDB data management techniques. Indexing in MongoDB. A working example using MongoDB Java driver on Mac OSX.
This deck talks about the basic overview of NoSQL technologies, implementation vendors/products, case studies, and some of the core implementation algorithms. The presentation also describes a quick overview of "Polyglot Persistency", "NewSQL" like emerging trends.
The deck is targeted to beginners who wants to get an overview of NoSQL databases.
DataStage Online Training, Job Oriented Data Stage Training Classes by Real Time Expert for India, USA, Canada, UK, Japan, Singapore , Hyderabad, Bangalore, Pune @ +91 7680813158
Domain Driven Design is a software development process that focuses on finding a common language for the involved parties. This language and the resulting models are taken from the domain rather than the technical details of the implementation. The goal is to improve the communication between customers, developers and all other involved groups. Even if Eric Evan's book about this topic was written almost ten years ago, this topic remains important because a lot of projects fail for communication reasons.
Relational databases have their own language and influence the design of software into a direction further away from the Domain: Entities have to be created for the sole purpose of adhering to best practices of relational database. Two kinds of NoSQL databases are changing that: Document stores and graph databases. In a document store you can model a "contains" relation in a more natural way and thereby express if this entity can exist outside of its surrounding entity. A graph database allows you to model relationships between entities in a straight forward way that can be expressed in the language of the domain.
In this talk I want to look at the way a multi model database that combines a document store and a graph database can help you to model your problems in a way that is understandable for all parties involved, and explain the benefits of this approach for the software development process.
A column-oriented DBMS is a database management system (DBMS) that stores its content by column rather than by row. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.
This is an exam cheat sheet hopes to cover all keys points for GCP Data Engineer Certification Exam
Let me know if there is any mistake and I will try to update it
Overview of MongoDB and Other Non-Relational DatabasesAndrew Kandels
My Minnesota PHP Usergroup (mnphp.org) presentation where I give an overview on MongoDB and other non-relational databases and their ability to solve unique, complex problems.
Apresentação na QCon São Paulo 2018 sobre Data engineering e casos de arquiteturas com grande volume de dados usando Cassandra, Elasticsearch e Postgresql
DNAD 2015 - Como a arquitetura emergente de sua aplicação pode jogar contra ...Gleicon Moraes
Apresentação com Renato Lucindo(https://github.com/lucindo) para o DNAD 2015 Esta apresentação é uma evolução do material que apresentamos anteriormente na QCon.
Web 2.0 applications for social networking provide data about users’ mood and opinions in almost real time. Many applications are taking advantage of these data to derive business intelligence. However, the volume of data makes it hard and error-prone to classify sentiments and opinions manually. The combination of data mining techniques and a pipeline to process data from Web 2.0 applications, such as Twitter, Facebook, and Wordpress, makes it possible to apply natural language processing and machine learning techniques to automate partially this task. Therefore, the amount of manual classification is reduced, as the incoming data has already a classification tag that can be easily changed, feeding back the classifier. There is room for improvements and a Brazilian Portuguese Corpus was created to do the initial training of the classifier. The code used for this testing was based on open source libraries and is available as a test bed for different corpora and new algorithms.
1. Architectural Anti Patterns
Notes on Data Distribution and Handling Failures
Gleicon Moraes
http://zenmachine.wordpress.com
http://github.com/gleicon
@gleicon
3. Anti Patterns
Evolution from SQL Anti Patterns (NoSQL:br May 2010)
More than just RDBMS
Large volumes of data
Distribution
Architecture
Research on other tools
Message Queues, DHT, Job Schedulers, NoSQL
Indexing, Map/Reduce
4. RDBMS Anti Patterns
Not all things fit on a relational database, single ou distributed
The eternal table-as-a-tree
Dynamic table creation
Table as cache
Table as queue
Table as log file
Stoned Procedures
Row Alignment
Extreme JOINs
Your scheme must be printed in an A3 sheet.
Your ORM issue full queries for Dataset iterations
6. The eternal tree
Problem: Most threaded discussion example uses something
like a table which contains all threads and answers, relating to
each other by an id. Usually the developer will come up with
his own binary-tree version to manage this mess.
id - parent_id -author - text
1 - 0 - gleicon - hello world
2 - 1 - elvis - shout !
Alternative: Document storage:
{ thread_id:1, title: 'the meeting', author: 'gleicon', replies:[
{
'author': elvis, text:'shout', replies:[{...}]
}
]
}
7. Dynamic table creation
Problem: To avoid huge tables, one must come with a
"dynamic schema". For example, lets think about a document
management company, which is adding new facilities over the
country. For each storage facility, a new table is created:
item_id - row - column - stuff
1 - 10 - 20 - cat food
2 - 12 - 32 - trout
Now you have to come up with "dynamic queries", which will
probably query a "central storage" table and issue a huge join
to check if you have enough cat food over the country.
Alternatives:
- Document storage, modeling a facility as a document
- Key/Value, modeling each facility as a SET
8. Table as cache
Problem: Complex queries demand that a result be stored in a
separated table, so it can be queried quickly. Worst than views
Alternatives:
- Really ?
- Memcached
- Redis + AOF + EXPIRE
- De-normalization
9. Table as queue
Problem: A table which holds messages to be completed.
Worse, they must be ordered by
time of creation.
Corolary: Job Scheduler table
Alternatives:
- RestMQ, Resque
- Any other message broker
- Redis (LISTS - LPUSH + RPOP)
- Use the right tool
10. Table as log file
Problem: A table in which data gets written as a log file. From
time to time it needs to be purged. Truncating this table once
a day usually is the first task assigned to new DBAs.
Alternative:
- MongoDB capped collection
- Redis, and RRD pattern
- RIAK
11. Stoned procedures
Problem: Stored procedures hold most of your applications
logic. Also, some triggers are used to - well - trigger important
data events.
SP and triggers has the magic property of vanishing of our
memories and being impossible to keep versioned.
Alternative:
- Now be careful so you dont use map/reduce as modern
stoned procedures. Unfit for real time search/processing
- Use your preferred language for business stuff, and let event
handling to pub/sub or message queues.
12. Row Alignment
Problem: Extra rows are created but not used, just in case.
Usually they are named as a1, a2, a3, a4 and called padding.
There's good will behind that, specially when version 1 of the
software needed an extra column in a 150M lines database
and it took 2 days to run an ALTER TABLE. But that's no
excuse.
Alternative:
- Quit being cheap. Quit feeling 'hacker' about padding
- Document based databases as MongoDB and CouchDB, has
no schema. New atributes are local to the document and can
be added easily.
13. Extreme JOINs
Problem: Business stuff modeled as tables. Table inheritance
(Product -> SubProduct_A). To find the complete data for a
user plan, one must issue gigantic queries with lots of JOINs.
Alternative:
- Document storage, as MongoDB
might help having important
information together.
- De-normalization
- Serialized objects
14. Your scheme fits in an A3 sheet
Problem: Huge data schemes are difficult to manage. Extreme
specialization creates tables which converges to key/value
model. The normal form get priority over common sense.
Product_A Product_B
id - desc id - desc
Alternatives:
- De-normalization
- Another scheme ?
- Document store for flattening model
- Key/Value
- See 'Extreme JOINs'
15. Your ORM ...
Problem: Your ORM issue full queries for dataset iterations,
your ORM maps and creates tables which mimics your
classes, even the inheritance, and the performance is bad
because the queries are huge, etc, etc
Alternative:
- Apart from denormalization and good old common sense,
ORMs are trying to bridge two things with distinct impedance.
- There is nothing to relational models which maps cleanly to
classes and objects. Not even the basic unit which is the
domain(set) of each column. Black Magic ?
16. No silver bullet
- Think about data
handling and your
system architecture
- Think outside the norm
- De-normalize
- Simplify
- Know stuff (Message
queues, NoSQL, DHT)
17. Cycle of changes - Product A
1. There was the database model
2. Then, the cache was needed. Performance was no good.
3. Cache key: query, value: resultset
4. High or inexistent expiration time [w00t]
(Now there's a turning point. Data didn't need to change often.
Denormalization was a given with cache)
5. The cache needs to be warmed or the app wont work.
6. Key/Value storage was a natural choice. No data on MySQL
anymore.
18. Cycle of changes - Product B
1. Postgres DB storing crawler results.
2. There was a counter in each row, and updating this counter
caused contention errors.
3. Memcache for reads. Performance is better.
4. First MongoDB test, no more deadlocks from counter
update.
5. Data model was simplified, the entire crawled doc was
stored.
19. Stuff to think about
Think if the data you use aren't de-normalized somewhere
(cached)
Most of the anti-patterns signals that there are architectural
issues instead of only database issues.
The NoSQL route (or at least a partial NoSQL route) may
simplify it.
Are you dependent on cache ? Does your application fails
when there is no cache ? Does it just slows down ?
Think about the way to put and to get back your data from the
database (be it SQL or NoSQL).