Agenda• Architecture for data - even if you don’t want it• Databases• Message Queues• Cache
Architecture“Everyone has a plan un4l they get punched in the mouth” – Mike Tyson
Even if you dont want it ...• There is an innate architecture on everything• You may end up with more data than you had planned to• You may get away from your quick and dirty CRUD• You probably are querying more than one Database• At some point you laugh when your boss asks you about Integrating Systems• Code turns into legacy - and so architectures• Scattered is not the same that Distributed
It usually starts like this App Server Database
thenApp Servers Database
itApp Servers Master DB Slave DB
goesApp Servers Master DB Slave DB Cache
likeApp Servers Master DB Slave DB Cache Indexing Service
thisApp Servers Master DB Slave DB Cache Indexing Service API Servers
and App Servers Master DBLoad Balancer/Reverse Proxy Slave DB Cache Indexing Service API Servers
beyond App Servers Master DBLoad Balancer/Reverse Proxy Slave DB Cache Indexing Service API Servers Auth Service
Problem is...An architect s first work is apt to be spare and clean. He knows hedoesn t know what he s doing, so he does it carefully and with greatrestraint. As he designs the first work, frill after frill and embellishment afterembellishment occur to him. These get stored away to be used nexttime. Sooner or later the first system is finished, and the architect,with firm confidence and a demonstrated mastery of that class ofsystems, is ready to build a second system. This second is the most dangerous system a man ever designs. Whenhe does his third and later ones, his prior experiences will confirm eachother as to the general characteristics of such systems, and theirdifferences will identify those parts of his experience that are particularand not generalizable. The general tendency is to over-design the second system, using allthe ideas and frills that were cautiously sidetracked on the first one.The result, as Ovid says, is a big pile. — Frederick P. Brooks, Jr. The Mythical Man-Month
Databases
Databases • Not an off-‐the-‐shelf architectural duct tape • Not only rela4onal, other paradigms • Usually the last place sought for op4miza4on • Usually the first place to accomodate last minute changes • Good ideas to try out: Sharding and Denormaliza4on • Some of your problems may require something other than a Rela4onal Database
Relevant RDBMS Anti-Patterns– Dynamic table creation– Table as cache– Table as queue– Table as log file– Distributed Global Locking– Stoned Procedures– Row Alignment– Extreme JOINs– Your ORM issue full queries for Dataset iterations– Throttle Control
Dynamic table creationProblem: To avoid huge tables, "dynamic schema” iscreated. For example, lets think about a documentmanagement company, which is adding new facilities overthe country. For each storage facility, a new table is created:item_id - row - column - stuff1 - 10 - 20 - cat food2 - 12 - 32 - troutSide Effect: "dynamic queries", which will probably query a"central storage" table and issue a huge join to check if youhave enough cat food over the country. It’s different fromSharding.Alternative:- Document storage, modeling a facility as a document- Key/Value, modeling each facility as a SET- Sharding properly
Table as cacheProblem: Complex queries demand that a result bestored in a separated table, so it can be queriedquickly. Worst than viewsAlternative:- Really ?- Memcached- Redis + AOF + EXPIRE- Denormalization
Table as queueProblem: A table which holds messages to becompleted. Worse, they must be sorted by date.Alternative:- RestMQ, Resque- Any other message broker- Redis (LISTS - LPUSH + RPOP)- Use the right tool
Table as log fileProblem: A table in which data gets written as a logfile. From time to time it needs to be purged.Truncating this table once a day usually is the first taskassigned to trainee DBAs.Alternative:- MongoDB capped collection- Redis, and a RRD pattern- RIAK
Distributed Global LockingProblem: Someone learns java and synchronize. A bitlater genius thinks that a distributed synchronize wouldbe awesome. The proper place to do that would be thedatabase of course. Start with a reference counter in atable and end up with this:> select COALESCE(GET_LOCK(my_lock,0 ),0 )Plain and simple, you might find it embedded in amagic class called DistributedSynchronize orClusterSemaphore. Locks, transactions and referencecounters (which may act as soft locks) doesnt belongsto the database.
Stoned proceduresProblem: Stored procedures hold most of yourapplications logic. Also, some triggers are used to - well- trigger important data events.SP and triggers has the magic property of vanishing ofour mind instantly, being impossible to keep versioned.Alternative:- Careful so you don’t use map/reduce as stonedprocedures.- Use your preferred language for business stuff, andlet event handling to pub/sub or message queues.
Row AlignmentProblem: Extra rows are created but not used, just incase. Usually they are named as a1, a2, a3, a4 andcalled padding.Theres good will behind that, specially when version 1of the software needed an extra column in a 150M linesdatabase and it took 2 days to run an ALTER TABLE.Alternative:- Document based databases as MongoDB andCouchDB, where new atributes are local to thedocument. Also, having no schema helps- Column based databases may be not the best choiceif column creation need restart/migrations
Extreme JOINsProblem: Business rules modeled as tables. Tableinheritance (Product -> SubProduct_A). To find thecomplete data for a user plan, one must issue giganticqueries with lots of JOINs.Alternative:- Document storage, as MongoDB- Denormalization- Serialized objects
Your ORM ...Problem: Your ORM issue full queries for datasetiterations, your ORM maps and creates tables whichmimics your classes, even the inheritance, and theperformance is bad because the queries are huge, etc,etcAlternative:Apart from denormalization and good old commonsense, ORMs are trying to bridge two things withdistinct impedance.There is nothing to relational models which mapscleanly to classes and objects. Not even the basic unitwhich is the domain(set) of each column. Black Magic ?
Throttle ControlProblem: A request tracker to create a throttle control by IPaddress, login, operation or any other event using a relationaldatabase Ranging from an update … select to a lock/transaction block,no relational database would be the best place to do that.Alternative: use memcached, redis or any other DHT which hasexpiration by creating a key asTHROTLE:<IP>:YYYYMMDDHH and increment it. At firstglance sounds the same but the expiration will take care ofcleaning up old entries. Also search time is the same as lookingfor a key.
No silver bullet- Consider alternatives - Think outside the norm - Denormalize - Simplify
Cycle of changes - Product A1. There was the database model 2. Then, the cache was needed. Performance was no good. 3. Cache key: query, value: resultset 4. High or inexistent expiration time [w00t] (Now theres a turning point. Data didnt need to change often.Denormalization was a given with cache) 5. The cache needs to be warmed or the app wont work. 6. Key/Value storage was a natural choice. No data on MySQLanymore.
Cycle of changes - Product B1. Postgres DB storing crawler results. 2. There was a counter in each row, and updating this counter caused contention errors. 3. Memcache for reads. Performance is better. 4. First MongoDB test, no more deadlocks from counter update. 5. Data model was simplified, the entire crawled doc was stored.
Stuff to think aboutThink if the data you use arent denormalized (cached) Most of the anti-patterns contain signs that a non-relationalroute (or at least a partial route) may help. Are you dependent on cache ? Does your application fails whenthere is no cache ? Does it just slows down ? Are you ready to think more about your data ? Think about the way to put and to get back your data from thedatabase (be it SQL or NoSQL).
Extra - MongoDB and RedisThe next two slides are here to show what is like to useMongoDB and Redis for the same task. There is more to managing your data than stuffing it inside adatabase. You gotta plan ahead for searches and migrations. This example is about storing books and searching betweenthem. MongoDB makes it simpler, just liek using its querylanguage. Redis requires that you keep track of tags and ids touse SET operations to recover which books you want. Check http://rediscookbook.org and http://cookbook.mongodb.org/ for recipes on data handling.
MongoDB/Redis recap - BooksMongoDB Redis { id: 1, title : Diving into Python, SET book:1 {title : Diving into Python, author: Mark Pilgrim, author: Mark Pilgrim} tags: [python,programming, computing] SET book:2 { title : Programing Erlang, } author: Joe Armstrong} SET book:3 { title : Programing in Haskell, { author: Graham Hutton} id:2, title : Programing Erlang, author: Joe Armstrong, SADD tag:python 1 SADD tag:erlang 2 tags: [erlang,programming, computing, SADD tag:haskell 3 distributedcomputing, FP] SADD tag:programming 1 2 3 } SADD tag computing 1 2 3 { SADD tag:distributedcomputing 2 id:3, SADD tag:FP 2 3 title : Programing in Haskell, author: Graham Hutton, tags: [haskell,programming, computing, FP] }
MongoDB/Redis recap - BooksMongoDB Redis Search tags for erlang or haskell: SINTER tag:erlang tag:haskell db.books.find({"tags": { $in: [erlang, haskell] 0 results } }) SINTER tag:programming tag:computing 3 results: 1, 2, 3Search tags for erlang AND haskell (no results) SUNION tag:erlang tag:haskell db.books.find({"tags": 2 results: 2 and 3 { $all: [erlang, haskell] } SDIFF tag:programming tag:haskell }) 2 results: 1 and 2 (haskell is excluded)This search yields 3 results db.books.find({"tags": { $all: [programming, computing] } })
1–2 of 2 previous next