03 - Porcelli e Gleicon - A importância dos dados em sua arquitetura… uma visão muito além do SQL Server


Published on

Nos últimos 30 anos temos vivido a hegemonia dos bancos de dados relacionais, a grande bala de prata da TI. O armazenamento de dados se tornou tão comoditizado, que nem mesmo nos questionamos se o modelo relacional é adequado as nossas necessidades. Mas será que o armazenamento de dados se resume ao modelo relacional? Será que as técnicas tradicionais de normalização ou ferramentas de produtividade como ORM são realmente adequadas? Será que você está tratando seus dados com a devida atenção?
Nesta palestra iremos responder estas e outras perguntas sobre tratamento e armazenamento de dados. Vamos colocar o "dedo na ferida" e apresentar uma nova escola de pensamento e algumas ferramentas que suportam esta nova realidade.
Tags: Anti-patterns, Arquitetura, Database, noSQL, newSQL

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

03 - Porcelli e Gleicon - A importância dos dados em sua arquitetura… uma visão muito além do SQL Server

  1. 1. A importância dos dados em suaarquitetura… uma visão muito alémdo SQL ServerAlexandre Porcelli - @porcelliGleicon Moraes - @gleicon
  2. 2. Alexandre Porcelli Creator & DictatorAlexandre PorcelliWriter Alexandre Porcelli Organizer Alexandre Porcelli Commiter / Parser Developer Alexandre Porcelli Core Developer / API Designer
  3. 3. Gleicon Moraeshttp://zenmachine.wordpress.com http://github.com/gleicon @gleicon
  4. 4. existe um mundoalém do...
  5. 5. ou do...
  6. 6. além, até mesmodos...
  7. 7. inclusive do...
  8. 8. próximo domundo sombriodo...
  9. 9. nosql
  10. 10. umanovaescola
  11. 11. contexto
  12. 12. falta de capital
  13. 13. big data
  14. 14. história...
  15. 15. modelos • Hierarchical (IMS): late 1960’s and 1970’s • Directed graph (CODASYL): 1970’s • Relational: 1970’s and early 1980’s • Entity-Relationship: 1970’s • Extended Relational: 1980’s • Semantic: late 1970’s and 1980’s • Object-oriented: late 1980’s and early 1990’s • Object-relational: late 1980’s and early 1990’s • Semi-structured (XML): late 1990’s to late 2000’s • The next big thing: ??? ref: What Goes Around Comes Around por Michael Stonebraker e Joey Hellerstein
  16. 16. next big thing?
  17. 17. definição...
  18. 18. abaixo ao banco de dadosrelacional!
  19. 19. abaixo ao banco de dados relacional!como bala de prata!
  20. 20. momentohistórico...
  21. 21. resolverproblemasespecíficos
  22. 22. estruturade dados
  23. 23. chave-valor
  24. 24. modelo
  25. 25. família de colunas
  26. 26. modelo Keyspace Família de Colunas linha chave coluna coluna coluna coluna coluna . . . coluna . . . linha chave coluna coluna coluna ... coluna Coluna nome timestamp valor
  27. 27. documento
  28. 28. modelo
  29. 29. grafo
  30. 30. visão geral
  31. 31. arquitetura
  32. 32. Architectural Anti PatternsNotes on Data Distribution and Handling Failures
  33. 33. Fail
  34. 34. Anti Patterns• Evolution from SQL Anti Patterns (NoSQL:br May 2010)• More than just RDBMS• Large volumes of data• Distribution• Architecture• Research on other tools• Message Queues, DHT, Job Schedulers, NoSQL• Indexing, Map/Reduce• New revision since QConSP 2010: included Hierarchical Sharding, Embedded lists and Distributed Global Locking
  35. 35. RDBMS Anti PatternsNot all things fit on a relational database, single ou distributed• The eternal table-as-a-tree• Dynamic table creation• Table as cache, queue, log file• Stoned Procedures• Row Alignment• Extreme JOINs• Your scheme must be printed in an A3 sheet.• Your ORM issue full queries for Dataset iterations• Hierarchical Sharding• Embedded lists• Distributed global locking• Throttle Control
  36. 36. The eternal treeProblem: Most threaded discussion example uses somethinglike a table which contains all threads and answers, relating toeach other by an id. Usually the developer will come up withhis own binary-tree version to manage this mess.id - parent_id -author - text1 - 0 - gleicon - hello world2 - 1 - elvis - shout !Alternative: Document storage:{ thread_id:1, title: the meeting, author: gleicon, replies:[ { author: elvis, text:shout, replies:[{...}] } ]}
  37. 37. Dynamic table creationProblem: To avoid huge tables, one must come with a"dynamic schema". For example, lets think about a documentmanagement company, which is adding new facilities over thecountry. For each storage facility, a new table is created:item_id - row - column - stuff1 - 10 - 20 - cat food2 - 12 - 32 - troutNow you have to come up with "dynamic queries", which willprobably query a "central storage" table and issue a huge jointo check if you have enough cat food over the country.Alternatives:- Document storage, modeling a facility as a document- Key/Value, modeling each facility as a SET
  38. 38. Table as cacheProblem: Complex queries demand that a result be stored in aseparated table, so it can be queried quickly. Worst than viewsAlternatives:- Really ?- Memcached- Redis + AOF + EXPIRE- De-normalization
  39. 39. Table as queueProblem: A table which holds messages to be completed.Worse, they must be ordered bytime of creation.Corolary: Job Scheduler tableAlternatives:- RestMQ, Resque- Any other message broker- Redis (LISTS - LPUSH + RPOP)- Use the right tool
  40. 40. Table as log fileProblem: A table in which data gets written as a log file. Fromtime to time it needs to be purged. Truncating this table oncea day usually is the first task assigned to new DBAs.Alternative:- MongoDB capped collection- Redis, and RRD pattern- RIAK
  41. 41. Stoned proceduresProblem: Stored procedures hold most of your applicationslogic. Also, some triggers are used to - well - trigger importantdata events.SP and triggers has the magic property of vanishing of ourmemories and being impossible to keep versioned.Alternative:- Now be careful so you dont use map/reduce as modernstoned procedures. Unfit for real time search/processing- Use your preferred language for business stuff, and let eventhandling to pub/sub or message queues.
  42. 42. Row AlignmentProblem: Extra rows are created but not used, just in case.Usually they are named as a1, a2, a3, a4 and called padding.Theres good will behind that, specially when version 1 of thesoftware needed an extra column in a 150M lines databaseand it took 2 days to run an ALTER TABLE. But thats noexcuse.Alternative:- Quit being cheap. Quit feeling hacker about padding- Document based databases as MongoDB and CouchDB, hasno schema. New atributes are local to the document and canbe added easily.
  43. 43. Extreme JOINsProblem: Business stuff modeled as tables. Table inheritance(Product -> SubProduct_A). To find the complete data for auser plan, one must issue gigantic queries with lots of JOINs.Alternative:- Document storage, as MongoDB might help having important information together.- De-normalization- Serialized objects
  44. 44. Your scheme fits in an A3 sheetProblem: Huge data schemes are difficult to manage. Extremespecialization creates tables which converges to key/valuemodel. The normal form get priority over common sense.Product_A Product_Bid - desc id - descAlternatives:- De-normalization- Another scheme ?- Document store for flattening model- Key/Value- See Extreme JOINs
  45. 45. Your ORM ...Problem: Your ORM issue full queries for dataset iterations,your ORM maps and creates tables which mimics yourclasses, even the inheritance, and the performance is badbecause the queries are huge, etc, etcAlternative:- Apart from denormalization and good old common sense,ORMs are trying to bridge two things with distinct impedance.- There is nothing to relational models which maps cleanly toclasses and objects. Not even the basic unit which is thedomain(set) of each column. Black Magic ?
  46. 46. Hierarchical ShardingProblem: Genius at work. Distinct databases inside a RDBMSranging from A to Z, each database has tables for usersstarting with the proper letter. Each table has user data.Fictional example: e-mail accounts management> show databases;abcdefghijklmnopqrstuwxz> use a> show tables;...alberto alice alma ... (and a lot more)There is no way to query anything in common for all users without application side processing. In this particular case thissharding was uncalled for as relational databases have alltools to deal with this particular case of different clients anddata
  47. 47. Embedded ListsProblem: As data complexity grows, one thinks that its properfor the application to handle different data structuresembedded in a single cell or row. The popular Lets usecommas to separe it. You may find distinct separators as |, -, []and so on.> select group_field from that_email_sharded_database.user"a@email1.net, b@email1.net,c@email2.net"> select flags from stupid_email_admin where id = 21|0|1|1|0|Either learn to model your data, or resort to model keys on K/Vstores. Or any other way to show up flags, as you are notprogramming in C over a RDBMS hopefully.
  48. 48. Distributed Global LockingProblem: Trying to emulate JAVAs synchronize in a distributedmanner. As there is no primitive architectura block to do that,sounds like the proper place to do that would be a RDBMS.May starts with a reference counter in a table and end up withthis:> select COALESCE(GET_LOCK(my_lock,0 ),0 )Plain and simple, you might find it embedded in a magic classcalled DistributedSynchronize or ClusterSemaphore. Locks,transactions and reference counters (which may act as softlocks) doesnt belongs to the database. While its they use isquestionable even in code, the matter of fact is that you aredoing it wrong, if you are doing like that.
  49. 49. Throttle ControlProblem: To control and track access to a given resource, asequence of statements is issued, varying from aupdate...select to a transaction block using a stored procedure:> select count(access) from throttle_ctl where ip_addr like ...> update .... or begin ... commitApart from having IP addresses stored as string, each requestwould have to check on this block. It gets worse if throttlecontrol is mixed with a table-based access.Using memcached (or any other k/v) as the data is ephemeralwould work as (after creating the entry and setting expiretime):if (add IPADDR:YYYY:MM:DD:HH, 1) < your_limit: do_stuff()
  50. 50. ferramentas
  51. 51. noSQL
  52. 52. columnkey-value document graph family
  53. 53. columnkey-value document graph family
  54. 54. newSQL
  55. 55. noSQL newSQL
  56. 56. cada escolha uma renúncia
  57. 57. padrões
  58. 58. how-to
  59. 59. acid
  60. 60. (
  61. 61. existe nosql acid
  62. 62. )
  63. 63. A word about ops
  64. 64. Meaningful data• Email traffic accounting ~4500 msgs/sec in, ~2000 msgs out• 10 Gb SPAM/NOT SPAM tokenized data per day• 300 Gb logfile data/day• 80 Billion DNS queries per day• ~1k req/sec tcptable queries.• 0.8 Pb of data migrated over mixed quality network. Planned 3mo, executed 6mo, online, on production.• traffic from 400mb/s to 3.1Gb/s
  65. 65. Stuff to think aboutThink if the data you use arent de-normalized somewhere(cached)Most of the anti-patterns signals that there are architecturalissues instead of only database issues.Call it NoSQL, non relational, or any other name, but assemblea toolbox to deal with different data types.Are you dependent on cache ? Does your application failswhen there is no warm cache ? Does it just slows down ?Think about the way to put and to get back your data from thedatabase (be it SQL or NoSQL). Migrations are painful.
  66. 66. Stuff to think about- Without operational requirements, its a personal choiceabout data organization- Without time pressure, any migration is easy, regardless datasize.- Out of production environment, events flow in a different timeflow- Normal Accidents (Charles Perrow) - how resilient is youroperation ? Are you ready to tackle your incidents ?- Everything breaks under scale (Benjamin Black)- "Picking up pennies in front of a steamroller" (Nassin N.Taleb)
  67. 67. Perguntas?
  68. 68. Obrigadogithub.com/porcelli github.com/gleiconlinkedin.com/in/alexandreporcelli linkedin.com/in/gleicon@porcelli @gleiconporcelli.com.br zenmachine.wordpress.com