A importância dos dados em sua arquitetura... uma visão muito além do SQL Server!

1,384 views

Published on

Nos últimos 30 anos temos vivido a hegemonia dos bancos de dados relacionais, a grande bala de prata da TI. O armazenamento de dados se tornou tão comoditizado, que nem mesmo nos questionamos se o modelo relacional é adequado as nossas necessidades. Mas será que o armazenamento de dados se resume ao modelo relacional? Será que as técnicas tradicionais de normalização ou ferramentas de produtividade como ORM são realmente adequadas? Será que você está tratando seus dados com a devida atenção?
Nesta palestra respondemos estas e outras perguntas sobre tratamento e armazenamento de dados. Colocamos o "dedo na ferida" e apresentamos uma nova escola de pensamento bem como algumas ferramentas que suportam esta nova realidade.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,384
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
22
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

A importância dos dados em sua arquitetura... uma visão muito além do SQL Server!

  1. 1. A importância dos dados em sua arquitetura… uma visão muito além do SQL Server Alexandre Porcelli - @porcelli Gleicon Moraes - @gleiconsexta-feira, 3 de junho de 2011
  2. 2. Alexandre Porcelli Creator & Dictator Alexandre Porcelli Writer Alexandre Porcelli Organizer Alexandre Porcelli Commiter / Parser Developer Alexandre Porcelli Core Developer / API Designersexta-feira, 3 de junho de 2011
  3. 3. Gleicon Moraes http://zenmachine.wordpress.com http://github.com/gleicon @gleiconsexta-feira, 3 de junho de 2011
  4. 4. sexta-feira, 3 de junho de 2011
  5. 5. existe um mundo além do...sexta-feira, 3 de junho de 2011
  6. 6. ou do...sexta-feira, 3 de junho de 2011
  7. 7. além, até mesmo dos...sexta-feira, 3 de junho de 2011
  8. 8. inclusive do...sexta-feira, 3 de junho de 2011
  9. 9. próximo domundo sombriodo...sexta-feira, 3 de junho de 2011
  10. 10. nosqlsexta-feira, 3 de junho de 2011
  11. 11. uma nova escolasexta-feira, 3 de junho de 2011
  12. 12. sexta-feira, 3 de junho de 2011
  13. 13. contextosexta-feira, 3 de junho de 2011
  14. 14. sexta-feira, 3 de junho de 2011
  15. 15. falta de capitalsexta-feira, 3 de junho de 2011
  16. 16. big datasexta-feira, 3 de junho de 2011
  17. 17. história...sexta-feira, 3 de junho de 2011
  18. 18. modelos • Hierarchical (IMS): late 1960’s and 1970’s • Directed graph (CODASYL): 1970’s • Relational: 1970’s and early 1980’s • Entity-Relationship: 1970’s • Extended Relational: 1980’s • Semantic: late 1970’s and 1980’s • Object-oriented: late 1980’s and early 1990’s • Object-relational: late 1980’s and early 1990’s • Semi-structured (XML): late 1990’s to late 2000’s • The next big thing: ??? ref: What Goes Around Comes Around por Michael Stonebraker e Joey Hellersteinsexta-feira, 3 de junho de 2011
  19. 19. next big thing?sexta-feira, 3 de junho de 2011
  20. 20. definição...sexta-feira, 3 de junho de 2011
  21. 21. abaixo ao banco de dados relacional!sexta-feira, 3 de junho de 2011
  22. 22. abaixo ao banco de dados relacional! como bala de prata!sexta-feira, 3 de junho de 2011
  23. 23. momento histórico...sexta-feira, 3 de junho de 2011
  24. 24. sexta-feira, 3 de junho de 2011
  25. 25. resolver problemas específicossexta-feira, 3 de junho de 2011
  26. 26. estruturasexta-feira, 3 de junho de 2011 de dados
  27. 27. chave-valorsexta-feira, 3 de junho de 2011
  28. 28. modelosexta-feira, 3 de junho de 2011
  29. 29. família de colunassexta-feira, 3 de junho de 2011
  30. 30. modelo Keyspace Família de Colunas linha chave coluna coluna coluna coluna coluna . . . coluna . . . linha chave coluna coluna coluna ... coluna Coluna nome timestamp valorsexta-feira, 3 de junho de 2011
  31. 31. documentosexta-feira, 3 de junho de 2011
  32. 32. modelosexta-feira, 3 de junho de 2011
  33. 33. grafosexta-feira, 3 de junho de 2011
  34. 34. visão geralsexta-feira, 3 de junho de 2011
  35. 35. sexta-feira, 3 de junho de 2011
  36. 36. sexta-feira, 3 de junho de 2011
  37. 37. arquiteturasexta-feira, 3 de junho de 2011
  38. 38. Architectural Anti Patterns Notes on Data Distribution and Handling Failuressexta-feira, 3 de junho de 2011
  39. 39. Failsexta-feira, 3 de junho de 2011
  40. 40. Anti Patterns • Evolution from SQL Anti Patterns (NoSQL:br May 2010) • More than just RDBMS • Large volumes of data • Distribution • Architecture • Research on other tools • Message Queues, DHT, Job Schedulers, NoSQL • Indexing, Map/Reduce • New revision since QConSP 2010: included Hierarchical Sharding, Embedded lists and Distributed Global Lockingsexta-feira, 3 de junho de 2011
  41. 41. RDBMS Anti Patterns Not all things fit on a relational database, single ou distributed • The eternal table-as-a-tree • Dynamic table creation • Table as cache, queue, log file • Stoned Procedures • Row Alignment • Extreme JOINs • Your scheme must be printed in an A3 sheet. • Your ORM issue full queries for Dataset iterations • Hierarchical Sharding • Embedded lists • Distributed global locking • Throttle Controlsexta-feira, 3 de junho de 2011
  42. 42. The eternal tree Problem: Most threaded discussion example uses something like a table which contains all threads and answers, relating to each other by an id. Usually the developer will come up with his own binary-tree version to manage this mess. id - parent_id -author - text 1 - 0 - gleicon - hello world 2 - 1 - elvis - shout ! Alternative: Document storage: { thread_id:1, title: the meeting, author: gleicon, replies:[ { author: elvis, text:shout, replies:[{...}] } ] }sexta-feira, 3 de junho de 2011
  43. 43. Dynamic table creation Problem: To avoid huge tables, one must come with a "dynamic schema". For example, lets think about a document management company, which is adding new facilities over the country. For each storage facility, a new table is created: item_id - row - column - stuff 1 - 10 - 20 - cat food 2 - 12 - 32 - trout Now you have to come up with "dynamic queries", which will probably query a "central storage" table and issue a huge join to check if you have enough cat food over the country. Alternatives: - Document storage, modeling a facility as a document - Key/Value, modeling each facility as a SETsexta-feira, 3 de junho de 2011
  44. 44. Table as cache Problem: Complex queries demand that a result be stored in a separated table, so it can be queried quickly. Worst than views Alternatives: - Really ? - Memcached - Redis + AOF + EXPIRE - De-normalizationsexta-feira, 3 de junho de 2011
  45. 45. Table as queue Problem: A table which holds messages to be completed. Worse, they must be ordered by time of creation. Corolary: Job Scheduler table Alternatives: - RestMQ, Resque - Any other message broker - Redis (LISTS - LPUSH + RPOP) - Use the right toolsexta-feira, 3 de junho de 2011
  46. 46. Table as log file Problem: A table in which data gets written as a log file. From time to time it needs to be purged. Truncating this table once a day usually is the first task assigned to new DBAs. Alternative: - MongoDB capped collection - Redis, and RRD pattern - RIAKsexta-feira, 3 de junho de 2011
  47. 47. Stoned procedures Problem: Stored procedures hold most of your applications logic. Also, some triggers are used to - well - trigger important data events. SP and triggers has the magic property of vanishing of our memories and being impossible to keep versioned. Alternative: - Now be careful so you dont use map/reduce as modern stoned procedures. Unfit for real time search/processing - Use your preferred language for business stuff, and let event handling to pub/sub or message queues.sexta-feira, 3 de junho de 2011
  48. 48. Row Alignment Problem: Extra rows are created but not used, just in case. Usually they are named as a1, a2, a3, a4 and called padding. Theres good will behind that, specially when version 1 of the software needed an extra column in a 150M lines database and it took 2 days to run an ALTER TABLE. But thats no excuse. Alternative: - Quit being cheap. Quit feeling hacker about padding - Document based databases as MongoDB and CouchDB, has no schema. New atributes are local to the document and can be added easily.sexta-feira, 3 de junho de 2011
  49. 49. Extreme JOINs Problem: Business stuff modeled as tables. Table inheritance (Product -> SubProduct_A). To find the complete data for a user plan, one must issue gigantic queries with lots of JOINs. Alternative: - Document storage, as MongoDB might help having important information together. - De-normalization - Serialized objectssexta-feira, 3 de junho de 2011
  50. 50. Your scheme fits in an A3 sheet Problem: Huge data schemes are difficult to manage. Extreme specialization creates tables which converges to key/value model. The normal form get priority over common sense. Product_A Product_B id - desc id - desc Alternatives: - De-normalization - Another scheme ? - Document store for flattening model - Key/Value - See Extreme JOINssexta-feira, 3 de junho de 2011
  51. 51. Your ORM ... Problem: Your ORM issue full queries for dataset iterations, your ORM maps and creates tables which mimics your classes, even the inheritance, and the performance is bad because the queries are huge, etc, etc Alternative: - Apart from denormalization and good old common sense, ORMs are trying to bridge two things with distinct impedance. - There is nothing to relational models which maps cleanly to classes and objects. Not even the basic unit which is the domain(set) of each column. Black Magic ?sexta-feira, 3 de junho de 2011
  52. 52. Hierarchical Sharding Problem: Genius at work. Distinct databases inside a RDBMS ranging from A to Z, each database has tables for users starting with the proper letter. Each table has user data. Fictional example: e-mail accounts management > show databases; abcdefghijklmnopqrstuwxz > use a > show tables; ...alberto alice alma ... (and a lot more) There is no way to query anything in common for all users with out application side processing. In this particular case this sharding was uncalled for as relational databases have all tools to deal with this particular case of different clients and datasexta-feira, 3 de junho de 2011
  53. 53. Embedded Lists Problem: As data complexity grows, one thinks that its proper for the application to handle different data structures embedded in a single cell or row. The popular Lets use commas to separe it. You may find distinct separators as |, -, [] and so on. > select group_field from that_email_sharded_database.user "a@email1.net, b@email1.net,c@email2.net" > select flags from stupid_email_admin where id = 2 1|0|1|1|0| Either learn to model your data, or resort to model keys on K/V stores. Or any other way to show up flags, as you are not programming in C over a RDBMS hopefully.sexta-feira, 3 de junho de 2011
  54. 54. Distributed Global Locking Problem: Trying to emulate JAVAs synchronize in a distributed manner. As there is no primitive architectura block to do that, sounds like the proper place to do that would be a RDBMS. May starts with a reference counter in a table and end up with this: > select COALESCE(GET_LOCK(my_lock,0 ),0 ) Plain and simple, you might find it embedded in a magic class called DistributedSynchronize or ClusterSemaphore. Locks, transactions and reference counters (which may act as soft locks) doesnt belongs to the database. While its they use is questionable even in code, the matter of fact is that you are doing it wrong, if you are doing like that.sexta-feira, 3 de junho de 2011
  55. 55. Throttle Control Problem: To control and track access to a given resource, a sequence of statements is issued, varying from a update...select to a transaction block using a stored procedure: > select count(access) from throttle_ctl where ip_addr like ... > update .... or begin ... commit Apart from having IP addresses stored as string, each request would have to check on this block. It gets worse if throttle control is mixed with a table-based access. Using memcached (or any other k/v) as the data is ephemeral would work as (after creating the entry and setting expire time): if (add IPADDR:YYYY:MM:DD:HH, 1) < your_limit: do_stuff()sexta-feira, 3 de junho de 2011
  56. 56. sexta-feira, 3 de junho de 2011 ferramentas
  57. 57. noSQLsexta-feira, 3 de junho de 2011
  58. 58. column key-value document graph familysexta-feira, 3 de junho de 2011
  59. 59. column key-value document graph familysexta-feira, 3 de junho de 2011
  60. 60. sexta-feira, 3 de junho de 2011
  61. 61. newSQLsexta-feira, 3 de junho de 2011
  62. 62. sexta-feira, 3 de junho de 2011
  63. 63. sexta-feira, 3 de junho de 2011
  64. 64. sexta-feira, 3 de junho de 2011
  65. 65. noSQL newSQLsexta-feira, 3 de junho de 2011
  66. 66. sexta-feira, 3 de junho de 2011
  67. 67. sexta-feira, 3 de junho de 2011
  68. 68. cada escolha uma renúnciasexta-feira, 3 de junho de 2011
  69. 69. padrõessexta-feira, 3 de junho de 2011
  70. 70. how-tosexta-feira, 3 de junho de 2011
  71. 71. sexta-feira, 3 de junho de 2011
  72. 72. acidsexta-feira, 3 de junho de 2011
  73. 73. sexta-feira, 3 de junho de 2011 (
  74. 74. existe nosql acidsexta-feira, 3 de junho de 2011
  75. 75. sexta-feira, 3 de junho de 2011 )
  76. 76. A word about opssexta-feira, 3 de junho de 2011
  77. 77. Meaningful data • Email traffic accounting ~4500 msgs/sec in, ~2000 msgs out • 10 Gb SPAM/NOT SPAM tokenized data per day • 300 Gb logfile data/day • 80 Billion DNS queries per day • ~1k req/sec tcptable queries. • 0.8 Pb of data migrated over mixed quality network. Planned 3mo, executed 6mo, online, on production. • traffic from 400mb/s to 3.1Gb/ssexta-feira, 3 de junho de 2011
  78. 78. Stuff to think about Think if the data you use arent de-normalized somewhere (cached) Most of the anti-patterns signals that there are architectural issues instead of only database issues. Call it NoSQL, non relational, or any other name, but assemble a toolbox to deal with different data types. Are you dependent on cache ? Does your application fails when there is no warm cache ? Does it just slows down ? Think about the way to put and to get back your data from the database (be it SQL or NoSQL). Migrations are painful.sexta-feira, 3 de junho de 2011
  79. 79. Stuff to think about - Without operational requirements, its a personal choice about data organization - Without time pressure, any migration is easy, regardless data size. - Out of production environment, events flow in a different time flow - Normal Accidents (Charles Perrow) - how resilient is your operation ? Are you ready to tackle your incidents ? - Everything breaks under scale (Benjamin Black) - "Picking up pennies in front of a steamroller" (Nassin N. Taleb)sexta-feira, 3 de junho de 2011
  80. 80. Perguntas?sexta-feira, 3 de junho de 2011
  81. 81. Obrigado github.com/porcelli github.com/gleicon linkedin.com/in/alexandreporcelli linkedin.com/in/gleicon @porcelli @gleicon porcelli.com.br zenmachine.wordpress.comsexta-feira, 3 de junho de 2011

×