Your SlideShare is downloading. ×
0
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Data Is A Heart Of Scalability
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Is A Heart Of Scalability

640

Published on

General discussion of problems of scalability wrt to data access patterns and systems

General discussion of problems of scalability wrt to data access patterns and systems

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
640
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. [email_address] December, 2009 Data is a heart of scalable system (base of “Philosophy of transaction processing”)
  • 2. Read vs. write dialectic
  • 3. Read vs write <ul><li>Processing of business transaction involves both read and write operations. These operations impose contradictory requirements for data structures and architecture. </li></ul>
  • 4. Data schema Normalized Denormalized Read Bad. Complex queries, joins are slow Good. Fast queries, no joins, simple queries Write Good. Non contradicting, less rows to update Bad. Potential inconsistency, more rows to update, complex update procedures
  • 5. Redundancy Single copy Multiple copies Read Bad. Bottleneck Good. Balancing of load between copies Write Good. No consistency problems, Single place to update Bad. Multiple places to update, synchronization and consistency problems
  • 6. Storage
  • 7. Message queue as a Storage <ul><li>Sending message = write operation </li></ul><ul><li>Consuming message = read operation (with very limited semantics) </li></ul><ul><li>Durable subscriptions </li></ul><ul><li>Transaction support </li></ul><ul><li>MQ does not need to keep indexes, and may write transactions on disk extremely fast </li></ul>
  • 8. Storage media <ul><li>Magnetic disks </li></ul><ul><ul><li>Slow </li></ul></ul><ul><ul><li>Persistent </li></ul></ul><ul><li>Dynamic memory </li></ul><ul><ul><li>Very Fast </li></ul></ul><ul><ul><li>Volatile </li></ul></ul><ul><li>Flash memory – starting to change IT landscape </li></ul><ul><ul><li>Fast </li></ul></ul><ul><ul><li>Persistent </li></ul></ul>
  • 9. RDBMS are not sleeping <ul><li>MQ gets integrated into the core of RDBMS </li></ul><ul><ul><li>MySQL, Postgres, BerkleyDB are starting to move </li></ul></ul><ul><ul><li>From anemic storage – to processing facility </li></ul></ul><ul><li>In-memory operations </li></ul><ul><ul><li>Oracle TimesTen </li></ul></ul><ul><li>Materialized views </li></ul>
  • 10. Distribution
  • 11. You have to go distributed <ul><li>You cannot avoid building distributed system. </li></ul><ul><li>Fault tolerance </li></ul><ul><ul><li>System should survive server failure </li></ul></ul><ul><li>Scaling </li></ul><ul><ul><li>Resources of single server are limited </li></ul></ul><ul><li>Globalization </li></ul><ul><ul><li>Modern business is distributed </li></ul></ul>
  • 12. Network <ul><li>Network is your enemy, never forget it </li></ul><ul><li>Network is unreliable </li></ul><ul><li>Network is slow </li></ul><ul><li>Network has limited bandwidth </li></ul><ul><li>Also network iterations require complex data format transformation – HTTP + SSL + XML may kill performance in blink of an eye </li></ul>
  • 13. Network vs. disk access to data Network Magnetic disk Latency Less (~1ms) Seek time (~10ms) if not cached Random access Good, unless large number of separate small iterations is used Bad, high seek time Bandwidth Limited, network infrastructure is shared Higher, if no seek required – very high throughput
  • 14. Data as a state of the system
  • 15. Validity of state <ul><li>Valid = has an interpretation which make sense from business or operational prospective </li></ul><ul><li>It should look meaningful, it doesn’t matter what happens inside. </li></ul><ul><li>Contradiction is not a contradiction as long a we know how to resolve it </li></ul>
  • 16. Time is just another axis <ul><li>Speed of light is limited, but it does not make stars less beautiful, even if their light is a thousands years old. </li></ul><ul><li>There is no need to force all changes instantly, offloading operations for asynchronous and batch processing are powerful method to increase performance and deal with peak-loads. </li></ul>
  • 17. What data do we usually have? <ul><li>Transactional </li></ul><ul><ul><li>Data changing dynamically (either by us or 3 rd party) </li></ul></ul><ul><li>Static data </li></ul><ul><ul><li>Data is changing not that often </li></ul></ul><ul><ul><li>Can be treated as immutable for most operations </li></ul></ul>
  • 18. Operations on data <ul><li>Operational data </li></ul><ul><ul><li>May include transactional and static data </li></ul></ul><ul><ul><li>Complex read operations </li></ul></ul><ul><ul><li>Low latency requirements </li></ul></ul><ul><li>Transactional data </li></ul><ul><ul><li>Intensive write operation </li></ul></ul><ul><ul><li>Low latency requirements </li></ul></ul><ul><li>Secondary operations (e.g. daily reporting) </li></ul><ul><ul><li>Read access to entire scope of information </li></ul></ul><ul><ul><li>Operations over large datasets </li></ul></ul>
  • 19. Data structures <ul><li>Data structures should be tailored for access pattern. </li></ul><ul><li>How to deal with read/write contradiction? </li></ul><ul><li>Use two storages </li></ul><ul><li>Synchronize data between storages </li></ul><ul><ul><li>Synchronously </li></ul></ul><ul><ul><li>Asynchronously </li></ul></ul><ul><ul><li>Periodically </li></ul></ul>
  • 20. Storages <ul><li>Operational log storage </li></ul><ul><ul><li>Transactional data only </li></ul></ul><ul><ul><li>Fast writes, limited read (e.g. in case of recovery only) </li></ul></ul><ul><li>Operational view </li></ul><ul><ul><li>Transactional + static data </li></ul></ul><ul><ul><li>Tailored for business logic queries </li></ul></ul><ul><ul><li>Write intensive </li></ul></ul><ul><li>Long term storage </li></ul><ul><ul><li>Every data in system </li></ul></ul><ul><ul><li>Required for migration backup/restore of information </li></ul></ul><ul><ul><li>Suitable for analytics and ad hoc </li></ul></ul>
  • 21. Sweet spot of data grids <ul><li>IMDG is very fast at simple queries </li></ul><ul><li>IMDG has great write throughput </li></ul><ul><li>This makes IMDG ideal solution as a “view” of operational data. </li></ul><ul><li>We can tailor data structures for queries </li></ul><ul><li>No need for persistence </li></ul>
  • 22. Long term storage <ul><li>RDBMS is unbeatable in this field </li></ul><ul><li>Complex analytic queries </li></ul><ul><li>Ad hoc queries </li></ul><ul><li>Trust in big vendors </li></ul><ul><li>Asynchronous synchronization works best here </li></ul>
  • 23. Inventing a bicycle General transaction processing system revisited
  • 24. Transaction processing style <ul><li>Synchronous processing </li></ul><ul><ul><li>We can return transaction acknowledge to client, only when we can guarantee that transaction is successful and durable </li></ul></ul><ul><li>Asynchronous processing </li></ul><ul><ul><li>We acknowledge only the fact of starting business transaction </li></ul></ul><ul><li>I both cases we have to fixate a request before acknowledge it </li></ul>
  • 25. A bicycle <ul><li>1. Fixate incoming request (optional) </li></ul><ul><li>2. Acknowledge (async processing) </li></ul><ul><li>3. Operational processing </li></ul><ul><li>4. Fixate operation result </li></ul>5. Response (sync processing) 6. Update operational view 7. Backend processing - updating warehouse - working with down streams Long term storage
  • 26. Operational log <ul><li>Files </li></ul><ul><ul><li>Backup solution required </li></ul></ul><ul><ul><li>Local access only </li></ul></ul><ul><li>DBMS </li></ul><ul><ul><li>Low transaction throughput </li></ul></ul><ul><ul><li>Index management overhead </li></ul></ul><ul><li>In-memory/IMDG </li></ul><ul><ul><li>Not durable </li></ul></ul><ul><li>MQ </li></ul><ul><ul><li>Best fit? </li></ul></ul>
  • 27. Operational view <ul><li>RDBMS – possible, with some tuning </li></ul><ul><ul><li>Normalized data model is efficient </li></ul></ul><ul><ul><li>Disk slow, but there are in-memory options </li></ul></ul><ul><li>Key/Value DBMS – possible </li></ul><ul><ul><li>Not so high write throughput </li></ul></ul><ul><li>In-memory/IMDG – best fit </li></ul><ul><ul><li>Limited capacity </li></ul></ul>
  • 28. Long term storage <ul><li>RDBMS – monopoly </li></ul><ul><li>Key/Value – possible </li></ul><ul><ul><li>But long term storage anticipates schema and strong consistency </li></ul></ul>
  • 29. Different approach <ul><li>Traditional design – one size fits all </li></ul><ul><ul><li>We need to design storage good for both read and write operations. </li></ul></ul><ul><ul><li>We are working against physics here </li></ul></ul><ul><li>Multiple layer storage </li></ul><ul><ul><li>One storage optimized for write and reliability * </li></ul></ul><ul><ul><li>One storage optimized for read operation </li></ul></ul><ul><ul><li>Synchronization between storages </li></ul></ul><ul><li>We replaced one impossible problem with 3 hard problems But it is clear how to solve each of them and such solutions can be reused. </li></ul><ul><li>* Only for write intensive applications </li></ul>
  • 30. IMDG
  • 31. Weak <ul><li>No SQL </li></ul><ul><li>“ In-memory” is not so fast </li></ul><ul><ul><li>Flash memory technology </li></ul></ul><ul><ul><li>Strong believe in hardware </li></ul></ul><ul><li>In-memory RDBMS (though limited to single server) </li></ul><ul><li>Network is slow </li></ul><ul><ul><li>Multiple network round trips per request may ruin performance </li></ul></ul><ul><ul><li>Bandwidth is limited </li></ul></ul><ul><li>RDBMS is better with complex queries </li></ul>
  • 32. Strong - data <ul><li>Schema should be adopted </li></ul><ul><ul><li>Denormalized – single lookup per operation </li></ul></ul><ul><ul><li>Data affinity is your friend </li></ul></ul><ul><li>… but if you cook a right schema: </li></ul><ul><ul><li>True horizontal scale out </li></ul></ul><ul><ul><li>Fast operations </li></ul></ul><ul><ul><li>Great write throughput </li></ul></ul><ul><ul><ul><li>And it scales! </li></ul></ul></ul>
  • 33. Strong - distribution <ul><li>Addresses headaches of distributed systems </li></ul><ul><li>Coordination of work </li></ul><ul><ul><li>Keeping cluster together </li></ul></ul><ul><ul><li>Node communications </li></ul></ul><ul><ul><li>Failover – (IMDG facilitates availability) </li></ul></ul><ul><li>Dealing with state (data) </li></ul><ul><ul><li>Data bottleneck problems </li></ul></ul><ul><ul><li>Data consistency – (IMDG provides consistency) </li></ul></ul><ul><li>+ Recovery from failure </li></ul>
  • 34. The CAP theorem <ul><li>* Introduced in 2000 by Eric Brewer, formally proven by Seth Gilbert and Nancy Lynch in 2002 </li></ul>
  • 35. Q&amp;A

×