A multi-tool in computing clouds: Tuple Space

991 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
991
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
20
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

A multi-tool in computing clouds: Tuple Space

  1. 1. A multi-tool in computing clouds: Tuple Space Joerg Fritsch, School of Computer Science & Informatics Cardiff University, 16 January 2014
  2. 2. Agenda • Key themes: parallelization, shared nothing, Challenging Data (aka “Big Data”) • Tuple Space: the multi-tool • Use Case (1): Overcoming limitations of tier based architectures • Use Case (2): In Stream processing of Big Data • Some miscellaneous remarks 2
  3. 3. The why’s and how’s • Eventually everything is about scalability. • Scalable software: Make use of 1000s of cores – Distribution – Decomposition & modularity – Coordination • Data does not fit in main memory – Distribution – Stream processing • Need for speed: reduce time complexity 3
  4. 4. Key Themes: Parallelization • Clouds will need to support scalable programs. • Many programs have to parallelize relative small computations with high inter-dependency. • “Any” application scaled through distribution over parallel (multicore) hardware. • Everything “inside a cloud” is physically distributed (data, processing). • Large scale distributed processing. “Many Core”. 4
  5. 5. Key Themes: Shared Nothing • Synchronization = shared “something” for example memory, disk, data(base) • Asynchronous = shared “shared nothing” • Avoid synchronization issues • Abstract multithreading and parallelization issues away from the developer, i.e. actor model • Highly scalable! –for example Erlang 5
  6. 6. Challenging Data (aka “Big Data”) • Data in computing clouds is challenging • 3V Data (Gartner, 2001): Volume, Variety, Velocity • Volume: perceived as “Big” – Hadoop & traditional RDBMs often similar in data volume – Differ in number of nodes (proportional to no. cores) – Analytics • Variety: unstructured data, data mashups – Hadoop does not cast into schemes, rows, cols • Velocity: streams 6
  7. 7. Challenging Data (aka “Big Data”) • Batch tasks are the prevailing computational model: – Map Reduce – Computation over “offline” data set (on disks) – Parallelized Polynomial time: Nm/k • Stream Processing catching up: – – – – – Operating on real-time data N * log (N) time You only got ‘one shot’ In memory data structures (e.g. Redis, Memcached) Examples: Storm project, AWS Kinesis, Apache S4, 7
  8. 8. Tuple Space, Gelertner (1985) • Tuples are key-value pairs • Tuple Space acting as Distributed Shared Memory (DSM) • Four primitives to manipulate and store tuples: rd(), in(), out(), eval() • No schema, ideal for unstructured data • Tuples matched using associative lookup • Associative lookup generally very powerful: CAM table/Routing, Data Flow programming & processors • Commercial use as in-memory Data Grids 8
  9. 9. Tuple Space, cont’d • Loose coupling • Decoupled in – Time – Space – Synchornization • Distributed shared memory (DSM) vs distributed memory (“like MPI”) 9
  10. 10. Redis Key-Value Store as Tuple Space • In memory key-value store, can be persistent across system reboot • No schema • Keys matched with glob-style patterns in O(1) time – Good enough implementation of associative lookup • Binary safe • Other key-value stores may be equally suitable and have different advantages/disadvantages – Distributed Hash Tables (DHT) – Memcached  Distribution – Dynamo  Presence as application service in AWS 10
  11. 11. Coordination Language LINDA, Gelertner (1992) • Coordination vs Threading • Composition happens outside of the worker or agent code – FPLs: composition and currying outside of functions – Stream processing and composition of kernels – Unix Pipes: application_1 | applications_3 | application_2 app_1 app_3 app_2 std_out – Pipes/(Message Queues) represent the dataflow graph • Error handling? – What happens to the mutable state if app_3 (or any of the kernels) fail? 11
  12. 12. Coordination Language LINDA, cont’d • Not enough expressive power! (for complex coordination) • Ways to make it more sonic: – Algebra of communication processes (ACP) – ACP generally quite suitable for streams, clicks, GUIs, Dataflow programming – Constraint Handling Rules (CHRs) – Agent Communicating through Logic Theories (ACLT), Omicini et al (1995), Denti et al (1998) • For example: Barrier (i.e. MPI_barrier)/Eureka conditions, Turing powerful implementation 12
  13. 13. Recap: Tuple Space is like a(n) … • Database , Data Grid – No schema • Key / Value store • Extension to programming languages – Without adaptation not Turing-powerful • Message bus, Message Queue • Means of coordination – Workers, Agents, Skeletons • Memory virtualization – Extension of main memory across physical boundaries 13
  14. 14. Use Case (1 of 2) Overcoming limitations of tier based architectures 14
  15. 15. Tier-based architectures • Concept has been around since 1998 • Costly serialization (of data) required at every system boundary  latency! • Often depicted w three simple tiers: web server, application server and data(base) • Many more devices & protocols involved: redundant load balancers, spanning tree, etc. 15
  16. 16. Tier-based architectures (alternatives) • To date: not many alternatives • Space based architectures – Gigaspaces – Tibco activespace • Notion of a one stop shop – Networks  L2 Ethernet fabrics – Networks  Integrated packet processing • Nobody wants to hit a spindle! – In-memory computing 16
  17. 17. The end of Tier-based architectures Source: http://wiki.gigaspaces.com 17
  18. 18. The end of Tier-based architectures (cont’d 1) Source: http://wiki.gigaspaces.com 18
  19. 19. The end of Tier-based architectures (cont’d 2) Space based cloud platform No tiers Implicit load balancing Harmonization of messaging, data and coordination Traditional tier-based cloud platform 19
  20. 20. Use Case (2 of 2) In stream processing of Big Data 20
  21. 21. “More programmer-friendly parallel dataflow languages await discovery, I think. Map Reduce is one (small) step in that direction.” Engineer-to-Engineer Lectures Jeff Hammerbacher June 2010 21
  22. 22. Stream processing of 3V Data • Stream – An unbounded sequence of tuples • Map Reduce excels in ad-hoc queries, no fit for recursion ≠ machine learning (ML) • Error resilient: Stateful stream processing – Redis knows transactions – Tuple space can contain global mutual state • Tuple vs Batch / Fine grain vs coarse grain 22
  23. 23. (Reactive) in Memory Tuple Space • Redis has a built-in Lua interpreter to manipulate data • Commercial tuple spaces are mostly “reactive” • Context-based recursion on portion of data that is in memory (aka “granularity”) 23
  24. 24. Tuple space architecture for in stream processing of Big Data 24
  25. 25. Commonalities FPLs & Flow based Programming (Johnston, 2004)      Immutable Data. Shared nothing. Freedom of side effects Locality of effects Lazy evaluation Data dependency equivalent to scheduling FPLs & Tuple Space (Fritsch & Walker, 2013)     Coordination Distribution Decoupling Inter process communication (IPC) 25
  26. 26. Commonalities cont’d Flow based Programming & Tuple Space  Both need “a space”  IP Space in Flow based programs  Tuple Spave in LINDA Altogether  (Data) Queues  Coordonation does not need to reckon w side effects  Coordination & composition  Representation of dataflow graph in place of a (thread) call graph 26
  27. 27. Real World Applications • News/RSS streams • Clicks – Online advertisement analytics (e.g. spider.io) – URLs (e.g. bit.ly) – GUI programming • Logistics & Transportation • Media – GPUs (streams + kernels) • Mashups: create new wisdom from multiple data sources (incompatible in velocity, volume, variety/structure) – Separate errors • Debit card transactions – Data Masking – Fraud detection/Feedback  Context  Mashups 27
  28. 28. Points to ponder • The ultimate mashup: batch data (aka “map reduce”) and speed data (aka “streams”) – Lambda architecture – Complementary to each other (e.e. Apache Spark, Lambda Architecture) • Currently three paradigms: RDBMs, Map Reduce, Streams. – Distributed query processing is a key element 28
  29. 29. Issues • Tuple Space is a piece of software as well • Scalability of tuple space – Distribution vs fast in memory computation • Complex coordination is a must! – So is error handling (stream replay?) • Number of supporting elements needed – (auto) scaler – cloud-like deployment: DevOps recipes – Zookeper? 29
  30. 30. Thank you 30
  31. 31. References Denti, Enrico, Antonio Natali, and Andrea Omicini. "On the expressive power of a language for programming coordination media." Proceedings of the 1998 ACM symposium on Applied Computing. ACM, 1998. Fritsch, Joerg, and Coral Walker. "CMQ-A lightweight, asynchronous high-performance messaging queue for the cloud." Journal of Cloud Computing 1.1 (2012): 1-13. Fritsch J. Walker C. (2013), “Cwmwl, a LINDA-based PaaS fabric for the cloud”, Journal of Communications, SI on Cloud and Big Data (to be published) Fritsch, Joerg, and Coral Walker. "CMQ-A lightweight, asynchronous high-performance messaging queue for the cloud." Journal of Cloud Computing 1.1 (2012): 1-13. Gelernter, David. "Generative communication in Linda." ACM Transactions on Programming Languages and Systems (TOPLAS) 7.1 (1985): 80-112. Gelernter, David, and Nicholas Carriero. "Coordination languages and their significance." Communications of the ACM 35.2 (1992): 96. Johnston, Wesley M., J. R. Hanna, and Richard J. Millar. "Advances in dataflow programming languages." ACM Computing Surveys (CSUR) 36.1 (2004): 1-34. Omicini, A., Denti, E., & Natali, A. (1995). Agent coordination and control through logic theories. In Topics in Artificial Intelligence (pp. 439-450). Springer Berlin Heidelberg. 31

×