Sector CloudSlam 09


Published on

A talk about Sector that was presented at CloudSlam '09.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Sector CloudSlam 09

  1. 1. Sector: An Open Source Cloud for Data Intensive Computing Robert Grossman University of Illinois at Chicago Open Data Group YunhongGu University of Illinois at Chicago April 20, 2009
  2. 2. Part 1 Varieties of Clouds 2
  3. 3. What is a Cloud?  Clouds provide on-demand resources or services over a network with the scale and reliability of a data center.  No standard definition.  Cloud architectures are not new.  What is new: – Scale – Ease of use – Pricing model. 3
  4. 4. Categories of Clouds  On-demand resources & services over the Internet at the scale of a data center  On-demand computing instances – IaaS: Amazon EC2, S3, etc.; Eucalyptus – supports many Web 2.0 users  On-demand computing capacity – Data intensive computing – (say 100 TB, 500 TB, 1PB, 5PB) – GFS/MapReduce/Bigtable, Hadoop, Sector, … 4
  5. 5. Requirements for Clouds Designed for Data Intensive Computing Scale to Scale Support Security Data Across Large Data Centers Data Flows Centers Business X X E-science X X X Health- X X care Sector/Sphere is a cloud designed for data intensive computing supporting all four requirements.
  6. 6. Sector Overview  Sector is fast – Over 2x faster than Hadoop using MalStone Benchmark – Sector exploits data locality and network topology to improve performance  Sector is easy to program – Supports MapReduce style over (key, value) pairs – Supports User-defined Functions over records – Easy to process binary data (images, specialized formats, etc.)  Sector clouds can be wide area 6
  7. 7. Part 2. Sector Design 7
  8. 8. Google’s Layered Cloud Services Applications Google’s MapReduce Compute Services Google’s BigTable Data Services Google File System (GFS) Storage Services Google’s Stack 8
  9. 9. Hadoop’s Layered Cloud Services Applications Hadoop’sMapReduce Compute Services Data Services Hadoop Distributed File Storage Services System (HDFS) Hadoop’s Stack 9
  10. 10. Sector’s Layered Cloud Services Applications Sphere’s UDFs Compute Services Data Services Sector’s Distributed File Storage Services System (SDFS) Routing & UDP-based Data Transport Transport Services Protocol (UDT) Sector’s Stack 10
  11. 11. Computing an Inverted Index Using Hadoop’sMapReduce HTML page_1 Stage 2: Sort each bucket on local word_x word_y word_y word_z node, merge the same word Map Bucket-A Bucket-A word_x Page_1 Bucket-B Bucket-B word_y Page_1 word_z Page_1 Sort Reduce Bucket-Z Bucket-Z 1st char word_z Page_1 word_z 1, 5, 10 Shuffle word_z Page_5 Stage 1: Page_10 word_z Process each HTML file and hash (word, file_id) pair to buckets
  12. 12. Idea 1 – Support UDF’s Over Files  Think of MapReduce as – Map acting on (text) records – With fixed Shuffle and Sort – Followed by Reducing acting on (text) records  We generalize this framework as follows: – Support a sequence of User Defined Functions (UDF) acting on segments (=chunks) of files. – In both cases, framework takes care of assigning nodes to process data, restarting failed processes, etc. 12
  13. 13. Computing an Inverted Index Using Sphere’s User Defined Functions (UDF) HTML page_1 Stage 2: Sort each bucket on local word_x word_y word_y word_z node, merge the same word UDF1 - Map Bucket-A Bucket-A word_x Page_1 Bucket-B Bucket-B word_y Page_1 UDF4- word_z Page_1 UDF3 - Sort Reduce Bucket-Z Bucket-Z 1st char word_z Page_1 word_z 1, 5, 10 UDF2 - Shuffle word_z Page_5 Stage 1: Page_10 word_z Process each HTML file and hash (word, file_id) pair to buckets
  14. 14. Applying UDF using Sector/Sphere 1. Split data Application Sphere Client Input stream 2. Locate & SPE SPE SPE schedule SPE 3. Collect results Output stream 14
  15. 15. Sphere’s UDF Input UDF Output Input UDF Intermediate UDF Output Input 1 UDF Output Input 2
  16. 16. Sector Programming Model  Sector dataset consists of one or more physical files  Sphere applies User Defined Functions over streams of data consisting of data segments  Data segments can be data records, collections of data records, or files  Example of UDFs: Map function, Reduce function, Split function for CART, etc.  Outputs of UDFs can be returned to originating node, written to local node, or shuffled to another node. 16
  17. 17. Idea 2: Add Security From the Start  Security server maintains Security Master Client information about users Server SSL and slaves. SSL  User access control: password and client IP address. AAA data  File level access control.  Messages are encrypted over SSL. Certificate is used for authentication.  Sector is HIPAA capable. Slaves
  18. 18. Idea 3: Extend the Stack Compute Services Compute Services Data Services Data Services Storage Services Storage Services Routing & Google, Hadoop Transport Services Sector 18
  19. 19. Sector is Built on Top of UDT • UDT is a specialized network transport protocol. • UDT can take advantage of wide area high performance 10 Gbps network • Sector is a wide area distributed file system built over UDT. • Sector is layered over the native file system (vs being a block-based file system). 19
  20. 20. UDT Has Been Downloaded 25,000+ Times Sterling Commerce Movie2Me Globus Power Folder Nifty TV 20
  21. 21. Alternatives to TCP – Decreasing Increases AIMD Protocols (x) UDT Scalable TCP HighSpeed TCP AIMD (TCP NewReno) x increase of packet sending rate x decrease factor
  22. 22. Using UDT Enables Wide Area Clouds 10 Gbps per application  Using UDT, Sector can take advantage of wide area high performance networks (10+ Gbps) 22
  23. 23. Part 3. Experimental Studies 23
  24. 24. Comparing Sector and Hadoop Hadoop Sector Storage Cloud Block-based file File-based system Programming MapReduce UDF&MapReduc Model e Protocol TCP UDP-based protocol (UDT) Replication At time of writing Periodically Security Not yet HIPAA capable Language Java C++ 24
  25. 25. Open Cloud Testbed – Phase 1 (2008) C-Wave CENIC Dragon Phase 1  Hadoop  4 racks  Sector/Sphere  120 Nodes MREN  Thrift  480 Cores  Eucalyptus  10+ Gb/s Each node in the testbed is a Dell 1435 computer with 12 GB memory, 1TB disk, 2.0GHz dual dual-core AMD Opteron 2212, with 1 Gb/s network interface cards. 25
  26. 26. MalStone Benchmark  Benchmark developed by Open Cloud Consortium for clouds supporting data intensive computing.  Code to generate synthetic data required is available from  Stylized analytic computation that is easy to implement in MapReduce and its generalizations. 26
  27. 27. MalStone B entities sites dk-2 dk-1 dk time 27
  28. 28. MalStone B Benchmark MalStone B Hadoop v0.18.3 799 min Hadoop Streamingv0.18.3 142 min Sector v1.19 44 min # Nodes 20 nodes # Records 10 Billion Size of Dataset 1 TB These are preliminary results and we expect these results to change as we improve the implementations of MalStone B. 28
  29. 29. Terasort - Sector vsHadoop Performance LAN MAN WAN 1 WAN 2 Number 58 116 178 236 Cores Hadoop 2252 2617 3069 3702 (secs) Sector 1265 1301 1430 1526 (secs) Locations UIC UIC, SL UIC, SL, UIC, SL, Calit2 Calit2, JHU All times in seconds.
  30. 30. With Sector, “Wide Area Penalty” < 5%  Used Open Cloud Testbed.  And wide area 10 Gb/sec networks.  Ran a data intensive computing benchmark on 4 clusters distributed across the U.S. vs one cluster in Chicago.  Difference in performance less than 5% for Terasort.  One expects quite different results, depending upon the particular computation. 30
  31. 31. Penalty for Wide Area Cloud Computing on Uncongested 10 Gb/s 28 Local 4x 7 distributed Wide Area Nodes Nodes “Penality” Hadoop 3 8650 11600 34% replicas Hadoop 1 7300 9600 31% replica Sector 4200 4400 4.7% All times in seconds usingMalStoneA benchmark on Open Cloud Testbed. 31
  32. 32. For More Information & To Obtain Sector  To obtain Sector or learn more about it:  To learn more about the Open Cloud Consortium  For related work by Robert Grossman,  For related work by YunhongGu 32
  33. 33. Thank you! 33