The	
  role	
  of	
  cloud	
  compu.ng	
  in	
  big	
  biology
Deepak	
  Singh
Via Reavel under a CC-BY-NC-ND license
life science industry
Credit: Bosco Ho
By ~Prescott under a CC-BY-NC license
context
analysis methods
technology
?


?   technology   ?


        ?
back of the room
technology



technology                technology



             technology
technology
                                       tec
                    y                        hn
             o   log...
Image: Keith Allison under a CC-BY-SA license
inherent characteristics
data driven
multi-dimensional
collaborative
distributed
<amazon web services>
the cloud
has_many :definitions
infrastructure as a service
precursors
virtualization
service oriented architecure
distributed computing
Compute                          Storage
    Amazon Elastic Compute                                   Database
           ...
Payments          On-Demand
Parallel Processing                                     Messaging
                           C...
Isolated Networks
         Monitoring                    Management                         Tools
                        ...
Your Custom Applications and Services

                                                                                   ...
scalable
scalable
cost effective
go
                        o u
                  s y
  scalable ay    a
            P
cost effective
scalable
cost effective
   reliable
scalable
cost effective
   reliable
   secure
Amazon EC2
servers on demand
highly scalable
3000 CPU’s for one firm’s risk management application
     3444JJ'
!"#$%&'()'*+,'-./01.2%/'




                          ...
design for failure
“Everything fails, all the time”
                   -- Werner Vogels
assume failure
assume failure

design backwards
assume failure

design backwards

  nothing fails
highly available systems
elastic block store
elastic IP
SQS
US East Region



Availability     Availability
 Zone A           Zone B



Availability     Availability
 Zone C         ...
data storage
one size does not fit all
Amazon S3
distributed object store
durable
available
!"#$%&'()*+


T                 T




     T
scalable
fast
simple
structured data anyone?
Amazon SimpleDB
zero administration
highly available
schema less
key-value store
Amazon Relational Data Service
single API call
MySQL database
automatic backup
scale up with API call
e s
    ur
  t
fu
e s
            ur
          t
        fu
master-slave replication
  data center failover
what do people do?
solve problems
> 1PB of data in S3
provide platforms & services
Platform as a Service




http://heroku.com
Computation as a Service




http://cyclecomputing.com
Computational Platforms




                                                   sudo gem install cloud-crowd

     http://c...
http://cyclecomputing.com
http://wiki.github.com/documentcloud/cloud-crowd
they do science



Image: Matt Wood
3.7 million classifications in just over three days
~15 million in less than a month
>2.6 million clicks in 100 hours
Image	
  via	
  image	
  editor	
  under	
  a	
  CC-­‐BY	
  License
Protein Docking @ Pfizer




http://bioteam.net
http://aws.amazon.com/publicdatasets/
</amazon web services>
anecdote
collaborative project
800 GB
Image: Wikipedia Commons
weeks to get started
Image: Matt Wood
Image: Chris Dagdigian
gigabytes
terabytes
petabytes
really fast
constant flux
Image: Chris Dagdigian
data management is not
     data storage
masterclass
Big data & Biology: The implications of
          petascale science
             Tuesday November 17

        ...
“science data platform”
deliver data to applications
deliver data to people
typical informatics workflow
Via Christolakis under a CC-BY-NC-ND license
Via Argonne National Labs under a CC-BY-SA license
p p
                                                             r a
                                                     ...
Da
            ta




Ap
     ps
Data Platform




App Platform
Data Platform




App Platform
Data Platform




App Platform
Data Platform




data services
application services

       App Platform
Scalable Data Platform


                Services


                  APIs


Getters         Filters            Savers



...
must accommodate change
must scale
highly available
loosely coupled
dynamic
task-based resources
one project
one set of resources
no waiting
Protein Docking @ Pfizer




http://bioteam.net
distributed mindset
one approach
disk read/writes
slow & expensive
data processing
 fast & cheap
distribute data
parallelize reads
map/reduce
distributed data processing
          at scale
abstracting away hadoop
apache hive



 http://hadoop.apache.org/hive/
apache pig



http://hadoop.apache.org/pig/
cascading



http://www.cascading.org/
hosted hadoop service
hadoop easy & simple
Amazon Elastic
                                    MapReduce

                                     Amazon EC2 Instances
  ...
developers
develop & distribute
scientists/analysts
     consume
CloudBurst




Catalog k-mers     Collect seeds   End-to-end alignment
Mike Schatz, University of Maryland
Scalable Data Platform


                Services


                  APIs


Getters         Filters            Savers



...
IN CONCLUSION
large scale biology
complex multidimensional data
whole lot of data
distributed collaborations
new computing and data
     architectures
a solution: cloud services
distributed
scalable
economical
here today
Thank	
  you!




deesingh@amazon.com	
  Twi<er:@mndoci	
  
     Presenta?on	
  ideas	
  from	
  @mza,	
  James	
  Hamilto...
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Talk given at "Cloud Computing for Systems Biology" workshop
Upcoming SlideShare
Loading in...5
×

Talk given at "Cloud Computing for Systems Biology" workshop

3,048

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,048
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
95
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Talk given at "Cloud Computing for Systems Biology" workshop

  1. 1. The  role  of  cloud  compu.ng  in  big  biology Deepak  Singh
  2. 2. Via Reavel under a CC-BY-NC-ND license
  3. 3. life science industry
  4. 4. Credit: Bosco Ho
  5. 5. By ~Prescott under a CC-BY-NC license
  6. 6. context
  7. 7. analysis methods
  8. 8. technology
  9. 9. ? ? technology ? ?
  10. 10. back of the room
  11. 11. technology technology technology technology
  12. 12. technology tec y hn o log olo hn gy c te technology technology y nolog tech gy nolo technology tech
  13. 13. Image: Keith Allison under a CC-BY-SA license
  14. 14. inherent characteristics
  15. 15. data driven
  16. 16. multi-dimensional
  17. 17. collaborative
  18. 18. distributed
  19. 19. <amazon web services>
  20. 20. the cloud
  21. 21. has_many :definitions
  22. 22. infrastructure as a service
  23. 23. precursors
  24. 24. virtualization
  25. 25. service oriented architecure
  26. 26. distributed computing
  27. 27. Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  28. 28. Payments On-Demand Parallel Processing Messaging Content Delivery Amazon Flexible Workforce Amazon Elastic Amazon Simple Amazon CloudFront Payments Service Amazon Mechanical MapReduce Queue Service (SQS) (FPS) Turk Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  29. 29. Isolated Networks Monitoring Management Tools Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS Toolkit for Eclipse Cloud Payments On-Demand Parallel Processing Messaging Content Delivery Amazon Flexible Workforce Amazon Elastic Amazon Simple Amazon CloudFront Payments Service Amazon Mechanical MapReduce Queue Service (SQS) (FPS) Turk Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  30. 30. Your Custom Applications and Services Isolated Networks Monitoring Management Tools Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS Toolkit for Eclipse Cloud Payments On-Demand Parallel Processing Messaging Content Delivery Amazon Flexible Workforce Amazon Elastic Amazon Simple Amazon CloudFront Payments Service Amazon Mechanical MapReduce Queue Service (SQS) (FPS) Turk Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  31. 31. scalable
  32. 32. scalable cost effective
  33. 33. go o u s y scalable ay a P cost effective
  34. 34. scalable cost effective reliable
  35. 35. scalable cost effective reliable secure
  36. 36. Amazon EC2
  37. 37. servers on demand
  38. 38. highly scalable
  39. 39. 3000 CPU’s for one firm’s risk management application 3444JJ' !"#$%&'()'*+,'-./01.2%/' 344'+567/'(.' 8%%9%.:/' 344'JJ' I%:.%/:1=' ;<"&/:1=' A&B:1=' C10"&:1=' C".:1=' E(.:1=' ;"%/:1=' >?,,?,44@' >?,3?,44@' >?,>?,44@' >?,H?,44@' >?,D?,44@' >?,F?,44@' >?,G?,44@'
  40. 40. design for failure
  41. 41. “Everything fails, all the time” -- Werner Vogels
  42. 42. assume failure
  43. 43. assume failure design backwards
  44. 44. assume failure design backwards nothing fails
  45. 45. highly available systems
  46. 46. elastic block store
  47. 47. elastic IP
  48. 48. SQS
  49. 49. US East Region Availability Availability Zone A Zone B Availability Availability Zone C Zone D
  50. 50. data storage
  51. 51. one size does not fit all
  52. 52. Amazon S3
  53. 53. distributed object store
  54. 54. durable
  55. 55. available
  56. 56. !"#$%&'()*+ T T T
  57. 57. scalable
  58. 58. fast
  59. 59. simple
  60. 60. structured data anyone?
  61. 61. Amazon SimpleDB
  62. 62. zero administration
  63. 63. highly available
  64. 64. schema less
  65. 65. key-value store
  66. 66. Amazon Relational Data Service
  67. 67. single API call
  68. 68. MySQL database
  69. 69. automatic backup
  70. 70. scale up with API call
  71. 71. e s ur t fu
  72. 72. e s ur t fu master-slave replication data center failover
  73. 73. what do people do?
  74. 74. solve problems
  75. 75. > 1PB of data in S3
  76. 76. provide platforms & services
  77. 77. Platform as a Service http://heroku.com
  78. 78. Computation as a Service http://cyclecomputing.com
  79. 79. Computational Platforms sudo gem install cloud-crowd http://cyclecomputing.com http://wiki.github.com/documentcloud/cloud-crowd
  80. 80. http://cyclecomputing.com http://wiki.github.com/documentcloud/cloud-crowd
  81. 81. they do science Image: Matt Wood
  82. 82. 3.7 million classifications in just over three days ~15 million in less than a month >2.6 million clicks in 100 hours
  83. 83. Image  via  image  editor  under  a  CC-­‐BY  License
  84. 84. Protein Docking @ Pfizer http://bioteam.net
  85. 85. http://aws.amazon.com/publicdatasets/
  86. 86. </amazon web services>
  87. 87. anecdote
  88. 88. collaborative project
  89. 89. 800 GB
  90. 90. Image: Wikipedia Commons
  91. 91. weeks to get started
  92. 92. Image: Matt Wood
  93. 93. Image: Chris Dagdigian
  94. 94. gigabytes
  95. 95. terabytes
  96. 96. petabytes
  97. 97. really fast
  98. 98. constant flux
  99. 99. Image: Chris Dagdigian
  100. 100. data management is not data storage
  101. 101. masterclass Big data & Biology: The implications of petascale science Tuesday November 17 1:30PM - 3:00PM Room: PB253-254-257-258
  102. 102. “science data platform”
  103. 103. deliver data to applications
  104. 104. deliver data to people
  105. 105. typical informatics workflow
  106. 106. Via Christolakis under a CC-BY-NC-ND license
  107. 107. Via Argonne National Labs under a CC-BY-SA license
  108. 108. p p r a il le k Via Argonne National Labs under a CC-BY-SA license
  109. 109. Da ta Ap ps
  110. 110. Data Platform App Platform
  111. 111. Data Platform App Platform
  112. 112. Data Platform App Platform
  113. 113. Data Platform data services
  114. 114. application services App Platform
  115. 115. Scalable Data Platform Services APIs Getters Filters Savers WORK
  116. 116. must accommodate change
  117. 117. must scale
  118. 118. highly available
  119. 119. loosely coupled
  120. 120. dynamic
  121. 121. task-based resources
  122. 122. one project one set of resources
  123. 123. no waiting
  124. 124. Protein Docking @ Pfizer http://bioteam.net
  125. 125. distributed mindset
  126. 126. one approach
  127. 127. disk read/writes slow & expensive
  128. 128. data processing fast & cheap
  129. 129. distribute data parallelize reads
  130. 130. map/reduce
  131. 131. distributed data processing at scale
  132. 132. abstracting away hadoop
  133. 133. apache hive http://hadoop.apache.org/hive/
  134. 134. apache pig http://hadoop.apache.org/pig/
  135. 135. cascading http://www.cascading.org/
  136. 136. hosted hadoop service
  137. 137. hadoop easy & simple
  138. 138. Amazon Elastic MapReduce Amazon EC2 Instances End Deploy Application Hadoop Hadoop Hadoop Elastic Elastic MapReduce MapReduce Hadoop Hadoop Hadoop Notify Web Console, Command line tools Input output dataset results Input  S3   Output  S3   Get Results Input Data bucket bucket Amazon S3
  139. 139. developers develop & distribute
  140. 140. scientists/analysts consume
  141. 141. CloudBurst Catalog k-mers Collect seeds End-to-end alignment
  142. 142. Mike Schatz, University of Maryland
  143. 143. Scalable Data Platform Services APIs Getters Filters Savers WORK
  144. 144. IN CONCLUSION
  145. 145. large scale biology
  146. 146. complex multidimensional data
  147. 147. whole lot of data
  148. 148. distributed collaborations
  149. 149. new computing and data architectures
  150. 150. a solution: cloud services
  151. 151. distributed
  152. 152. scalable
  153. 153. economical
  154. 154. here today
  155. 155. Thank  you! deesingh@amazon.com  Twi<er:@mndoci   Presenta?on  ideas  from  @mza,  James  Hamilton,  and  @lessig
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×