Clouds: Evaluating Fitness for Genomics Analysis and Data Management

2. Wellcome Trust Sanger Institute

3. [email_address]

5. Cloud: Where are we at?

6. Good Fit: Web services

7. Bad Fit: HPTC compute

8. Better fit...?

9. Data management

10. Collaboration

11. Grids

13. ~700 employees.

16. DNA sequencing

18. 23 labs.

20. 1 machine.

21. $10,000.

23. $500 genome is probable within 5 years.

24. The scary graph Instrument upgrades Peak Yearly capillary sequencing

27. Sequencing cost: T d =12 months

28. Cloud: Where are we at?

30. Hype Cycle Awesome! Just works...

31. Lost in the clouds...

32. Victory!

33. Where are we? ? ? ?

37. Provides web / programmatic interfaces to genomic data.

41. Ensembl Website

52. Usage

53. What has this got to do with clouds?

56. Plan for production “real soon now”.

58. West-coast was built as an extension of Sanger internal network via VPN.

60. Lots of people doing Apache/mysql on AWS, so there is a good amount of best-practice etc available.

61. Does it work?

64. Buying a $2000 server?

65. Leasing a $2000 server for 3 years?

66. Using $150 of time at your local supercomputing facility?

71. It is not free!

75. Simplifies configuration for other users wanting to run Ensembl sites.

77. Added benefits

80. Hype Cycle Web services

82. Compute Pipeline TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC

83. Raw Sequence -> Something useful

84. Example annotation

85. Gene Finding DNA HMM Prediction Alignment with known proteins Alignment with fragments recovered in vivo Alignment with other genes and other species

87. Core algorithms are C.

91. Very IO intensive.

98. What happens if we need to “out-source” our compute?

99. Can we be in a position to shift peaks of demand to cloud facilities?

104. Getting them to run took a lot of fiddling.

105. Machines need to find each other one they are inside the cloud.

108. Moving data into the cloud is hard

109. Doing stuff with data once it is in the cloud is also hard

110. If you look closely, most successful cloud projects have small amounts of data (10-100 Mbytes).

113. Cambridge -> EC2 Dublin: 25 Mbytes/s (200 Mbits/s)

114. 11 hours to move 1TB to Dublin.

118. As more labs get sequencers, our potential collaborators also increase.

119. We need good connectivity to everywhere.

123. Compute architecture VS CPU CPU CPU Fat Network Posix Global filesystem CPU CPU CPU CPU thin network Local storage Local storage Local storage Local storage Batch schedular hadoop/S3 Data-store Data-store

124. Elephant in the room

126. But hadoop apps do not exist in isolation.

129. Hype cycle HPTC

131. For the less-data intensive parts.

132. Shared data archives

133. Past Collaborations Data Sequencing Centre + DCC Sequencing centre Sequencing centre Sequencing centre Sequencing centre

134. Future Collaborations Collaborations are short term: 18 months-3 years. Sequencing Centre 3 Sequencing Centre 1 Sequencing Centre 2A Sequencing Centre 2B Federated access

136. Genomics Data Unstructured data (flat files) Data size per Genome Structured data (databases) Clinical Researchers, non-infomaticians Sequencing informatics specialists Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB)

140. Some will have patient identifiable data.

141. Plan for it now.

143. Is data in an inaccessible archive really useful?

146. 3 month lead time.

147. ~$1.5M capex.

149. Put VMs on compute that is “attached” to the data. Data CPU CPU CPU CPU Data CPU CPU CPU CPU VM

150. Practical Hurdles

156. Culture shock.

158. Single sign on?

162. We have effectively tied ourselves to a single provider.

163. Summary

166. Gen-Tao Chiang

167. Pete Clapham

170. John Teague

171. Backup

Clouds: Evaluating Fitness for Genomics Analysis and Data Management

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (7)

Similar to Clouds: Evaluating Fitness for Genomics Analysis and Data Management

Similar to Clouds: Evaluating Fitness for Genomics Analysis and Data Management (20)

More from Guy Coates

More from Guy Coates (6)

Recently uploaded

Recently uploaded (20)

Clouds: Evaluating Fitness for Genomics Analysis and Data Management