Masterworks talk on Big Data and the implications of petascale science

3,114 views
3,054 views

Published on

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,114
On SlideShare
0
From Embeds
0
Number of Embeds
87
Actions
Shares
0
Downloads
127
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Masterworks talk on Big Data and the implications of petascale science

  1. 1. Big  Data  and  Biology:  The  implica4ons  of  petascale  science Deepak  Singh
  2. 2. Via Reavel under a CC-BY-NC-ND license
  3. 3. life science industry
  4. 4. Credit: Bosco Ho
  5. 5. By ~Prescott under a CC-BY-NC license
  6. 6. data
  7. 7. Image: Wikipedia
  8. 8. biology
  9. 9. big data
  10. 10. Source: http://www.nature.com/news/specials/bigdata/index.html
  11. 11. Image: Matt Wood
  12. 12. Hu ma ng en om e Image: Matt Wood
  13. 13. not just sequencing
  14. 14. Image: Ricardipus
  15. 15. more data
  16. 16. Image: Matt Wood
  17. 17. all hell breaks loose
  18. 18. ~100 TB/Week
  19. 19. ~100 TB/Week >2 PB/Year
  20. 20. years
  21. 21. weeks
  22. 22. days
  23. 23. days
  24. 24. mi nu tes days ?
  25. 25. gigabytes
  26. 26. terabytes
  27. 27. petabytes
  28. 28. exabytes?
  29. 29. really fast
  30. 30. Image: http://www.broadinstitute.org/~apleite/photos.html
  31. 31. single lab
  32. 32. Image: Chris Dagdigian
  33. 33. implications of scale
  34. 34. data management
  35. 35. data processing
  36. 36. data sharing
  37. 37. fundamental concepts
  38. 38. 1. architecting for scale
  39. 39. “Everything fails, all the time” -- Werner Vogels
  40. 40. “Things will crash. Deal with it” -- Jeff Dean
  41. 41. “Remember everything fails” -- Randy Shoup
  42. 42. fun with numbers
  43. 43. datacenter availability
  44. 44. Source: Uptime Institute
  45. 45. Tier  I:  28.8  hours  annual  down4me  (99.67%  availability) Tier  II:  22.0  hrs  annual  down4me  (99.75%  availability) Tier  III:  1.6  hrs  annual  down4me  (99.98%  availability) Tier  IV:  0.8  hrs  annual  down4me  (99.99%  availability) Source: Uptime Institute
  46. 46. cooling systems go down
  47. 47. power units fail
  48. 48. 2-4% of servers will die annually Source: Jeff Dean, LADIS 2009
  49. 49. 1-5% of disk drives will die every year Source: Jeff Dean, LADIS 2009
  50. 50. 2.3% AFR in population of 13,250 3.3% AFR in population of 22,400 4.2% AFR in population of 246,000 Source: James Hamilton
  51. 51. software breaks
  52. 52. human errors
  53. 53. human errors ~20% admin issues have unintended consequences Source: James Hamilton
  54. 54. achieving scalability and availability
  55. 55. partitioning
  56. 56. redundancy
  57. 57. recovery oriented computing Source: http://perspectives.mvdirona.com/, http://roc.cs.berkeley.edu/
  58. 58. assume sw/hw failure
  59. 59. design apps to be resilient
  60. 60. automation
  61. 61. Your Custom Applications and Services Isolated Networks Monitoring Management Tools Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS Toolkit for Eclipse Cloud Payments On-Demand Parallel Processing Messaging Content Delivery Amazon Flexible Workforce Amazon Elastic Amazon Simple Amazon CloudFront Payments Service Amazon Mechanical MapReduce Queue Service (SQS) (FPS) Turk Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  62. 62. Amazon S3
  63. 63. durable
  64. 64. available
  65. 65. !"#$%&'()*+ T T T
  66. 66. Amazon EC2
  67. 67. highly scalable
  68. 68. 3000 CPU’s for one firm’s risk management application 3444JJ' !"#$%&'()'*+,'-./01.2%/' 344'+567/'(.' 8%%9%.:/' 344'JJ' I%:.%/:1=' ;<"&/:1=' A&B:1=' C10"&:1=' C".:1=' E(.:1=' ;"%/:1=' >?,,?,44@' >?,3?,44@' >?,>?,44@' >?,H?,44@' >?,D?,44@' >?,F?,44@' >?,G?,44@'
  69. 69. highly available systems
  70. 70. dynamic
  71. 71. fault tolerant
  72. 72. US East Region Availability Availability Zone A Zone B Availability Availability Zone C Zone D
  73. 73. 2. one size does not fit all
  74. 74. data 2. one size does not fit all ^
  75. 75. many data types
  76. 76. structured data
  77. 77. using the right data store
  78. 78. (a) feature first
  79. 79. RDBMS Oracle, SQL Server, DB2, MySQL, Postgres
  80. 80. Source: http://www.bioinformaticszen.com/
  81. 81. Source: http://www.bioinformaticszen.com/
  82. 82. Source: http://www.bioinformaticszen.com/
  83. 83. use a bigger computer
  84. 84. remove joins
  85. 85. scaling limits
  86. 86. (b) scale first
  87. 87. scale is highest priority
  88. 88. single RDBMS incapable
  89. 89. solution 1: data sharding
  90. 90. 10’s
  91. 91. 100’s
  92. 92. solution 2: scalable key- value store
  93. 93. scale is design point MongoDB, Project Voldermort, Cassandra, HBase, BigTable, Amazon SimpleDB, Dynamo
  94. 94. (c) simple structured storage
  95. 95. simple fast low ops cost BerkeleyDB, Tokyo Cabinet, Amazon SimpleDB
  96. 96. (d) purpose optimized stores
  97. 97. data warehousing stream processing Aster Data,Vertica, Netezza, Greenplum,VoltDB, StreamBase
  98. 98. what about files?
  99. 99. cluster file systems Lustre, GlusterFS
  100. 100. distributed file systems HDFS, GFS
  101. 101. distributed object store Amazon S3, Dynomite
  102. 102. Your Custom Applications and Services Isolated Networks Monitoring Management Tools Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS Toolkit for Eclipse Cloud Payments On-Demand Parallel Processing Messaging Content Delivery Amazon Flexible Workforce Amazon Elastic Amazon Simple Amazon CloudFront Payments Service Amazon Mechanical MapReduce Queue Service (SQS) (FPS) Turk Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  103. 103. Your Custom Applications and Services Isolated Networks Monitoring Management Tools Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS Toolkit for Eclipse Cloud Payments On-Demand Parallel Processing Messaging Content Delivery Amazon Flexible Workforce Amazon Elastic Amazon Simple Amazon CloudFront Payments Service Amazon Mechanical MapReduce Queue Service (SQS) (FPS) Turk Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  104. 104. 3. processing big data
  105. 105. disk read/writes slow & expensive
  106. 106. data processing fast & cheap
  107. 107. distribute the data parallel reads
  108. 108. data processing for the cloud
  109. 109. distributed file system (HDFS)
  110. 110. map/reduce
  111. 111. Via Cloudera under a Creative Commons License
  112. 112. Via Cloudera under a Creative Commons License
  113. 113. fault tolerance
  114. 114. massive scalability
  115. 115. petabyte scale
  116. 116. hosted hadoop service
  117. 117. hadoop easy and simple
  118. 118. Amazon Elastic MapReduce Amazon EC2 Instances End Deploy Application Hadoop Hadoop Hadoop Elastic Elastic MapReduce MapReduce Hadoop Hadoop Hadoop Notify Web Console, Command line tools Input output dataset results Input  S3   Output  S3   Get Results Input Data bucket bucket Amazon S3
  119. 119. back to the science
  120. 120. basic informatics workflow
  121. 121. Via Christolakis under a CC-BY-NC-ND license
  122. 122. Via Argonne National Labs under a CC-BY-SA license
  123. 123. killer app Via Argonne National Labs under a CC-BY-SA license
  124. 124. getting the data
  125. 125. Register projects Register samples Sample prep Sequencing Analysis These slides cover work presented by Matt Wood at various conferences
  126. 126. Image: Matt Wood
  127. 127. constant change
  128. 128. flexible data capture
  129. 129. virtual fields
  130. 130. no schema
  131. 131. specify at run time
  132. 132. specify at run time (bootstrapping)
  133. 133. Sample Name Organism Concentration Source: Matt Wood
  134. 134. Source: Matt Wood
  135. 135. key value pairs
  136. 136. change happens
  137. 137. V1 V2 Sample Sample Name Name Organism Organism Concentration Concentration Origin Quality metric Source: Matt Wood
  138. 138. Source: Matt Wood
  139. 139. high throughput
  140. 140. lots of pipelines
  141. 141. scaling projects/pipelines?
  142. 142. lots of apps
  143. 143. loosely coupled
  144. 144. automation
  145. 145. scale operationally
  146. 146. be agile
  147. 147. now what?
  148. 148. Via asklar under a CC-BY license
  149. 149. Via Argonne National Labs under a CC-BY-SA license
  150. 150. many data types
  151. 151. changing data types
  152. 152. Shaq Image: Keith Allison under a CC-BY-SA license
  153. 153. Shaq Image: Keith Allison under a CC-BY-SA license
  154. 154. Shaq Image: Keith Allison under a CC-BY-SA license
  155. 155. Shaq Image: Keith Allison under a CC-BY-SA license
  156. 156. Shaq Image: Keith Allison under a CC-BY-SA license
  157. 157. ?
  158. 158. lots and lots and lots and lots and lots and lots of data and lots and lots of lots of data
  159. 159. By bitterlysweet under a CC-BY-NC-ND license
  160. 160. Source: http://bit.ly/anderson-bigdata
  161. 161. Chris Anderson doesn’t understand science
  162. 162. “more is different”
  163. 163. few data points
  164. 164. elaborate models
  165. 165. the unreasonable effectiveness of data Source: “The Unreasonable Effectiveness of Data”, Alon Halevy, Peter Norvig, and Fernando Pereira
  166. 166. simple models lots of data
  167. 167. information platform
  168. 168. information platforms at scale
  169. 169. one organization
  170. 170. 4 TB daily added (compressed)
  171. 171. 135 TB data scanned daily (compressed)
  172. 172. 15 PB data total capacity
  173. 173. ???
  174. 174. Facebook data from Ashish Thusoo’s HadoopWorld 2009 talk
  175. 175. not always that big
  176. 176. can we learn any lessons? Source: “Information Platforms and the Rise of the Data Scientist”, Jeff Hammerbacher in Beautiful Data
  177. 177. analytics platform
  178. 178. Data warehouse
  179. 179. Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis
  180. 180. ETL
  181. 181. extract
  182. 182. transform
  183. 183. load
  184. 184. Via asklar under a CC-BY license
  185. 185. 1 TB
  186. 186. MySQL --> Oracle
  187. 187. more data
  188. 188. more data types
  189. 189. changing data types
  190. 190. limit data warehouse
  191. 191. too limited
  192. 192. how do you scale and adapt?
  193. 193. 100’s of TBs
  194. 194. 1000’s of jobs
  195. 195. back to the science
  196. 196. back in the day
  197. 197. small data sets
  198. 198. flat files
  199. 199. ../ ../folder1/ ../folder2/ . . . ../folderN/ file1 file2 . . fileN
  200. 200. shared file system
  201. 201. RDBMS
  202. 202. Image: Wikimedia Commons
  203. 203. Image: Chris Dagdigian
  204. 204. need to process
  205. 205. need to analyze
  206. 206. 100’s of TBs
  207. 207. 1000’s of jobs
  208. 208. Facebook data from Ashish Thusoo’s HadoopWorld 2009 talk
  209. 209. ETL
  210. 210. Via asklar under a CC-BY license
  211. 211. data mining & analytics
  212. 212. Via Argonne National Labs under a CC-BY-SA license
  213. 213. analysts are not programmers
  214. 214. not savvy with map/reduce
  215. 215. apache hive http://hadoop.apache.org/hive/
  216. 216. manage & query data
  217. 217. manage & query data on top of Hadoop
  218. 218. work by @peteskomoroch
  219. 219. cascading http://www.cascading.org/
  220. 220. apache pig http://hadoop.apache.org/pig/
  221. 221. Amazon Elastic MapReduce Amazon EC2 Instances End Deploy Application Hadoop Hadoop Hadoop Elastic Elastic MapReduce MapReduce Hadoop Hadoop Hadoop Notify Web Console, Command line tools Input output dataset results Input  S3   Output  S3   Get Results Input Data bucket bucket Amazon S3
  222. 222. hadoop and bioinformatics
  223. 223. High Throughput Sequence Analysis Mike Schatz, University of Maryland
  224. 224. Short Read Mapping
  225. 225. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1)
  226. 226. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  227. 227. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  228. 228. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale Need parallelization framework
  229. 229. CloudBurst Catalog k-mers Collect seeds End-to-end alignment
  230. 230. http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
  231. 231. Bowtie: Ultrafast short read aligner Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  232. 232. SOAPSnp: Consensus alignment and SNP calling Ruiqiang Li,Yingrui Li, Xiaodong Fang, et al. (2009) "SNP detection for massively parallel whole-genome resequencing" Genome Res
  233. 233. Crossbow: Rapid whole genome SNP analysis Ben Langmead http://bowtie-bio.sourceforge.net/crossbow/index.shtml
  234. 234. Preprocessed reads
  235. 235. Preprocessed reads Map: Bowtie
  236. 236. Preprocessed reads Map: Bowtie Sort: Bin and partition
  237. 237. Preprocessed reads Map: Bowtie Sort: Bin and partition Reduce: SoapSNP
  238. 238. Crossbow   condenses   over   1,000   hours   of   resequencing   computa:on   into   a   few   hours   without   requiring   the   user   to   own   or   operate   a   computer  cluster
  239. 239. Comparing Genomes
  240. 240. Estimating relative evolutionary rates from sequence comparisons: Identification of probable orthologs Admissible comparisons: A or B vs. D C vs. E Inadmissible comparisons: A or B vs. E C vs. D A B C D E species tree gene tree S. cerevisiae C. elegans
  241. 241. Estimating relative evolutionary rates from sequence comparisons: 1. Orthologs found using the Reciprocal smallest distance algorithm 2. Build alignment between two orthologs >Sequence C MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… >Sequence E MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL… 3. Estimate distance given a substitution matrix Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ A B C D E species tree gene tree S. cerevisiae C. elegans
  242. 242. RSD algorithm summary Genome I Genome J Ib Jc Align sequences & Calculate distances L Orthologs: Align sequences & Calculate distances H ib - jc D = 0.1 c vs. D=1.2 vs. D=0.2 a b a vs. D=0.1 vs. D=0.3 c b b b vs. D=0.9 vs. D=0.1 c c b c
  243. 243. Prof. Dennis Wall Harvard Medical School
  244. 244. Roundup is a database of orthologs and their evolutionary distances. To get started, click browse. Alternatively, you can read our documentation here. Good luck, researchers!
  245. 245. massive computational demand
  246. 246. 1000 genomes = 5,994,000 processes = 23,976,000 hours
  247. 247. 2737 years
  248. 248. compared 50+ genomes
  249. 249. trends in data sharing
  250. 250. data motion is hard
  251. 251. cloud services are a viable dataspace
  252. 252. share data
  253. 253. share applications
  254. 254. share results
  255. 255. http://aws.amazon.com/publicdatasets/
  256. 256. Data Platform App Platform
  257. 257. Data Platform App Platform
  258. 258. Scalable Data Platform Services APIs Getters Filters Savers WORK
  259. 259. to conclude
  260. 260. big data
  261. 261. change thinking
  262. 262. data management data processing data sharing
  263. 263. think distributed
  264. 264. new software architectures
  265. 265. new computing paradigms
  266. 266. cloud services
  267. 267. the cloud works
  268. 268. Thank  you! deesingh@amazon.com  Twi2er:@mndoci   Presenta4on  ideas  from  @mza,  James  Hamilton,  and  @lessig

×