• Like
Masterworks talk on Big Data and the implications of petascale science
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Masterworks talk on Big Data and the implications of petascale science

  • 2,822 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,822
On SlideShare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
125
Comments
0
Likes
8

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big  Data  and  Biology:  The  implica4ons  of  petascale  science Deepak  Singh
  • 2. Via Reavel under a CC-BY-NC-ND license
  • 3. life science industry
  • 4. Credit: Bosco Ho
  • 5. By ~Prescott under a CC-BY-NC license
  • 6. data
  • 7. Image: Wikipedia
  • 8. biology
  • 9. big data
  • 10. Source: http://www.nature.com/news/specials/bigdata/index.html
  • 11. Image: Matt Wood
  • 12. Hu ma ng en om e Image: Matt Wood
  • 13. not just sequencing
  • 14. Image: Ricardipus
  • 15. more data
  • 16. Image: Matt Wood
  • 17. all hell breaks loose
  • 18. ~100 TB/Week
  • 19. ~100 TB/Week >2 PB/Year
  • 20. years
  • 21. weeks
  • 22. days
  • 23. days
  • 24. mi nu tes days ?
  • 25. gigabytes
  • 26. terabytes
  • 27. petabytes
  • 28. exabytes?
  • 29. really fast
  • 30. Image: http://www.broadinstitute.org/~apleite/photos.html
  • 31. single lab
  • 32. Image: Chris Dagdigian
  • 33. implications of scale
  • 34. data management
  • 35. data processing
  • 36. data sharing
  • 37. fundamental concepts
  • 38. 1. architecting for scale
  • 39. “Everything fails, all the time” -- Werner Vogels
  • 40. “Things will crash. Deal with it” -- Jeff Dean
  • 41. “Remember everything fails” -- Randy Shoup
  • 42. fun with numbers
  • 43. datacenter availability
  • 44. Source: Uptime Institute
  • 45. Tier  I:  28.8  hours  annual  down4me  (99.67%  availability) Tier  II:  22.0  hrs  annual  down4me  (99.75%  availability) Tier  III:  1.6  hrs  annual  down4me  (99.98%  availability) Tier  IV:  0.8  hrs  annual  down4me  (99.99%  availability) Source: Uptime Institute
  • 46. cooling systems go down
  • 47. power units fail
  • 48. 2-4% of servers will die annually Source: Jeff Dean, LADIS 2009
  • 49. 1-5% of disk drives will die every year Source: Jeff Dean, LADIS 2009
  • 50. 2.3% AFR in population of 13,250 3.3% AFR in population of 22,400 4.2% AFR in population of 246,000 Source: James Hamilton
  • 51. software breaks
  • 52. human errors
  • 53. human errors ~20% admin issues have unintended consequences Source: James Hamilton
  • 54. achieving scalability and availability
  • 55. partitioning
  • 56. redundancy
  • 57. recovery oriented computing Source: http://perspectives.mvdirona.com/, http://roc.cs.berkeley.edu/
  • 58. assume sw/hw failure
  • 59. design apps to be resilient
  • 60. automation
  • 61. Your Custom Applications and Services Isolated Networks Monitoring Management Tools Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS Toolkit for Eclipse Cloud Payments On-Demand Parallel Processing Messaging Content Delivery Amazon Flexible Workforce Amazon Elastic Amazon Simple Amazon CloudFront Payments Service Amazon Mechanical MapReduce Queue Service (SQS) (FPS) Turk Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  • 62. Amazon S3
  • 63. durable
  • 64. available
  • 65. !"#$%&'()*+ T T T
  • 66. Amazon EC2
  • 67. highly scalable
  • 68. 3000 CPU’s for one firm’s risk management application 3444JJ' !"#$%&'()'*+,'-./01.2%/' 344'+567/'(.' 8%%9%.:/' 344'JJ' I%:.%/:1=' ;<"&/:1=' A&B:1=' C10"&:1=' C".:1=' E(.:1=' ;"%/:1=' >?,,?,44@' >?,3?,44@' >?,>?,44@' >?,H?,44@' >?,D?,44@' >?,F?,44@' >?,G?,44@'
  • 69. highly available systems
  • 70. dynamic
  • 71. fault tolerant
  • 72. US East Region Availability Availability Zone A Zone B Availability Availability Zone C Zone D
  • 73. 2. one size does not fit all
  • 74. data 2. one size does not fit all ^
  • 75. many data types
  • 76. structured data
  • 77. using the right data store
  • 78. (a) feature first
  • 79. RDBMS Oracle, SQL Server, DB2, MySQL, Postgres
  • 80. Source: http://www.bioinformaticszen.com/
  • 81. Source: http://www.bioinformaticszen.com/
  • 82. Source: http://www.bioinformaticszen.com/
  • 83. use a bigger computer
  • 84. remove joins
  • 85. scaling limits
  • 86. (b) scale first
  • 87. scale is highest priority
  • 88. single RDBMS incapable
  • 89. solution 1: data sharding
  • 90. 10’s
  • 91. 100’s
  • 92. solution 2: scalable key- value store
  • 93. scale is design point MongoDB, Project Voldermort, Cassandra, HBase, BigTable, Amazon SimpleDB, Dynamo
  • 94. (c) simple structured storage
  • 95. simple fast low ops cost BerkeleyDB, Tokyo Cabinet, Amazon SimpleDB
  • 96. (d) purpose optimized stores
  • 97. data warehousing stream processing Aster Data,Vertica, Netezza, Greenplum,VoltDB, StreamBase
  • 98. what about files?
  • 99. cluster file systems Lustre, GlusterFS
  • 100. distributed file systems HDFS, GFS
  • 101. distributed object store Amazon S3, Dynomite
  • 102. Your Custom Applications and Services Isolated Networks Monitoring Management Tools Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS Toolkit for Eclipse Cloud Payments On-Demand Parallel Processing Messaging Content Delivery Amazon Flexible Workforce Amazon Elastic Amazon Simple Amazon CloudFront Payments Service Amazon Mechanical MapReduce Queue Service (SQS) (FPS) Turk Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  • 103. Your Custom Applications and Services Isolated Networks Monitoring Management Tools Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS Toolkit for Eclipse Cloud Payments On-Demand Parallel Processing Messaging Content Delivery Amazon Flexible Workforce Amazon Elastic Amazon Simple Amazon CloudFront Payments Service Amazon Mechanical MapReduce Queue Service (SQS) (FPS) Turk Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  • 104. 3. processing big data
  • 105. disk read/writes slow & expensive
  • 106. data processing fast & cheap
  • 107. distribute the data parallel reads
  • 108. data processing for the cloud
  • 109. distributed file system (HDFS)
  • 110. map/reduce
  • 111. Via Cloudera under a Creative Commons License
  • 112. Via Cloudera under a Creative Commons License
  • 113. fault tolerance
  • 114. massive scalability
  • 115. petabyte scale
  • 116. hosted hadoop service
  • 117. hadoop easy and simple
  • 118. Amazon Elastic MapReduce Amazon EC2 Instances End Deploy Application Hadoop Hadoop Hadoop Elastic Elastic MapReduce MapReduce Hadoop Hadoop Hadoop Notify Web Console, Command line tools Input output dataset results Input  S3   Output  S3   Get Results Input Data bucket bucket Amazon S3
  • 119. back to the science
  • 120. basic informatics workflow
  • 121. Via Christolakis under a CC-BY-NC-ND license
  • 122. Via Argonne National Labs under a CC-BY-SA license
  • 123. killer app Via Argonne National Labs under a CC-BY-SA license
  • 124. getting the data
  • 125. Register projects Register samples Sample prep Sequencing Analysis These slides cover work presented by Matt Wood at various conferences
  • 126. Image: Matt Wood
  • 127. constant change
  • 128. flexible data capture
  • 129. virtual fields
  • 130. no schema
  • 131. specify at run time
  • 132. specify at run time (bootstrapping)
  • 133. Sample Name Organism Concentration Source: Matt Wood
  • 134. Source: Matt Wood
  • 135. key value pairs
  • 136. change happens
  • 137. V1 V2 Sample Sample Name Name Organism Organism Concentration Concentration Origin Quality metric Source: Matt Wood
  • 138. Source: Matt Wood
  • 139. high throughput
  • 140. lots of pipelines
  • 141. scaling projects/pipelines?
  • 142. lots of apps
  • 143. loosely coupled
  • 144. automation
  • 145. scale operationally
  • 146. be agile
  • 147. now what?
  • 148. Via asklar under a CC-BY license
  • 149. Via Argonne National Labs under a CC-BY-SA license
  • 150. many data types
  • 151. changing data types
  • 152. Shaq Image: Keith Allison under a CC-BY-SA license
  • 153. Shaq Image: Keith Allison under a CC-BY-SA license
  • 154. Shaq Image: Keith Allison under a CC-BY-SA license
  • 155. Shaq Image: Keith Allison under a CC-BY-SA license
  • 156. Shaq Image: Keith Allison under a CC-BY-SA license
  • 157. ?
  • 158. lots and lots and lots and lots and lots and lots of data and lots and lots of lots of data
  • 159. By bitterlysweet under a CC-BY-NC-ND license
  • 160. Source: http://bit.ly/anderson-bigdata
  • 161. Chris Anderson doesn’t understand science
  • 162. “more is different”
  • 163. few data points
  • 164. elaborate models
  • 165. the unreasonable effectiveness of data Source: “The Unreasonable Effectiveness of Data”, Alon Halevy, Peter Norvig, and Fernando Pereira
  • 166. simple models lots of data
  • 167. information platform
  • 168. information platforms at scale
  • 169. one organization
  • 170. 4 TB daily added (compressed)
  • 171. 135 TB data scanned daily (compressed)
  • 172. 15 PB data total capacity
  • 173. ???
  • 174. Facebook data from Ashish Thusoo’s HadoopWorld 2009 talk
  • 175. not always that big
  • 176. can we learn any lessons? Source: “Information Platforms and the Rise of the Data Scientist”, Jeff Hammerbacher in Beautiful Data
  • 177. analytics platform
  • 178. Data warehouse
  • 179. Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis
  • 180. ETL
  • 181. extract
  • 182. transform
  • 183. load
  • 184. Via asklar under a CC-BY license
  • 185. 1 TB
  • 186. MySQL --> Oracle
  • 187. more data
  • 188. more data types
  • 189. changing data types
  • 190. limit data warehouse
  • 191. too limited
  • 192. how do you scale and adapt?
  • 193. 100’s of TBs
  • 194. 1000’s of jobs
  • 195. back to the science
  • 196. back in the day
  • 197. small data sets
  • 198. flat files
  • 199. ../ ../folder1/ ../folder2/ . . . ../folderN/ file1 file2 . . fileN
  • 200. shared file system
  • 201. RDBMS
  • 202. Image: Wikimedia Commons
  • 203. Image: Chris Dagdigian
  • 204. need to process
  • 205. need to analyze
  • 206. 100’s of TBs
  • 207. 1000’s of jobs
  • 208. Facebook data from Ashish Thusoo’s HadoopWorld 2009 talk
  • 209. ETL
  • 210. Via asklar under a CC-BY license
  • 211. data mining & analytics
  • 212. Via Argonne National Labs under a CC-BY-SA license
  • 213. analysts are not programmers
  • 214. not savvy with map/reduce
  • 215. apache hive http://hadoop.apache.org/hive/
  • 216. manage & query data
  • 217. manage & query data on top of Hadoop
  • 218. work by @peteskomoroch
  • 219. cascading http://www.cascading.org/
  • 220. apache pig http://hadoop.apache.org/pig/
  • 221. Amazon Elastic MapReduce Amazon EC2 Instances End Deploy Application Hadoop Hadoop Hadoop Elastic Elastic MapReduce MapReduce Hadoop Hadoop Hadoop Notify Web Console, Command line tools Input output dataset results Input  S3   Output  S3   Get Results Input Data bucket bucket Amazon S3
  • 222. hadoop and bioinformatics
  • 223. High Throughput Sequence Analysis Mike Schatz, University of Maryland
  • 224. Short Read Mapping
  • 225. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1)
  • 226. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  • 227. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  • 228. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale Need parallelization framework
  • 229. CloudBurst Catalog k-mers Collect seeds End-to-end alignment
  • 230. http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
  • 231. Bowtie: Ultrafast short read aligner Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  • 232. SOAPSnp: Consensus alignment and SNP calling Ruiqiang Li,Yingrui Li, Xiaodong Fang, et al. (2009) "SNP detection for massively parallel whole-genome resequencing" Genome Res
  • 233. Crossbow: Rapid whole genome SNP analysis Ben Langmead http://bowtie-bio.sourceforge.net/crossbow/index.shtml
  • 234. Preprocessed reads
  • 235. Preprocessed reads Map: Bowtie
  • 236. Preprocessed reads Map: Bowtie Sort: Bin and partition
  • 237. Preprocessed reads Map: Bowtie Sort: Bin and partition Reduce: SoapSNP
  • 238. Crossbow   condenses   over   1,000   hours   of   resequencing   computa:on   into   a   few   hours   without   requiring   the   user   to   own   or   operate   a   computer  cluster
  • 239. Comparing Genomes
  • 240. Estimating relative evolutionary rates from sequence comparisons: Identification of probable orthologs Admissible comparisons: A or B vs. D C vs. E Inadmissible comparisons: A or B vs. E C vs. D A B C D E species tree gene tree S. cerevisiae C. elegans
  • 241. Estimating relative evolutionary rates from sequence comparisons: 1. Orthologs found using the Reciprocal smallest distance algorithm 2. Build alignment between two orthologs >Sequence C MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… >Sequence E MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL… 3. Estimate distance given a substitution matrix Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ A B C D E species tree gene tree S. cerevisiae C. elegans
  • 242. RSD algorithm summary Genome I Genome J Ib Jc Align sequences & Calculate distances L Orthologs: Align sequences & Calculate distances H ib - jc D = 0.1 c vs. D=1.2 vs. D=0.2 a b a vs. D=0.1 vs. D=0.3 c b b b vs. D=0.9 vs. D=0.1 c c b c
  • 243. Prof. Dennis Wall Harvard Medical School
  • 244. Roundup is a database of orthologs and their evolutionary distances. To get started, click browse. Alternatively, you can read our documentation here. Good luck, researchers!
  • 245. massive computational demand
  • 246. 1000 genomes = 5,994,000 processes = 23,976,000 hours
  • 247. 2737 years
  • 248. compared 50+ genomes
  • 249. trends in data sharing
  • 250. data motion is hard
  • 251. cloud services are a viable dataspace
  • 252. share data
  • 253. share applications
  • 254. share results
  • 255. http://aws.amazon.com/publicdatasets/
  • 256. Data Platform App Platform
  • 257. Data Platform App Platform
  • 258. Scalable Data Platform Services APIs Getters Filters Savers WORK
  • 259. to conclude
  • 260. big data
  • 261. change thinking
  • 262. data management data processing data sharing
  • 263. think distributed
  • 264. new software architectures
  • 265. new computing paradigms
  • 266. cloud services
  • 267. the cloud works
  • 268. Thank  you! deesingh@amazon.com  Twi2er:@mndoci   Presenta4on  ideas  from  @mza,  James  Hamilton,  and  @lessig