There is no magicThere is only awesome    A platform for data science         D e e p a k   S i n g h
bioinformaticsimage: Ethan Hein
3
collection
curation
analysis
what’s the big deal?
Source: http://www.nature.com/news/specials/bigdata/index.html
Image:Yael Fitzpatrick (AAAS)
Image:Yael Fitzpatrick (AAAS)
lots of data
lots of people
lots of places
to make data effective
versioning
provenance
filter
aggregate
extend
mashup
human interfaces
hard problem
really hard problem
change how we think   about compute
change how we think    about data
change how we think   about science
information platforms
Image: Drew Conway
dataspacesFurther reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
the unreasonable          effectiveness of dataHalevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)
accept all data   formats
evolve APIs
beyond the  databaseand the data warehouse
data as aprogrammable   resource
data as a royal garden
compute as a fungible    commodity
which brings us to ...
amazon web services
common characteristics
on demand
pay as you go
secure
elastic
3000 CPU’s for one firm’s risk management application     3444JJ!"#$%&()*+,-./01.2%/                                      ...
programmable
“infrastructure as      code”
include_recipe "packages"include_recipe "ruby"include_recipe "apache2"if platform?("centos","redhat")  if dist_only?     #...
import botoimport boto.emrfrom boto.emr.step import StreamingStep                                                         ...
“I terminate the   instance and   relaunch it. Thats   my error handling”Source: @jtimberman on Twitter
compute is a fungible    commodity
emphasis onproductivity
you can get a lot of     awesome
dive in
just a little
S3Simple Storage   Service
highly durable
99.999999999%
Highly scalable
EC2Elastic Compute Cloud
dynamic
autoscaling
EC2 instance types
s             pe           ty           ce        an             standard “m1”     st  in  2EC              high cpu “c1” ...
cluster compute instances          text
cluster GPU instances
s             pe           ty           ce        an     st  in        cluster compute “cc1”  2EC            cluster GPU “...
10gbps
Placement  Group
full bisection bandwidth
Linpack benchmark  880-instance CC1 clusterPerformance: 41.82 TFlops*     *#231 in Nov 2010 Top 500 rankings
Credit: K. Jorissen, F. D.Villa, and J. J. RehrWIEN2k Parallel Performance                                                ...
cost and use models
Ideal Effective Utilization                Spot Utilization% Utilization                                           On Dema...
making things easier
Elastic Beanstalk
Heroku
Ideal Effective Utilization                Spot Utilization% Utilization                                           On Dema...
data at scale
some practicalconsiderations
everything fails all the         time
compute needs vary
new data/compute   paradigms
Amazon Elastic MapReduce
doing stuff
Customer Case Study: cyclopic energy                           OpenFOAM®         http://aws.amazon.com/solutions/case-stud...
NASA JPL
Credit: Angel Pizzaro, U. Penn
http://aws.amazon.com/solutions/case-studies/numerate/
Bioproximity          http://aws.amazon.com/solutions/case-studies/bioproximity/
http://usegalaxy.org/cloud
mapreduce for  genomics http://bowtie-bio.sourceforge.net/crossbow/index.shtml            http://contrail-bio.sourceforge....
http://cloudbiolinux.org/
in summary
large scale datarequires a rethink
data architecture
compute architecture
in infrastructure
the cloud
distributed,programmable infrastructure
rapid, massive, scaling
architecture evolved  with the internet
can we build datascience platforms?
there is no magicthere is only awesome
two more things
10 minutes
http://aws.amazon.com/about-aws/build-a-cluster-in-under-10/
http://aws.amazon.com/education
deesingh@amazon.com                                                             Twitter:@mndoci                           ...
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Upcoming SlideShare
Loading in...5
×

Systems Bioinformatics Workshop Keynote

1,939

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,939
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Systems Bioinformatics Workshop Keynote

  1. 1. There is no magicThere is only awesome A platform for data science D e e p a k S i n g h
  2. 2. bioinformaticsimage: Ethan Hein
  3. 3. 3
  4. 4. collection
  5. 5. curation
  6. 6. analysis
  7. 7. what’s the big deal?
  8. 8. Source: http://www.nature.com/news/specials/bigdata/index.html
  9. 9. Image:Yael Fitzpatrick (AAAS)
  10. 10. Image:Yael Fitzpatrick (AAAS)
  11. 11. lots of data
  12. 12. lots of people
  13. 13. lots of places
  14. 14. to make data effective
  15. 15. versioning
  16. 16. provenance
  17. 17. filter
  18. 18. aggregate
  19. 19. extend
  20. 20. mashup
  21. 21. human interfaces
  22. 22. hard problem
  23. 23. really hard problem
  24. 24. change how we think about compute
  25. 25. change how we think about data
  26. 26. change how we think about science
  27. 27. information platforms
  28. 28. Image: Drew Conway
  29. 29. dataspacesFurther reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
  30. 30. the unreasonable effectiveness of dataHalevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)
  31. 31. accept all data formats
  32. 32. evolve APIs
  33. 33. beyond the databaseand the data warehouse
  34. 34. data as aprogrammable resource
  35. 35. data as a royal garden
  36. 36. compute as a fungible commodity
  37. 37. which brings us to ...
  38. 38. amazon web services
  39. 39. common characteristics
  40. 40. on demand
  41. 41. pay as you go
  42. 42. secure
  43. 43. elastic
  44. 44. 3000 CPU’s for one firm’s risk management application 3444JJ!"#$%&()*+,-./01.2%/ 344+567/(. 8%%9%.:/ 344JJ I%:.%/:1= ;<"&/:1= A&B:1= C10"&:1= C".:1= E(.:1= ;"%/:1= >?,,?,44@ >?,3?,44@ >?,>?,44@ >?,H?,44@ >?,D?,44@ >?,F?,44@ >?,G?,44@
  45. 45. programmable
  46. 46. “infrastructure as code”
  47. 47. include_recipe "packages"include_recipe "ruby"include_recipe "apache2"if platform?("centos","redhat") if dist_only? # just the gem, well install the apache module within apache2 package "rubygem-passenger" return else package "httpd-devel" endelse %w{ apache2-prefork-dev libapr1-dev }.each do |pkg| package pkg do action :upgrade end endendgem_package "passenger" do version node[:passenger][:version]endexecute "passenger_module" do command echo -en "nnnn" | passenger-install-apache2-module creates node[:passenger][:module_path]end
  48. 48. import botoimport boto.emrfrom boto.emr.step import StreamingStep Connect to Elastic MapReducefrom boto.emr.bootstrap_action import BootstrapActionimport time# set your aws keys and S3 bucket, e.g. from environment or .botoAWSKEY=SECRETKEY=S3_BUCKET=NUM_INSTANCES = 1conn = boto.connect_emr(AWSKEY,SECRETKEY)bootstrap_step = BootstrapAction("download.tst","s3://elasticmapreduce/bootstrap-actions/download.sh",None) Install packagesstep = StreamingStep(name=Wordcount,                     mapper=s3n://elasticmapreduce/samples/wordcount/wordSplitter.py,                     cache_files = ["s3n://" + S3_BUCKET + "/boto.mod#boto.mod"],                     reducer=aggregate,                     input=s3n://elasticmapreduce/samples/wordcount/input,                     output=s3n:// + S3_BUCKET + /output/wordcount_output) Set up mappers &jobid = conn.run_jobflow(    name="testbootstrap", reduces    log_uri="s3://" + S3_BUCKET + "/logs",    steps = [step],    bootstrap_actions=[bootstrap_step],    num_instances=NUM_INSTANCES)print "finished spawning job (note: starting still takes time)"state = conn.describe_jobflow(jobid).stateprint "job state = ", stateprint "job id = ", jobidwhile state != uCOMPLETED:    print time.localtime() job state    time.sleep(30)    state = conn.describe_jobflow(jobid).state    print "job state = ", state    print "job id = ", jobidprint "final output can be found in s3://" + S3_BUCKET + "/output" + TIMESTAMPprint "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ."
  49. 49. “I terminate the instance and relaunch it. Thats my error handling”Source: @jtimberman on Twitter
  50. 50. compute is a fungible commodity
  51. 51. emphasis onproductivity
  52. 52. you can get a lot of awesome
  53. 53. dive in
  54. 54. just a little
  55. 55. S3Simple Storage Service
  56. 56. highly durable
  57. 57. 99.999999999%
  58. 58. Highly scalable
  59. 59. EC2Elastic Compute Cloud
  60. 60. dynamic
  61. 61. autoscaling
  62. 62. EC2 instance types
  63. 63. s pe ty ce an standard “m1” st in 2EC high cpu “c1” high memory “m2” http://aws.amazon.com/ec2/instance-types/
  64. 64. cluster compute instances text
  65. 65. cluster GPU instances
  66. 66. s pe ty ce an st in cluster compute “cc1” 2EC cluster GPU “cg1” http://aws.amazon.com/ec2/instance-types/
  67. 67. 10gbps
  68. 68. Placement Group
  69. 69. full bisection bandwidth
  70. 70. Linpack benchmark 880-instance CC1 clusterPerformance: 41.82 TFlops* *#231 in Nov 2010 Top 500 rankings
  71. 71. Credit: K. Jorissen, F. D.Villa, and J. J. RehrWIEN2k Parallel Performance (U. Washington) KS for huge system at 1 k-pointH size 56,000 (25GB)Runtime (16x8 processors) Local (Infiniband) 3h:48 Cloud (10Gbps) 1h:30 ($40) VERY DEMANDINGnetwork performance •1200 atom unit cell; SCALAPACK+MPI diagonalization, matrix size 50k-100k
  72. 72. cost and use models
  73. 73. Ideal Effective Utilization Spot Utilization% Utilization On Demand Utilization Reserved Utilization time
  74. 74. making things easier
  75. 75. Elastic Beanstalk
  76. 76. Heroku
  77. 77. Ideal Effective Utilization Spot Utilization% Utilization On Demand Utilization Reserved Utilization time
  78. 78. data at scale
  79. 79. some practicalconsiderations
  80. 80. everything fails all the time
  81. 81. compute needs vary
  82. 82. new data/compute paradigms
  83. 83. Amazon Elastic MapReduce
  84. 84. doing stuff
  85. 85. Customer Case Study: cyclopic energy OpenFOAM® http://aws.amazon.com/solutions/case-studies/cyclopic-energy/
  86. 86. NASA JPL
  87. 87. Credit: Angel Pizzaro, U. Penn
  88. 88. http://aws.amazon.com/solutions/case-studies/numerate/
  89. 89. Bioproximity http://aws.amazon.com/solutions/case-studies/bioproximity/
  90. 90. http://usegalaxy.org/cloud
  91. 91. mapreduce for genomics http://bowtie-bio.sourceforge.net/crossbow/index.shtml http://contrail-bio.sourceforge.net http://bowtie-bio.sourceforge.net/myrna/index.shtml
  92. 92. http://cloudbiolinux.org/
  93. 93. in summary
  94. 94. large scale datarequires a rethink
  95. 95. data architecture
  96. 96. compute architecture
  97. 97. in infrastructure
  98. 98. the cloud
  99. 99. distributed,programmable infrastructure
  100. 100. rapid, massive, scaling
  101. 101. architecture evolved with the internet
  102. 102. can we build datascience platforms?
  103. 103. there is no magicthere is only awesome
  104. 104. two more things
  105. 105. 10 minutes
  106. 106. http://aws.amazon.com/about-aws/build-a-cluster-in-under-10/
  107. 107. http://aws.amazon.com/education
  108. 108. deesingh@amazon.com Twitter:@mndoci http://slideshare.net/mndoci http://mndoci.com Inspiration and ideas from Matt Wood& Larry LessigCredit” Oberazzi under a CC-BY-NC-SA license

×