Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

eResearch New Zealand Keynote

749 views

Published on

Bill Howe gave a keynote at eResearch New Zealand in early July 2013

Published in: Technology
  • Be the first to comment

  • Be the first to like this

eResearch New Zealand Keynote

  1. 1. Bill Howe, PhD Director of Research, Scalable Data Analytics University of Washington eScience Institute eScience and Data Science at the UW eScience Institute 7/8/2013 Bill Howe, UW 1
  2. 2. 2 “It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to figure out how to make people click on ads” -- Jeff Hammerbacher, co-founder, Cloudera
  3. 3. Jim Gray
  4. 4. The University of Washington eScience Institute • Rationale – The exponential increase in physical and virtual sensing tech is transitioning all fields of science and engineering from data-poor to data-rich – Techniques and technologies include • Sensors and sensor networks, data management, data mining, machine learning, visualization, cluster/cloud computing • Mission – Help position the University of Washington and partners at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon them. • Strategy – Bootstrap a cadre of Research Scientists – Add faculty in key fields – Build out a “consultancy” of students and non-research staff 7/8/2013 Bill Howe, UW 4
  5. 5. 7/8/2013 Bill Howe, UW 5 #ofbytes # of data sources telescopes spectra LSST (~100PB; images, spectra) PanSTARRS (~40PB; images, trajectories) OOI (~50TB/year; sims, RSN) IOOS (~50TB/year; sims, satellite, gliders, AUVs, vessels, more) CMOP (~10TB/year; sims, stations, gliders, AUVs, vessels, more) SDSS (~400TB; images, spectra, catalogs) n-body sims models AUVs stations cruises, CTDs flow cytometry gliders ADCP satellites Astronomy Ocean Sciences 3 V’s of Big Data Volume Variety Velocity
  6. 6. PDB GenBank UniProt Pfam Spreadsheets, Notebooks Local, Lost High throughput experimental methods Industrial scale Commons based production Publicly data sets Cherry picked results Preserved CATH, SCOP (Protein Structure Classification) ChemSpider Long Tail of Research Data [src: Carol Goble]
  7. 7. Types of Data Stored 5.20% 11.70% 15.20% 20.50% 22.80% 48.00% 61.20% 71.60% 0% 20% 40% 60% 80% 100% Quantitative/tabular/statistical Text (literature, transcriptions, field notes) Images (photographs, maps) Video recordings Audio recordings Multimedia digital objects Geo-tagged objects/ spatial data Other Lewis et al 2008 src: Conversations with Research Leaders (2008)
  8. 8. Where do you store your data? src: Conversations with Research Leaders (2008) src: Faculty Technology Survey (2011) 5% 6% 12% 27% 41% 66% 87% 0% 20% 40% 60% 80% 100% Other Department-managed data center External (non-UW) data center Server managed by research group Department-managed server External device (hard drive, thumb drive) My computer Lewis et al 2011
  9. 9. How much data do you work with? Wright 2013
  10. 10. The Long Tail is worthy of investment Jean-Michel Fortin, David J. Currie, Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE 8(6) log(NSERC grants) ($) log(NSERC + CIHR rants) ($)
  11. 11. The Long Tail is worthy of investment Jean-Michel Fortin, David J. Currie, Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE 8(6) “In sum, greater productivity is not strongly related to greater funding.” “…the best article of one rich researcher received, on average, 14% fewer citations than the best article from any random pair of researchers, each of whom received only half as much funding!” “if maximizing the total impact of the entire pool of grantees is the goal, then the “few big” strategy would be less effective than the “many small” strategy.”
  12. 12. 7/8/2013 Bill Howe, UW 12 “Ensuring a successful future for the biological sciences will require restraint in the growth of large centers and -omics-like projects, so as to provide more financial support for the critical work of innovative small laboratories striving to understand the wonderful complexity of living systems.” Alberts B (2012) The End of “Small Science”? Science 337: 1583 The long tail is worthy of investment
  13. 13. 7/3/13 13
  14. 14. 7/8/2013 Bill Howe, UW 14
  15. 15. 100x more impact How much more productive is a great programmer than a mediocre programmer?
  16. 16. Problem How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 7/8/2013 Bill Howe, UW 16
  17. 17. 7/8/2013 Bill Howe, UW Simple Example ###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1 chr_4[480001-580000].287 4500 chr_4[560001-660000].1 3556 chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein C chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fa chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf fa chr_24[160001-260000].65 3542 chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf fa chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydr chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and p chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and p chr_11[1-100000].70 2886 chr_11[80001-180000].100 1523 ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length 1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285 2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233 3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872 … 2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089 2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316 … 3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105 … COGAnnotation_coastal_sample.txt SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit 17
  18. 18. 7/8/2013 Bill Howe, UW 18 “The future is already here; it’s just not very evenly distributed.” -- William Gibson
  19. 19. Three Avenues of Attack • Technological • Educational • Organizational 7/8/2013 Bill Howe, UW 19
  20. 20. SQLSHARE: QUERY-AS-A-SERVICE Technology for the Long Tail
  21. 21. Alex Szalay Jim Gray How can we deliver 1000 little SDSSs?
  22. 22. 1) Upload data “as is” Cloud-hosted; no need to install or design a database; no pre-defined schema 2) Analyze data with SQL Right in your browser, writing queries on top of queries on top of queries ... SELECT hit, COUNT(*) FROM tigrfam_surface GROUP BY hit ORDER BY cnt DESC 3) Share the results Queries on queries on queries…
  23. 23. 7/8/2013 23 Browse English descriptions
  24. 24. 7/8/2013 24 Save the results, share them with others Edit a Query
  25. 25. VizDeck Python Client “Flagship” SQLShare App (Python) on EC2 SQLShare REST API Excel Addin R Client Spreadsheet Crawler ASP.NET OAuth2 WCF
  26. 26. METADATA Aside 7/8/2013 Bill Howe, UW 26
  27. 27. What about metadata? • Claim: Comprehensive metadata standards* represent a shared consensus about the world • At the frontier of research, this shared consensus does not exist, by definition • Any consensus that does emerge will change frequently, by definition • Data found “in the wild” will typically not conform to any standard, by definition So are we stuffed? * ontology / schema / controlled vocabulary / etc. 7/8/2013 Bill Howe, UW 27
  28. 28. 7/8/2013 Bill Howe, UW 28 “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43 Maslow’s Needs Hierarchy
  29. 29. A “Needs Hierarchy” of Science Data Management storage sharing 7/8/2013 Bill Howe, UW 29 query curation analytics “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43
  30. 30. A “Needs Hierarchy” of Science Data Management storage sharing 7/8/2013 Bill Howe, UW 30 curation query analytics “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43
  31. 31. Views • A view is just a query with a name • We can use the view just like a real table 7/8/2013 Bill Howe, UW 31 Why can we do this? Because we know that every query returns a relation: We say that the language is “algebraically closed”
  32. 32. Scientific data management reduces to sharing views • Integrate data from multiple sources? – joins and unions with views • Standardize on units, apply naming conventions? – rename columns, apply functions with views • Attach metadata? – add new tables with descriptive names, add new columns with views • Data cleaning, quality control? – hide bad values with views • Maintain provenance? – inspect view dependencies • Propagate updates? – view maintenance • Protect sensitive data? – expose subsets with views (assuming views carry permissions) 7/8/2013 Bill Howe, UW 32
  33. 33. Deeply nested hierarchies of views Provenance Controlled Sharing Implicit re-execution
  34. 34. Bring the computation to the data • Rich query services, not data cemetaries • Avoid “Transloading:” Pointless data movement from one environment to another – Compute  Vis – Server  Client – Cloud  Cloud • “Share the soup” and curate incrementally as a side effect of usage • Pay-as-you-go metadata
  35. 35. USE CASES 7/8/2013 Bill Howe, UW 35
  36. 36. Laser Microscope Objective Pine Hole Lens Nozzle d1 d2 FSC (Forward scatter) Orange fluo Red fluo SeaFlow Francois Ribalet Jarred Swalwell Ginger Armbrust
  37. 37. SeaFlow 10 0 10 1 10 2 10 3 10 4 100 10 1 10 2 10 3 10 4 ps3.fcs…Focus D1/FSC D2/FSC d1/FSC d2 / FSC 10 0 10 1 10 2 10 3 10 4 100 101 10 2 10 3 10 4 ps3.fcs…subset FSC 692-40REDfluorescence FSC Picoplankton Nanoplankton 100 101 102 103 104 10 0 10 1 10 2 103 104 P35-surf FSC Small Stuff 580-30 IS Ultraplankton Prochlorococcus  Continuous observations of various phytoplankton groups from 1-20 m in size  Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton  Based on ORANGE fluo: Synechococcus, Cryptophytes  Based on FSC: Coccolithophores Francois Ribalet Jarred Swalwell Ginger Armbrust
  38. 38. SeaFlow Francois Ribalet Jarred Swalwell Ginger Armbrust
  39. 39. Script-oriented computing was killing them • Scripts (typically in R) must be pre-shared with all collaborators • When the data changes, everybody has to re-run all the scripts • When the scripts change, everybody has to re-run all the scripts. • Implicit assumption that all data fits in main memory • Terrible provenance, terrible reproducibility • Pipeline of scripts dependent on intricate file formats and file naming schemes 7/8/2013 Bill Howe, UW 39
  40. 40. Howe, et al., CISE 2012
  41. 41. Workflow Systems • Scripts (typically in R) must be pre-shared with all collaborators • When the data changes, everybody has to re-run all the scripts • When the scripts change, everybody has to re-run all the scripts. • Implicit assumption that data fits in main memory • No provenance • Pipeline of scripts dependent on intricate file formats and/or file naming schemes 7/8/2013 Bill Howe, UW 41
  42. 42. Steven Roberts SQL as a lab notebook: http://bit.ly/16Xj2JP Calculate # methylated CGs Calculate # all CGs Calculate methylation ratio Link methylation with gene description GFF of methylated CG locations GFF of all genes GFF of all CG locations Gene descriptions Join Reorder columns Count Count JoinJoin Reorder columns Reorder columns Compute Trim Excel Join Join misstep: join w/ wrong fill Calculate # methylated CGs Calculate # all CGs GFF of methylated CG locations GFF of all genes GFF of all CG locations Gene descriptions Calculate methylation ratio and link with gene description Popular service for Bioinformatics Workflows
  43. 43. Halperin, Howe, et al. SSDBM 2013
  44. 44. Robin Kodner 7/8/2013 Bill Howe, UW 44 “I have had two students who are struggling with R come up and tell me how much more they like working in SQLShare.” Data management and statistics for biologists
  45. 45. Bill Howe, UW “An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a 10 minute 100 line script in 1 line of SQL.” -- Andrew D White Andrew White, UW Chemistry 7/8/2013 45 Decoding nonspecific interactions from nature. A. White, A. Nowinski, W. Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted
  46. 46. SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC Non-programmers can write very complex queries (rather than relying on staff programmers) Example: Computing the overlaps of two sets of blast results We see thousands of queries written by non-programmers
  47. 47. DATA SCIENCE 7/8/2013 Bill Howe, UW 47 Education
  48. 48. 7/8/2013 Bill Howe, UW 48
  49. 49. Drew Conway’s Data Science Venn Diagram 7/8/2013 Bill Howe, UW 49
  50. 50. 7/8/2013 Bill Howe, UW 50 “I worry that the Data Scientist role is like the mythical “webmaster” of the 90s: master of all trades.” -- Aaron Kimball, CTO Wibidata
  51. 51. 7/8/2013 Bill Howe, UW 51 But what are the abstractions of data science? tools abstr. “Data Jujitsu” “Data Wrangling” “Data Munging” Translation: “We have no idea what this is all about” Claim: Relational Algebra is the universal formalism for data modeling and manipulation, and every data scientist should know it
  52. 52. UW Data Science Education Efforts 7/8/2013 Bill Howe, UW 52 Students Non-Students CS/Informatics Non-Major professionals researchers undergrads grads undergrads grads UWEO Data Science Certificate Graduate Certificate in Big Data CS Data Management Courses eScience workshops Intro to data programming eScience Masters (planned) Coursera Course: Intro to Data Science Previous courses: Scientific Data Management, Graduate CS, Summer 2006, Portland State University Scientific Data Management, Graduate CS, Spring 2010, University of Washington
  53. 53. 7/8/2013 Bill Howe, UW 53 Bill Howe
  54. 54. “Very very interesting tht there is high correlation in the way the election results were being announced and the way the graph is shaped.” “I was quite amazined that I was able to obtain this analysis with just 2 days of lecturing and practice!” “Inspired by all my new learning, I thought about doing a little sentiment analysis myself for our national elections!”
  55. 55. “Darth Grader”
  56. 56. … I was frankly amazed that you were so fast in responding to my query, in spite of the class being so huge. I’d like to thank you so much for teaching this course. This has been one of the most useful courses I’ve ever taken and you’re an awesome instructor. With such great quality education freely available, I wonder why I joined a Master’s program haha. Thanks again and take care. I’ll try my best to finish another competition. “With such great quality education freely available, I wonder why I joined a Master’s program haha.”
  57. 57. INTELLECTUAL INFRASTRUCTURE Institution 7/8/2013 Bill Howe, UW 57
  58. 58. Seek work in “Pasteur’s Quadrant” Considerations of use Questforfundamentalunderstanding Pasteur Edison Bohr
  59. 59. 7/8/2013 Bill Howe, UW 59 Multiple modes of interaction, multiple time scales 1-2 years and up1-2 weeks and down 1-2 quarters Incubation (projects) Communication (events) Collaboration (partnerships)
  60. 60. 2018 2008 Some local observations: • Big data work exposes common ground • Every job is becoming “data scientist” • More π-shaped people! • Democratization to the long tail is key • Industry and research aren’t too different Incubator • Seed grants to students and postdocs • Rotating staff from science and industry • An evolving portfolio of reusable tools • Produce digital capital and human capital Data Science Incubator 2013
  61. 61. 7/8/2013 Bill Howe, UW 61 On NoSQL
  62. 62. 7/8/2013 Bill Howe, UW 62 http://sqlshare.escience.washington.edu billhowe@cs.washington.edu http://escience.washington.edu
  63. 63. 7/8/2013 Bill Howe, UW 63
  64. 64. WHY SQL? 7/8/2013 Bill Howe, UW 64
  65. 65. 5/18/10 Garret Cole, eScience Institute What’s the point? • Conventional wisdom says “Science data isn’t relational” – This is nonsense • Conventional wisdom says “Scientists won’t write SQL” – This is nonsense • So why aren’t databases being used more often? – They’re a PITA • We implicate difficulty in – installation, configuration – schema design, data loading – performance tuning – app-building (NoGUI?) We ask instead, “What kind of platform can support ad hoc scientific Q&A with SQL?”
  66. 66. 7/8/2013 Bill Howe, UW 66 God made the integers; all else is the work of man. (Leopold Kronecker, 19th Century Mathematician) slide src: Mike Franklin Codd made relations; all else is the work of man. (Raghu Ramakrishnan, DB text book author)
  67. 67. 7/8/2013 Bill Howe, UW 67 Key Idea: Algebraic Optimization N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator Same idea works with the Relational Algebra!
  68. 68. Data Curation Data Management Data Analytics Cyberinfrastructure Database & Systems Researchers Stats, ML, and Viz Library Science Researchers DataONE Hadoop GraphLab Vertica Greenplum Oracle/MS/IBM Dataverse R/SPSS/MATLAB/Stata Dryad ICPSR Geodata.gov Tableau Weka GenBank Intellectual Infrastructure Spark Pig HIVE Shark Dremel
  69. 69. SQLShare as a CS Research Platform • Automatic “Starter” Queries – (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle) • VizDeck: Automatic Mashups and Visualization – (Bill Howe, Alicia Key, Daniel Perry, Cecilia Aragon) • Info Extraction from Spreadsheets – (Mike Cafarella, Dave Maier, Bill Howe, Sagar Chitnis, Abdu Alwani) • Scalable Analytics-as-a-Service – (Dan Suciu, Magda Balazinska, Bill Howe) • Optimizing Iterative Queries for Machine Learning – (Dan Suciu, Magda Balazinska, Bill Howe) • Case Studies in Metagenomics, Chemistry, more SSDBM 2011 SIGMOD 2011 (demo) SSDBM 2011 CHI 2012 SIGMOD 2012 (demo) 7/8/2013 Bill Howe, UW 69 VLDB 2010 Datalog2.0 2012 CIDR 2013 Data engineering 2012 CiSE 2012
  70. 70. Two Failure Modes Serving the Long Tail over-abstraction “uber-system” “neither fish nor fowl” tries to address so many requirements, it addresses none too reactive, ad hoc, one-off Addresses exactly 1 application no leverage; doesn’t scale
  71. 71. A stripped-down version of Jim Gray’s “20 questions” methodology Experimental Engagement Algorithm for the Long Tail 1. Get the data 2. Load the data “as is” – no schema design 3. Get ~20 questions (in English) 4. Translate the questions into SQL (when possible) 5. Provide these “starter queries” to the researchers Q: Can researchers questions be expressed in SQL? Q: Are a few examples sufficient for novices to self-train with SQL? Q: Can we scale this process up? Q: If so, will the use of SQL reduce their data handling overhead?

×