Data Science at Berkeley


Published on

Keynote talk at PyData Silicon Valley 2014 (Facebook Headquarters) on May 4, 2014, by Prof. Joshua Bloom.

Published in: Data & Analytics, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Science at Berkeley

  1. Data$Science$at$Berkeley Joshua'Bloom' UC'Berkeley,'Astronomy @pro%sb PyData,'May'4.'2014
  2. The$First$Rule$of$Data$Science...
  3. diagram'from'Drew'Conway Data$Science$as$a$Discipline
  4. diagram'from'Drew'Conway Data$Science$as$a$Discipline • What'are'the'Core$Principles? • Is'it'an'academic$pursuit'to'be'taught'or'a' skillset'to'be'trained? • When'should'they'be'taught?'At'what'level'of' depth/breath? • Who'should'teach'them'and'who'should'know' data'science? • Where'should'investments'be'made?'Does'Data' Science'need'an'intellectual'home'within'insHtuHons?'
  5. “I'love'working'with'astronomers,'since'their' data'is'worthless.” K'Jim'Gray,'Microso'
  6. Bayesian FrequenHst Theory/Hypothesis Driven Data Driven non-parametric parametric Data$Inference$Space Hardware---laptops-→-clusters/supercomputers So6ware---Python/Scipy,-R,-... Carbonware---(astro)-grad-students,-postdocs
  7. mij = µi + M0j + ↵j log10 (Pi/P0) + E(B V )i ⇥ [RV ⇥ a (1/ i) + b (1/ i)] + ✏ij Bayesian$Distance$Ladder Some-Variable-Stars-show-a-PeriodDBrightness-CorrelaFon i'indexes'over'individual'stars j'indexes'over'wavebands a'and'b'are'fixed'constants'at'each'color Data--134'RR'Lyrae'(ultraviolet'to'infrared) Fit--307-dimensional-model-parameter-inference K'determinisHc'MCMC'model'with'PyMC K'~6'days'for'a'single'run'(one'core) K'parallelism'for'convergence'tests Klein+12;'Klein,JSB+14, “Leavitt Law” HenrieKa-Swan-LeaviK
  8. Figure 6. Multi-band period–luminosity relations. RRab stars are in blue, RRc stars in red. Blazhko-a↵ected stars are denoted with • Sub-1% distance uncertainty • Precision 3D dust in the Milky Way Bayesian$Distance$Ladder Milky Way projection distance dust Brightness Period Klein+12;'Klein,JSB+14,
  9. Morphology of LMC RR Lyrae stars 3 the 62 individual science CCDs2 were pro- andard reduction algorithms (bias subtrac- g, etc.) using the computational resources at nergy Research Scientific Computing Center y on individual frames was calibrated with ence sources from the Two Micron All Sky S; Skrutskie et al. 2006) using the astrom- re package (Lang et al. 2010). Photometric performed using same-night observations of Sloan Digital Sky Survey Stripe 82 Standard Ivezi´c et al. 2007). Applying standard cali- dology (e.g., Ofek et al. 2012), we find that a robust scatter in our absolute photometric . 0.02 mag on clear nights (& 50% of the ob- om the Science Verification run). While this be improved with more advanced modeling signatures (e.g., Tucker et al. 2014), we find on is su cient for our scientific objectives. m observing program produced on average Figure 3. V -, I-, and z-band period–magnitude relations (solid lines) derived for the LMC RR Lyrae population, superimposed on scatter plots of the RR Lyrae posteriors (M computed using µPost). The dashed lines denote the 1 prediction intervals for a new RR Lyrae star with known period. used for all of the µi. This standard deviation was selected to be a fractional distance error of 10 per cent (⇡ 5 kpc), which is much larger than the depth of the LMC and significantly larger than (> 2 times) the median posterior µi . To fit the model given by equation 6 ten identical MCMC traces were run, each generating 3.5 million iter- ations. The first 0.5 million were discarded as burn-in and the remaining 3 million were thinned by 300 to result in ten traces of 10,000 iterations each. The Gelman-Rubin conver- now with 15,040 stars... scipy.sparse,'hdf5 Klein..JSB+14 Dust Map basemap 4-✕'improvement'in'distance'error 3D density projection mayavi2
  10. Astronomical$Data$Deluge Serious$Challenge$to$Tradi;onal$Approaches$&$Toolkits Large$Synop;c$Survey$Telescope$(LSST)$A$2020$ ' Light'curves'for'800M'sources'every'3'days '''''106'supernovae/yr,'105'eclipsing'binaries '''''3.2'gigapixel'camera,'20'TB/night LOFAR$&$SKA ''''150'Gps'(27'Tflops)'→'20'Pps'(~100'Pflops) Gaia$space$astrometry$mission$A$2014 ''''1'billion'stars'observed'∼70'Hmes'over'5'years '''''''Will'observe'20K'supernovae Many'other'astronomical'surveys'are'already'producing'data: SDSS,'iPTF,'CRTS,'PanKSTARRS,'Hipparcos,'OGLE,'ASAS,'Kepler,'LINEAR,'DES'etc.,
  11. strategy scheduling observing reducFon finding discovery classificaFon followup inference Towards$a$Fully$Automated$ScienAfic$Stack$for$Transients } current state)of)the)art stack automated not-(yet)-automated published-work NSF/CDI NSF/BIGDATA
  12. Our$ML$framework$found$ the$Nearest$Supernova$in$3$ Decades$..‣ Built'&'Deployed'robust,'realKHme' machine'learning'framework,' discovering'>10,000'events'in'>'10' TB'of'imaging' ''''''''→'50+'journal'arHcles ‣ Built'ProbabilisHc'Event' classificaHon'catalogs'with' innovaHve'acHve'learning' hhp:// hhps://
  13. What is the toolbox of the modern (data-driven) scientist? domain training statistics advanced computing database GUI parallel visualization Bayesian machine learning Physics laboratory techniques MCMC MapReduce
  14. And...How do we teach this with what little time the students have? What is the toolbox of the modern (data-driven) scientist?
  15. Data-Centric Coursework, Bootcamps, Seminars, & Lecture Series BDAS: Berkeley Data Analytics Stack [Spark, Shark, ...] parallel programming bootcamp ...and entire degree programs
  16. 2010: 85 campers 2012a: 135 campers Python Bootcamps at Berkeley
  17. a modern superglue computing language for (data) science ‣ high-level scripting language ‣ open source, huge & growing community in academia & industry ‣ Just in time compilation but also fast numerical computation ‣ Extensive interfaces to 3rd party frameworks
  18. a modern superglue computing language for (data) science ‣ high-level scripting language ‣ open source, huge & growing community in academia & industry ‣ Just in time compilation but also fast numerical computation ‣ Extensive interfaces to 3rd party frameworks A reasonable lingua franca for scientists...
  19. 2012b: 210 campers Python Bootcamps at Berkeley 2013a: 253 campers
  20. ‣ 3 days of live/archive streamed lectures ‣ all open material in GitHub ‣ widely disseminated (e.g., @ NASA)
  21. Part of the Designated Emphasis in Computational Science & Engineering at Berkeley visualization machine learning database interaction user interface & web frameworks timeseries & numerical computing interfacing to other languages Bayesian inference & MCMC hardware control parallelism
  22. “Are we alone in the universe? What makes up the missing mass of the universe? ... And maybe the biggest question of all: How in the wide world can you add $3 billion in market capitalization simply by adding .com to the end of a name?” President William Jefferson Clinton Science and Technology Policy Address 21 January 2000 “Add Data Science or Big Data to your course name to increase enrollment by tenfold.” Joshua Bloom Just Now
  23. Python for Data Science @ Berkeley [Sept 2013]
  24. 64% 36% female male 8% 4% 8% 12% 4% 12% 8% 16% 16% 12%Psychology Astronomy Neuroscience Biostatistics Physics Chemical Engineering ISchool Earth and Planetary Sciences Industrial Engineering Mechanical Engineering “Parallel Image Reconstruction from Radio Interferometry Data” “Graph Theory Analysis of Growing Graphs” “Realtime Prediction of Activity Behavior from Smartphone” “Bus Arrival Time Prediction in Spain”
  25. Time domain preprocessing - Start with raw photometry! - Gaussian process detrending! - Calibration! - Petigura & Marcy 2012! ! Transit search - Matched filter! - Similar to BLS algorithm (Kovcas+ 2002)! - Leverages Fast-Folding Algorithm O(N^2) → O(N log N) (Staelin+ 1968)! ! Data validation - Significant peaks in periodogram, but inconsistent with exoplanet transit TERRA – optimized for small planets Detrended/calibrated photometry TERRA RawFlux(ppt)CalibratedFlux Erik Petigura Berkeley Astro Grad Student Prevalence of Earth-size planets orbiting Sun-like stars Erik A. Petiguraa,b,1 , Andrew W. Howardb , and Geoffrey W. Marcya a Astronomy Department, University of California, Berkeley, CA 94720; and b Institute for Astronomy, University of Hawaii at Manoa, Honolulu, HI 96822 Contributed by Geoffrey W. Marcy, October 22, 2013 (sent for review October 18, 2013) Determining whether Earth-like planets are common or rare looms as a touchstone in the question of life in the universe. We searched for Earth-size planets that cross in front of their host stars by examining the brightness measurements of 42,000 stars from National Aeronautics and Space Administration’s Kepler mission. We found 603 planets, including 10 that are Earth size (1 − 2 R⊕) and receive comparable levels of stellar energy to that of Earth (0:25 − 4 F⊕). We account for Kepler’s imperfect detectability of such planets by injecting synthetic planet–caused dimmings into the Kepler brightness measurements and recording the fraction detected. We find that 11 ± 4% of Sun-like stars harbor an Earth- size planet receiving between one and four times the stellar inten- sity as Earth. We also find that the occurrence of Earth-size planets is constant with increasing orbital period (P), within equal intervals of logP up to ∼200 d. Extrapolating, one finds 5:7+1:7 −2:2 % of Sun-like stars harbor an Earth-size planet with orbital periods of 200–400 d. extrasolar planets | astrobiology The National Aeronautics and Space Administration’s (NASA’s) Kepler mission was launched in 2009 to search for planets that transit (cross in front of) their host stars (1–4). The resulting dimming of the host stars is detectable by measuring their bright- ness, and Kepler monitored the brightness of 150,000 stars every 30 min for 4 y. To date, this exoplanet survey has detected more than 3,000 planet candidates (4). We searched for transiting planets in Kepler brightness mea- surements using our custom-built TERRA software package described in previous works (6, 9) and in SI Appendix. In brief, TERRA conditions Kepler photometry in the time domain, re- moving outliers, long timescale variability (>10 d), and systematic errors common to a large number of stars. TERRA then searches for transit signals by evaluating the signal-to-noise ratio (SNR) of prospective transits over a finely spaced 3D grid of orbital period, P, time of transit, t0, and transit duration, ΔT. This grid-based search extends over the orbital period range of 0.5–400 d. TERRA produced a list of “threshold crossing events” (TCEs) that meet the key criterion of a photometric dimming SNR ratio SNR > 12. Unfortunately, an unwieldy 16,227 TCEs met this cri- terion, many of which are inconsistent with the periodic dimming profile from a true transiting planet. Further vetting was performed by automatically assessing which light curves were consistent with theoretical models of transiting planets (10). We also visually inspected each TCE light curve, retaining only those exhibiting a consistent, periodic, box-shaped dimming, and rejecting those caused by single epoch outliers, correlated noise, and other data anomalies. The vetting process was applied homogeneously to all TCEs and is described in further detail in SI Appendix. To assess our vetting accuracy, we evaluated the 235 Kepler objects of interest (KOIs) among Best42k stars having P > 50 d, which had been found by the Kepler Project and identified as planet candidates in the official Exoplanet Archive (exoplanetarchive. RONOMY Bootcamp/Seminar Alum Python DOE/NERSC computation PNAS [2014]
  26. Dangers+of+a+Cursory+Training/Teaching+ Curriculum
  27. first-this... ...then-this. Undergraduate$&$Graduate$Training$Mission Thinking$Data'Literacy$before$ Thinking$Big'Data'Proficiency
  28. 34 Fitting a straight line to data 0 50 100 150 200 250 300 x 0 100 200 300 400 500 600 700 y y = 1.33 x + 164 0 50 100 150 200 250 300 x 0 100 200 300 400 500 600 700 y 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 Data analysis recipes: Fitting a model to data⇤ David W. Hogg Center for Cosmology and Particle Physics, Department of Physics, New York University Max-Planck-Institut f¨ur Astronomie, Heidelberg Jo Bovy Center for Cosmology and Particle Physics, Department of Physics, New York University Dustin Lang Department of Computer Science, University of Toronto Princeton University Observatory Abstract We go through the many considerations involved in fitting a model to data, using as an example the fit of a straight line to a set of points in a two-dimensional plane. Standard weighted least-squares fitting is only appropriate when there is a dimension along which the data points have negligible uncertainties, and another along which all the uncertainties can be described by Gaussians of known variance; these D Fi Da Cen Ma Jo Cen Du Dep Prin tha dit line and ⇤ arXiv:1008.4686v1 [astro-ph.IM] 27 Aug 2010 Sta8s8cal+Inference Undergraduate$&$Graduate$Training$Mission Thinking$Data'Literacy$before$ Thinking$Big'Data'Proficiency
  29. Versioning+&+Reproducibility “Recently, the scientific community was shaken by reports that a troubling proportion of peer-reviewed preclinical studies are not reproducible.” McNutt, 2014 K'Git'has'emerged'as'the'de'facto'versioning'tool K'Berkeley'Common'Environment'(BCE)'Sonware'Stack K'“Reproducible'and'CollaboraHve'StaHsHcal'Data'Science”'(StaHsHcs'157:'P.'Stark) Undergraduate$&$Graduate$Training$Mission Thinking$Data'Literacy$before$ Thinking$Big'Data'Proficiency
  30. The'IPython'Notebook'–' SupporHng'Research'at'Berkeley •Designing'nuclear'reactor'cores •SimulaHng'electron'flow'in'plasmas •CompuHng'supernovae'spectra •Analyzing'brain'acHvity •Modeling'neural'networks •CalculaHng'quantum'dynamics'and'spectroscopy •Visualizing'MRI'results
  31. Berkeley Data Analytics Stack A Comprehensive Big Data Reference Architecture AMP! Alpha or Soon! AMP! Released! BSD/Apache! 3rd Party Open Source! Apache Mesos! YARN Resource Manager! Resource! HDFS / Hadoop Storage! Tachyon! Storage! ! Apache Spark! Spark Streaming! ML-lib! Processing! and Data ! Management! Applications: Traffic, Carat, Genomics, 3rd Party! Tools: Visualization, Data Cleaning, …! Shark (SQL)! BlinkDB! GraphX! MLBase! Analytics! Frameworks! Spark-R! Berkeley$Data$Analy;cs$Stack A,Comprehensive,Big,Data,Reference, Architecture Methodology'innovaHon:'InvenHng'What’s'Next'in'Big'Data'AnalyHcs Testing the Vision/Early Adopters/Momentum 2010+
  32. (Recent)'Data'Science'Industry'Spinoffs'at'Berkeley
  33. Data'Science'growing'organically'everywhere Feb'15,'2013 AMP'Lab Ion$Stoica,$CS Michael$Franklin,$CS Adam$Arkin,$Bioengineering Emmanuel$Saez,$Economics Reconstruc;ng$the$movies in$your$mind Bin$Yu,$Sta;s;cs Jack$Gallant,$Neuroscience Earthquake Strong Shaking in 11seconds Richard$Allen$ Earth&$Plan.$Science Geospa;al$Lab Fernando$Perez,$ Brain$Imaging$Center iPython$tools$and$community Charles$Marshall Rosie$Gillespie Integra;ve$Biology Digi;zed$Museum
  34. Created by Natalia Bilenko. Data source: PubMed Central.
  35. Established-CS/Stats/Math-in,Service of-novelty-in-domain-science vs. Novelty-in-domain-science-driving-&-informing-novelty-in- CS/Stats/Math “novelty2$problem” an'extra'Burden'for'Forefront'ScienHsts hhps://
  36. Berkeley Institute for Data Science “Bold new partnership launches to harness potential of data scientists and big data” Founded'in'December'2013'as'a'result'of'a'year+'long'naHonal'selecHon'process $37.8M'over'5'years,'along'with'University'of'Washington'&'NYU ‣ An'accelerator'for'dataKdriven'discovery ‣ An'agent$of$change'in'the'modern'university'as'Data'Science'takes'hold ‣ An'incubator'for'the'next'generaHon'of'Data'Science'technology'and'pracHce
  37. Leadership$from$across$the$spectrum Joshua$Bloom,'Professor,'Astronomy;'Director,'Center' for'Time'Domain'InformaHcs '''' Henry$Brady,'Dean,'Goldman'School'of'Public'Policy' ''''' Cathryn$Carson,'Associate'Dean,'Social'Sciences;'AcHng' Director'of'Social'Sciences'Data'Laboratory'"DKLab” '''''' David$Culler,'Chair,'EECS ''''''' Michael$Franklin,'Professor;'EECS,'CoKDirector,'AMP' Lab ''''''''' Erik$Mitchell,'Associate'University'Librarian ''''''''' Faculty'Lead/PI:'Saul$PerlmuXer,'Physics,'Berkeley'Center'for'Cosmological'Physics Fernando$Perez,'Researcher,'Henry'H.'Wheeler'Jr.'Brain' Imaging'Center Jasjeet$Sekhon,'Professor,'PoliHcal'Science'and' StaHsHcs;'Center'for'Causal'Inference'and'Program' EvaluaHon '''''' Jamie$Sethian,'Professor,'MathemaHcs '''''' Kimmen$Sjölander,'Professor,'Bioengineering,'Plant'and' Microbial'Biology '''''' Philip$Stark,'Chair,'StaHsHcs ''''' Ion$Stoica,'Professor,'EECS;'CoKDirector,'AMP'Lab
  38. BIDS goals ‣ Support$meaningful$and$sustained$interac;ons$and$collabora;ons$ between'Methodology'fields'&'Science'domains'to'recognize'what'it'takes'to' move'these'fields'forward ‣ Establish$new$Data$Science$career$paths$that$are$longAterm$and$sustainable • A'generaHon'of'mulHKdisciplinary'scienHsts'in'dataKintensive'science • A'generaHon'of'data'scienHsts'focused'on'tool'development ‣ Build$an$ecosystem$of$analy;cal$tools,$teaching,$&$research$prac;ces • Sustainable,'reusable,'extensible,'easy'to'learn'and'to'translate'across' research'domains • Enables'scienHsts'to'spend'more'Hme'focusing'on'their'science 37
  39. A'place'to'bring'it'all'together'at' the'Center Vibrant'nexus'in'the'heart'of'campus Doe-Library Enhancing'strengths'of: •Simons-InsFtute-for-the- Theory-of-CompuFng •-AMP-Lab •-CITRIS •etc.
  40. Doe'Memorial'Library @'the'center'of'UC'Berkeley
  41. Berkeley Institute for Data Science Opening
  42. Berkeley Institute for Data Science Opening
  43. Berkeley/UW/NYU$Working$Groups$as$Bridges$ Applied'Math /
  44. Towards+an+Inclusive+Ecosystem Expanding+Par8cipa8on+Among+Underrepresented+Groups 11% 56% 33% female male decline'to'state 2013'Python'bootcamp K'2013'AMP'Camp:'''<'5%'women K'Today'@'PyData:'''''1'women'out'of'18'speakers K'2013'Python'Seminar:''36%'women
  45. Chris-Mentzel Moore-FoundaFon @NYU,-on-Monday Josh-Greenberg Sloan-FoundaFon Yann-LeCun NYU/Facebook
  46. Summary Data+science+at+Berkeley+is+thriving+and+is+geJng+an+ intellectual+home ---D-incubaFng-novel-science-&-methodologies ---D-teaching-&-training ---D-innovate-environments,-interacFons,-&-networks A+data+scien8st+is+a+unicorn,+but... Looking-for-founda8onal+industry+partners-to- parFcipate-and-help-us-grow
  47. @pro{sb Thank+you. PyData,'May'4.'2014