Computational Training for Domain
Scientists & Data Literacy
Joshua	
  Bloom
Moore	
  Founda1on	
  Data-­‐Driven	
  Inves1gator	
  	
  
UC	
  Berkeley,	
  Astronomy
@pro%sb
AAAS,	
  San	
  Jose	
  	
  Sunday	
  Feb	
  15,	
  2015
“knowledge and understanding of scientific concepts and processes required
for personal decision making, participation in civic and cultural affairs,
and economic productivity"
Scientific Literacy
National Science Education Standards Report, National
Academies of Science 1996
Data Literacy, as a broad new educational imperative
Data Literacy
“knowledge and understanding of analytical concepts and processes required
to draw fair conclusions from data that impact personal, professional, and
civic lives”
now @ AAAS
http://boingboing.net/2013/01/01/correlation-between-autism-dia.html
via reddit/r/skeptic
Our	
  ML	
  framework	
  found	
  
the	
  Nearest	
  Supernova	
  in	
  3	
  
Decades	
  ..‣ Built	
  &	
  Deployed	
  robust,	
  real-­‐?me	
  
machine	
  learning	
  framework,	
  
discovering	
  >10,000	
  events	
  in	
  >	
  10	
  
TB	
  of	
  imaging	
  

	
  	
  	
  	
  	
  	
  	
  	
  →	
  50+	
  journal	
  ar?cles
‣ Built	
  Probabilis?c	
  Event	
  
classifica?on	
  catalogs	
  with	
  
innova?ve	
  ac?ve	
  learning	
  
hPp://?medomain.org hPps://www.nsf.gov/news/news_summ.jsp?cntn_id=122537
What is the toolbox
of the modern
(data-driven) scientist?
domain
training
statistics
advanced
computing
database
GUI
parallel
visualization
Bayesian
machine learning
Physics
laboratory techniques
MCMC
MapReduce
And...How do we teach this with what
little time the students have?
What is the toolbox
of the modern
(data-driven) scientist?
Data-Centric Coursework, Bootcamps, Seminars, & Lecture
Series
BDAS: Berkeley Data Analytics
Stack
[Spark, Shark, ...]
parallel
programming
bootcamp
...and entire degree programs
Taught by CS/Stats
Aimed at Engineers & Programmers
Heading Toward Industry
2010: 85 campers 2012a: 135 campers
Python Bootcamps at Berkeley
2012b: 210 campers
Python Bootcamps at Berkeley
2013a: 253 campers
‣ 3 days of live/archive streamed lectures
‣ all open material in GitHub
‣ widely disseminated (e.g., @ NASA)
http://pythonbootcamp.info
Part of the
Designated
Emphasis in
Computational
Science &
Engineering at
Berkeley
visualization
machine learning
database interaction
user interface & web frameworks
timeseries & numerical computing
interfacing to other languages
Bayesian inference & MCMC
hardware control
parallelism
Python for Data Science @ Berkeley [Sept 2013]
64%
36%
female male
8%
4%
8%
12%
4%
12%
8%
16%
16%
12%Psychology
Astronomy
Neuroscience
Biostatistics
Physics
Chemical Engineering
ISchool
Earth and Planetary Sciences
Industrial Engineering
Mechanical Engineering
“Parallel Image Reconstruction
from Radio Interferometry Data”
“Graph Theory Analysis of Growing
Graphs”
http://mb3152.github.io/Graph-Growth/
“Realtime Prediction of Activity Behavior
from Smartphone”
“Bus Arrival Time
Prediction in Spain”
Time domain preprocessing
- Start with raw photometry!
- Gaussian process detrending!
- Calibration!
- Petigura & Marcy 2012!
!
Transit search
- Matched filter!
- Similar to BLS algorithm (Kovcas+ 2002)!
- Leverages Fast-Folding Algorithm
O(N^2) → O(N log N) (Staelin+ 1968)!
!
Data validation
- Significant peaks in periodogram, but
inconsistent with exoplanet transit
TERRA – optimized for small planets
Detrended/calibrated photometry
TERRA
RawFlux(ppt)CalibratedFlux
Erik Petigura
Berkeley Astro
Grad Student
Prevalence of Earth-size planets orbiting Sun-like stars
Erik A. Petiguraa,b,1
, Andrew W. Howardb
, and Geoffrey W. Marcya
a
Astronomy Department, University of California, Berkeley, CA 94720; and b
Institute for Astronomy, University of Hawaii at Manoa, Honolulu, HI 96822
Contributed by Geoffrey W. Marcy, October 22, 2013 (sent for review October 18, 2013)
Determining whether Earth-like planets are common or rare looms
as a touchstone in the question of life in the universe. We searched
for Earth-size planets that cross in front of their host stars by
examining the brightness measurements of 42,000 stars from
National Aeronautics and Space Administration’s Kepler mission.
We found 603 planets, including 10 that are Earth size (1 − 2 R⊕)
and receive comparable levels of stellar energy to that of Earth
(0:25 − 4 F⊕). We account for Kepler’s imperfect detectability of
such planets by injecting synthetic planet–caused dimmings into
the Kepler brightness measurements and recording the fraction
detected. We find that 11 ± 4% of Sun-like stars harbor an Earth-
size planet receiving between one and four times the stellar inten-
sity as Earth. We also find that the occurrence of Earth-size planets is
constant with increasing orbital period (P), within equal intervals of
logP up to ∼200 d. Extrapolating, one finds 5:7+1:7
−2:2 % of Sun-like stars
harbor an Earth-size planet with orbital periods of 200–400 d.
extrasolar planets | astrobiology
The National Aeronautics and Space Administration’s (NASA’s)
Kepler mission was launched in 2009 to search for planets
that transit (cross in front of) their host stars (1–4). The resulting
dimming of the host stars is detectable by measuring their bright-
ness, and Kepler monitored the brightness of 150,000 stars every
30 min for 4 y. To date, this exoplanet survey has detected more
We searched for transiting planets in Kepler brightness mea-
surements using our custom-built TERRA software package
described in previous works (6, 9) and in SI Appendix. In brief,
TERRA conditions Kepler photometry in the time domain, re-
moving outliers, long timescale variability (>10 d), and systematic
errors common to a large number of stars. TERRA then searches
for transit signals by evaluating the signal-to-noise ratio (SNR) of
prospective transits over a finely spaced 3D grid of orbital period,
P, time of transit, t0, and transit duration, ΔT. This grid-based
search extends over the orbital period range of 0.5–400 d.
TERRA produced a list of “threshold crossing events” (TCEs)
that meet the key criterion of a photometric dimming SNR ratio
SNR > 12. Unfortunately, an unwieldy 16,227 TCEs met this cri-
terion, many of which are inconsistent with the periodic dimming
profile from a true transiting planet. Further vetting was performed
by automatically assessing which light curves were consistent with
theoretical models of transiting planets (10). We also visually
inspected each TCE light curve, retaining only those exhibiting a
consistent, periodic, box-shaped dimming, and rejecting those
caused by single epoch outliers, correlated noise, and other data
anomalies. The vetting process was applied homogeneously to all
TCEs and is described in further detail in SI Appendix.
To assess our vetting accuracy, we evaluated the 235 Kepler
objects of interest (KOIs) among Best42k stars having P > 50 d,
which had been found by the Kepler Project and identified as planet
candidates in the official Exoplanet Archive (exoplanetarchive.
ONOMY
Bootcamp/Seminar Alum
Python
DOE/NERSC computation
PNAS [2014]
Dangers	
  of	
  a	
  Cursory	
  Training/Teaching	
  
Curriculum
34 Fitting a straight line to data
0 50 100 150 200 250 300
x
0
100
200
300
400
500
600
700
y
y = 1.33 x + 164
0 50 100 150 200 250 300
x
0
100
200
300
400
500
600
700
y
0.0
1.0
1.0
1.0
0.0
0.0
0.0
0.00.0
0.0
0.0
0.0
0.00.0
0.0
0.0
0.0
0.0
0.0
0.0
Data analysis recipes:
Fitting a model to data⇤
David W. Hogg
Center for Cosmology and Particle Physics, Department of Physics, New York University
Max-Planck-Institut f¨ur Astronomie, Heidelberg
Jo Bovy
Center for Cosmology and Particle Physics, Department of Physics, New York University
Dustin Lang
Department of Computer Science, University of Toronto
Princeton University Observatory
Abstract
We go through the many considerations involved in fitting a model
to data, using as an example the fit of a straight line to a set of points
in a two-dimensional plane. Standard weighted least-squares fitting
is only appropriate when there is a dimension along which the data
points have negligible uncertainties, and another along which all the
uncertainties can be described by Gaussians of known variance; these
D
F
Da
Ce
Ma
Jo
Ce
Du
De
Pr
th
di
lin
an
arXiv:1008.4686v1 [astro-ph.IM] 27 Aug 2010
Sta7s7cal	
  Inference
Data	
  Literacy	
  in	
  the	
  Undergraduate	
  &	
  Graduate	
  Training	
  Mission	
  
Versioning	
  &	
  Reproducibility
“Recently, the scientific community was shaken by reports that a troubling proportion of peer-reviewed
preclinical studies are not reproducible.” McNutt, 2014
http://www.sciencemag.org/content/343/6168/229.summary
-­‐	
  Git	
  has	
  emerged	
  as	
  the	
  de	
  facto	
  versioning	
  tool	
  
-­‐	
  Berkeley	
  Common	
  Environment	
  (BCE)	
  So]ware	
  Stack	
  
-­‐	
  “Reproducible	
  and	
  Collabora?ve	
  Sta?s?cal	
  Data	
  Science”	
  (Sta?s?cs	
  157:	
  P.	
  Stark)
Data	
  Literacy	
  in	
  the	
  Undergraduate	
  &	
  Graduate	
  Training	
  Mission	
  
Undergraduate Curriculum
• Data	
  Science	
  Task	
  Force	
  report	
  generated	
  at	
  Chancellor/Provost	
  behest	
  
• Recommenda?ons:	
  
– A	
  founda?onal	
  lower-­‐division	
  course	
  accessible	
  to	
  everyone	
  (sta?s?cal	
  &	
  computa?onal	
  thinking),	
  with	
  
broad	
  applicability	
  
– A	
  suite	
  of	
  courses	
  that	
  advance	
  the	
  use	
  of	
  data	
  sciences	
  within	
  the	
  broad	
  swath	
  of	
  

disciplines	
  available	
  to	
  Berkeley	
  undergraduates	
  	
  
– A	
  high-­‐quality	
  minor	
  for	
  students	
  who	
  wish	
  to	
  couple	
  data	
  sciences	
  with	
  their	
  major	
  

discipline	
  	
  
– A	
  data	
  sciences	
  major	
  for	
  those	
  students	
  who	
  want	
  this	
  to	
  be	
  their	
  primary	
  area	
  of	
  study
pg. 12; Announced Feb 11
Chancellor’s Report for 2015
Grand	
  challenges	
  at	
  the	
  intersec7on	
  of:	
  
• Open	
  science	
  and	
  reproducible	
  data	
  analysis	
  
• Responses	
  of	
  coupled	
  human-­‐natural	
  systems	
  to	
  rapid	
  environmental	
  change	
  (water	
  
resources,	
  land	
  use	
  planning,	
  agriculture,	
  etc.)	
  
• Development	
  of	
  evidence-­‐based	
  solu?ons	
  in	
  public	
  policy,	
  resource	
  management	
  and	
  
environmental	
  design
• Five	
  schools	
  and	
  colleges:	
  LeFers	
  &	
  Science,	
  Natural	
  Resources,	
  Engineering,	
  Public	
  Policy	
  
and	
  Environmental	
  Design	
  
• Two	
  year	
  graduate	
  training	
  program;	
  Master’s	
  and	
  PhD	
  students	
  
• Four	
  cohorts	
  of	
  up	
  to	
  20	
  students	
  each,	
  first	
  cohort	
  in	
  fall	
  2015
Data Sciences for the 21st Century (DS421): Environment and Society
via Prof. David Ackerly (Integrative Biology)
Data Sciences for the 21st Century (DS421): Environment and Society
via Prof. David Ackerly (Integrative Biology)
http://bitly.com/bundles/fperezorg/1
“Bold new partnership launches to harness potential of data scientists and big data”
Founded	
  in	
  December	
  2013	
  as	
  a	
  result	
  of	
  a	
  year+	
  long	
  na?onal	
  selec?on	
  process	
  
$37.8M	
  over	
  5	
  years,	
  along	
  with	
  University	
  of	
  Washington	
  &	
  NYU	
  
‣ An	
  accelerator	
  for	
  data-­‐driven	
  discovery	
  
‣ An	
  agent	
  of	
  change	
  in	
  the	
  modern	
  university	
  as	
  Data	
  Science	
  takes	
  hold	
  
‣ An	
  incubator	
  for	
  the	
  next	
  genera?on	
  of	
  Data	
  Science	
  technology	
  and	
  prac?ce
Leadership	
  from	
  across	
  the	
  spectrum
Joshua	
  Bloom,	
  Professor,	
  Astronomy;	
  Director,	
  Center	
  
for	
  Time	
  Domain	
  Informa?cs

	
  	
  	
  	
  	
  
Henry	
  Brady,	
  Dean,	
  Goldman	
  School	
  of	
  Public	
  Policy	
  	
  	
  	
  	
  	
  	
  	
  
Cathryn	
  Carson,	
  Associate	
  Dean,	
  Social	
  Sciences;	
  Ac?ng	
  
Director	
  of	
  Social	
  Sciences	
  Data	
  Laboratory	
  "D-­‐Lab”

	
  	
  	
  	
  	
  	
  	
  


David	
  Culler,	
  Chair,	
  EECS

	
  	
  	
  	
  	
  	
  	
  	
  
Michael	
  Franklin,	
  Professor;	
  EECS,	
  Co-­‐Director,	
  AMP	
  Lab

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
Erik	
  Mitchell,	
  Associate	
  University	
  Librarian

	
  	
  	
  	
  	
  	
  	
  	
  	
  

Faculty	
  Lead/PI:	
  Saul	
  PerlmuFer,	
  Physics,	
  Berkeley	
  Center	
  for	
  Cosmological	
  Physics
Fernando	
  Perez,	
  Researcher,	
  Henry	
  H.	
  Wheeler	
  Jr.	
  Brain	
  
Imaging	
  Center

Jasjeet	
  Sekhon,	
  Professor,	
  Poli?cal	
  Science	
  and	
  Sta?s?cs;	
  
Center	
  for	
  Causal	
  Inference	
  and	
  Program	
  Evalua?on

	
  	
  	
  	
  	
  	
  	
  


Jamie	
  Sethian,	
  Professor,	
  Mathema?cs

	
  	
  	
  	
  	
  	
  	
  
Kimmen	
  Sjölander,	
  Professor,	
  Bioengineering,	
  Plant	
  and	
  
Microbial	
  Biology

	
  	
  	
  	
  	
  	
  	
  
Philip	
  Stark,	
  Chair,	
  Sta?s?cs

	
  	
  	
  	
  	
  	
  


Ion	
  Stoica,	
  Professor,	
  EECS;	
  Co-­‐Director,	
  AMP	
  Lab
“I	
  love	
  working	
  with	
  astronomers,	
  since	
  their	
  
data	
  is	
  worthless.”
-­‐	
  Jim	
  Gray,	
  Microso'
Created by Natalia Bilenko.
Data source: PubMed Central. http://sciencereview.berkeley.edu/bsr_design/Issue26/datascience/cover/
Established	
  CS/Stats/Math	
  in	
  Service	
  
of	
  novelty	
  in	
  domain	
  science	
  
vs.	
  
Novelty	
  in	
  domain	
  science	
  driving	
  &	
  informing	
  novelty	
  in	
  
CS/Stats/Math
“novelty2	
  problem”	
  
an	
  extra	
  Burden	
  for	
  Forefront	
  Scien?sts
hPps://medium.com/tech-­‐talk/dd88857f662
ỉπ vs.
21st Century Educational Mission: Produce Gamma-shaped people
deep domain skill/knowledge/training
deep methodological knowledge/skill
deep domain or methodological skill/knowledge/training
strong methodological or domain knowledge/skill
Goal: empower teams of gamma’s to excel
Summary
Data	
  science	
  at	
  Berkeley	
  is	
  thriving	
  and	
  BIDS	
  is	
  its	
  intellectual	
  home	
  
	
  	
  	
  -­‐	
  incuba?ng	
  novel	
  science	
  &	
  methodologies	
  
	
  	
  	
  -­‐	
  teaching	
  &	
  training	
  
	
  	
  	
  -­‐	
  innovate	
  environments,	
  interac?ons,	
  &	
  networks
Teaching	
  impera7ves	
  emerging	
  around	
  data	
  literacy	
  
	
  	
  	
  -­‐	
  new	
  campus-­‐wide	
  undergraduate	
  curriculum	
  (star?ng	
  with	
  a	
  founda?on	
  
for	
  all	
  1st	
  years)	
  
	
  	
  	
  -­‐	
  new	
  mul?-­‐year,	
  mul?-­‐disciplinary	
  graduate	
  program
@pro%sb
Thank	
  you.
PyData,	
  May	
  4.	
  2014

Computational Training for Domain Scientists & Data Literacy

  • 1.
    Computational Training forDomain Scientists & Data Literacy Joshua  Bloom Moore  Founda1on  Data-­‐Driven  Inves1gator     UC  Berkeley,  Astronomy @pro%sb AAAS,  San  Jose    Sunday  Feb  15,  2015
  • 2.
    “knowledge and understandingof scientific concepts and processes required for personal decision making, participation in civic and cultural affairs, and economic productivity" Scientific Literacy National Science Education Standards Report, National Academies of Science 1996 Data Literacy, as a broad new educational imperative Data Literacy “knowledge and understanding of analytical concepts and processes required to draw fair conclusions from data that impact personal, professional, and civic lives” now @ AAAS
  • 3.
  • 4.
    Our  ML  framework  found   the  Nearest  Supernova  in  3   Decades  ..‣ Built  &  Deployed  robust,  real-­‐?me   machine  learning  framework,   discovering  >10,000  events  in  >  10   TB  of  imaging  
                →  50+  journal  ar?cles ‣ Built  Probabilis?c  Event   classifica?on  catalogs  with   innova?ve  ac?ve  learning   hPp://?medomain.org hPps://www.nsf.gov/news/news_summ.jsp?cntn_id=122537
  • 5.
    What is thetoolbox of the modern (data-driven) scientist? domain training statistics advanced computing database GUI parallel visualization Bayesian machine learning Physics laboratory techniques MCMC MapReduce
  • 6.
    And...How do weteach this with what little time the students have? What is the toolbox of the modern (data-driven) scientist?
  • 7.
    Data-Centric Coursework, Bootcamps,Seminars, & Lecture Series BDAS: Berkeley Data Analytics Stack [Spark, Shark, ...] parallel programming bootcamp ...and entire degree programs Taught by CS/Stats Aimed at Engineers & Programmers Heading Toward Industry
  • 8.
    2010: 85 campers2012a: 135 campers Python Bootcamps at Berkeley
  • 9.
    2012b: 210 campers PythonBootcamps at Berkeley 2013a: 253 campers
  • 10.
    ‣ 3 daysof live/archive streamed lectures ‣ all open material in GitHub ‣ widely disseminated (e.g., @ NASA) http://pythonbootcamp.info
  • 11.
    Part of the Designated Emphasisin Computational Science & Engineering at Berkeley visualization machine learning database interaction user interface & web frameworks timeseries & numerical computing interfacing to other languages Bayesian inference & MCMC hardware control parallelism
  • 12.
    Python for DataScience @ Berkeley [Sept 2013]
  • 13.
    64% 36% female male 8% 4% 8% 12% 4% 12% 8% 16% 16% 12%Psychology Astronomy Neuroscience Biostatistics Physics Chemical Engineering ISchool Earthand Planetary Sciences Industrial Engineering Mechanical Engineering “Parallel Image Reconstruction from Radio Interferometry Data” “Graph Theory Analysis of Growing Graphs” http://mb3152.github.io/Graph-Growth/ “Realtime Prediction of Activity Behavior from Smartphone” “Bus Arrival Time Prediction in Spain”
  • 14.
    Time domain preprocessing -Start with raw photometry! - Gaussian process detrending! - Calibration! - Petigura & Marcy 2012! ! Transit search - Matched filter! - Similar to BLS algorithm (Kovcas+ 2002)! - Leverages Fast-Folding Algorithm O(N^2) → O(N log N) (Staelin+ 1968)! ! Data validation - Significant peaks in periodogram, but inconsistent with exoplanet transit TERRA – optimized for small planets Detrended/calibrated photometry TERRA RawFlux(ppt)CalibratedFlux Erik Petigura Berkeley Astro Grad Student Prevalence of Earth-size planets orbiting Sun-like stars Erik A. Petiguraa,b,1 , Andrew W. Howardb , and Geoffrey W. Marcya a Astronomy Department, University of California, Berkeley, CA 94720; and b Institute for Astronomy, University of Hawaii at Manoa, Honolulu, HI 96822 Contributed by Geoffrey W. Marcy, October 22, 2013 (sent for review October 18, 2013) Determining whether Earth-like planets are common or rare looms as a touchstone in the question of life in the universe. We searched for Earth-size planets that cross in front of their host stars by examining the brightness measurements of 42,000 stars from National Aeronautics and Space Administration’s Kepler mission. We found 603 planets, including 10 that are Earth size (1 − 2 R⊕) and receive comparable levels of stellar energy to that of Earth (0:25 − 4 F⊕). We account for Kepler’s imperfect detectability of such planets by injecting synthetic planet–caused dimmings into the Kepler brightness measurements and recording the fraction detected. We find that 11 ± 4% of Sun-like stars harbor an Earth- size planet receiving between one and four times the stellar inten- sity as Earth. We also find that the occurrence of Earth-size planets is constant with increasing orbital period (P), within equal intervals of logP up to ∼200 d. Extrapolating, one finds 5:7+1:7 −2:2 % of Sun-like stars harbor an Earth-size planet with orbital periods of 200–400 d. extrasolar planets | astrobiology The National Aeronautics and Space Administration’s (NASA’s) Kepler mission was launched in 2009 to search for planets that transit (cross in front of) their host stars (1–4). The resulting dimming of the host stars is detectable by measuring their bright- ness, and Kepler monitored the brightness of 150,000 stars every 30 min for 4 y. To date, this exoplanet survey has detected more We searched for transiting planets in Kepler brightness mea- surements using our custom-built TERRA software package described in previous works (6, 9) and in SI Appendix. In brief, TERRA conditions Kepler photometry in the time domain, re- moving outliers, long timescale variability (>10 d), and systematic errors common to a large number of stars. TERRA then searches for transit signals by evaluating the signal-to-noise ratio (SNR) of prospective transits over a finely spaced 3D grid of orbital period, P, time of transit, t0, and transit duration, ΔT. This grid-based search extends over the orbital period range of 0.5–400 d. TERRA produced a list of “threshold crossing events” (TCEs) that meet the key criterion of a photometric dimming SNR ratio SNR > 12. Unfortunately, an unwieldy 16,227 TCEs met this cri- terion, many of which are inconsistent with the periodic dimming profile from a true transiting planet. Further vetting was performed by automatically assessing which light curves were consistent with theoretical models of transiting planets (10). We also visually inspected each TCE light curve, retaining only those exhibiting a consistent, periodic, box-shaped dimming, and rejecting those caused by single epoch outliers, correlated noise, and other data anomalies. The vetting process was applied homogeneously to all TCEs and is described in further detail in SI Appendix. To assess our vetting accuracy, we evaluated the 235 Kepler objects of interest (KOIs) among Best42k stars having P > 50 d, which had been found by the Kepler Project and identified as planet candidates in the official Exoplanet Archive (exoplanetarchive. ONOMY Bootcamp/Seminar Alum Python DOE/NERSC computation PNAS [2014]
  • 15.
    Dangers  of  a  Cursory  Training/Teaching   Curriculum
  • 16.
    34 Fitting astraight line to data 0 50 100 150 200 250 300 x 0 100 200 300 400 500 600 700 y y = 1.33 x + 164 0 50 100 150 200 250 300 x 0 100 200 300 400 500 600 700 y 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 Data analysis recipes: Fitting a model to data⇤ David W. Hogg Center for Cosmology and Particle Physics, Department of Physics, New York University Max-Planck-Institut f¨ur Astronomie, Heidelberg Jo Bovy Center for Cosmology and Particle Physics, Department of Physics, New York University Dustin Lang Department of Computer Science, University of Toronto Princeton University Observatory Abstract We go through the many considerations involved in fitting a model to data, using as an example the fit of a straight line to a set of points in a two-dimensional plane. Standard weighted least-squares fitting is only appropriate when there is a dimension along which the data points have negligible uncertainties, and another along which all the uncertainties can be described by Gaussians of known variance; these D F Da Ce Ma Jo Ce Du De Pr th di lin an arXiv:1008.4686v1 [astro-ph.IM] 27 Aug 2010 Sta7s7cal  Inference Data  Literacy  in  the  Undergraduate  &  Graduate  Training  Mission  
  • 17.
    Versioning  &  Reproducibility “Recently,the scientific community was shaken by reports that a troubling proportion of peer-reviewed preclinical studies are not reproducible.” McNutt, 2014 http://www.sciencemag.org/content/343/6168/229.summary -­‐  Git  has  emerged  as  the  de  facto  versioning  tool   -­‐  Berkeley  Common  Environment  (BCE)  So]ware  Stack   -­‐  “Reproducible  and  Collabora?ve  Sta?s?cal  Data  Science”  (Sta?s?cs  157:  P.  Stark) Data  Literacy  in  the  Undergraduate  &  Graduate  Training  Mission  
  • 18.
    Undergraduate Curriculum • Data  Science  Task  Force  report  generated  at  Chancellor/Provost  behest   • Recommenda?ons:   – A  founda?onal  lower-­‐division  course  accessible  to  everyone  (sta?s?cal  &  computa?onal  thinking),  with   broad  applicability   – A  suite  of  courses  that  advance  the  use  of  data  sciences  within  the  broad  swath  of  
 disciplines  available  to  Berkeley  undergraduates     – A  high-­‐quality  minor  for  students  who  wish  to  couple  data  sciences  with  their  major  
 discipline     – A  data  sciences  major  for  those  students  who  want  this  to  be  their  primary  area  of  study
  • 19.
    pg. 12; AnnouncedFeb 11 Chancellor’s Report for 2015
  • 20.
    Grand  challenges  at  the  intersec7on  of:   • Open  science  and  reproducible  data  analysis   • Responses  of  coupled  human-­‐natural  systems  to  rapid  environmental  change  (water   resources,  land  use  planning,  agriculture,  etc.)   • Development  of  evidence-­‐based  solu?ons  in  public  policy,  resource  management  and   environmental  design • Five  schools  and  colleges:  LeFers  &  Science,  Natural  Resources,  Engineering,  Public  Policy   and  Environmental  Design   • Two  year  graduate  training  program;  Master’s  and  PhD  students   • Four  cohorts  of  up  to  20  students  each,  first  cohort  in  fall  2015 Data Sciences for the 21st Century (DS421): Environment and Society via Prof. David Ackerly (Integrative Biology)
  • 21.
    Data Sciences forthe 21st Century (DS421): Environment and Society via Prof. David Ackerly (Integrative Biology)
  • 22.
    http://bitly.com/bundles/fperezorg/1 “Bold new partnershiplaunches to harness potential of data scientists and big data” Founded  in  December  2013  as  a  result  of  a  year+  long  na?onal  selec?on  process   $37.8M  over  5  years,  along  with  University  of  Washington  &  NYU   ‣ An  accelerator  for  data-­‐driven  discovery   ‣ An  agent  of  change  in  the  modern  university  as  Data  Science  takes  hold   ‣ An  incubator  for  the  next  genera?on  of  Data  Science  technology  and  prac?ce
  • 23.
    Leadership  from  across  the  spectrum Joshua  Bloom,  Professor,  Astronomy;  Director,  Center   for  Time  Domain  Informa?cs
           Henry  Brady,  Dean,  Goldman  School  of  Public  Policy                 Cathryn  Carson,  Associate  Dean,  Social  Sciences;  Ac?ng   Director  of  Social  Sciences  Data  Laboratory  "D-­‐Lab”
               
 David  Culler,  Chair,  EECS
                 Michael  Franklin,  Professor;  EECS,  Co-­‐Director,  AMP  Lab
                     Erik  Mitchell,  Associate  University  Librarian
                  
 Faculty  Lead/PI:  Saul  PerlmuFer,  Physics,  Berkeley  Center  for  Cosmological  Physics Fernando  Perez,  Researcher,  Henry  H.  Wheeler  Jr.  Brain   Imaging  Center
 Jasjeet  Sekhon,  Professor,  Poli?cal  Science  and  Sta?s?cs;   Center  for  Causal  Inference  and  Program  Evalua?on
               
 Jamie  Sethian,  Professor,  Mathema?cs
               Kimmen  Sjölander,  Professor,  Bioengineering,  Plant  and   Microbial  Biology
               Philip  Stark,  Chair,  Sta?s?cs
             
 Ion  Stoica,  Professor,  EECS;  Co-­‐Director,  AMP  Lab
  • 24.
    “I  love  working  with  astronomers,  since  their   data  is  worthless.” -­‐  Jim  Gray,  Microso'
  • 25.
    Created by NataliaBilenko. Data source: PubMed Central. http://sciencereview.berkeley.edu/bsr_design/Issue26/datascience/cover/
  • 26.
    Established  CS/Stats/Math  in  Service   of  novelty  in  domain  science   vs.   Novelty  in  domain  science  driving  &  informing  novelty  in   CS/Stats/Math “novelty2  problem”   an  extra  Burden  for  Forefront  Scien?sts hPps://medium.com/tech-­‐talk/dd88857f662
  • 27.
    ỉπ vs. 21st CenturyEducational Mission: Produce Gamma-shaped people deep domain skill/knowledge/training deep methodological knowledge/skill deep domain or methodological skill/knowledge/training strong methodological or domain knowledge/skill Goal: empower teams of gamma’s to excel
  • 28.
    Summary Data  science  at  Berkeley  is  thriving  and  BIDS  is  its  intellectual  home        -­‐  incuba?ng  novel  science  &  methodologies        -­‐  teaching  &  training        -­‐  innovate  environments,  interac?ons,  &  networks Teaching  impera7ves  emerging  around  data  literacy        -­‐  new  campus-­‐wide  undergraduate  curriculum  (star?ng  with  a  founda?on   for  all  1st  years)        -­‐  new  mul?-­‐year,  mul?-­‐disciplinary  graduate  program
  • 29.