SlideShare a Scribd company logo
1 of 32
Download to read offline
Computational Training & Data
Literacy for Domain Scientists
Joshua Bloom
UC Berkeley, Astronomy
@profjsb
“Training Students to Extract Value from Big Data” National Academies of Science, DC 11 April 2014
What is the toolbox
of the modern
(data-driven)
scientist?
domain
training
statistics
advanced
computing
database
GUI
parallel
visualization
Bayesian
machine learning
Physics
laboratory techniques
MCMC
MapReduce
And...How do we teach
this with what little time
the students have?
What is the toolbox
of the modern
(data-driven)
scientist?
Astronomical Data Deluge
Serious Challenge to Traditional Approaches & Toolkits
Astronomical Data Deluge
Serious Challenge to Traditional Approaches & Toolkits
Large Synoptic Survey Telescope (LSST) - 2020
! Light curves for 800M sources every 3 days
106 supernovae/yr, 105 eclipsing binaries
3.2 gigapixel camera, 20 TB/night
LOFAR & SKA
150 Gps (27 Tflops) → 20 Pps (~100 Pflops)
Gaia space astrometry mission - 2014
1 billion stars observed ∼70 times over 5 years
Will observe 20K supernovae
Many other astronomical surveys are already producing data:
SDSS, iPTF, CRTS, Pan-STARRS, Hipparcos, OGLE, ASAS,
Kepler, LINEAR, DES etc.,
strategy
scheduling
observing
reduction
finding
discovery
classification
followup
inference
Towards a Fully Automated Scientific Stack
for Transients
}
current
state-of-the-art
stack
automated (e.g. iPTF)
not (yet) automated
published work
NSF/CDI
NSF/BIGDATA
Our ML framework found the
Nearest Supernova in 3 Decades ..‣ Built & Deployed Real-
time ML framework,
discovering >10,000
events in > 10 TB of
imaging
→ 50+ journal articles
‣ Built Probabilistic
Event classification
catalogs with
innovative active
learning
http://timedomain.org https://www.nsf.gov/news/news_summ.jsp?cntn_id=122537
Data-Centric Coursework, Bootcamps,
Seminars, & Lecture Series
BDAS: Berkeley Data
Analytics Stack
[Spark, Shark, ...]
parallel
programming
bootcamp
...and entire degree programs
Data-Centric Coursework, Bootcamps,
Seminars, & Lecture Series
BDAS: Berkeley Data
Analytics Stack
[Spark, Shark, ...]
parallel
programming
bootcamp
...and entire degree programs
Taught by CS/Stats
Aimed at Engineers &
Programmers Heading
Toward Industry
2010: 85 campers 2012a: 135 campers
Python Bootcamps at Berkeley
a modern superglue computing
language for science
‣ high-level scripting language
‣ open source, huge & growing community in
academia & industry
‣ Just in time compilation but also fast numerical
computation
‣ Extensive interfaces to 3rd party frameworks
A reasonable lingua franca for scientists...
2012b: 210 campers
Python Bootcamps at Berkeley
2013a: 253 campers
‣ 3 days of live/archive streamed lectures
‣ all open material in GitHub
‣ widely disseminated (e.g., @ NASA)
‣ funded (~$18k) by the Vice Chancellor for Research
& NSF (BIGDATA)
http://pythonbootcamp.info
Part of the
Designated
Emphasis in
Computation
al Science &
Engineering
at Berkeley
visualization
machine learning
database interaction
user interface & web frameworks
timeseries & numerical
computing
interfacing to other languages
Bayesian inference & MCMC
hardware control
parallelism
64%
36%
female
male
8%
4%
8%
12%
4%
12%
8%
16%
16%
12%Psychology
Astronomy
Neuroscience
Biostatistics
Physics
Chemical Engineering
ISchool
Earth and Planetary Sciences
Industrial Engineering
Mechanical Engineering
“Parallel Image
Reconstruction from
Radio Interferometry
Data”
“Graph Theory Analysis of
Growing Graphs”
http://mb3152.github.io/Graph-Growth/
“Realtime Prediction of Activity
Behavior from Smartphone”
“Bus Arrival
Time Prediction
in Spain”
Time domain preprocessing
- Start with raw photometry!
- Gaussian process detrending!
- Calibration!
- Petigura & Marcy 2012!
!
Transit search
- Matched filter!
- Similar to BLS algorithm (Kovcas+ 2002)!
- Leverages Fast-Folding Algorithm
O(N^2) → O(N log N) (Staelin+ 1968)!
!
Data validation
- Significant peaks in periodogram, but
inconsistent with exoplanet transit
TERRA – optimized for small planets
Detrended/calibrated photometry
TERRA
RawFlux(ppt)CalibratedFlux
Erik Petigura
Berkeley Astro
Grad Student
Petigura, Howard, & Marcy (20
Prevalence of Earth-size planets orbiting Sun-like stars
Erik A. Petiguraa,b,1
, Andrew W. Howardb
, and Geoffrey W. Marcya
a
Astronomy Department, University of California, Berkeley, CA 94720; and b
Institute for Astronomy, University of Hawaii at Manoa, Honolulu, HI 96822
Contributed by Geoffrey W. Marcy, October 22, 2013 (sent for review October 18, 2013)
Determining whether Earth-like planets are common or rare looms
as a touchstone in the question of life in the universe. We searched
for Earth-size planets that cross in front of their host stars by
examining the brightness measurements of 42,000 stars from
National Aeronautics and Space Administration’s Kepler mission.
We found 603 planets, including 10 that are Earth size (1 − 2 R⊕)
and receive comparable levels of stellar energy to that of Earth
(0:25 − 4 F⊕). We account for Kepler’s imperfect detectability of
such planets by injecting synthetic planet–caused dimmings into
the Kepler brightness measurements and recording the fraction
detected. We find that 11 ± 4% of Sun-like stars harbor an Earth-
size planet receiving between one and four times the stellar inten-
sity as Earth. We also find that the occurrence of Earth-size planets is
constant with increasing orbital period (P), within equal intervals of
logP up to ∼200 d. Extrapolating, one finds 5:7+1:7
−2:2 % of Sun-like stars
harbor an Earth-size planet with orbital periods of 200–400 d.
extrasolar planets | astrobiology
The National Aeronautics and Space Administration’s (NASA’s)
Kepler mission was launched in 2009 to search for planets
that transit (cross in front of) their host stars (1–4). The resulting
dimming of the host stars is detectable by measuring their bright-
ness, and Kepler monitored the brightness of 150,000 stars every
30 min for 4 y. To date, this exoplanet survey has detected more
than 3,000 planet candidates (4).
The most easily detectable planets in the Kepler survey are
those that are relatively large and orbit close to their host stars,
especially those stars having lower intrinsic brightness fluctua-
tions (noise). These large, close-in worlds dominate the list of
known exoplanets. However, the Kepler brightness measurements
can be analyzed and debiased to reveal the diversity of planets,
We searched for transiting planets in Kepler brightness mea-
surements using our custom-built TERRA software package
described in previous works (6, 9) and in SI Appendix. In brief,
TERRA conditions Kepler photometry in the time domain, re-
moving outliers, long timescale variability (>10 d), and systematic
errors common to a large number of stars. TERRA then searches
for transit signals by evaluating the signal-to-noise ratio (SNR) of
prospective transits over a finely spaced 3D grid of orbital period,
P, time of transit, t0, and transit duration, ΔT. This grid-based
search extends over the orbital period range of 0.5–400 d.
TERRA produced a list of “threshold crossing events” (TCEs)
that meet the key criterion of a photometric dimming SNR ratio
SNR > 12. Unfortunately, an unwieldy 16,227 TCEs met this cri-
terion, many of which are inconsistent with the periodic dimming
profile from a true transiting planet. Further vetting was performed
by automatically assessing which light curves were consistent with
theoretical models of transiting planets (10). We also visually
inspected each TCE light curve, retaining only those exhibiting a
consistent, periodic, box-shaped dimming, and rejecting those
caused by single epoch outliers, correlated noise, and other data
anomalies. The vetting process was applied homogeneously to all
TCEs and is described in further detail in SI Appendix.
To assess our vetting accuracy, we evaluated the 235 Kepler
objects of interest (KOIs) among Best42k stars having P > 50 d,
which had been found by the Kepler Project and identified as planet
candidates in the official Exoplanet Archive (exoplanetarchive.
ipac.caltech.edu; accessed 19 September 2013). Among them, we
found four whose light curves are not consistent with being
planets. These four KOIs (364.01, 2,224.02, 2,311.01, and 2,474.01)
have long periods and small radii (SI Appendix). This exercise
suggests that our vetting process is robust and that careful scrutiny
of the light curves of small planets in long period orbits is useful to
identify false positives.
ASTRONOMY
Bootcamp/
Seminar Alum
Python
DOE/NERSC computation
PNAS [2014]
“Are we alone in the universe? What makes up the missing mass
of the universe? ... And maybe the biggest question of all: How in
the wide world can you add $3 billion in market capitalization
simply by adding .com to the end of a name?”
President William Jefferson Clinton
Science and Technology Policy Address
21 January 2000
“Add Data Science or Big Data to your course name to increase
enrollment by tenfold.”
Joshua Bloom
Just Now
Python for Data Science @ Berkeley [Sept 2013]
‣ Where do Bootcamps & Seminars fit into
traditional domain science curricula?
- formal coursework competes with research
obligations for graduate students
‣ Are they too vocational/practical for
higher Ed?
‣ Who should teach them & how do we
credit them?
first this... ...then this.
Undergraduate & Graduate Training Mission
Thinking Data Literacy before
Thinking Big Data Proficiency
Undergraduate & Graduate Training Mission
Thinking Data Literacy before
Thinking Big Data Proficiency
Data analysis recipes:
Fitting a model to data⇤
David W. Hogg
Center for Cosmology and Particle Physics, Department of Physics, New York University
Max-Planck-Institut f¨ur Astronomie, Heidelberg
Jo Bovy
Center for Cosmology and Particle Physics, Department of Physics, New York University
Dustin Lang
Department of Computer Science, University of Toronto
Princeton University Observatory
Abstract
We go through the many considerations involved in fitting a model
to data, using as an example the fit of a straight line to a set of points
in a two-dimensional plane. Standard weighted least-squares fitting
is only appropriate when there is a dimension along which the data
points have negligible uncertainties, and another along which all the
uncertainties can be described by Gaussians of known variance; these
34 Fitting a straight line to data
0 50 100 150 200 250 300
x
0
100
200
300
400
500
600
700
y
y = 1.33 x + 164
0 50 100 150 200 250 300
x
0
100
200
300
400
500
600
700
y
0.0
1.0
1.0
1.0
0.0
0.0
0.0
0.00.0
0.0
0.0
0.0
0.00.0
0.0
0.0
0.0
0.0
0.0
0.0
D
Fi
Da
Cen
Ma
Jo
Cen
Du
Dep
Pri
tha
dit
lin
and
arXiv:1008.4686v1 [astro-ph.IM] 27 Aug 2010
Statistical Inference
Versioning & Reproducibility
“Recently, the scientific community was shaken by reports that
a troubling proportion of peer-reviewed preclinical studies are
not reproducible.” McNutt, 2014
http://www.sciencemag.org/content/343/6168/229.summary
- Git has emerged as the de facto versioning tool
- Berkeley Common Environment (BCE) Software Stack
- “Reproducible and Collaborative Statistical Data
Science” (Statistics 157: P. Stark)
- Next up: Versioning (big) data?
Undergraduate & Graduate Training Mission
Thinking Data Literacy before
Thinking Big Data Proficiency
Fernando Pérez
IPython Creator
Fernando Pérez
IPython Creator
Julia
R
IPython notebook
is ~agnostic to the
backend
Established CS/Stats/Math in Service
of novelty in domain science
vs.
Novelty in domain science driving &
informing novelty in CS/Stats/Math
“novelty2 problem”
Extra Burden for Forefront Scientists
https://medium.com/tech-talk/dd88857f662
Berkeley Institute for Data Sciences (BIDS)
‣ Physical Space & New Entity dedicated
to the Moore/Sloan Data Science
principles
‣ Goal: rich resource and ecosystem for
domain scientists to connect &
collaborate with methodologists
http://bitly.com/bundles/fperezorg/1
“Bold new partnership launches to harness potential of data scientists and big data”
Berkeley Institute for Data Sciences
Berkeley Institute for Data Sciences
Towards an Inclusive Ecosystem
Expanding Participation Among
Underrepresented Groups
11%
56%
33%
female
male
decline to state
2013 Python
bootcamp
- 2013 AMP Camp: < 5% women at
- This Workshop: 2 women out of 22 speakers
- 2013 Python Seminar: 36% women
Summary
‣ Data Literacy before Big Data Proficiency
‣ Domain Science increasingly dependent
upon methodological competencies
‣ Higher-Ed Role of such training still TBD
• formal courses competes for time
‣ Need to create inclusive, collaborative
environments bridging domains & methodologies
“Training Students to Extract Value from Big Data” National Academies of Science, DC 11 April 2014@profjsb
@profjsb
Thank you.

More Related Content

What's hot

Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateMario Juric
 
LSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your QuestionsLSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your QuestionsMario Juric
 
Insights to the Morphology of Planetary Nebulae from 3D Spectroscopy
Insights to the Morphology of Planetary Nebulae from 3D SpectroscopyInsights to the Morphology of Planetary Nebulae from 3D Spectroscopy
Insights to the Morphology of Planetary Nebulae from 3D SpectroscopyAshkbiz Danehkar
 
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...EarthCube
 
Ultra-Fast Outflows in Seyfert I AGN
Ultra-Fast Outflows in Seyfert I AGNUltra-Fast Outflows in Seyfert I AGN
Ultra-Fast Outflows in Seyfert I AGNAshkbiz Danehkar
 
A seven-planet resonant chain in TRAPPIST-1
A seven-planet resonant chain in TRAPPIST-1A seven-planet resonant chain in TRAPPIST-1
A seven-planet resonant chain in TRAPPIST-1Sérgio Sacani
 
Kinematical Properties of Planetary Nebulae with WR-type Nuclei
Kinematical Properties of Planetary Nebulae with WR-type NucleiKinematical Properties of Planetary Nebulae with WR-type Nuclei
Kinematical Properties of Planetary Nebulae with WR-type NucleiAshkbiz Danehkar
 
Astronomical Research in the Classroom with the Faulkes Telescope Project by ...
Astronomical Research in the Classroom with the Faulkes Telescope Project by ...Astronomical Research in the Classroom with the Faulkes Telescope Project by ...
Astronomical Research in the Classroom with the Faulkes Telescope Project by ...GTTP-GHOU-NUCLIO
 
ASTRONOMICAL OBJECTS DETECTION IN CELESTIAL BODIES USING COMPUTER VISION ALGO...
ASTRONOMICAL OBJECTS DETECTION IN CELESTIAL BODIES USING COMPUTER VISION ALGO...ASTRONOMICAL OBJECTS DETECTION IN CELESTIAL BODIES USING COMPUTER VISION ALGO...
ASTRONOMICAL OBJECTS DETECTION IN CELESTIAL BODIES USING COMPUTER VISION ALGO...csandit
 
Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014William Comaskey
 
20131107 damasso great
20131107 damasso great20131107 damasso great
20131107 damasso greatOAVdA_APACHE
 
The kepler 10_planetary_system_revisited_by_harps
The kepler 10_planetary_system_revisited_by_harpsThe kepler 10_planetary_system_revisited_by_harps
The kepler 10_planetary_system_revisited_by_harpsSérgio Sacani
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.J On The Beach
 
Science Coffee - Algorithms to Monitor Telemetry for Subtle Indications of De...
Science Coffee - Algorithms to Monitor Telemetry for Subtle Indications of De...Science Coffee - Algorithms to Monitor Telemetry for Subtle Indications of De...
Science Coffee - Algorithms to Monitor Telemetry for Subtle Indications of De...Advanced-Concepts-Team
 
A dynamically packed_planetary _system_around_gj667_c_with_three_superearths_...
A dynamically packed_planetary _system_around_gj667_c_with_three_superearths_...A dynamically packed_planetary _system_around_gj667_c_with_three_superearths_...
A dynamically packed_planetary _system_around_gj667_c_with_three_superearths_...Sérgio Sacani
 

What's hot (20)

Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status Update
 
LSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your QuestionsLSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your Questions
 
Cosmology Research in China
Cosmology Research in ChinaCosmology Research in China
Cosmology Research in China
 
Insights to the Morphology of Planetary Nebulae from 3D Spectroscopy
Insights to the Morphology of Planetary Nebulae from 3D SpectroscopyInsights to the Morphology of Planetary Nebulae from 3D Spectroscopy
Insights to the Morphology of Planetary Nebulae from 3D Spectroscopy
 
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...
 
Ultra-Fast Outflows in Seyfert I AGN
Ultra-Fast Outflows in Seyfert I AGNUltra-Fast Outflows in Seyfert I AGN
Ultra-Fast Outflows in Seyfert I AGN
 
Space Tug Rendezvous
Space Tug RendezvousSpace Tug Rendezvous
Space Tug Rendezvous
 
A seven-planet resonant chain in TRAPPIST-1
A seven-planet resonant chain in TRAPPIST-1A seven-planet resonant chain in TRAPPIST-1
A seven-planet resonant chain in TRAPPIST-1
 
Kinematical Properties of Planetary Nebulae with WR-type Nuclei
Kinematical Properties of Planetary Nebulae with WR-type NucleiKinematical Properties of Planetary Nebulae with WR-type Nuclei
Kinematical Properties of Planetary Nebulae with WR-type Nuclei
 
Astronomical Research in the Classroom with the Faulkes Telescope Project by ...
Astronomical Research in the Classroom with the Faulkes Telescope Project by ...Astronomical Research in the Classroom with the Faulkes Telescope Project by ...
Astronomical Research in the Classroom with the Faulkes Telescope Project by ...
 
ASTRONOMICAL OBJECTS DETECTION IN CELESTIAL BODIES USING COMPUTER VISION ALGO...
ASTRONOMICAL OBJECTS DETECTION IN CELESTIAL BODIES USING COMPUTER VISION ALGO...ASTRONOMICAL OBJECTS DETECTION IN CELESTIAL BODIES USING COMPUTER VISION ALGO...
ASTRONOMICAL OBJECTS DETECTION IN CELESTIAL BODIES USING COMPUTER VISION ALGO...
 
Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014
 
20131107 damasso great
20131107 damasso great20131107 damasso great
20131107 damasso great
 
CV-AIHyaAAS
CV-AIHyaAASCV-AIHyaAAS
CV-AIHyaAAS
 
FDL 2017 3D Shape Modeling
FDL 2017 3D Shape ModelingFDL 2017 3D Shape Modeling
FDL 2017 3D Shape Modeling
 
The kepler 10_planetary_system_revisited_by_harps
The kepler 10_planetary_system_revisited_by_harpsThe kepler 10_planetary_system_revisited_by_harps
The kepler 10_planetary_system_revisited_by_harps
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.
 
AGanguly_ALuo
AGanguly_ALuoAGanguly_ALuo
AGanguly_ALuo
 
Science Coffee - Algorithms to Monitor Telemetry for Subtle Indications of De...
Science Coffee - Algorithms to Monitor Telemetry for Subtle Indications of De...Science Coffee - Algorithms to Monitor Telemetry for Subtle Indications of De...
Science Coffee - Algorithms to Monitor Telemetry for Subtle Indications of De...
 
A dynamically packed_planetary _system_around_gj667_c_with_three_superearths_...
A dynamically packed_planetary _system_around_gj667_c_with_three_superearths_...A dynamically packed_planetary _system_around_gj667_c_with_three_superearths_...
A dynamically packed_planetary _system_around_gj667_c_with_three_superearths_...
 

Viewers also liked

Joshua Bloom Data Science at Berkeley
Joshua Bloom Data Science at BerkeleyJoshua Bloom Data Science at Berkeley
Joshua Bloom Data Science at BerkeleyPyData
 
Corrall - Data literacy conceptions and pedagogies: Redefining information li...
Corrall - Data literacy conceptions and pedagogies: Redefining information li...Corrall - Data literacy conceptions and pedagogies: Redefining information li...
Corrall - Data literacy conceptions and pedagogies: Redefining information li...IL Group (CILIP Information Literacy Group)
 
Introduction to Ethics of Big Data
Introduction to Ethics of Big DataIntroduction to Ethics of Big Data
Introduction to Ethics of Big Data28 Burnside
 
Promoting Data Literacy at the Grassroots (ACRL 2015, Portland, OR)
Promoting Data Literacy at the Grassroots (ACRL 2015, Portland, OR)Promoting Data Literacy at the Grassroots (ACRL 2015, Portland, OR)
Promoting Data Literacy at the Grassroots (ACRL 2015, Portland, OR)Adam Beauchamp
 
Ethics of Big Data
Ethics of Big DataEthics of Big Data
Ethics of Big DataMatti Vesala
 
Service Design as Method: Library Services Developed from User's Needs
Service Design as Method: Library Services Developed from User's NeedsService Design as Method: Library Services Developed from User's Needs
Service Design as Method: Library Services Developed from User's NeedsLIBER Europe
 
HUMANITIES DATA LITERACY: STUDENT PERSPECTIVE ON DIGITAL CULTURAL HERITAGE CO...
HUMANITIES DATA LITERACY: STUDENT PERSPECTIVE ON DIGITAL CULTURAL HERITAGE CO...HUMANITIES DATA LITERACY: STUDENT PERSPECTIVE ON DIGITAL CULTURAL HERITAGE CO...
HUMANITIES DATA LITERACY: STUDENT PERSPECTIVE ON DIGITAL CULTURAL HERITAGE CO...LIBER Europe
 
Introduction to Ethics of Big Data
Introduction to Ethics of Big DataIntroduction to Ethics of Big Data
Introduction to Ethics of Big Data28 Burnside
 

Viewers also liked (9)

Joshua Bloom Data Science at Berkeley
Joshua Bloom Data Science at BerkeleyJoshua Bloom Data Science at Berkeley
Joshua Bloom Data Science at Berkeley
 
Corrall - Data literacy conceptions and pedagogies: Redefining information li...
Corrall - Data literacy conceptions and pedagogies: Redefining information li...Corrall - Data literacy conceptions and pedagogies: Redefining information li...
Corrall - Data literacy conceptions and pedagogies: Redefining information li...
 
Building Data Literacy Among Middle School Administrators and Teachers
Building Data Literacy Among Middle School Administrators and TeachersBuilding Data Literacy Among Middle School Administrators and Teachers
Building Data Literacy Among Middle School Administrators and Teachers
 
Introduction to Ethics of Big Data
Introduction to Ethics of Big DataIntroduction to Ethics of Big Data
Introduction to Ethics of Big Data
 
Promoting Data Literacy at the Grassroots (ACRL 2015, Portland, OR)
Promoting Data Literacy at the Grassroots (ACRL 2015, Portland, OR)Promoting Data Literacy at the Grassroots (ACRL 2015, Portland, OR)
Promoting Data Literacy at the Grassroots (ACRL 2015, Portland, OR)
 
Ethics of Big Data
Ethics of Big DataEthics of Big Data
Ethics of Big Data
 
Service Design as Method: Library Services Developed from User's Needs
Service Design as Method: Library Services Developed from User's NeedsService Design as Method: Library Services Developed from User's Needs
Service Design as Method: Library Services Developed from User's Needs
 
HUMANITIES DATA LITERACY: STUDENT PERSPECTIVE ON DIGITAL CULTURAL HERITAGE CO...
HUMANITIES DATA LITERACY: STUDENT PERSPECTIVE ON DIGITAL CULTURAL HERITAGE CO...HUMANITIES DATA LITERACY: STUDENT PERSPECTIVE ON DIGITAL CULTURAL HERITAGE CO...
HUMANITIES DATA LITERACY: STUDENT PERSPECTIVE ON DIGITAL CULTURAL HERITAGE CO...
 
Introduction to Ethics of Big Data
Introduction to Ethics of Big DataIntroduction to Ethics of Big Data
Introduction to Ethics of Big Data
 

Similar to Computational Training and Data Literacy for Domain Scientists

Computational Training for Domain Scientists & Data Literacy
Computational Training for Domain Scientists & Data LiteracyComputational Training for Domain Scientists & Data Literacy
Computational Training for Domain Scientists & Data LiteracyJoshua Bloom
 
Data Science at Berkeley
Data Science at BerkeleyData Science at Berkeley
Data Science at BerkeleyJoshua Bloom
 
ExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training Algorithms
ExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training AlgorithmsExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training Algorithms
ExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training AlgorithmsIRJET Journal
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks
 
GrantProposalSethKrantzler_Fra
GrantProposalSethKrantzler_FraGrantProposalSethKrantzler_Fra
GrantProposalSethKrantzler_FraSeth Krantzler
 
The Emerging Cyberinfrastructure for Earth and Ocean Sciences
The Emerging Cyberinfrastructure for Earth and Ocean SciencesThe Emerging Cyberinfrastructure for Earth and Ocean Sciences
The Emerging Cyberinfrastructure for Earth and Ocean SciencesLarry Smarr
 
Toward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing CyberinfrastructureToward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing CyberinfrastructureLarry Smarr
 
A Review on Astronomy, Space Science and Technology Development in Thailand
A Review on Astronomy, Space Science and Technology Development in ThailandA Review on Astronomy, Space Science and Technology Development in Thailand
A Review on Astronomy, Space Science and Technology Development in ThailandILOAHawaii
 
Identifying Exoplanets with Machine Learning Methods: A Preliminary Study
Identifying Exoplanets with Machine Learning Methods: A Preliminary StudyIdentifying Exoplanets with Machine Learning Methods: A Preliminary Study
Identifying Exoplanets with Machine Learning Methods: A Preliminary StudyIJCI JOURNAL
 
A Search for Technosignatures Around 11,680 Stars with the Green Bank Telesco...
A Search for Technosignatures Around 11,680 Stars with the Green Bank Telesco...A Search for Technosignatures Around 11,680 Stars with the Green Bank Telesco...
A Search for Technosignatures Around 11,680 Stars with the Green Bank Telesco...Sérgio Sacani
 
201109021 mcguinness ska_meeting
201109021 mcguinness ska_meeting201109021 mcguinness ska_meeting
201109021 mcguinness ska_meetingDeborah McGuinness
 
FDL 2017 Lunar Water and Volatiles
FDL 2017 Lunar Water and VolatilesFDL 2017 Lunar Water and Volatiles
FDL 2017 Lunar Water and VolatilesLeonard Silverberg
 
The Possible Tidal Demise of Kepler’s First Planetary System
The Possible Tidal Demise of Kepler’s First Planetary SystemThe Possible Tidal Demise of Kepler’s First Planetary System
The Possible Tidal Demise of Kepler’s First Planetary SystemSérgio Sacani
 
Cyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean ObservatoriesCyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean ObservatoriesLarry Smarr
 
Research Data Infrastructure for Geochemistry (DFG Roundtable)
Research Data Infrastructure for Geochemistry (DFG Roundtable)Research Data Infrastructure for Geochemistry (DFG Roundtable)
Research Data Infrastructure for Geochemistry (DFG Roundtable)Kerstin Lehnert
 
Kepler 432b a_massive_warm_jupiter_in_a_52days_eccentric_orbit_transiting_a_g...
Kepler 432b a_massive_warm_jupiter_in_a_52days_eccentric_orbit_transiting_a_g...Kepler 432b a_massive_warm_jupiter_in_a_52days_eccentric_orbit_transiting_a_g...
Kepler 432b a_massive_warm_jupiter_in_a_52days_eccentric_orbit_transiting_a_g...Sérgio Sacani
 

Similar to Computational Training and Data Literacy for Domain Scientists (20)

Computational Training for Domain Scientists & Data Literacy
Computational Training for Domain Scientists & Data LiteracyComputational Training for Domain Scientists & Data Literacy
Computational Training for Domain Scientists & Data Literacy
 
Data Science at Berkeley
Data Science at BerkeleyData Science at Berkeley
Data Science at Berkeley
 
ExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training Algorithms
ExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training AlgorithmsExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training Algorithms
ExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training Algorithms
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache Spark
 
GrantProposalSethKrantzler_Fra
GrantProposalSethKrantzler_FraGrantProposalSethKrantzler_Fra
GrantProposalSethKrantzler_Fra
 
The Emerging Cyberinfrastructure for Earth and Ocean Sciences
The Emerging Cyberinfrastructure for Earth and Ocean SciencesThe Emerging Cyberinfrastructure for Earth and Ocean Sciences
The Emerging Cyberinfrastructure for Earth and Ocean Sciences
 
Toward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing CyberinfrastructureToward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing Cyberinfrastructure
 
A Review on Astronomy, Space Science and Technology Development in Thailand
A Review on Astronomy, Space Science and Technology Development in ThailandA Review on Astronomy, Space Science and Technology Development in Thailand
A Review on Astronomy, Space Science and Technology Development in Thailand
 
Identifying Exoplanets with Machine Learning Methods: A Preliminary Study
Identifying Exoplanets with Machine Learning Methods: A Preliminary StudyIdentifying Exoplanets with Machine Learning Methods: A Preliminary Study
Identifying Exoplanets with Machine Learning Methods: A Preliminary Study
 
A Search for Technosignatures Around 11,680 Stars with the Green Bank Telesco...
A Search for Technosignatures Around 11,680 Stars with the Green Bank Telesco...A Search for Technosignatures Around 11,680 Stars with the Green Bank Telesco...
A Search for Technosignatures Around 11,680 Stars with the Green Bank Telesco...
 
Teletransportacion marcikic2003
Teletransportacion  marcikic2003Teletransportacion  marcikic2003
Teletransportacion marcikic2003
 
201109021 mcguinness ska_meeting
201109021 mcguinness ska_meeting201109021 mcguinness ska_meeting
201109021 mcguinness ska_meeting
 
FDL 2017 Lunar Water and Volatiles
FDL 2017 Lunar Water and VolatilesFDL 2017 Lunar Water and Volatiles
FDL 2017 Lunar Water and Volatiles
 
6%2E2017-2021
6%2E2017-20216%2E2017-2021
6%2E2017-2021
 
The Possible Tidal Demise of Kepler’s First Planetary System
The Possible Tidal Demise of Kepler’s First Planetary SystemThe Possible Tidal Demise of Kepler’s First Planetary System
The Possible Tidal Demise of Kepler’s First Planetary System
 
Cyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean ObservatoriesCyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean Observatories
 
MIRI & the James Webb Space Telescope
MIRI & the James Webb Space TelescopeMIRI & the James Webb Space Telescope
MIRI & the James Webb Space Telescope
 
Research Data Infrastructure for Geochemistry (DFG Roundtable)
Research Data Infrastructure for Geochemistry (DFG Roundtable)Research Data Infrastructure for Geochemistry (DFG Roundtable)
Research Data Infrastructure for Geochemistry (DFG Roundtable)
 
Kepler 432b a_massive_warm_jupiter_in_a_52days_eccentric_orbit_transiting_a_g...
Kepler 432b a_massive_warm_jupiter_in_a_52days_eccentric_orbit_transiting_a_g...Kepler 432b a_massive_warm_jupiter_in_a_52days_eccentric_orbit_transiting_a_g...
Kepler 432b a_massive_warm_jupiter_in_a_52days_eccentric_orbit_transiting_a_g...
 
Presentation
PresentationPresentation
Presentation
 

More from Joshua Bloom

Autoencoding RNN for inference on unevenly sampled time-series data
Autoencoding RNN for inference on unevenly sampled time-series dataAutoencoding RNN for inference on unevenly sampled time-series data
Autoencoding RNN for inference on unevenly sampled time-series dataJoshua Bloom
 
Industrial Machine Learning (SIGKDD17)
Industrial Machine Learning (SIGKDD17)Industrial Machine Learning (SIGKDD17)
Industrial Machine Learning (SIGKDD17)Joshua Bloom
 
Industrial Machine Learning (at GE)
Industrial Machine Learning (at GE)Industrial Machine Learning (at GE)
Industrial Machine Learning (at GE)Joshua Bloom
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Large-Scale Inference in Time Domain Astrophysics
Large-Scale Inference in Time Domain AstrophysicsLarge-Scale Inference in Time Domain Astrophysics
Large-Scale Inference in Time Domain AstrophysicsJoshua Bloom
 
Joshua Bloom: Machine Learning and Classification in the Synoptic Survey Era
Joshua Bloom: Machine Learning and Classification in the Synoptic Survey EraJoshua Bloom: Machine Learning and Classification in the Synoptic Survey Era
Joshua Bloom: Machine Learning and Classification in the Synoptic Survey EraJoshua Bloom
 

More from Joshua Bloom (6)

Autoencoding RNN for inference on unevenly sampled time-series data
Autoencoding RNN for inference on unevenly sampled time-series dataAutoencoding RNN for inference on unevenly sampled time-series data
Autoencoding RNN for inference on unevenly sampled time-series data
 
Industrial Machine Learning (SIGKDD17)
Industrial Machine Learning (SIGKDD17)Industrial Machine Learning (SIGKDD17)
Industrial Machine Learning (SIGKDD17)
 
Industrial Machine Learning (at GE)
Industrial Machine Learning (at GE)Industrial Machine Learning (at GE)
Industrial Machine Learning (at GE)
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Large-Scale Inference in Time Domain Astrophysics
Large-Scale Inference in Time Domain AstrophysicsLarge-Scale Inference in Time Domain Astrophysics
Large-Scale Inference in Time Domain Astrophysics
 
Joshua Bloom: Machine Learning and Classification in the Synoptic Survey Era
Joshua Bloom: Machine Learning and Classification in the Synoptic Survey EraJoshua Bloom: Machine Learning and Classification in the Synoptic Survey Era
Joshua Bloom: Machine Learning and Classification in the Synoptic Survey Era
 

Recently uploaded

Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 

Recently uploaded (20)

Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 

Computational Training and Data Literacy for Domain Scientists

  • 1. Computational Training & Data Literacy for Domain Scientists Joshua Bloom UC Berkeley, Astronomy @profjsb “Training Students to Extract Value from Big Data” National Academies of Science, DC 11 April 2014
  • 2. What is the toolbox of the modern (data-driven) scientist? domain training statistics advanced computing database GUI parallel visualization Bayesian machine learning Physics laboratory techniques MCMC MapReduce
  • 3. And...How do we teach this with what little time the students have? What is the toolbox of the modern (data-driven) scientist?
  • 4. Astronomical Data Deluge Serious Challenge to Traditional Approaches & Toolkits
  • 5. Astronomical Data Deluge Serious Challenge to Traditional Approaches & Toolkits Large Synoptic Survey Telescope (LSST) - 2020 ! Light curves for 800M sources every 3 days 106 supernovae/yr, 105 eclipsing binaries 3.2 gigapixel camera, 20 TB/night LOFAR & SKA 150 Gps (27 Tflops) → 20 Pps (~100 Pflops) Gaia space astrometry mission - 2014 1 billion stars observed ∼70 times over 5 years Will observe 20K supernovae Many other astronomical surveys are already producing data: SDSS, iPTF, CRTS, Pan-STARRS, Hipparcos, OGLE, ASAS, Kepler, LINEAR, DES etc.,
  • 6. strategy scheduling observing reduction finding discovery classification followup inference Towards a Fully Automated Scientific Stack for Transients } current state-of-the-art stack automated (e.g. iPTF) not (yet) automated published work NSF/CDI NSF/BIGDATA
  • 7. Our ML framework found the Nearest Supernova in 3 Decades ..‣ Built & Deployed Real- time ML framework, discovering >10,000 events in > 10 TB of imaging → 50+ journal articles ‣ Built Probabilistic Event classification catalogs with innovative active learning http://timedomain.org https://www.nsf.gov/news/news_summ.jsp?cntn_id=122537
  • 8. Data-Centric Coursework, Bootcamps, Seminars, & Lecture Series BDAS: Berkeley Data Analytics Stack [Spark, Shark, ...] parallel programming bootcamp ...and entire degree programs
  • 9. Data-Centric Coursework, Bootcamps, Seminars, & Lecture Series BDAS: Berkeley Data Analytics Stack [Spark, Shark, ...] parallel programming bootcamp ...and entire degree programs Taught by CS/Stats Aimed at Engineers & Programmers Heading Toward Industry
  • 10. 2010: 85 campers 2012a: 135 campers Python Bootcamps at Berkeley
  • 11. a modern superglue computing language for science ‣ high-level scripting language ‣ open source, huge & growing community in academia & industry ‣ Just in time compilation but also fast numerical computation ‣ Extensive interfaces to 3rd party frameworks A reasonable lingua franca for scientists...
  • 12. 2012b: 210 campers Python Bootcamps at Berkeley 2013a: 253 campers
  • 13. ‣ 3 days of live/archive streamed lectures ‣ all open material in GitHub ‣ widely disseminated (e.g., @ NASA) ‣ funded (~$18k) by the Vice Chancellor for Research & NSF (BIGDATA) http://pythonbootcamp.info
  • 14. Part of the Designated Emphasis in Computation al Science & Engineering at Berkeley visualization machine learning database interaction user interface & web frameworks timeseries & numerical computing interfacing to other languages Bayesian inference & MCMC hardware control parallelism
  • 15. 64% 36% female male 8% 4% 8% 12% 4% 12% 8% 16% 16% 12%Psychology Astronomy Neuroscience Biostatistics Physics Chemical Engineering ISchool Earth and Planetary Sciences Industrial Engineering Mechanical Engineering “Parallel Image Reconstruction from Radio Interferometry Data” “Graph Theory Analysis of Growing Graphs” http://mb3152.github.io/Graph-Growth/ “Realtime Prediction of Activity Behavior from Smartphone” “Bus Arrival Time Prediction in Spain”
  • 16. Time domain preprocessing - Start with raw photometry! - Gaussian process detrending! - Calibration! - Petigura & Marcy 2012! ! Transit search - Matched filter! - Similar to BLS algorithm (Kovcas+ 2002)! - Leverages Fast-Folding Algorithm O(N^2) → O(N log N) (Staelin+ 1968)! ! Data validation - Significant peaks in periodogram, but inconsistent with exoplanet transit TERRA – optimized for small planets Detrended/calibrated photometry TERRA RawFlux(ppt)CalibratedFlux Erik Petigura Berkeley Astro Grad Student Petigura, Howard, & Marcy (20 Prevalence of Earth-size planets orbiting Sun-like stars Erik A. Petiguraa,b,1 , Andrew W. Howardb , and Geoffrey W. Marcya a Astronomy Department, University of California, Berkeley, CA 94720; and b Institute for Astronomy, University of Hawaii at Manoa, Honolulu, HI 96822 Contributed by Geoffrey W. Marcy, October 22, 2013 (sent for review October 18, 2013) Determining whether Earth-like planets are common or rare looms as a touchstone in the question of life in the universe. We searched for Earth-size planets that cross in front of their host stars by examining the brightness measurements of 42,000 stars from National Aeronautics and Space Administration’s Kepler mission. We found 603 planets, including 10 that are Earth size (1 − 2 R⊕) and receive comparable levels of stellar energy to that of Earth (0:25 − 4 F⊕). We account for Kepler’s imperfect detectability of such planets by injecting synthetic planet–caused dimmings into the Kepler brightness measurements and recording the fraction detected. We find that 11 ± 4% of Sun-like stars harbor an Earth- size planet receiving between one and four times the stellar inten- sity as Earth. We also find that the occurrence of Earth-size planets is constant with increasing orbital period (P), within equal intervals of logP up to ∼200 d. Extrapolating, one finds 5:7+1:7 −2:2 % of Sun-like stars harbor an Earth-size planet with orbital periods of 200–400 d. extrasolar planets | astrobiology The National Aeronautics and Space Administration’s (NASA’s) Kepler mission was launched in 2009 to search for planets that transit (cross in front of) their host stars (1–4). The resulting dimming of the host stars is detectable by measuring their bright- ness, and Kepler monitored the brightness of 150,000 stars every 30 min for 4 y. To date, this exoplanet survey has detected more than 3,000 planet candidates (4). The most easily detectable planets in the Kepler survey are those that are relatively large and orbit close to their host stars, especially those stars having lower intrinsic brightness fluctua- tions (noise). These large, close-in worlds dominate the list of known exoplanets. However, the Kepler brightness measurements can be analyzed and debiased to reveal the diversity of planets, We searched for transiting planets in Kepler brightness mea- surements using our custom-built TERRA software package described in previous works (6, 9) and in SI Appendix. In brief, TERRA conditions Kepler photometry in the time domain, re- moving outliers, long timescale variability (>10 d), and systematic errors common to a large number of stars. TERRA then searches for transit signals by evaluating the signal-to-noise ratio (SNR) of prospective transits over a finely spaced 3D grid of orbital period, P, time of transit, t0, and transit duration, ΔT. This grid-based search extends over the orbital period range of 0.5–400 d. TERRA produced a list of “threshold crossing events” (TCEs) that meet the key criterion of a photometric dimming SNR ratio SNR > 12. Unfortunately, an unwieldy 16,227 TCEs met this cri- terion, many of which are inconsistent with the periodic dimming profile from a true transiting planet. Further vetting was performed by automatically assessing which light curves were consistent with theoretical models of transiting planets (10). We also visually inspected each TCE light curve, retaining only those exhibiting a consistent, periodic, box-shaped dimming, and rejecting those caused by single epoch outliers, correlated noise, and other data anomalies. The vetting process was applied homogeneously to all TCEs and is described in further detail in SI Appendix. To assess our vetting accuracy, we evaluated the 235 Kepler objects of interest (KOIs) among Best42k stars having P > 50 d, which had been found by the Kepler Project and identified as planet candidates in the official Exoplanet Archive (exoplanetarchive. ipac.caltech.edu; accessed 19 September 2013). Among them, we found four whose light curves are not consistent with being planets. These four KOIs (364.01, 2,224.02, 2,311.01, and 2,474.01) have long periods and small radii (SI Appendix). This exercise suggests that our vetting process is robust and that careful scrutiny of the light curves of small planets in long period orbits is useful to identify false positives. ASTRONOMY Bootcamp/ Seminar Alum Python DOE/NERSC computation PNAS [2014]
  • 17. “Are we alone in the universe? What makes up the missing mass of the universe? ... And maybe the biggest question of all: How in the wide world can you add $3 billion in market capitalization simply by adding .com to the end of a name?” President William Jefferson Clinton Science and Technology Policy Address 21 January 2000 “Add Data Science or Big Data to your course name to increase enrollment by tenfold.” Joshua Bloom Just Now
  • 18. Python for Data Science @ Berkeley [Sept 2013]
  • 19. ‣ Where do Bootcamps & Seminars fit into traditional domain science curricula? - formal coursework competes with research obligations for graduate students ‣ Are they too vocational/practical for higher Ed? ‣ Who should teach them & how do we credit them?
  • 20. first this... ...then this. Undergraduate & Graduate Training Mission Thinking Data Literacy before Thinking Big Data Proficiency
  • 21. Undergraduate & Graduate Training Mission Thinking Data Literacy before Thinking Big Data Proficiency Data analysis recipes: Fitting a model to data⇤ David W. Hogg Center for Cosmology and Particle Physics, Department of Physics, New York University Max-Planck-Institut f¨ur Astronomie, Heidelberg Jo Bovy Center for Cosmology and Particle Physics, Department of Physics, New York University Dustin Lang Department of Computer Science, University of Toronto Princeton University Observatory Abstract We go through the many considerations involved in fitting a model to data, using as an example the fit of a straight line to a set of points in a two-dimensional plane. Standard weighted least-squares fitting is only appropriate when there is a dimension along which the data points have negligible uncertainties, and another along which all the uncertainties can be described by Gaussians of known variance; these 34 Fitting a straight line to data 0 50 100 150 200 250 300 x 0 100 200 300 400 500 600 700 y y = 1.33 x + 164 0 50 100 150 200 250 300 x 0 100 200 300 400 500 600 700 y 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 D Fi Da Cen Ma Jo Cen Du Dep Pri tha dit lin and arXiv:1008.4686v1 [astro-ph.IM] 27 Aug 2010 Statistical Inference
  • 22. Versioning & Reproducibility “Recently, the scientific community was shaken by reports that a troubling proportion of peer-reviewed preclinical studies are not reproducible.” McNutt, 2014 http://www.sciencemag.org/content/343/6168/229.summary - Git has emerged as the de facto versioning tool - Berkeley Common Environment (BCE) Software Stack - “Reproducible and Collaborative Statistical Data Science” (Statistics 157: P. Stark) - Next up: Versioning (big) data? Undergraduate & Graduate Training Mission Thinking Data Literacy before Thinking Big Data Proficiency
  • 26. Established CS/Stats/Math in Service of novelty in domain science vs. Novelty in domain science driving & informing novelty in CS/Stats/Math “novelty2 problem” Extra Burden for Forefront Scientists https://medium.com/tech-talk/dd88857f662
  • 27. Berkeley Institute for Data Sciences (BIDS) ‣ Physical Space & New Entity dedicated to the Moore/Sloan Data Science principles ‣ Goal: rich resource and ecosystem for domain scientists to connect & collaborate with methodologists http://bitly.com/bundles/fperezorg/1 “Bold new partnership launches to harness potential of data scientists and big data”
  • 28. Berkeley Institute for Data Sciences
  • 29. Berkeley Institute for Data Sciences
  • 30. Towards an Inclusive Ecosystem Expanding Participation Among Underrepresented Groups 11% 56% 33% female male decline to state 2013 Python bootcamp - 2013 AMP Camp: < 5% women at - This Workshop: 2 women out of 22 speakers - 2013 Python Seminar: 36% women
  • 31. Summary ‣ Data Literacy before Big Data Proficiency ‣ Domain Science increasingly dependent upon methodological competencies ‣ Higher-Ed Role of such training still TBD • formal courses competes for time ‣ Need to create inclusive, collaborative environments bridging domains & methodologies “Training Students to Extract Value from Big Data” National Academies of Science, DC 11 April 2014@profjsb