Data Science and Urban Science @ UW

2
“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying
to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera

The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
7/21/2014 Bill Howe, UW 3

“All across our campus, the process of discovery will increasingly rely on
researchers’ ability to extract knowledge from vast amounts of data… In order
to remain at the forefront, UW must be a leader in advancing these
techniques and technologies, and in making [them] accessible to researchers
in the broadest imaginable range of fields.”
2005-2008
In other words:
• Data-driven discovery will be ubiquitous
• UW must be a leader in inventing the
capabilities
• UW must be a leader in translational
activities – in putting these capabilities to
work
• It’s about intellectual infrastructure (human capital) and software
infrastructure (shared tools and services – digital capital)

A 5-year, US$37.8 million cross-institutional
collaboration to create a data science environment
5
2014

$9.3 million from Washington Research Foundation to
Amplify the Moore/Sloan effort
• 6 X 5-year Faculty lines in Data Science
• 6 X startup packages
• 15 X 3 yr postdoctoral fellows
• Funds to remodel and furnish a WRF Data Science Studio
• Also $7.1 million to closely-related Institute for
Neuroengineering, $8.0 million to Institute for Protein
Design, $6.7 million to Clean Energy Institute
6

7/21/2014 Bill Howe, UW 7
Data Science Kickoff Session:
137 posters from 30+ departments and units

8
PIs on Moore/Sloan effort
+ eScience Institute
Steering Committee
+ UW participants in
February 7 Data Science
poster session
Broad collaborations

Establish a virtuous cycle
• 6 working groups, each with
• 3-6 faculty from each institution

Key Activity: Promote interdisciplinary careers
• Interdisciplinary graduate students
– New, interdisciplinary “Data Science” Ph.D. tracks and program
• Interdisciplinary postdocs (“Data Science Fellows”)
– Dual-mentored postdocs with interests in both methods and a domain
science
• Interdisciplinary research scientists (“Data Scientists”)
• Work across disciplines to solve people’s data science challenges
• Interdisciplinary faculty
– Supported with special hiring and funding initiatives
• “Senior Research Fellows”
– Short-term and long-term visitors
• A diverse faculty steering committee

UW Data Science Education Efforts
7/21/2014 Bill Howe, UW 11
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
UWEO Data Science Certificate
MOOC Intro to Data Science
IGERT: Big Data PhD Track
New CS Courses
Bootcamps and workshops
Intro to Data Programming
Data Science Masters (planned)
Incubator: hands-on training

12
Educational
transformation
Big Data access
and management
Big Data
modeling
Big Data analytics
Collaborative
Big Data scienceData
Key Activity: Foster Interdisciplinary Education
• Ultimate goal: A new PhD program
– Initial goal: A new certificate based on Big Data tracks in all departments
– Education highlights: data science courses, co-advising, and internships
• End-to-End Research Agenda
– Big Data mgmt, analytics, modeling, & collaboration
• Cyberinfrastructure Development
– Big Data analysis service

• Additional data science educational activities
– Coursera MOOCs
• Introduction to Data Science (Bill Howe)
• Computational Methods of Data Analysis (Nathan Kutz)
• High Performance Scientific Computing (Randy LeVeque)
– Traditional courses
• Many! Example: Biochemistry for Computer Scientists (Joe Hellerstein)
• We try to list relevant courses on the eScience Institute website
– UW Educational Outreach
• 3-course Certificate in Data Science
• 3-course Certificate in Cloud Data Management & Analytics
• 3-course Certificate in Cloud Application Development on Amazon Web Services
• 3-course Certificate in Data Visualization
– Workshops and bootcamps
• Software Carpentry (Winter & Spring 2013; Winter, Spring, & Summer 2014)
• Cosmology and Machine Learning (Autumn 2014)

• An open shared R&D space where researchers from
across the campus will come to collaborate
• A resident data science team
– Permanent staff of ~5 Data Scientists – applied research and development
– ~15-20 Data Science Fellows (research scientists, visitors, postdocs, students)
– Entrepreneurial mentorship
• Modes of engagement
– Drop-in open workspace
– Studio “Office Hours”
– Incubation Program
– Plus seminars, sponsored
lunches, workshops,
bootcamps, joint proposals …
Key Activity: “Re-establish the watercooler”

Key Activity: Create scalable impact through a
Data Science Incubation Program
• Scale and concentrate our efforts
– Move from “accidental” encounters to engineered partnerships
– Identify emerging opportunities around campus
– Provide a shared environment where researchers can learn from an in-house
team, external mentors, and each other
• A startup environment!
– “Seed grant” program
• Lightweight – 1-page proposals
– Significant potential for technology spinout – new markets for existing
technology and new technology for existing
markets

Key Activity: Democratize Access to Big Data and Big
Data Infrastructure
• SQLShare: Database-as-a-Service for scientists and engineers
• Myria: Easy, Scalable Analytics-as-a-Service

Open Data sharing platforms
• Database-as-a-service for open data analytics
• Interoperable with external tools and languages
• Local or cloud deployments
• Interoperable with existing database platforms
• Built-in data integration, profiling, analytics
Google
Fusion
Tables
17
Entrepreneurship
1) “Data once guarded for assumed but untested reasons is now
open, and we're seeing benefits.”
-- Nigel Shadbolt, Open Data Institute
2) Need to help “non-specialists within an organization use data
that had been the realm of programmers and DB admins”
-- Benjamin Romano, Xconomy
“Businesses are now using data the way scientists always have”
-- Jeff Hammerbacher, Cloudera

Halperin, Howe, et al. SSDBM 2013

19
Scalable Analytics as a Service

Kenya Health Information System Data
Grégoire Lurton
June 12, 2014
Abie FlaxmanDan HalperinGregoire
Lurton

“Much of the material remains unprocessed,
or, if processed, unanalyzed, or, if analyzed,
not read, or, if read, not used or acted upon”
Objectives
 Design generalizable method to process HIS-
like data
 Make important dataset available for
analysis
 Explore actionable data analysis of HIS data
Why do we care?

Metadata Trace - saving
Reports of year n saved in
January of year n+1
Years were not recorded for
the first year of use…

REDPy
Repeating Earthquake Detector (Python)
An eScience Incubator Project
Project Lead: Alicia Hotovec-Ellis
Data Scientist: Jake Vanderplas
John
Vidale
Alicia
Hotovec-Ellis
Jake
Vanderplas

What is a
“repeating” earthquake?
EVENT# 1
2
3
4
5
6
7

Why do we study
repeating earthquakes?

The problem(s)…
Time (minutes)
Time(HH:MM:SS)

Clustering for
Ordered in time
Event#
Event #
Ordered with OPTICS
Event#
Event #

I talked with Alicia a bit yesterday, and she showed me that her earthquake-repeater-
searching implementation is more general, and more powerful than I had thought, and
closer to trial by others (and I have a particular use in mind in the ongoing iMUSH
experiment on Mount St Helens)<snip>
So I'm encouraging her to continue to work on it a day per week or so for the
forseeable future, assuming you have the facilities to continue the incubation.
The project outlives the incubator……
Publications in the works on both the software and
the science – from three months of half-time work

Using Twitter data to identify geographic
clustering of anti-vaccination sentiments
Ben Brooks
June 12, 2014
Benjamin
Brooks
Andrew
Whitaker
Abie Flaxman

Initial approach
• Sentiment regarding vaccination can be discerned
from Twitter.
• Can we find city- or county-level pockets of anti-
vaccination sentiment?
• Do these locales correlate with outbreak and
vaccination rate data (beyond H1N1)?

Training data issues
• Training data from PSU study labeled
tweets as positive, negative, neutral, or
irrelevant.
• Many tweet categorizations seemed
suspect.
• Produced new training dataset; switched
approach to negative tweets vs. all others.
• Of tweets we labeled as negative, PSU
training data agreed with 36%.
• Sample non-negative tweets in
training dataset from PSU study:
• “RT @Lyn_Sue Lyn_Sue18 Reasons Why u
Should NOT Vaccinate Your Children
Against The Flu This Season”
• “1882 -3 O RT @alexHroz Citizens From
All Walks Intend To Refuse Swine Flu
"Vaccine,”
• “Eighteen Reasons Why You Should NOT
Vaccinate Your Children Against The Flu
This Season by Bill Sard”
• “Swine Flu Vaccine not necessary and not
healthy:”

Background: Previous work
• “For our sentiment classification, we used an ensemble method combining the Naive Bayes and the
Maximum Entropy classifiers…The accuracy of this ensemble classifier was 84.29%.”

Other sentiment approaches
• Precision Of all tweets labeled negative by the algorithm, what percentage are “true
negatives”?
• Recall Of all “true negative” tweets, what percentage are labeled negative by the algorithm?
Precision Recall
Vaccine-specific keywords 19% 59%
Modified general sentiment 25% 41%
Naïve Bayes 79% 19%
Logistic regression 70% 28%
Labeled data from PSU study 41% 36%

Other sentiment approaches
• Data labeled by human beings does not perform dramatically better than other classifiers!
Precision Recall
Vaccine-specific keywords 19% 59%
Modified general sentiment 25% 41%
Naïve Bayes 79% 19%
Logistic regression 70% 28%
Labeled data from PSU study 41% 36%

Scalable Analytics over Call Record Data in Developing Nations
Project Lead
Ian Kelley
Information School
University of Washington
E-mail: ikelley@uw.edu
eScience Data Incubator - 12 June 2014
Andrew WhitakerIan Kelley Josh Blumenstock

Map migration patterns of workers during labor
market shortages (Rwanda)
Measure and categorize mobility patterns
Determine peoples’ geographic center of gravity
Discover the effects of violent events on internal
population mobility (Afghanistan)
Track activity patterns over time; identify changes
Map connected areas of country

Average position during a time period (e.g., day, week)

Towards An Urban Science Incubation Cohort
44
OneBusAway:
Transit Traveler Information Systems
Foreclosure Rates and
changes in poverty
concentration
PNW Seismic Network
Early Warning System
Ocean Observatories Initiative
Education CRPE

Seattle the tech and innovation hub
• “most innovative state” (Bloomberg 12/13)
• “smartest city” (Fast Company, 11/13)
• only US city on “ten best Internet cities” (UBM’s Future
Cities blog, 8/13)
• ranked 2nd for women entrepreneurs (geekwire, 2/13)
• ranked 4th as global startup hub, > NYC (geekwire, 11/12)
• “the top tech city” (geekwire, 6/12)
• …and so on
45

eScience Institute + Urban Science
• Better public engagement than in physical and earth sciences
• Leverages our core interest in open data and open science
• Acute need relative to traditionally data-intensive fields
– relative newcomers in DS techniques and technologies
– We prefer collaborations with smaller labs and individuals as opposed to
“Big Science” projects
• Seattle offers a unique testbed as an urbanizing region
– Brookings “metro”: Interconnected urban, suburban, rural, environment
– Engaged, active communities
– Strong local interest in open data, open government
– Global hub for technology and innovation (next slide)
• Connections with King County Executive’s office, State CIO’s
office, Seattle CTO’s office, local gov data companies (Socrata) 46

Data Science @ UW
We are at the dawn of
a revolutionary new era of discovery and learning

Data Science and Urban Science @ UW

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Science and Urban Science @ UW

Similar to Data Science and Urban Science @ UW (20)

More from University of Washington

More from University of Washington (20)

Recently uploaded

Recently uploaded (20)

Data Science and Urban Science @ UW

Editor's Notes