Data Science in 2016: Moving Up

Data Science in 2016:
Moving Up
2015-10-15 • Madrid • http://bigdataspain.org/
Paco Nathan, @pacoid 
O’Reilly Media

• general patterns
• trends and analysis: the discipline, the jobs
• some good examples: moving up into use cases
• glimpses ahead: an emerging content
• a proposed theme
Data Science 2016: Moving Up

Design Patterns
Methodology for cloud-computing architecture 
(2008-06-29)
http://ceteri.blogspot.com/2008/06/methodology-for-
cloud-computing.html

cluster scheduler
data
pipes
some cloud
containers
analytics
search/index
elastic
compute
elastic
storage
Design Patterns

Design Patterns
some cloud
DataStax
$189.7M
Confluent
$30.9M
Databricks
$47M
Jupyter
$6M
Elastic
$104M
Docker
$162MMesosphere
$48.75M

Design Patterns: Issues
some cloud
• integration could be better
• that implies sharing markets
• VCs in SiliconValley dislike that
• customers need integration

some cloud
Design Patterns: Where?

some cloud

some cloud
• that playing ﬁeld becomes
overly crowded, soon…
• what happens at that point?

• so much emphasis on plumbing: `data engineering`
• not enough on domain expertise, which trumps all
Much activity in Big Data seems awkwardly focused at the
bottom of the tech stack: infrastructure, not domain
However, that may be changing…
Design Patterns: Opinion

Interesting Trends
There are many possible trends to discuss, but let’s  
concentrate on four of these going into 2016:
• leveraging multicore and large memory spaces
• generalized libraries for frequently repeated work
• workﬂows blend the best of people and computing
• framework for a big leap ahead, not just incremental

Original deﬁnitions for what became relational
databases had less to do with dedicated SQL
products, more similarity with something like  
Spark SQL
Interesting Trend #1: Contemporary Hardware
A relational model of data  
for large shared data banks 
Edgar Codd 
Communications of the ACM (1970) 
dl.acm.org/citation.cfm?id=362685

Python Java/Scala RSQL …
DataFrame
Logical Plan
LLVMJVM GPU NVRAM
Unified API, One Engine, Automatically Optimized
Tungsten
backend
language
frontend
…
from Databricks

Deep Dive into ProjectTungsten:  
Bringing Spark Closer to Bare Metal 
Josh Rosen 
spark-summit.org/2015/events/deep-dive-into-project-
tungsten-bringing-spark-closer-to-bare-metal/
Set Footer from Insert Dropdown Menu
Physical Execution:
CPU Efficient Data Structures
Keep data closure to CPU cache
from Databricks

Interesting Trend #2: Generalized Libraries
Tensors are a good way to handle time-series  
geo-spatially distributed linked data with lots  
of N-dimensional attributes
In other words, nearly a general case for handling
much of the data that we’re likely to encounter
That’s better than attempting to shoehorn data
into matrix representation, then writing lots of
custom code to support it

Tensor factorization may be problematic, but
probabilistic solutions seem to provide relatively
general case solutions:
TheTensor Renaissance in Data Science 
Anima Anandkumar @UC Irvine 
radar.oreilly.com/2015/05/the-tensor-
renaissance-in-data-science.html
Spacey RandomWalks and  
Higher Order Markov Chains 
David Gleich @Purdue 
slideshare.net/dgleich/spacey-random-
walks-and-higher-order-markov-chains
Interesting Trend #2: Generalized Libraries

Interesting Trend #3: Leveraging Workflows
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
APIs, algorithms, developer-centric template thinking –  
these only go so far; the overall context is a workﬂow…

evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
look beyond an API, beyond a
code repo … think of people
and machines working together
Interesting Trend #3: Leveraging Workflows
APIs, algorithms, developer-centric template thinking –
these only

Chris Ré, @Stanford 
https://www.macfound.org/fellows/943/
Drugs, DNA, and Dinosaurs: Building High Quality
Knowledge Bases with DeepDive 
Strata CA (2015)
TheThorn in the Side of Big Data: too few artists 
Strata CA (2014)
Interesting Trend #4: A Leap Ahead

Chris Ré
Knowledge Bases with DeepDive
Strata CA (2015)
TheThorn in the Side of Big Data: too few artists
Strata CA (2014)
cognitive computing “ﬂywheel”:
probabilistic reasoning about complex
data and predictions together

Chris Ré
Knowledge Bases with DeepDive
Strata CA (2015)
TheThorn in the Side of Big Data: too few artists
Strata CA (2014)

William Cleveland  
“Data Science: an Action Plan for Expanding  
the Technical Areas of the Field of Statistics,”  
International Statistical Review (2001), 69, 21-26
http://www.stat.purdue.edu/~wsc/papers/
datascience.pdf
Leo Breiman 
“Statistical modeling: the two cultures”,  
Statistical Science (2001), 16:199-231
http://projecteuclid.org/euclid.ss/1009213726
…also good to mention John Tukey
Data Scientists: Primary Sources

Data Scientists: Five Years of Strata Conference

One 2015 report (RJMetrics) tallied a minimum of  
11,400 data scientists worldwide by scraping LinkedIn
So many suddenly, really? Perhaps that’s doubtful…
Comparing surveys: O’Reilly Media conducts salary surveys  
for data scientists, along with exploring about the tools used
2013 – tools, trends, not all data is “Big”, coding scripts!
2014 – correlation of tools and skills, rapid evolution
2015 – divide blurring between open source and proprietary
Data Scientists: Everywhere, all the time?

http://radar.oreilly.com/2015/09/2015-data-science-salary-survey.html
John King, Roger Magoulas
Data Scientists: 2015 Survey

Enlitic http://www.enlitic.com/
deep learning to assist doctors treating cancer
Moving Up: Medicine

Moving Up: Medicine
“Whatever the models might discover or predict, Howard
isn’t suggesting they’ll do away with a doctor’s judgment.
Rather, artificially intelligent computers could provide strong,
unbiased second opinions, or perhaps lead a doctor down  
a path of investigation she other wouldn’t have considered.”
With Enlitic, a veteran data scientist plans  
to fight disease using deep learning 
GigaOM (2014-08-22) 
https://gigaom.com/2014/08/22/with-enlitic-a-veteran-
data-scientist-plans-to-fight-disease-using-deep-learning/

Moving Up: Political Platform
http://www.predikon.ch/en/voting-patterns/residents

Moving Up: Political Platform
Mining Democracy 
Matthias Grossglauser @EPFL 
ICT Labs (2015) 
http://ictlabs-summer-school.sics.se/
slides/mining%20democracy.pdf
What if a political candidate could cluster political
positions in a multi-dimensional data space, to
optimize for being recommended to voters?
http://www.predikon.ch/en/voting-patterns/residents

Moving Up: Government Ethics
TheWhite House has a plan to help society through data analysis 
Fortune (2018-09-30) 
http://fortune.com/2015/09/30/dj-patil-white-house-data/

Moving Up: Government Ethics
TheWhite House has a plan to help society through data analysis 
Fortune (2018-09-30) 
http://fortune.com/2015/09/30/dj-patil-white-house-data/
“Opening up government data about child labor to concerned data
scientists; recruiting folks to help analyze data about suicide prevention,
social injustice and incarceration; a call for mandatory and `intrinsic`
ethics instruction in every course teaching students data science; and an
effort to help the transgender community create its own census of sorts,
so that members and society can get a better grasp on the issues that
matter to the group.”

Moving Up: Neuroscience
Analytics +Visualization for Neuroscience:
Spark,Thunder, Lightning
Jeremy Freeman 
2015-01-29
youtu.be/cBQm4LhHn9g?t=28m55s

For excellent examples of Science and Data
together see CodeNeuro, particularly for  
use of Jupyter notebooks + Apache Spark
Moving Up: Neuroscience

Massive Open Online Courses –  
seven year trend, beginning with:
Connectivism and Connective Knowledge 
George Siemens, Stephen Downes 
University of PEI (2008) 
http://cck11.mooc.ca/
Learning: What About MOOCs?
Adios EdTech. Hola something else 
George Siemens (2015-09-09) 
http://www.elearnspace.org/blog/2015/09/09/
adios-ed-tech-hola-something-else/

Online education: MOOCs taken by educated few 
Ezekiel Emanuel, Nature 503, 342 (2013-11-21)
• 80% students already have an advanced degree
• 80% come from the richest 6% of the population
Michael Shanks @Stanford: “retrenchment around traditional
disciplines will make disparities even more pronounced”
An Early Report Card on Massive Open Online Courses 
Geoffrey Fowler, WSJ (2013-10-08)
Amherst, Duke, etc., have rejected edX

Online education: MOOCs taken by educated few
Ezekiel Emanuel
• 80% students already have an advanced degree
• 80% come from the richest 6% of the population
Michael Shanks
disciplines will make disparities even more pronounced”
An Early Report Card on Massive Open Online Courses
Geoffrey Fowler
Amhers
So then, what else works better?

How to Flip a Class  
CTL @UT/Austin 
http://ctl.utexas.edu/teaching/ﬂipping-a-class/how
1. identify where the ﬂipped classroom model makes  
the most sense for your course
2. spend class time engaging students in application
activities with feedback
3. clarify connections between inside and outside  
of class learning
4. adapt your materials for students to acquire course
content in preparation of class
5. extend learning beyond class through individual  
and collaborative practice
Learning: Inverted Classroom

Scalable Learning 
David Black-Schaffer @Uppsala 
Sverker Janson @KTH SICS
https://www.scalable-learning.com/
• active learning: Flipped Classroom and Just-in-timeTeaching
• exams built directly into speciﬁc diagrams within videos
• metrics for where in video+code that students get stuck
• instructor can customize subsequent classroom discussions  
(active teaching phase) based on stuck/unstuck metrics
Learning: Inverted Classroom

Learning programming at scale
Philip Guo  
O’Reilly Radar (2015-08-13)
http://radar.oreilly.com/2015/08/learning-
programming-at-scale.html
• PythonTutor
• Codechella
Tutors could keep an eye on around  
50 learners during a 30-minute session,  
start 12 chat conversations, and  
concurrently help 3 learners at once
Learning: Collaborative Learning

Data-driven Education and the Quantiﬁed Student
Lorena Barba @GWU
PyData Seattle (2015)
https://youtu.be/2YIZ2SY9mW4
• keynote talk: abstract, slides
• homepage
• Open edX Universities Symposium, DC 2015-11-11
Learning: If you study just one link from this talk…

If by some bizarre chance you haven’t used  
it already, go to https://jupyter.org/
• 50+ different language kernels
• new funding 2015-07
• UC Berkeley, Cal Poly
• nbgrader autograder by Jess Hamrick
• jupyterhub multi-user server
• curating a list of examples
• repeatable science!
see also: 
Teaching with Jupyter Notebooks 
http://tinyurl.com/scipy2015-education
Learning: Jupyter Project

Embracing Jupyter Notebooks at O'Reilly 
Andrew Odewahn 
O’Reilly Media (2015-05-07)
https://beta.oreilly.com/ideas/jupyter-at-oreilly
O’Reilly Media is using our Atlas platform  
to make Jupyter Notebooks a ﬁrst class
authoring environment for our publishing
program
Jupyter, Thebe, Atlas, Docker, etc.
Learning: O’Reilly Media

Learning: O’Reilly Media
https://beta.oreilly.com/

in-person blended on-demand
Mostly
Synchronous
Mostly
Asynch
Inverted
Classroom
Subscription
Free
Content
Learning: Audience Patterns

Is it possible to measure “distance” between  
a learner and a subject community?
From Amateurs to Connoisseurs: 
Modeling the Evolution of User  
Expertise through Online Reviews 
Julian McAuley, Jure Leskovec 
http://i.stanford.edu/~julian/pdfs/www13.pdf
Learning: Machine Learning about People Learning

Learning,Assessment,Team Building, Diversity –
these can be accomplished together, in situ
Collective Intelligence in Human Groups 
Anita Williams Woolley @CMU 
https://youtu.be/Bz1dDiW2mvM
• balance of participation (no one dominates)
• 2+ women engaging within the group
• group size < 9
• diversity of formal backgrounds
Learning: Machine Learning about People Learning

Data Science teams apply machine learning (automation)
to help arrive at key insights, to learn what is important  
in data sets – ﬁnding the proverbial needle in the haystack
Cognitive Computing exhibits people + automation  
as a process, in a learning context
That’s also a basic tenet of workﬂows in general:  
people + automation
And a key aspect of the emerging gig economy too…
People + Automation

People + Automation: Gig Economy

http://orchestra.unlimitedlabs.com/
“Workﬂows with humans and machines”

Workers in aWorld of Continuous Partial Employment
Tim O’Reilly
Medium (2015-08-31) 
https://medium.com/the-wtf-economy/workers-in-a-
world-of-continuous-partial-employment-4d7b53f18f96
http://conferences.oreilly.com/next-economy

Learning is key. Effective use of Data Science in these new
economic conditions requires people + automation, learning
together – albeit in different ways. Plus, there’s an excellent
framework for that:
Autopoiesis and Cognition 
Humberto Maturana, FranciscoVarela 
Springer (1973)
https://books.google.es/books?id=nVmcN9Ja68kC
People + Automation

I’d like to leave this as a theme for you to consider about  
Data Science 2016, Moving Up into use cases…
We see an intersection of key points in both the emerging
Cognitive Computing context and the Gig Economy in general:
systems of people + automation, learning together
It posits an interesting duality for use to leverage
With that I wish you a great conference here at Big Data Spain!
People + Automation

contact:
Just Enough Math
O’Reilly (2014)
justenoughmath.com 
preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates,  
events, conf summaries, etc.:
liber118.com/pxn/
Intro to Apache Spark 
O’Reilly (2015) 
shop.oreilly.com/product/
0636920036807.do

Data Science in 2016: Moving Up

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Data Science in 2016: Moving Up

Similar to Data Science in 2016: Moving Up (20)

More from Paco Nathan

More from Paco Nathan (8)

Recently uploaded

Recently uploaded (20)

Data Science in 2016: Moving Up