SlideShare a Scribd company logo
Eleven Almost-Truisms
About Data
2015-07-24 • Seattle
Paco Nathan, @pacoid

O’Reilly Learning
Set and Setting:
Almost a Dozen Almost-Truisms about Data …

to consider when embarking on a journey 

into Data Science
There are a number preconceptions about
working with data at scale, where the realities
beg to differ
We’ll crank this number up to eleven – even
though the actual number is of course much
larger, but that’s perhaps for another day
Almost a Dozen Almost-Truisms about Data …

to consider when embarking on a journey 

into Data Science
Let’s discuss some less-intuitive directions,
along with likely consequences and corollaries
This is not intended to prove a set of points,
rather to provide a set of launching points
Set and Setting:
#01: Because Rates
The rates of data being stored and analyzed
jumped quite dramatically in the late 1990s 

to early 2000s … partly because storage
became incredibly cheap … partly because
internetworked machines suddenly started
producing much more machine data
Fifteen years later, the rates jump again, this
time by orders of magnitude … Because IoT
It’s almost like this thing has a pulse?
#01: Because Rates
In other words, to paraphrase von Schelling,
experience precedes analysis
Typically, we’re swimming in data, and we tend
to respond by struggling to understand its
structure and dynamics
That, in contrast to the myth that our analysis
drives data collection
#01: Because Rates
Four independent teams were working toward horizontal 

scale-out of workflows based on commodity hardware
This effort prepared the way for huge Internet successes during

the 1997 holiday season…
AMZN, EBAY, Inktomi (YHOO Search), then GOOG
MapReduce on clusters of commodity hardware and the 

Apache Hadoop open source stack emerged from this context
#01: Because Rates – 1997 Q3 Inflection Point
Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-
website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/
you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/
eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtu.be/E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtu.be/qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/
JeffDeanOnGoogleInfrastructure.aspx
#01: Because Rates – 1997 Q3 Inflection Point
RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
#01: Because Rates – Circa 2001, post e-commerce success
RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
“data products”
#01: Because Rates – Circa 2001, post e-commerce success
Primary sources for the notion:
Cleveland,W. S., 

“Data Science: an Action Plan for Expanding 

the Technical Areas of the Field of Statistics,” 

International Statistical Review (2001), 69, 21-26.
http://cm.bell-labs.com/stat/doc/datascience.ps
Breiman L., 

“Statistical modeling: the two cultures”, 

Statistical Science (2001), 16:199-231.
http://projecteuclid.org/euclid.ss/1009213726
…also good to mention John Tukey
#01: Because Rates –Whither Data Science?
Rashomon, the 1950 Japanese period drama 

by Akira Kurosawa, symbolizes a long-standing
tension in Statistics, one which Mark Twain
described ever so succinctly…
wikipedia.org/wiki/Rashomon:
“The film is known for a plot device

which involves various characters

providing alternative, self-serving

and contradictory versions of the

same incident.”
#01: Because Rates – A Sea Change
Because IoT! (exabytes/day per sensor)
bits.blogs.nytimes.com/2013/06/19/g-e-makes-the-
machine-and-then-uses-sensors-to-listen-to-it/
#01: Because Rates – A Sea Change, Redux
#02: Batch Defenestration
#02: Batch Defenestration
#02: Batch Defenestration
Batch Analytics
Going strong, since 1944

Been there, done that
Businesses want to join the 21c., 

and level up to streaming analytics
“I saw what you did … in batch,”

now performed a zillion times faster
#02: Batch Defenestration – Infrastructure, Remodeled
Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
Most active project at Apache,
More than 500 known production deployments
Tuning Spark Streaming forThroughput
Gerard Maas, 2014-12-22
virdata.com/tuning-spark/
#02: Batch Defenestration – “Team Apache”, $316.4M funding
Can Spark Streaming survive Chaos Monkey?
Bharat Venkat, Prasanna Padmanabhan, 

Antony Arokiasamy, Raju Uppalapati
techblog.netflix.com/2015/03/can-spark-
streaming-survive-chaos-monkey.html
#02: Batch Defenestration – Resiliency, at the edge of Comp Sci
#03: Circa 1904
Trending interests:
• electric cars
• organic farm-to-table cuisine
• permaculture
• sustainable urbanism
#03: Circa 1904
Speaking of batch windows…
The last century or two of statistics
represent an extremely huge mess
Let’s start the clock over, then move
forward into a more real-time near-future
#03: Circa 1904
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
Probability got going, formally, in the 16th c. – 

although interesting mathematical estimations 

trace back to classical times
Arabs in the 9th c. used frequency analysis – 

later rediscovered by Europeans during the 

early Italian Renaissance
Statistics followed, originally more about what 

we might call demographics – through 18th c.
Laplace, Gauss, et al., bridged prob & stats in the 

late 18th c. using distributions (what we studied 

in Stats 101) to infer the probability of errors 

in estimates
Much of the 19th/20th c. work was about using
goodness of fit tests, etc., justifying some distribution
• generally speaking, that require samples
• that, in turn, implies batch windows
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
While 19th/20th c. stats work focused on defensibility
21st c. work, w.r.t. Big Data apps, focuses more 

on predictability – plus there’s a shift in how we make
estimates…
BTW, doesn’t it seem weird to crunch through piles
of data in large batch jobs, at large expense, when
the results get used to approximate features
ultimately? Why not perform that in stream?
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
A fascinating, relatively new area pioneered by
relatively few people – e.g., Philippe Flajolet
Provides approximation with error bounds using
much less resources (RAM, CPU, etc.)
highlyscalable.wordpress.com/
2012/05/01/probabilistic-structures-
web-analytics-data-mining/
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
algorithm use case example
Bloom Filter set membership code
MinHash set similarity code
HyperLogLog set cardinality code
Count-Min Sketch frequency summaries code
DSQ streaming quantiles code
SkipList ordered sequence search code
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
E.g., ±4% could buy you two orders of magnitude
reduction in the required memory footprint for 

an analytics app
OSS projects such as Algebird and BlinkDB
provide for this newer approach to the math of
approximations at scale
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
#04: Your API is an Illusion
IMO, many notions of “API” are illusions
Arguably, reductionist shell games
And that imposes limitations on how we
work, and even how we think…
#04: Your API is an Illusion
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Algorithms and developer-centric template thinking
only go so far in a workflow…
Results are shown in blue, while the real work 

is highlighted in red
#04: Your API is an Illusion –The Libraries: Alexandria, Redux
On the other hand, Physics
does well to teach modeling –
I like to hire physicists to work
on Data teams…
They tend to get the interdisciplinary aspects: 

got the math background, coding experience, 

generally good at systems engineering, etc.
Not saying we must all rush out to get Physics 

degrees – there’s something to be learned there, 

vital for the work and priorities ahead
#04: Your API is an Illusion –The Interzone
“The impact of computing extends far beyond

science… affecting all aspects of our lives. 

To flourish in today's world, everyone needs

computational thinking.” – Jeannette Wing, CMU
Computing now ranks alongside the proverbial
Reading,Writing, and Arithmetic…
Center for ComputationalThinking @ CMU

http://www.cs.cmu.edu/~CompThink/
Exploring ComputationalThinking @ Google

https://www.google.com/edu/computational-thinking/
#04: Your API is an Illusion – Antidote: ComputationalThinking
#05: Code Inceptionism
Even so, do we really need to 

write code for WordCount 

10^N times?
#05: Code Inceptionism
Inceptionism: Going Deeper into 

Neural Networks

Alexander Mordvintsev, 

Christopher Olah, Mike Tyka

Google (2015-06-17)
googleresearch.blogspot.com/2015/06/
inceptionism-going-deeper-into-neural.html
Artificial Neural Networks have spurred remarkable recent
progress in image classification and speech recognition. But
even though these are very useful tools based on well-known
mathematical methods, we actually understand surprisingly
little of why certain models work and others don’t. So let’s
take a look at some simple techniques for peeking inside
these networks.
#05: Code Inceptionism
Imagine data mining GitHub commit
histories of popular open source projects,
then applying genetic programming to
evolve patches for other OSS projects... 



In other words, brilliant:
Imagine data mining GitHub commit
histories of popular open source projects,
then apply genetic programming to evolve
patches for other OSS projects… 

in other words, brilliant:
Sidebar: Claire Le Goues, automating software repair
Claire Le Goues

cmu.edu
GenProg:A Generic Method for Automatic
Software Repair

Claire Le Goues, ThanhVu Nguyen,
Stephanie Forrest, Westley Weimer

IEEE TSE (2012)

www.cs.cmu.edu/~clegoues/
docs/legoues-tse-genprog12.pdf
We describe the algorithm and report experimental
results of its success on 16 programs totaling 1.25M
lines of C code and 120K lines of module code,
spanning eight classes of defects, in 357 seconds, 

on average.We analyze the generated repairs
qualitatively and quantitatively to demonstrate 

that the process efficiently produces evolved
programs that repair the defect, are not fragile 

input memorizations, and do not lead to serious
degradation in functionality.
GenProg:A Generic Method for 

Automatic Software Repair



Claire Le Goues, ThanhVu Nguyen,
Stephanie Forrest, Westley Weimer

IEEE TSE (2012)

www.cs.cmu.edu/~clegoues/ docs/
legoues-tse-genprog12.pdf
We describe the algorithm and report experimental results
of its success on 16 programs totaling 1.25M lines of C code
and 120K lines of module code, spanning eight classes of
defects, in 357 seconds, on average.


We analyze the generated repairs qualitatively and
quantitatively to demonstrate that the process efficiently
produces evolved programs that repair the defect, are not
fragile input memorizations, and do not lead to serious
degradation in functionality.
#05: Code Inceptionism
#06: Database Extinction?
Are databases going extinct?
Distributed file systems that can be accessed
as column stores are generally quite useful
There’s an old saying in Computer Science: 

it’s difficult to distinguish a really good file
system from a database, and vice versa
#06: Database Extinction?
Original definitions for what became relational
databases had less to do with dedicated SQL
products, more similarity with something like
Spark SQL:
A relational model of data for 

large shared data banks

Edgar Codd

Communications of the ACM (1970)

dl.acm.org/citation.cfm?id=362685
#06: Database Extinction?
#06: Database Extinction?
Tungsten Execution
PythonSQL R Streaming
DataFrame
Advanced
Analytics
Physical Execution:
CPU Efficient Data Structures
Keep data closure to CPU cache
Tungsten
#07: “N Dims good, 2 Dims baa-d”
Consider: matrices, pivot tables, etc.
Our thinking about data representation 

is often quite two-dimensional…
#07: “N Dims good, 2 Dims baa-d”
• many real-world problems are often
represented as graphs
• graphs can generally be converted into sparse
matrices (bridge to linear algebra)
• eigenvectors find the stable points in 

a system defined by matrices – which 

may be more efficient to compute
• beyond simpler graphs, complex data 

may require work with tensors
#07: “N Dims good, 2 Dims baa-d”
Suppose we have a graph as shown below:
We call x a vertex (sometimes called a node)
An edge (sometimes called an arc) is any line
connecting two vertices
v
u
w
x
#07: “N Dims good, 2 Dims baa-d”
We can represent this kind of graph as an
adjacency matrix:
• label the rows and columns based 

on the vertices
• entries get a 1 if an edge connects the
corresponding vertices, or 0 otherwise
v
u
w
x
u v w x
u 0 1 0 1
v 1 0 1 1
w 0 1 0 1
x 1 1 1 0
#07: “N Dims good, 2 Dims baa-d”
An adjacency matrix always has certain
properties:
• it is symmetric, i.e., A = AT
• it has real eigenvalues
Therefore algebraic graph theory bridges
between linear algebra and graph theory
#07: “N Dims good, 2 Dims baa-d”
Tensors are a good way to handle time-
series geo-spatially distributed linked data
with lots of N-dimensional attributes
In other words, potentially a general case 

for handling much of the data that we’re
likely to encounter
#07: “N Dims good, 2 Dims baa-d”
Although tensor factorization is considered
problematic, it may provide more general
case solutions:
TheTensor Renaissance in Data Science

Anima Anandkumar @UC Irvine

radar.oreilly.com/2015/05/the-tensor-
renaissance-in-data-science.html
Spacey RandomWalks and 

Higher Order Markov Chains

David Gleich @Purdue

slideshare.net/dgleich/spacey-random-walks-
and-higher-order-markov-chains
#07: “N Dims good, 2 Dims baa-d”
#08: Science … and Data
There is Science … and there is Data
Data Science is largely about interdisciplinary
teams, largely about crossing boundaries
(organizational, cognitive) that might otherwise
preclude arriving at crucial insights –
In other words, about learning
It’s also about the repeatability and predictive
aspects of science, where workflows combine
people + automation
NB: may conflict with large portions of academia
which tend to decontextualize subjects
#08: Science … and Data
The Science in Data Science tends to rely on
the phenomenology and modeling of complex
systems (did we already mention Physics?)
Speaking of science and predictions, two
important works to include:
• Charles Sanders Peirce – one of the
most prolific scientists in the US, and also
one of the most fierce critics (abduction,
etc.)
• Karl Popper – who articulated some 

of the inherent risks of mixing “science”,
“history”, and politics
#08: Science … and Data
For excellent examples of Science and Data
together, see CodeNeuro, particularly for
use of notebooks:
#08: Science … and Data
#09: Learning Curves are Forever
Learning Curves are forever – 

the part you need to manage
more carefully than just about
anything else, especially within

a social context
In some sense, this is essence of
Data Science: How well do you
learn?
Much of the risk in managing 

a Data Science team is about
budgeting for learning curve
#09: Learning Curves are Forever
In contrast, IT has a long history of
practicing a flavor of engineering
“conservatism”: highly structured
process, strictly codified practices
People learn a few things well, then
avoid having to struggle with learning
many new things perpetually…
That leads to enormous teams and
low ROI, among other badness
scale➞
complexity➞
#09: Learning Curves are Forever
ThrowYour Life a Curve

Whitney Johnson
blogs.hbr.org/johnson/2012/09/
throw-your-life-a-curve.html
Aggressively Pro-Active Learning:
• deconstruction of the cognitive bias One Size Fits All
• “makes a compelling case for personal disruption”
• “plan your career around learning curves”
• hire people who learn/re-learn efficiently
#09: Learning Curves are Forever
#09: Learning Curves are Forever
Education is more than just lessons, exams,
certifications, instructor evaluations, etc., …
though some tools would try to reduce it 

to that level
What’s even more interesting is to leverage
ML to understand the “distance” between
the learner and the subject material
#10: Books, not so much, sadly…
Speaking as a former alt bookstore owner…
Sadly, we don’t use books quite as much 

these days:
• above ~35: buy it on Kindle
• below ~35: watch it onYouTube
#10: Books, not so much, sadly…
From a publisher perspective, consider
some of the risks:
• less people buy the titles
• search engines surface oh-so-much noise
• increasingly, it’s more difficult for experts
to take time to author good content and
keep it updated
#10: Books, not so much, sadly…
Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
Most active project at Apache,
More than 500 known production deployments
However, it’s unlikely that Kindle, etc.,
represent the end-all-be-all of publishing…
Here’s an idea: your next “book” or
“video” should be able to compute
something useful
#10: Books, not so much, sadly…
Interactive notebooks: Sharing the code
Helen Shen
Nature (2014-11-05)
nature.com/news/interactive-notebooks-
sharing-the-code-1.16261
#10: Books, not so much – Repeatable Science
Embracing Jupyter Notebooks at O'Reilly

Andrew Odewahn, 2015-05-07
https://beta.oreilly.com/ideas/jupyter-at-oreilly
“O'Reilly Media is using our Atlas platform to 

make Jupyter Notebooks a first class authoring
environment for our publishing program.”
Jupyter, Thebe, Docker, etc.
#10: Books, not so much – Something Borrowed, Something New
#10: Books, not so much – Something Borrowed, Something New
#11: A MOOCish Edumacation?
MOOCs have become popular, some are
quite useful … even so, these tend to have 

a very low completion rate
Don’t hold your breath waiting for MOOCs
to replace other modes of education
Learning generally requires a social context:
for reinforcement, peer insights/modeling,
and frankly some people really feel a need
to be given permission to learn
#11: A MOOCish Edumacation?
One problem with university study is that
disciplines tend to decontextualize
GalvanizeU is rare opportunity in that way:
accredited, with contextualized hands-on
experience
#11: A MOOCish Edumacation?
A significant improvement may be
found in the notion of “flipped” 

or inverted classrooms
For a good example, see:
Caltech Offers Online Course with 

Live Lectures in Machine Learning
Yaser Abu-Mostafa (2012-03-30)
http://www.caltech.edu/news/caltech-offers-online-
course-live-lectures-machine-learning-4248
#11: A MOOCish Edumacation?
So a good bit of advice about learning and
Data Science … is to invert your classrooms,
recontextualize, cross the boundaries to do
things that matter, and leverage the hands-on
social aspects of learning
Like here at GalvanizeU
Summary…
Thank You
contact:
Just Enough Math
O’Reilly (2014)
justenoughmath.com

preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates, 

events, conf summaries, etc.:
liber118.com/pxn/
Intro to Apache Spark

O’Reilly (2015)

shop.oreilly.com/product/
0636920036807.do
Sometimes A Strange Notion
After we’ve cleaned up data, formulated workflows
in terms of monoids, used graph representation, and
parallelized with a wealth of linear algebra, much of
the heavy-lifting that remains on the clusters is in
optimization
For example, deep learning @Google 

uses many layers of neural nets trained 

with gradient descent optimization
Taming LatencyVariability and Scaling Deep Learning

Jeff Dean @Google (2013)

youtu.be/S9twUcX1Zp0
Vector Quantization:
One advantage of quantum algorithms is 

to run large gradient descent problems in
constant time… Reworking high-ROI apps
to leverage lots of ML and large clusters, 

then SGD represents the datacenter cost
basis, notably that part that scales…
Want to slash costs exponentially? 

Plug in quantum for a game-changer,

maybe
Fast quantum algorithm for 

numerical gradient estimation

Stephen P. Jordan

Phys. Rev. Lett. 95, 050501 (2005)

arxiv.org/abs/quant-ph/0405146 dwavesys.com
Vector Quantization:
Proposal: let’s drop clusters of quantum
devices into lunar polar craters, so we 

can handle massive vector quantization
workloads
• micro-kelvin environs
• near perpetual sunlight 

for energy sources
• park routers at L4
• approx. $15B to finance, 

i.e., ~6 days DoD budget
Vector Quantization:
We’ll just put this here… 

a couple o’ Googly projects in progress:
qCraft: Quantum Physics In Minecraft

plus.google.com/u/
1/+QuantumAILab/posts/
grMbaaDGChH
Vector Quantization:
“We’re going back to the Moon. For good.”
lunar.xprize.org

More Related Content

What's hot

How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
Paco Nathan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosAugury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Paco Nathan
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
Doug Needham
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
Krishna Sankar
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
Joshua Bloom
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
Kenny Bastani
 
H2O Big Join Slides
H2O Big Join SlidesH2O Big Join Slides
H2O Big Join Slides
Sri Ambati
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
Andy Petrella
 
Scalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2OScalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2O
odsc
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Andy Petrella
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
Sri Ambati
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
Krishna Sankar
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
Neo4j
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis Patterns
Mikio L. Braun
 

What's hot (20)

How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosAugury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
H2O Big Join Slides
H2O Big Join SlidesH2O Big Join Slides
H2O Big Join Slides
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Scalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2OScalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2O
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis Patterns
 

Viewers also liked

Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
Paco Nathan
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
Paco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
Paco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
Future of data science as a profession
Future of data science as a professionFuture of data science as a profession
Future of data science as a profession
Jose Quesada
 
Big data & data science challenges and opportunities
Big data & data science   challenges and opportunitiesBig data & data science   challenges and opportunities
Big data & data science challenges and opportunities
Jose Quesada
 
Bruce Kasanoff - "Bring out the talent in other people"
Bruce Kasanoff - "Bring out the talent in other people"Bruce Kasanoff - "Bring out the talent in other people"
Bruce Kasanoff - "Bring out the talent in other people"
Bruce Kasanoff
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
CO Data Science - Workshop 1: Probability Distributiions
CO Data Science - Workshop 1: Probability DistributiionsCO Data Science - Workshop 1: Probability Distributiions
CO Data Science - Workshop 1: Probability Distributiions
Jared Polivka
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 

Viewers also liked (15)

Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Future of data science as a profession
Future of data science as a professionFuture of data science as a profession
Future of data science as a profession
 
Big data & data science challenges and opportunities
Big data & data science   challenges and opportunitiesBig data & data science   challenges and opportunities
Big data & data science challenges and opportunities
 
Bruce Kasanoff - "Bring out the talent in other people"
Bruce Kasanoff - "Bring out the talent in other people"Bruce Kasanoff - "Bring out the talent in other people"
Bruce Kasanoff - "Bring out the talent in other people"
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
CO Data Science - Workshop 1: Probability Distributiions
CO Data Science - Workshop 1: Probability DistributiionsCO Data Science - Workshop 1: Probability Distributiions
CO Data Science - Workshop 1: Probability Distributiions
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 

Similar to GalvanizeU Seattle: Eleven Almost-Truisms About Data

Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
Eli White
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
Dieter De Witte
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Big Data Spain
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
Department of Communication Science, University of Amsterdam
 
Making the Most of In-Memory: More than Speed
Making the Most of In-Memory: More than SpeedMaking the Most of In-Memory: More than Speed
Making the Most of In-Memory: More than Speed
Inside Analysis
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 
Secular Technological Tailwinds
Secular Technological TailwindsSecular Technological Tailwinds
Secular Technological Tailwinds
Dionisio Chiuratto Agourakis
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
Kelly Technologies
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
Krishna Sankar
 
Data science and Artificial Intelligence
Data science and Artificial IntelligenceData science and Artificial Intelligence
Data science and Artificial Intelligence
Suman Srinivasan
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
Blockchains for AI [With New Applications]
Blockchains for AI [With New Applications]Blockchains for AI [With New Applications]
Blockchains for AI [With New Applications]
Trent McConaghy
 
Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014
gdusbabek
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Hejwowski Piotr
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
Jongwook Woo
 

Similar to GalvanizeU Seattle: Eleven Almost-Truisms About Data (20)

Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
Making the Most of In-Memory: More than Speed
Making the Most of In-Memory: More than SpeedMaking the Most of In-Memory: More than Speed
Making the Most of In-Memory: More than Speed
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Secular Technological Tailwinds
Secular Technological TailwindsSecular Technological Tailwinds
Secular Technological Tailwinds
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Data science and Artificial Intelligence
Data science and Artificial IntelligenceData science and Artificial Intelligence
Data science and Artificial Intelligence
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Blockchains for AI [With New Applications]
Blockchains for AI [With New Applications]Blockchains for AI [With New Applications]
Blockchains for AI [With New Applications]
 
Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Paco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Paco Nathan
 
Computable Content
Computable ContentComputable Content
Computable Content
Paco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
Paco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 

More from Paco Nathan (9)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 

Recently uploaded

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
SubhamMandal40
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
Priyanka Aash
 
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
Priyanka Aash
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
Axel Rennoch
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
What's new in android: jetpack compose 2024
What's new in android: jetpack compose 2024What's new in android: jetpack compose 2024
What's new in android: jetpack compose 2024
Toru Wonyoung Choi
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite SolutionIPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Networks
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
Ivanti
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
Safe Software
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
aakash malhotra
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
shyamraj55
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
313mohammedarshad
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
Anant Gupta
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Muhammad Ali
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 

Recently uploaded (20)

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
 
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
What's new in android: jetpack compose 2024
What's new in android: jetpack compose 2024What's new in android: jetpack compose 2024
What's new in android: jetpack compose 2024
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite SolutionIPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite Solution
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 

GalvanizeU Seattle: Eleven Almost-Truisms About Data

  • 1. Eleven Almost-Truisms About Data 2015-07-24 • Seattle Paco Nathan, @pacoid
 O’Reilly Learning
  • 2. Set and Setting: Almost a Dozen Almost-Truisms about Data …
 to consider when embarking on a journey 
 into Data Science There are a number preconceptions about working with data at scale, where the realities beg to differ We’ll crank this number up to eleven – even though the actual number is of course much larger, but that’s perhaps for another day
  • 3. Almost a Dozen Almost-Truisms about Data …
 to consider when embarking on a journey 
 into Data Science Let’s discuss some less-intuitive directions, along with likely consequences and corollaries This is not intended to prove a set of points, rather to provide a set of launching points Set and Setting:
  • 5. The rates of data being stored and analyzed jumped quite dramatically in the late 1990s 
 to early 2000s … partly because storage became incredibly cheap … partly because internetworked machines suddenly started producing much more machine data Fifteen years later, the rates jump again, this time by orders of magnitude … Because IoT It’s almost like this thing has a pulse? #01: Because Rates
  • 6. In other words, to paraphrase von Schelling, experience precedes analysis Typically, we’re swimming in data, and we tend to respond by struggling to understand its structure and dynamics That, in contrast to the myth that our analysis drives data collection #01: Because Rates
  • 7. Four independent teams were working toward horizontal 
 scale-out of workflows based on commodity hardware This effort prepared the way for huge Internet successes during
 the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce on clusters of commodity hardware and the 
 Apache Hadoop open source stack emerged from this context #01: Because Rates – 1997 Q3 Inflection Point
  • 8. Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting- website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/ you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/ eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtu.be/E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtu.be/qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/ JeffDeanOnGoogleInfrastructure.aspx #01: Because Rates – 1997 Q3 Inflection Point
  • 9. RDBMS SQL Query result sets recommenders + classifiers Web Apps customer transactions Algorithmic Modeling Logs event history aggregation dashboards Product Engineering UX Stakeholder Customers DW ETL Middleware servletsmodels #01: Because Rates – Circa 2001, post e-commerce success
  • 10. RDBMS SQL Query result sets recommenders + classifiers Web Apps customer transactions Algorithmic Modeling Logs event history aggregation dashboards Product Engineering UX Stakeholder Customers DW ETL Middleware servletsmodels “data products” #01: Because Rates – Circa 2001, post e-commerce success
  • 11. Primary sources for the notion: Cleveland,W. S., 
 “Data Science: an Action Plan for Expanding 
 the Technical Areas of the Field of Statistics,” 
 International Statistical Review (2001), 69, 21-26. http://cm.bell-labs.com/stat/doc/datascience.ps Breiman L., 
 “Statistical modeling: the two cultures”, 
 Statistical Science (2001), 16:199-231. http://projecteuclid.org/euclid.ss/1009213726 …also good to mention John Tukey #01: Because Rates –Whither Data Science?
  • 12. Rashomon, the 1950 Japanese period drama 
 by Akira Kurosawa, symbolizes a long-standing tension in Statistics, one which Mark Twain described ever so succinctly… wikipedia.org/wiki/Rashomon: “The film is known for a plot device
 which involves various characters
 providing alternative, self-serving
 and contradictory versions of the
 same incident.” #01: Because Rates – A Sea Change
  • 13. Because IoT! (exabytes/day per sensor) bits.blogs.nytimes.com/2013/06/19/g-e-makes-the- machine-and-then-uses-sensors-to-listen-to-it/ #01: Because Rates – A Sea Change, Redux
  • 16. #02: Batch Defenestration Batch Analytics Going strong, since 1944
 Been there, done that
  • 17. Businesses want to join the 21c., 
 and level up to streaming analytics “I saw what you did … in batch,”
 now performed a zillion times faster #02: Batch Defenestration – Infrastructure, Remodeled Contributors per Month to Spark 0 20 40 60 80 100 2011 2012 2013 2014 2015 Most active project at Apache, More than 500 known production deployments
  • 18. Tuning Spark Streaming forThroughput Gerard Maas, 2014-12-22 virdata.com/tuning-spark/ #02: Batch Defenestration – “Team Apache”, $316.4M funding
  • 19. Can Spark Streaming survive Chaos Monkey? Bharat Venkat, Prasanna Padmanabhan, 
 Antony Arokiasamy, Raju Uppalapati techblog.netflix.com/2015/03/can-spark- streaming-survive-chaos-monkey.html #02: Batch Defenestration – Resiliency, at the edge of Comp Sci
  • 21. Trending interests: • electric cars • organic farm-to-table cuisine • permaculture • sustainable urbanism #03: Circa 1904
  • 22. Speaking of batch windows… The last century or two of statistics represent an extremely huge mess Let’s start the clock over, then move forward into a more real-time near-future #03: Circa 1904
  • 23. #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science Probability got going, formally, in the 16th c. – 
 although interesting mathematical estimations 
 trace back to classical times Arabs in the 9th c. used frequency analysis – 
 later rediscovered by Europeans during the 
 early Italian Renaissance Statistics followed, originally more about what 
 we might call demographics – through 18th c.
  • 24. Laplace, Gauss, et al., bridged prob & stats in the 
 late 18th c. using distributions (what we studied 
 in Stats 101) to infer the probability of errors 
 in estimates Much of the 19th/20th c. work was about using goodness of fit tests, etc., justifying some distribution • generally speaking, that require samples • that, in turn, implies batch windows #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
  • 25. While 19th/20th c. stats work focused on defensibility 21st c. work, w.r.t. Big Data apps, focuses more 
 on predictability – plus there’s a shift in how we make estimates… BTW, doesn’t it seem weird to crunch through piles of data in large batch jobs, at large expense, when the results get used to approximate features ultimately? Why not perform that in stream? #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
  • 26. A fascinating, relatively new area pioneered by relatively few people – e.g., Philippe Flajolet Provides approximation with error bounds using much less resources (RAM, CPU, etc.) highlyscalable.wordpress.com/ 2012/05/01/probabilistic-structures- web-analytics-data-mining/ #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
  • 27. algorithm use case example Bloom Filter set membership code MinHash set similarity code HyperLogLog set cardinality code Count-Min Sketch frequency summaries code DSQ streaming quantiles code SkipList ordered sequence search code #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
  • 28. E.g., ±4% could buy you two orders of magnitude reduction in the required memory footprint for 
 an analytics app OSS projects such as Algebird and BlinkDB provide for this newer approach to the math of approximations at scale #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
  • 29. #04: Your API is an Illusion
  • 30. IMO, many notions of “API” are illusions Arguably, reductionist shell games And that imposes limitations on how we work, and even how we think… #04: Your API is an Illusion
  • 31. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far in a workflow… Results are shown in blue, while the real work 
 is highlighted in red #04: Your API is an Illusion –The Libraries: Alexandria, Redux
  • 32. On the other hand, Physics does well to teach modeling – I like to hire physicists to work on Data teams… They tend to get the interdisciplinary aspects: 
 got the math background, coding experience, 
 generally good at systems engineering, etc. Not saying we must all rush out to get Physics 
 degrees – there’s something to be learned there, 
 vital for the work and priorities ahead #04: Your API is an Illusion –The Interzone
  • 33. “The impact of computing extends far beyond
 science… affecting all aspects of our lives. 
 To flourish in today's world, everyone needs
 computational thinking.” – Jeannette Wing, CMU Computing now ranks alongside the proverbial Reading,Writing, and Arithmetic… Center for ComputationalThinking @ CMU
 http://www.cs.cmu.edu/~CompThink/ Exploring ComputationalThinking @ Google
 https://www.google.com/edu/computational-thinking/ #04: Your API is an Illusion – Antidote: ComputationalThinking
  • 35. Even so, do we really need to 
 write code for WordCount 
 10^N times? #05: Code Inceptionism
  • 36. Inceptionism: Going Deeper into 
 Neural Networks
 Alexander Mordvintsev, 
 Christopher Olah, Mike Tyka
 Google (2015-06-17) googleresearch.blogspot.com/2015/06/ inceptionism-going-deeper-into-neural.html Artificial Neural Networks have spurred remarkable recent progress in image classification and speech recognition. But even though these are very useful tools based on well-known mathematical methods, we actually understand surprisingly little of why certain models work and others don’t. So let’s take a look at some simple techniques for peeking inside these networks. #05: Code Inceptionism
  • 37. Imagine data mining GitHub commit histories of popular open source projects, then applying genetic programming to evolve patches for other OSS projects... 
 
 In other words, brilliant: Imagine data mining GitHub commit histories of popular open source projects, then apply genetic programming to evolve patches for other OSS projects… 
 in other words, brilliant: Sidebar: Claire Le Goues, automating software repair Claire Le Goues
 cmu.edu GenProg:A Generic Method for Automatic Software Repair
 Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, Westley Weimer
 IEEE TSE (2012)
 www.cs.cmu.edu/~clegoues/ docs/legoues-tse-genprog12.pdf We describe the algorithm and report experimental results of its success on 16 programs totaling 1.25M lines of C code and 120K lines of module code, spanning eight classes of defects, in 357 seconds, 
 on average.We analyze the generated repairs qualitatively and quantitatively to demonstrate 
 that the process efficiently produces evolved programs that repair the defect, are not fragile 
 input memorizations, and do not lead to serious degradation in functionality. GenProg:A Generic Method for 
 Automatic Software Repair
 
 Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, Westley Weimer
 IEEE TSE (2012)
 www.cs.cmu.edu/~clegoues/ docs/ legoues-tse-genprog12.pdf We describe the algorithm and report experimental results of its success on 16 programs totaling 1.25M lines of C code and 120K lines of module code, spanning eight classes of defects, in 357 seconds, on average. 
 We analyze the generated repairs qualitatively and quantitatively to demonstrate that the process efficiently produces evolved programs that repair the defect, are not fragile input memorizations, and do not lead to serious degradation in functionality. #05: Code Inceptionism
  • 39. Are databases going extinct? Distributed file systems that can be accessed as column stores are generally quite useful There’s an old saying in Computer Science: 
 it’s difficult to distinguish a really good file system from a database, and vice versa #06: Database Extinction?
  • 40. Original definitions for what became relational databases had less to do with dedicated SQL products, more similarity with something like Spark SQL: A relational model of data for 
 large shared data banks
 Edgar Codd
 Communications of the ACM (1970)
 dl.acm.org/citation.cfm?id=362685 #06: Database Extinction?
  • 41. #06: Database Extinction? Tungsten Execution PythonSQL R Streaming DataFrame Advanced Analytics Physical Execution: CPU Efficient Data Structures Keep data closure to CPU cache Tungsten
  • 42. #07: “N Dims good, 2 Dims baa-d”
  • 43. Consider: matrices, pivot tables, etc. Our thinking about data representation 
 is often quite two-dimensional… #07: “N Dims good, 2 Dims baa-d”
  • 44. • many real-world problems are often represented as graphs • graphs can generally be converted into sparse matrices (bridge to linear algebra) • eigenvectors find the stable points in 
 a system defined by matrices – which 
 may be more efficient to compute • beyond simpler graphs, complex data 
 may require work with tensors #07: “N Dims good, 2 Dims baa-d”
  • 45. Suppose we have a graph as shown below: We call x a vertex (sometimes called a node) An edge (sometimes called an arc) is any line connecting two vertices v u w x #07: “N Dims good, 2 Dims baa-d”
  • 46. We can represent this kind of graph as an adjacency matrix: • label the rows and columns based 
 on the vertices • entries get a 1 if an edge connects the corresponding vertices, or 0 otherwise v u w x u v w x u 0 1 0 1 v 1 0 1 1 w 0 1 0 1 x 1 1 1 0 #07: “N Dims good, 2 Dims baa-d”
  • 47. An adjacency matrix always has certain properties: • it is symmetric, i.e., A = AT • it has real eigenvalues Therefore algebraic graph theory bridges between linear algebra and graph theory #07: “N Dims good, 2 Dims baa-d”
  • 48. Tensors are a good way to handle time- series geo-spatially distributed linked data with lots of N-dimensional attributes In other words, potentially a general case 
 for handling much of the data that we’re likely to encounter #07: “N Dims good, 2 Dims baa-d”
  • 49. Although tensor factorization is considered problematic, it may provide more general case solutions: TheTensor Renaissance in Data Science
 Anima Anandkumar @UC Irvine
 radar.oreilly.com/2015/05/the-tensor- renaissance-in-data-science.html Spacey RandomWalks and 
 Higher Order Markov Chains
 David Gleich @Purdue
 slideshare.net/dgleich/spacey-random-walks- and-higher-order-markov-chains #07: “N Dims good, 2 Dims baa-d”
  • 50. #08: Science … and Data
  • 51. There is Science … and there is Data Data Science is largely about interdisciplinary teams, largely about crossing boundaries (organizational, cognitive) that might otherwise preclude arriving at crucial insights – In other words, about learning It’s also about the repeatability and predictive aspects of science, where workflows combine people + automation NB: may conflict with large portions of academia which tend to decontextualize subjects #08: Science … and Data
  • 52. The Science in Data Science tends to rely on the phenomenology and modeling of complex systems (did we already mention Physics?) Speaking of science and predictions, two important works to include: • Charles Sanders Peirce – one of the most prolific scientists in the US, and also one of the most fierce critics (abduction, etc.) • Karl Popper – who articulated some 
 of the inherent risks of mixing “science”, “history”, and politics #08: Science … and Data
  • 53. For excellent examples of Science and Data together, see CodeNeuro, particularly for use of notebooks: #08: Science … and Data
  • 54. #09: Learning Curves are Forever
  • 55. Learning Curves are forever – 
 the part you need to manage more carefully than just about anything else, especially within
 a social context In some sense, this is essence of Data Science: How well do you learn? Much of the risk in managing 
 a Data Science team is about budgeting for learning curve #09: Learning Curves are Forever
  • 56. In contrast, IT has a long history of practicing a flavor of engineering “conservatism”: highly structured process, strictly codified practices People learn a few things well, then avoid having to struggle with learning many new things perpetually… That leads to enormous teams and low ROI, among other badness scale➞ complexity➞ #09: Learning Curves are Forever
  • 57. ThrowYour Life a Curve
 Whitney Johnson blogs.hbr.org/johnson/2012/09/ throw-your-life-a-curve.html Aggressively Pro-Active Learning: • deconstruction of the cognitive bias One Size Fits All • “makes a compelling case for personal disruption” • “plan your career around learning curves” • hire people who learn/re-learn efficiently #09: Learning Curves are Forever
  • 58. #09: Learning Curves are Forever Education is more than just lessons, exams, certifications, instructor evaluations, etc., … though some tools would try to reduce it 
 to that level What’s even more interesting is to leverage ML to understand the “distance” between the learner and the subject material
  • 59. #10: Books, not so much, sadly…
  • 60. Speaking as a former alt bookstore owner… Sadly, we don’t use books quite as much 
 these days: • above ~35: buy it on Kindle • below ~35: watch it onYouTube #10: Books, not so much, sadly…
  • 61. From a publisher perspective, consider some of the risks: • less people buy the titles • search engines surface oh-so-much noise • increasingly, it’s more difficult for experts to take time to author good content and keep it updated #10: Books, not so much, sadly… Contributors per Month to Spark 0 20 40 60 80 100 2011 2012 2013 2014 2015 Most active project at Apache, More than 500 known production deployments
  • 62. However, it’s unlikely that Kindle, etc., represent the end-all-be-all of publishing… Here’s an idea: your next “book” or “video” should be able to compute something useful #10: Books, not so much, sadly…
  • 63. Interactive notebooks: Sharing the code Helen Shen Nature (2014-11-05) nature.com/news/interactive-notebooks- sharing-the-code-1.16261 #10: Books, not so much – Repeatable Science
  • 64. Embracing Jupyter Notebooks at O'Reilly
 Andrew Odewahn, 2015-05-07 https://beta.oreilly.com/ideas/jupyter-at-oreilly “O'Reilly Media is using our Atlas platform to 
 make Jupyter Notebooks a first class authoring environment for our publishing program.” Jupyter, Thebe, Docker, etc. #10: Books, not so much – Something Borrowed, Something New
  • 65. #10: Books, not so much – Something Borrowed, Something New
  • 66. #11: A MOOCish Edumacation?
  • 67. MOOCs have become popular, some are quite useful … even so, these tend to have 
 a very low completion rate Don’t hold your breath waiting for MOOCs to replace other modes of education Learning generally requires a social context: for reinforcement, peer insights/modeling, and frankly some people really feel a need to be given permission to learn #11: A MOOCish Edumacation?
  • 68. One problem with university study is that disciplines tend to decontextualize GalvanizeU is rare opportunity in that way: accredited, with contextualized hands-on experience #11: A MOOCish Edumacation?
  • 69. A significant improvement may be found in the notion of “flipped” 
 or inverted classrooms For a good example, see: Caltech Offers Online Course with 
 Live Lectures in Machine Learning Yaser Abu-Mostafa (2012-03-30) http://www.caltech.edu/news/caltech-offers-online- course-live-lectures-machine-learning-4248 #11: A MOOCish Edumacation?
  • 70. So a good bit of advice about learning and Data Science … is to invert your classrooms, recontextualize, cross the boundaries to do things that matter, and leverage the hands-on social aspects of learning Like here at GalvanizeU Summary…
  • 72. contact: Just Enough Math O’Reilly (2014) justenoughmath.com
 preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Intro to Apache Spark
 O’Reilly (2015)
 shop.oreilly.com/product/ 0636920036807.do
  • 74. After we’ve cleaned up data, formulated workflows in terms of monoids, used graph representation, and parallelized with a wealth of linear algebra, much of the heavy-lifting that remains on the clusters is in optimization For example, deep learning @Google 
 uses many layers of neural nets trained 
 with gradient descent optimization Taming LatencyVariability and Scaling Deep Learning
 Jeff Dean @Google (2013)
 youtu.be/S9twUcX1Zp0 Vector Quantization:
  • 75. One advantage of quantum algorithms is 
 to run large gradient descent problems in constant time… Reworking high-ROI apps to leverage lots of ML and large clusters, 
 then SGD represents the datacenter cost basis, notably that part that scales… Want to slash costs exponentially? 
 Plug in quantum for a game-changer,
 maybe Fast quantum algorithm for 
 numerical gradient estimation
 Stephen P. Jordan
 Phys. Rev. Lett. 95, 050501 (2005)
 arxiv.org/abs/quant-ph/0405146 dwavesys.com Vector Quantization:
  • 76. Proposal: let’s drop clusters of quantum devices into lunar polar craters, so we 
 can handle massive vector quantization workloads • micro-kelvin environs • near perpetual sunlight 
 for energy sources • park routers at L4 • approx. $15B to finance, 
 i.e., ~6 days DoD budget Vector Quantization:
  • 77. We’ll just put this here… 
 a couple o’ Googly projects in progress: qCraft: Quantum Physics In Minecraft
 plus.google.com/u/ 1/+QuantumAILab/posts/ grMbaaDGChH Vector Quantization: “We’re going back to the Moon. For good.” lunar.xprize.org