1. Big Data & Artificial Intelligence
2014 Technology Review and Primer
Zavain Dar
2. High Level
Data —> Infrastructure —> Enables more Data —> Analytics,
Applications, & Artificial Intelligence
If we buy the above, we see ‘AI’, ‘Big Data’, ‘Deep Learning’, etc…
not as buzz words, but as a logical next step of technological
progress from the past 20 years
2
3. Outline
• Historical Context: The Web, Big Data, & Distributed Computing
• Modern Infrastructure
• Artificial Intelligence
• Learnings & Thesis Directions
3
4. Computing Infrastructure pre Web
• Storage Paradigm: Relational
Databases (Oracle, MySQL, etc…)
• Access Paradigm: Relation Algebra
(SQL)
• Each computer owned its data,
computation was generally done on
a single computer
C
C
C
D
D
D
5. 1984: 100 Nodes convert to TCP/IP
• Until 1984, there was no unified
‘internet’, rather a collection of
fragmented networks using one-off
protocols
• In 1984, the most connected 100
nodes switched to TCP/IP. Modern
Internet was born
6. The Web as a ‘Big Data’-base
• We can view the Web itself as the
first big database
• Storage Paradigm: HTML, DOM,
Relational Databases (Oracle,
MySQL)
• Access Paradigm: HTTP
C
C
C
C
D
D
D
D
7. The Web emerged as the first ‘Big Data’-set
• Other than HTTP requests, which were slow and clunky - we had no
way to index, and parse web content
• A handful of search engines came and went, but all struggled to
effectively deploy algorithms atop this massive distributed data set
7
8. Google in 1998
• Data uniformly distributed across
computers
• Storage Paradigm: GFS (Google
Filing System)
• Access Paradigm: ???
• Google kept Access Paradigm
proprietary for years
D
C C
C C
9. 2004: Big Data leaves Google’s confines
• Jeff Dean and Sanjay Ghemawat
publish seminal paper outlining
MapReduce, a distributed data
access paradigm
• Storage Paradigm: GFS
• Access Paradigm: MapReduce
10. Modern Big Data
• Apache Hadoop was born as an open source project form Yahoo in 2005.
Followed Google’s GFS and Google MapReduce implementations
• Hadoop consisted of HFS (Hadoop Filing System) and Hadoop Map Reduce
• It took years for the open source framework to become enterprise ready. In
the interim, Cloudera and HortonWorks began offering enterprise solutions
based around Hadoop
• Others wrote completely black box, proprietary versions based on GFS and
Map Reduce. Examples: Palantir and Discovery Engine
• Palantir only recently switching over to Hadoop based code.
10
11. Emergent Themes
• Commoditization of Infrastructure
• Early infrastructure providers have plateaued in value;
Hortonworks a recent example with a down round IPO
• DevOps
• As computing models changed from local and heterogeneous-hardware
based, new solutions emerge to help pace innovation
• ‘Appification' and Analytics atop Hadoop
11
12. DevOps: Docker
• Programming on and testing on a
laptop different than running on Dell
x86 clusters or mobile+HP server.
• Docker creates a portable
container (eg docker) around an
application, making it easy to port to
heterogenous environments
laptop x86 x86
Application
x86 x86
HP iOS
Application
Application
13. DevOps: Mesosphere
• The old world had Virtual Machines
which sliced single computers into
numerous ‘virtual instances’ for
security, debugging, etc…
C
C
• Now we need the opposite, to view
entire clusters as a singe computer
with shared and (hence) optimized
storage, network, and compute C
C’
15. Computational Logic + Planning
• Based on implementing static rules for a computer to follow. The
end algorithm and rules are independent of the data
• Old school (Chomskyan) NLP and chess playing followed this
approach
• Planning based on route optimization and ‘graph search’
• Eg how do you efficiently plan a UPS route, or guide a robotic
arm around obstacles of a pre known course
15
16. Computational Logic + Planning
• From 1940s through the early 1990s this was the preferred methodology for AI
• Key assumption: The world is guided by rules, and it’s just going to be a while
before we can encode the minimal viable set before computers can deduce future
outcomes and propositions
• AI slowed in results, and hence funding from the 70s through the 80s.This was
known as the AI Winter. Largely due to heavy academic emphasis on these
methods
• The early 90s showed focus on statistical methods - commonly dubbed the
Bayesian Revolution
• This lead to the proliferation and growth of machine learning
16
17. Machine Learning
• Premise for machine learning:
• Have a dataset
• Have an algorithm f(D)
• f(D) applied to a dataset gives a new function (model) m(i)
• m(i) applied to any input i predicts an output o
17
D
f
18. Machine Learning (Pictorially)
D f m(i) o
1. The machine learning algorithm f is
applied to the dataset D, giving the model
m
2. For any input i, the model m predicts an
18
output o
19. 3 Types of Machine Learning
1) Supervised Learning
D f m(i) o
• D consists of pairs of input, output types: <i, o>
• The larger D the more generalized and accurate the end model m is
• Learn by example
19
20. 3 Types of Machine Learning
2) Unsupervised (Topological) Learning
D f m(i) o
• D consists of just inputs: <i>
• Generally end up with a partitioning of D
• Good at finding patterns
20
21. 3 Types of Machine Learning
3) Reinforcement Learning
D f m(i) o
?
• You add some derivative of the output back to the initial dataset, and reoptimize your
model
• Eg Learning to play chess by playing over and over again. Ideally the more you play the
less you lose
21
22. Deep Learning
• Deep Learning and Neural Nets are synonymous
• Deep Learning is a subset of machine learning, it is a class of
functions f from the previous slides
• Deep learning algorithms take in a data set and spit out another
function, or model, m
• Can be deployed in structured, unstructured, and reinforced
contexts
22
23. Deep Learning
• First theorized and worked on in the 80s
• However, lacked the infrastructure and data to meaningfully deploy
• Has seen a massive resurgence 2009 onwards
• Loosely inspired by (vague) knowledge of brain - layers of abstraction
23
24. Deep Learning
• Useful for noisy, large, human generated data
• That is data for which, even the correct form of model input i can be tricky to
characterize
• When I see a picture of a human face, I immediately recognize eyes, a nose
and ears … hence a face
• When a computer receives the same image, it’s a rectangular grid of RGB
values. How do we map the computer’s input space to our semantic space?
• Types of data that this makes sense for: Text, Visual (images & video), Audio,
User behavior (my patterns on Twitter or Facebook), Basketball (player
millisecond movement), etc…
24
25. Good Fine-grained Classification
Functions Artificial Neural Nets
Can Learn
Deep Learning
LSTM for End to End Translation
25
Image Models
Audio: “sh ang hai res taur aun ts”
“hibiscus” “dahlia”
Sensible Errors
“dog”
Embeddings are Powerful
fallen
draw
fell
drawn
taken
drew take
took
given
give
gave
fall
sentence rep
PCA
linearly separable!
wrt subject vs object
Generating Work in progress by Oriol Vinyals Generating Generating Image Captions from Pixels
Human: A young girl asleep on the sofa cuddling a stuffed bear.!
Model sample 1: A close up of a child holding a stuffed animal.!
Model sample 2: A baby is asleep next to a teddy bear.
Human: Model Model
26. Current Landscape
GPUs, FPGAs, ASICs (User wants specialized deployments either for the learning
function f or the end model m):
Select examples: Nervana Systems, TerraDeep, Qualcomm Neuromorphic Group
APIs, SDKs (USer wants to use prewritten algos on their datasets):
Select examples: Metamind, Skymind.io, Vicarious, Deep Mind
Vertical (Technology is black-boxed from user):
Select examples: Clarifai, Butterfly Networks, Binatix, etc…
26
28. Learnings
Static software commoditizes
• Early big data infrastructure providers stagnating
• Google’s algorithms are essentially public (PageRank etc..)
• Deep Learning algos are an arms race & race to bottom
Defensibility and ability to grow into large 100M+ company is in owning proprietary data from which you can train
better models and/or have network or scale effect
Why is now special? We’re sitting at the intersection of:
1. a matured big data infrastructure driven by well understood distributed storage and data access paradigms
2. data continues to explode. Not only though web, but also via noisy sensor and human generated data
3. have AI tools necessary to make sense of unstructured and noisy datasets whose features don’t map well
to our a priori intuition
28
31. Feedback Loops
• Google collects click-data with each user - this enables better search
for next user: n+1th user has a better experience than nth user
• Google increases margin from competition the more we use it
• Leads to a run-away effect
• Can explain Google’s monopoly in search
• Same analogy with Facebook/Twitter-adds and other large tech co’s
• Prediction: Early movers who can bootstrap initial feedback loop will
be big, potentially winner-take-all, winners
31
32. Data —> Infrastructure —> Enables more Data —> Analytics,
Applications, & Artificial Intelligence
32