Predictive Analytics - Big Data & Artificial Intelligence
Big Data & Artificial Intelligence
Artificial Intelligence AI
NLPNatural Language Processing
Demystify the following buzzwords.
Ultimate Goal: Predictive Analytics
Predict what users will want to buy.
A consumer searches
for a TV and based on
data, show a product
that has a high
probability of being
bought as well.
Evolution of Data Analytics
Excel Business Intelligence (BI)
2015 and beyond
What Happened? What’s Happening? What Will Happen?
Data is stored in
Process the data
and AI algorithms
to detect patterns
Central Processing Unit (CPU) / Graphics Processing Unit (GPU)
Big Data Artificial Intelligence
How Did We Get Here?
• Relational databases
• Gigabytes in size
• Low latency
• Terabytes in size
• Custom hardware
We know what we are trying to
predict. We use some examples that
we and the model know the answers
to “train” our model. It can then
generate predictions to examples we
don’t know the answer to.
Example: Predict the price of a house
based on the size of the house.
We don’t know what we are trying to
predict. We are trying to identify
some naturally occurring patterns in
the data which may be informative.
Example: Try to identify “clusters” of
customers based on the data we have
What is Deep Learning?
• Deep Learning and Neural Networks are synonymous
• It’s a branch of machine learning based on a set of algorithms that
attempt to model high level abstractions in data by using a deep graph
with multiple processing layers, composed of multiple linear and non-
What we see What the computer “sees”
Tools of The Trade
Named after the yellow toy elephant of Doug Cutting’s son.
In 2006 while working at Yahoo, Doug came up with the Hadoop
framework. In 2008, it was taken over by the open source group
Apache, hence the official name is Apache Hadoop.
Hadoop to the Rescue
“an open source framework written in Java for storing and
processing massive amounts of data in a distributed manner”
Hadoop Distributed File System
(HDFS). Scalable file system that
distributes and stores data across
many machines in a cluster.
MapReduce – framework for
2 Key Components of the Framework:
Storage 2 Analysis
Hadoop can run on cheap commoditized
hardware on premise or in the cloud.
Stores files in large
blocks (64MB) across
multiple machines for
fault tolerance. By
default, data is stored
on 3 separate machines
Breaks large data processing
problems into multiple steps,
namely Mappers (DataNode)
and Reducers (TaskTrackers)
that can be worked on in
parallel on multiple machines
MapReduce Store Sales Data
Mappers Name Node 1 Data Node 1
Data Node 2
LA NYC LA NYC
Reducers Job Tracker Task Tracker
Shuffle and Sort