Implementing AI
with Big Data
SoCal Data Science Conference 2017
Raymond Fu
Los Angeles, CA
10-22-2017
Raymond Fu
Data Architect
Trace3
Email: rfu@trace3.com
Twitter: @RaymondxFu
The Future World with AI
What Problem is AI Solving Today
Input
Emails
Images
Audio
Chinese (你妹)
Text
Response
Is it a Spam? (0/1)
What is it? (1, …, 100)
Text
English (Your Sister)
Audio
“The massive economic value of AI
today is driven by supervised
learning.”
- Andrew Ng
Machine Learning
Features
Emails
Labels
Is it a Spam? (0/1)
1 f1 f2 f.. fn
2
...
m
1
0
0
1
... ... ... ... ... ?
Training
Predicting
Machine Learning
AI vs. Machine Learning
vs. Deep Learning
Artificial Intelligence - Machine thinks, talks,
and behaves as human.
Machine Learning - Computer makes decision
without being explicitly programmed.
Deep Learning - A network of multi-layer non-
linear processing unit capable of adapting
itself to new data.
“AI problem is a Data Problem. The
more data, the merrier.”
- Raymond Fu
Machine Learning vs. Statistics
Machine Learning
Goal: “learning” from data of all sorts
No assumptions about data distributions
Generalization is through training,
validation and test datasets
Tolerant of redundant features.
Does not promote data reduction prior to
learning.
Statistics
Goal: Analyzing and summarizing data
Tight assumptions about data
distributions
Generalization is pursued using statistical
tests on the training dataset.
Preferable to use less input features
Promotes data reduction as much as
possible before modeling
Computing Cluster
GPU
Cloud
Large Scale Data Processing
Dataset Labeling
Labeled data is a group of samples with one specific meaning or tag.
● Label an image with objects in it.
● Label an X-ray photo with whether or not the patient has
certain disease.
● Join datasets that may correlate with each other.
Big Data Engineering
1. Data Cleansing: Create both better features and better
labels
2. Self Service Analytics: Give data analyst tools to easily
prepare their data
3. Data Storage: Build performance and cost efficient data
storage strategy.
4. Streaming: Fast data feed + AI = Fast decision making.
AI in Today’s Industry
Questions?

Implementing Artificial Intelligence with Big Data

  • 1.
    Implementing AI with BigData SoCal Data Science Conference 2017 Raymond Fu Los Angeles, CA 10-22-2017
  • 2.
    Raymond Fu Data Architect Trace3 Email:rfu@trace3.com Twitter: @RaymondxFu
  • 3.
  • 4.
    What Problem isAI Solving Today Input Emails Images Audio Chinese (你妹) Text Response Is it a Spam? (0/1) What is it? (1, …, 100) Text English (Your Sister) Audio
  • 5.
    “The massive economicvalue of AI today is driven by supervised learning.” - Andrew Ng
  • 6.
    Machine Learning Features Emails Labels Is ita Spam? (0/1) 1 f1 f2 f.. fn 2 ... m 1 0 0 1 ... ... ... ... ... ? Training Predicting
  • 7.
  • 8.
    AI vs. MachineLearning vs. Deep Learning Artificial Intelligence - Machine thinks, talks, and behaves as human. Machine Learning - Computer makes decision without being explicitly programmed. Deep Learning - A network of multi-layer non- linear processing unit capable of adapting itself to new data.
  • 9.
    “AI problem isa Data Problem. The more data, the merrier.” - Raymond Fu
  • 10.
    Machine Learning vs.Statistics Machine Learning Goal: “learning” from data of all sorts No assumptions about data distributions Generalization is through training, validation and test datasets Tolerant of redundant features. Does not promote data reduction prior to learning. Statistics Goal: Analyzing and summarizing data Tight assumptions about data distributions Generalization is pursued using statistical tests on the training dataset. Preferable to use less input features Promotes data reduction as much as possible before modeling
  • 11.
  • 12.
    Dataset Labeling Labeled datais a group of samples with one specific meaning or tag. ● Label an image with objects in it. ● Label an X-ray photo with whether or not the patient has certain disease. ● Join datasets that may correlate with each other.
  • 13.
    Big Data Engineering 1.Data Cleansing: Create both better features and better labels 2. Self Service Analytics: Give data analyst tools to easily prepare their data 3. Data Storage: Build performance and cost efficient data storage strategy. 4. Streaming: Fast data feed + AI = Fast decision making.
  • 14.
  • 15.

Editor's Notes

  • #11 Differences in Goal: let the machine learn vs. give a fact to human so human can make a decision. Difference is methodology: Reduction of data for Statistics: reduction in two directions, number of data, which is sampling, and number of features, which is to simplify.
  • #13 Label
  • #14 Target