AI on Big Data

Jongwook Woo
HiPIC
CalStateLA
동의대학교
상경대 경제학과 임 동 순 교수
May 29 2018
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
Introduction to AI on Big Data

High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 인공지능
 인공지능과 빅데이터
 Summary

Jongwook Woo
CalStateLA
Myself
Experience:
 Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
 Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
 Since 2007: Exposed to Big Data at CitySearch.com
 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors

Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits

Jongwook Woo
CalStateLA
Experience (Cont’d): Bring in Big Data R&D and training to
Korea since 2009
Collaborating with LA city since 2016
– Collect, Search, and Analyze City Data
• Spark, Hadoop, ElasticSearch, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and Research Centers
• Yonsei, Gachon, DongEui
• US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana
State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself

Jongwook Woo
CalStateLA
Myself: Partners for Services

Jongwook Woo
CalStateLA
Experience in Big Data
 Collaboration
 Council Member of IBM Spark Technology Center
 City of Los Angeles for OpenHub and Open Data
 Startup Companies in Los Angeles
 External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
– The Big Link, Softzen, Wiken in Korea
 Grants
 IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant
 Partnership
 Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS,
Teradata

Jongwook Woo
CalStateLA
Myself: Public Partners

Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive

Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004

Jongwook Woo
CalStateLA
What is Hadoop?
12
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …

Jongwook Woo
CalStateLA
Super Computer vs Hadoop
Parallel vs. Distributed file systems by Michael Malak
Updated by Jongwook Woo
Cluster for Store Cluster for Compute/Store
Cluster for Compute

Jongwook Woo
CalStateLA
Hadoop Cluster: Logical Diagram
Web Browser of Cluster nonitor: CM/Ambari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
.
.
.
.
.
.
.
.
.
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala

Jongwook Woo
CalStateLA
Hadoop Ecosystems
http://dawn.dbsdataprojects.com/tag/hadoop/

Jongwook Woo
CalStateLA
Definition: Big Data
Non-expensive frameworks that is distributed parallel systems
and that can store a large scale data and process it in parallel [1,
2]
Hadoop
– Non-expensive Super Computer
– More public than the traditional super computers
• You can store and process your applications
– In your university labs, small companies, research centers
Others
– NoSQL DB (Cassandra, MongoDB, Redis, HBase)
– ElasticSearch

Jongwook Woo
CalStateLA
NoSQL DB
 Key-Value
Memcached, Memcachedb, Redis
 Column Oriented (Column Family Store)
BigTable, Hbase
Cassandra (Key-Value Column Oriented)
Amazon SimpleDB
 Document Oriented
MongoDB, Couchbase, CouchDB
 Graph Oriented
Neo4j, InfiniteGraph

Jongwook Woo
CalStateLA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
 In-Memory storage for intermediate data
 20 ~ 100 times faster than N/W and Disk
– MapReduce
Good in Machine Learning
– Iterative algorithms

Jongwook Woo
CalStateLA
Spark and Hadoop
Spark
File Systems: Tachyon
Resource Manager: Mesos
But, Hadoop has been dominating market
Integrating Spark into Hadoop cluster
Cloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS & Azure
– No Hadoop ecosystems

Jongwook Woo
CalStateLA
Sentiment Map of Alphago
Positive
Negative

Jongwook Woo
CalStateLA
Sentiment Map of Lee Se-Dol vs Alphago
 YouTube video: “alphago sentiment” by Google
 The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTbToiB8wQ2w14a

Jongwook Woo
CalStateLA
K-Election 2017
(April 29 – May 9)

Jongwook Woo
CalStateLA
Mapping of Crimes Occurred within 5miles
from CalStateLA, UCLA and USC in 2015

Jongwook Woo
CalStateLA
Review count of popular sub-categories of
business

Jongwook Woo
CalStateLA
Businesses popular in 5 miles of CalStateLA,
USC , UCLA

Jongwook Woo
CalStateLA
Average Undergraduates Receiving
PELL GRANT in Each College
East Georgia State College: $2,854 Avg.
PELL grant: 97.285%

Jongwook Woo
CalStateLA
Big Data Analysis Flow
Data Collection
Batch API: Yelp,
Google
Streaming: Twitter,
Apache NiFi, Kafka,
Storm
Open Data:
Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Datameer, Qlik, Tableau,…)
Data Visualization
Qlik, Datameer, Excel
PowerView
- Big Data Engineering
- Big Data Analysis
- Big Data Science
- Data Visualization

Jongwook Woo
CalStateLA
Terms
We know
Data Engineering
– Collect, clean, filter data
Data Analysis
– Find insights from the data
Data Science (Predictive Analysis)
– Predict the trend or pattern from the existing data
Do we know?
Big Data Analysis and Science
– Using Big Data for Data Analysis and Science
• Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
– For Massive Data Set
• How to store and compute?

Jongwook Woo
CalStateLA
AI and Deep Learning
Artificial
Intelligence
Machine
Learning
Deep
Learning
Neural
Networks
▪Deep learning
▪Sub-field of neural networks,
machine learning, and artificial
intelligence
▪Deep learning is neural networks
with many layers
▪Inspired by, but not limited to,
▪ the architecture of the human brain
3
1
© 2017 SAP SE or an SAP affiliate company. All rights
reserved. ǀ PUBLIC

Jongwook Woo
CalStateLA
Deep Learning and TensorFlow
▪Development led by Google
▪Open-source library for deep learning
▪ Define model structures, library for efficient
execution
▪Define once, run anywhere:
▪ can run on on CPUs and GPUs, many devices
▪ NVidia, Google GPU
▪Can be used in Python
▪ and many other languages
▪Built for large-scale machine learning
▪ development and operations
3
2
reserved. ǀ PUBLIC

Jongwook Woo
CalStateLA
7
• Neural Networks
• Multi-Layer Perceptron
• Convolutional Neural
Networks
Deep Learning [9]

Jongwook Woo
CalStateLA
7
• good at problems like image classification.
Convolutional Neural Networks

Jongwook Woo
CalStateLA
9
• Has 3 types of parameters
▫ W – Hidden weights
▫ U – Hidden to Hidden weights
▫ V – Hidden to Label weights
• Good for Text Processing such as sentiment analysis:
• My Projects > sapDeepLearningTensorflow > Week_03_Unit_05_S
Recurrent Neural Networks (RNN)

Jongwook Woo
CalStateLA
10
 Neural Networks are resource intensive
o Typically require huge dedicated hardware (RAM, GPUs)
 Parameter space huge
o 100s of thousands of parameters
o Tuning is important
 Architecture choice is important:
o See http://www.asimovinstitute.org/neural-network-zoo/
Key takeaways from modeling Deep Neural
Networks

Jongwook Woo
CalStateLA
Recap
Spark:
an efficient framework for running computations on
thousands of computers
TensorFlow:
high-performance numerical framework
Get the best of both
Simple API for distributed numerical computing
Can leverage the hardware of the cluster
38

Jongwook Woo
CalStateLA
13
 Investment in Big-Data
o infrastructure
 GPUs
o Require specialized hardware
o – Niche Use-cases
 Can enterprises reuse existing infrastructure
o for deep learning applications?
 What use-cases in Deep learning can leverage Apache Spark?
Deep Learning + Apache Spark

Jongwook Woo
CalStateLA
Spark using TensorFlow [8, 9]
 Neural networks
 have seen spectacular progress during the last few years
 the state of the art in image recognition and automated translation.
 TensorFlow
 a new framework released by Google
– for numerical computations and neural networks.
 Spark and TensorFlow
 use Spark and a cluster of machines
– to improve deep learning pipelines with TensorFlow
– how to use TensorFlow and Spark together to train and apply deep learning models
 Hyperparameter Tuning:
– use Spark to find the best set of hyperparameters for neural network training,
• leading to 10X reduction in training time and 34% lower error rate.
 Deploying models at scale:
– use Spark to apply a trained neural network model on a large amount of data

Jongwook Woo
CalStateLA
 The accuracy of Spark with the default set of hyperparameters
 99.2%.
 best result with hyperparameter tuning
– has a 99.47% accuracy on the test set,
• which is a 34% reduction of the test error.
Spark Cluster with TensorFlow

Jongwook Woo
CalStateLA
14
 Databricks
 Platform for running Spark with TensorFlow
 BigDL
 Intel’s library for deep learning on existing data frameworks.
 TensorflowOnSpark
 Yahoo’s Distributed Deep Learning on Big Data
 SparkNet
 AMPLab’s framework for training deep networks in Spark
Efforts on using Deep Learning
Frameworks with Spark

Jongwook Woo
CalStateLA
14
 DeepLearning4J
 Uses Data parallism to train on separate neural networks
 DeepDist
 Lightning-Fast Deep Learning on Spark Via parallel
stochastic gradient updates
 IBM DSX
Efforts on using Deep Learning
Frameworks with Spark

Jongwook Woo
CalStateLA
15
 Deploying trained models
o to make predictions on data stored in Spark RDDs or Dataframes
o Inception model: https://www.tensorflow.org/tutorials/image_recognition
o Each prediction requires about 4.8 billion operations
o Parallelizing with Spark helps scale operations
Databricks
https://databricks.com/blog/2016/12/21/deep-learning-on-
databricks.html

Jongwook Woo
CalStateLA
16
• Distributed model training
 Use deep learning libraries like TensorFlow to test different
model hyperparameters on each worker
 Task parallelism
Databricks
https://databricks.com/blog/2016/12/21/deep-learning-on-
databricks.html

Jongwook Woo
CalStateLA
IBM DSX
 Data Science Experience (DSX) includes
TensorFlow libraty
GPU
Easy to develop and run Spark with TensorFlow
Don’t need to configure library
Databricks’ examples run in DSX
–While Databricks CE does not support GPU
Brunel for visualization lately
‹#›

Jongwook Woo
CalStateLA
Multiple nodes in the
cluster:
 the computations scaled
linearly
a graph
– the computation times (in
seconds)
• with respect to the number of
machines on the cluster:
– using a 13-node cluster,
• train 13 models in parallel,
• which translates into a 7x
speedup compared to training
the models one at a time on one
machine.
Spark Cluster with TensorFlow (Cont’d)

Jongwook Woo
CalStateLA

Jongwook Woo
CalStateLA
the learning rate for different numbers of neurons:
The learning rate is critical:
– if it is too low,
• the neural network does not learn anything (high test error).
– If it is too high,
• the training process may oscillate randomly and even diverge in some configurations.
The number of neurons
– not as important for getting a good performance,
• and networks with many neurons
– much more sensitive to the learning rate.
– This is Occam’s Razor principle:
• simpler model tend to be “good enough” for most purposes.
• If you have the time and resource to go after the missing 1% test error, you
must be willing to invest a lot of resources in training,
• to find the proper hyperparameters that will make the difference.

Jongwook Woo
CalStateLA
Distributed processing of images using
TensorFlow
 Apache Spark with a Deep Learning library
takes an existing neural network (INCEPTION-3)
– applies it to a corpus of images.
requires that TensorFlow be installed on the cluster
Run in IBM DSX
– Not in Databricks CE
• Built by Databricks but needs GPU
 Spark integration work flow:
define TensorFlow operations as methods, to be used within Spark tasks.
broadcast the model for use within Spark tasks.
parallelize a list of image URLs.
Using Spark, we process the image URLs in parallel:
– Load image.
– Run inference on the image using TensorFlow to predict the image contents.

Jongwook Woo
CalStateLA
Distributed processing of images classification using TensorFlow
 use the “Simple image classification with
Inception” example from TensorFlow,
which applies the Inception model to predict the
contents of a set of images.
 For example, given Photo of two scuba divers
The Inception model will tell us the contents of the
image:
('scuba diver', 0.88708681),
('electric ray, crampfish, numbfish, torpedo',
0.012277877),
('sea snake', 0.005639134),
('tiger shark, Galeocerdo cuvieri', 0.0051873429),
('reel', 0.0044495272)

Jongwook Woo
CalStateLA
Distributed processing of images classification using TensorFlow
(Cont’d)
Each of the lines above represents a “synset,”
or a set of synonymous terms
– representing a concept.
The weight given to each synset
– represents a confidence in how applicable the synset is to the image.
– In this case, “scuba diver” is pretty accurate!
Making predictions with Inception-v3
 expensive:
– each prediction requires about 4.8 billion operations (Szegedy et al., 2015).
Even with smaller datasets,
– worthwhile to parallelize this computation.
– distribute these costly predictions using Spark.

Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Introduction to AI
AI on Big Data

Jongwook Woo
CalStateLA
Databricks Partners

Jongwook Woo
CalStateLA
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo

Jongwook Woo
CalStateLA
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles

Jongwook Woo
CalStateLA
Question?

Jongwook Woo
CalStateLA
References
1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and
Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing
Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011)
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue
6, pp445-452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-
choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag
Chhadva, Shubhra Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-
Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. Github URL: https://github.com/nmelche/IntroductionToBigDataScience

Jongwook Woo
CalStateLA
References
8. TensorFrames: Google Tensorflow on Apache Spark,
https://www.slideshare.net/databricks/tensorframes-google-tensorflow-on-apache-spark
9. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-
and-apache-spark
10. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-
frameworks-on-spark
11. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-
at-scalewith-apache-spark-keynote-by-ziya-ma
12. Deep Learning with Apache Spark and TensorFlow,
https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-
tensorflow.html
13. Tensor Flow Deep Learning Open SAP

Jongwook Woo
CalStateLA
Deep Learning for the Intelligent Enterprise
Deep learning
Artificial
Intelligence
Machine
Learning
Deep
Learning
Neural
Networks
▪ Sub-field of neural
networks, machine
learning, and artificial
intelligence
▪ Deep learning is neural
networks with many layers
▪ Inspired by, but not limited
to, the architecture of the
human brain
▪ Deep learning is the reality
behind artificial intelligence
6
1
reserved. ǀ PUBLIC

AI on Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AI on Big Data

Similar to AI on Big Data (20)

More from Jongwook Woo

More from Jongwook Woo (8)

Recently uploaded

Recently uploaded (20)

AI on Big Data