SlideShare a Scribd company logo
1 of 61
Jongwook Woo
HiPIC
CalStateLA
동의대학교
상경대 경제학과 임 동 순 교수
May 29 2018
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
Introduction to AI on Big Data
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 인공지능
 인공지능과 빅데이터
 Summary
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Myself
Experience:
 Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
 Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
 Since 2007: Exposed to Big Data at CitySearch.com
 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Experience (Cont’d): Bring in Big Data R&D and training to
Korea since 2009
Collaborating with LA city since 2016
– Collect, Search, and Analyze City Data
• Spark, Hadoop, ElasticSearch, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and Research Centers
• Yonsei, Gachon, DongEui
• US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana
State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Myself: Partners for Services
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Experience in Big Data
 Collaboration
 Council Member of IBM Spark Technology Center
 City of Los Angeles for OpenHub and Open Data
 Startup Companies in Los Angeles
 External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
– The Big Link, Softzen, Wiken in Korea
 Grants
 IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant
 Partnership
 Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS,
Teradata
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Myself: Public Partners
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 인공지능
 인공지능과 빅데이터
 Summary
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
High Performance Information Computing Center
Jongwook Woo
CalStateLA
What is Hadoop?
12
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Super Computer vs Hadoop
Parallel vs. Distributed file systems by Michael Malak
Updated by Jongwook Woo
Cluster for Store Cluster for Compute/Store
Cluster for Compute
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Hadoop Cluster: Logical Diagram
Web Browser of Cluster nonitor: CM/Ambari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
.
.
.
.
.
.
.
.
.
Agent Hadoop Agent Hadoop Agent Hadoop
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Hadoop Ecosystems
http://dawn.dbsdataprojects.com/tag/hadoop/
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Definition: Big Data
Non-expensive frameworks that is distributed parallel systems
and that can store a large scale data and process it in parallel [1,
2]
Hadoop
– Non-expensive Super Computer
– More public than the traditional super computers
• You can store and process your applications
– In your university labs, small companies, research centers
Others
– NoSQL DB (Cassandra, MongoDB, Redis, HBase)
– ElasticSearch
High Performance Information Computing Center
Jongwook Woo
CalStateLA
NoSQL DB
 Key-Value
Memcached, Memcachedb, Redis
 Column Oriented (Column Family Store)
BigTable, Hbase
Cassandra (Key-Value Column Oriented)
Amazon SimpleDB
 Document Oriented
MongoDB, Couchbase, CouchDB
 Graph Oriented
Neo4j, InfiniteGraph
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
 In-Memory storage for intermediate data
 20 ~ 100 times faster than N/W and Disk
– MapReduce
Good in Machine Learning
– Iterative algorithms
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Spark and Hadoop
Spark
File Systems: Tachyon
Resource Manager: Mesos
But, Hadoop has been dominating market
Integrating Spark into Hadoop cluster
Cloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS & Azure
– No Hadoop ecosystems
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Sentiment Map of Alphago
Positive
Negative
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Sentiment Map of Lee Se-Dol vs Alphago
 YouTube video: “alphago sentiment” by Google
 The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTbToiB8wQ2w14a
High Performance Information Computing Center
Jongwook Woo
CalStateLA
K-Election 2017
(April 29 – May 9)
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Mapping of Crimes Occurred within 5miles
from CalStateLA, UCLA and USC in 2015
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Review count of popular sub-categories of
business
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Businesses popular in 5 miles of CalStateLA,
USC , UCLA
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Average Undergraduates Receiving
PELL GRANT in Each College
East Georgia State College: $2,854 Avg.
PELL grant: 97.285%
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Big Data Analysis Flow
Data Collection
Batch API: Yelp,
Google
Streaming: Twitter,
Apache NiFi, Kafka,
Storm
Open Data:
Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Datameer, Qlik, Tableau,…)
Data Visualization
Qlik, Datameer, Excel
PowerView
- Big Data Engineering
- Big Data Analysis
- Big Data Science
- Data Visualization
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Terms
We know
Data Engineering
– Collect, clean, filter data
Data Analysis
– Find insights from the data
Data Science (Predictive Analysis)
– Predict the trend or pattern from the existing data
Do we know?
Big Data Analysis and Science
– Using Big Data for Data Analysis and Science
• Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
– For Massive Data Set
• How to store and compute?
High Performance Information Computing Center
Jongwook Woo
CalStateLA
NoSQL DB
 Key-Value
Memcached, Memcachedb, Redis
 Column Oriented (Column Family Store)
BigTable, Hbase
Cassandra (Key-Value Column Oriented)
Amazon SimpleDB
 Document Oriented
MongoDB, Couchbase, CouchDB
 Graph Oriented
Neo4j, InfiniteGraph
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 인공지능
 인공지능과 빅데이터
 Summary
High Performance Information Computing Center
Jongwook Woo
CalStateLA
AI and Deep Learning
Artificial
Intelligence
Machine
Learning
Deep
Learning
Neural
Networks
▪Deep learning
▪Sub-field of neural networks,
machine learning, and artificial
intelligence
▪Deep learning is neural networks
with many layers
▪Inspired by, but not limited to,
▪ the architecture of the human brain
3
1
© 2017 SAP SE or an SAP affiliate company. All rights
reserved. ǀ PUBLIC
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Deep Learning and TensorFlow
▪Development led by Google
▪Open-source library for deep learning
▪ Define model structures, library for efficient
execution
▪Define once, run anywhere:
▪ can run on on CPUs and GPUs, many devices
▪ NVidia, Google GPU
▪Can be used in Python
▪ and many other languages
▪Built for large-scale machine learning
▪ development and operations
3
2
© 2017 SAP SE or an SAP affiliate company. All rights
reserved. ǀ PUBLIC
High Performance Information Computing Center
Jongwook Woo
CalStateLA
7
• Neural Networks
• Multi-Layer Perceptron
• Convolutional Neural
Networks
Deep Learning [9]
High Performance Information Computing Center
Jongwook Woo
CalStateLA
7
• good at problems like image classification.
Convolutional Neural Networks
High Performance Information Computing Center
Jongwook Woo
CalStateLA
9
• Has 3 types of parameters
▫ W – Hidden weights
▫ U – Hidden to Hidden weights
▫ V – Hidden to Label weights
• Good for Text Processing such as sentiment analysis:
• My Projects > sapDeepLearningTensorflow > Week_03_Unit_05_S
Recurrent Neural Networks (RNN)
High Performance Information Computing Center
Jongwook Woo
CalStateLA
10
 Neural Networks are resource intensive
o Typically require huge dedicated hardware (RAM, GPUs)
 Parameter space huge
o 100s of thousands of parameters
o Tuning is important
 Architecture choice is important:
o See http://www.asimovinstitute.org/neural-network-zoo/
Key takeaways from modeling Deep Neural
Networks
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 인공지능
 인공지능과 빅데이터
 Summary
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Recap
Spark:
an efficient framework for running computations on
thousands of computers
TensorFlow:
high-performance numerical framework
Get the best of both
Simple API for distributed numerical computing
Can leverage the hardware of the cluster
38
High Performance Information Computing Center
Jongwook Woo
CalStateLA
13
 Investment in Big-Data
o infrastructure
 GPUs
o Require specialized hardware
o – Niche Use-cases
 Can enterprises reuse existing infrastructure
o for deep learning applications?
 What use-cases in Deep learning can leverage Apache Spark?
Deep Learning + Apache Spark
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Spark using TensorFlow [8, 9]
 Neural networks
 have seen spectacular progress during the last few years
 the state of the art in image recognition and automated translation.
 TensorFlow
 a new framework released by Google
– for numerical computations and neural networks.
 Spark and TensorFlow
 use Spark and a cluster of machines
– to improve deep learning pipelines with TensorFlow
– how to use TensorFlow and Spark together to train and apply deep learning models
 Hyperparameter Tuning:
– use Spark to find the best set of hyperparameters for neural network training,
• leading to 10X reduction in training time and 34% lower error rate.
 Deploying models at scale:
– use Spark to apply a trained neural network model on a large amount of data
High Performance Information Computing Center
Jongwook Woo
CalStateLA
 The accuracy of Spark with the default set of hyperparameters
 99.2%.
 best result with hyperparameter tuning
– has a 99.47% accuracy on the test set,
• which is a 34% reduction of the test error.
Spark Cluster with TensorFlow
High Performance Information Computing Center
Jongwook Woo
CalStateLA
14
 Databricks
 Platform for running Spark with TensorFlow
 BigDL
 Intel’s library for deep learning on existing data frameworks.
 TensorflowOnSpark
 Yahoo’s Distributed Deep Learning on Big Data
 SparkNet
 AMPLab’s framework for training deep networks in Spark
Efforts on using Deep Learning
Frameworks with Spark
High Performance Information Computing Center
Jongwook Woo
CalStateLA
14
 DeepLearning4J
 Uses Data parallism to train on separate neural networks
 DeepDist
 Lightning-Fast Deep Learning on Spark Via parallel
stochastic gradient updates
 IBM DSX
Efforts on using Deep Learning
Frameworks with Spark
High Performance Information Computing Center
Jongwook Woo
CalStateLA
15
 Deploying trained models
o to make predictions on data stored in Spark RDDs or Dataframes
o Inception model: https://www.tensorflow.org/tutorials/image_recognition
o Each prediction requires about 4.8 billion operations
o Parallelizing with Spark helps scale operations
Databricks
https://databricks.com/blog/2016/12/21/deep-learning-on-
databricks.html
High Performance Information Computing Center
Jongwook Woo
CalStateLA
16
• Distributed model training
 Use deep learning libraries like TensorFlow to test different
model hyperparameters on each worker
 Task parallelism
Databricks
https://databricks.com/blog/2016/12/21/deep-learning-on-
databricks.html
High Performance Information Computing Center
Jongwook Woo
CalStateLA
IBM DSX
 Data Science Experience (DSX) includes
TensorFlow libraty
GPU
Easy to develop and run Spark with TensorFlow
Don’t need to configure library
Databricks’ examples run in DSX
–While Databricks CE does not support GPU
Brunel for visualization lately
‹#›
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Multiple nodes in the
cluster:
 the computations scaled
linearly
a graph
– the computation times (in
seconds)
• with respect to the number of
machines on the cluster:
– using a 13-node cluster,
• train 13 models in parallel,
• which translates into a 7x
speedup compared to training
the models one at a time on one
machine.
Spark Cluster with TensorFlow (Cont’d)
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Spark Cluster with TensorFlow (Cont’d)
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Spark Cluster with TensorFlow (Cont’d)
the learning rate for different numbers of neurons:
The learning rate is critical:
– if it is too low,
• the neural network does not learn anything (high test error).
– If it is too high,
• the training process may oscillate randomly and even diverge in some configurations.
The number of neurons
– not as important for getting a good performance,
• and networks with many neurons
– much more sensitive to the learning rate.
– This is Occam’s Razor principle:
• simpler model tend to be “good enough” for most purposes.
• If you have the time and resource to go after the missing 1% test error, you
must be willing to invest a lot of resources in training,
• to find the proper hyperparameters that will make the difference.
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Distributed processing of images using
TensorFlow
 Apache Spark with a Deep Learning library
takes an existing neural network (INCEPTION-3)
– applies it to a corpus of images.
requires that TensorFlow be installed on the cluster
Run in IBM DSX
– Not in Databricks CE
• Built by Databricks but needs GPU
 Spark integration work flow:
define TensorFlow operations as methods, to be used within Spark tasks.
broadcast the model for use within Spark tasks.
parallelize a list of image URLs.
Using Spark, we process the image URLs in parallel:
– Load image.
– Run inference on the image using TensorFlow to predict the image contents.
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Distributed processing of images classification using TensorFlow
 use the “Simple image classification with
Inception” example from TensorFlow,
which applies the Inception model to predict the
contents of a set of images.
 For example, given Photo of two scuba divers
The Inception model will tell us the contents of the
image:
('scuba diver', 0.88708681),
('electric ray, crampfish, numbfish, torpedo',
0.012277877),
('sea snake', 0.005639134),
('tiger shark, Galeocerdo cuvieri', 0.0051873429),
('reel', 0.0044495272)
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Distributed processing of images classification using TensorFlow
(Cont’d)
Each of the lines above represents a “synset,”
or a set of synonymous terms
– representing a concept.
The weight given to each synset
– represents a confidence in how applicable the synset is to the image.
– In this case, “scuba diver” is pretty accurate!
Making predictions with Inception-v3
 expensive:
– each prediction requires about 4.8 billion operations (Szegedy et al., 2015).
Even with smaller datasets,
– worthwhile to parallelize this computation.
– distribute these costly predictions using Spark.
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 인공지능
 인공지능과 빅데이터
 Summary
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Introduction to AI
AI on Big Data
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Databricks Partners
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Question?
High Performance Information Computing Center
Jongwook Woo
CalStateLA
References
1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and
Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing
Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011)
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue
6, pp445-452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-
choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag
Chhadva, Shubhra Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-
Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. Github URL: https://github.com/nmelche/IntroductionToBigDataScience
High Performance Information Computing Center
Jongwook Woo
CalStateLA
References
8. TensorFrames: Google Tensorflow on Apache Spark,
https://www.slideshare.net/databricks/tensorframes-google-tensorflow-on-apache-spark
9. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-
and-apache-spark
10. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-
frameworks-on-spark
11. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-
at-scalewith-apache-spark-keynote-by-ziya-ma
12. Deep Learning with Apache Spark and TensorFlow,
https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-
tensorflow.html
13. Tensor Flow Deep Learning Open SAP
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Deep Learning for the Intelligent Enterprise
Deep learning
Artificial
Intelligence
Machine
Learning
Deep
Learning
Neural
Networks
▪ Sub-field of neural
networks, machine
learning, and artificial
intelligence
▪ Deep learning is neural
networks with many layers
▪ Inspired by, but not limited
to, the architecture of the
human brain
▪ Deep learning is the reality
behind artificial intelligence
6
1
© 2017 SAP SE or an SAP affiliate company. All rights
reserved. ǀ PUBLIC

More Related Content

What's hot

Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
Edureka!
 

What's hot (20)

Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data science
Data scienceData science
Data science
 
Analytics Education in the era of Big Data
Analytics Education in the era of Big DataAnalytics Education in the era of Big Data
Analytics Education in the era of Big Data
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?
 

Similar to AI on Big Data

Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
Jongwook Woo
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
Jongwook Woo
 

Similar to AI on Big Data (20)

Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open Data
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on Hadoop
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using Spark
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 
Cloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarCloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinar
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzer
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 

More from Jongwook Woo

More from Jongwook Woo (8)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

Recently uploaded

Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 

Recently uploaded (20)

Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 

AI on Big Data

  • 1. Jongwook Woo HiPIC CalStateLA 동의대학교 상경대 경제학과 임 동 순 교수 May 29 2018 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles Introduction to AI on Big Data
  • 2. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  인공지능  인공지능과 빅데이터  Summary
  • 3. High Performance Information Computing Center Jongwook Woo CalStateLA Myself Experience:  Since 2002, Professor at California State University Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  • 4. High Performance Information Computing Center Jongwook Woo CalStateLA Myself: S/W Development Lead http://www.mobygames.com/game/windows/matrix-online/credits
  • 5. High Performance Information Computing Center Jongwook Woo CalStateLA Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009 Collaborating with LA city since 2016 – Collect, Search, and Analyze City Data • Spark, Hadoop, ElasticSearch, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education to Univ and Research Centers • Yonsei, Gachon, DongEui • US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB • Europe: Univ of Luxembourg Myself
  • 6. High Performance Information Computing Center Jongwook Woo CalStateLA Myself: Partners for Services
  • 7. High Performance Information Computing Center Jongwook Woo CalStateLA Experience in Big Data  Collaboration  Council Member of IBM Spark Technology Center  City of Los Angeles for OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University – The Big Link, Softzen, Wiken in Korea  Grants  IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  • 8. High Performance Information Computing Center Jongwook Woo CalStateLA Myself: Public Partners
  • 9. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  인공지능  인공지능과 빅데이터  Summary
  • 10. High Performance Information Computing Center Jongwook Woo CalStateLA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive
  • 11. High Performance Information Computing Center Jongwook Woo CalStateLA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  • 12. High Performance Information Computing Center Jongwook Woo CalStateLA What is Hadoop? 12  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, …
  • 13. High Performance Information Computing Center Jongwook Woo CalStateLA Super Computer vs Hadoop Parallel vs. Distributed file systems by Michael Malak Updated by Jongwook Woo Cluster for Store Cluster for Compute/Store Cluster for Compute
  • 14. High Performance Information Computing Center Jongwook Woo CalStateLA Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari HTTP(S) Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Cluster Monitor . . . . . . . . . Agent Hadoop Agent Hadoop Agent Hadoop HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  • 15. High Performance Information Computing Center Jongwook Woo CalStateLA Hadoop Ecosystems http://dawn.dbsdataprojects.com/tag/hadoop/
  • 16. High Performance Information Computing Center Jongwook Woo CalStateLA Definition: Big Data Non-expensive frameworks that is distributed parallel systems and that can store a large scale data and process it in parallel [1, 2] Hadoop – Non-expensive Super Computer – More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers Others – NoSQL DB (Cassandra, MongoDB, Redis, HBase) – ElasticSearch
  • 17. High Performance Information Computing Center Jongwook Woo CalStateLA NoSQL DB  Key-Value Memcached, Memcachedb, Redis  Column Oriented (Column Family Store) BigTable, Hbase Cassandra (Key-Value Column Oriented) Amazon SimpleDB  Document Oriented MongoDB, Couchbase, CouchDB  Graph Oriented Neo4j, InfiniteGraph
  • 18. High Performance Information Computing Center Jongwook Woo CalStateLA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab  In-Memory storage for intermediate data  20 ~ 100 times faster than N/W and Disk – MapReduce Good in Machine Learning – Iterative algorithms
  • 19. High Performance Information Computing Center Jongwook Woo CalStateLA Spark and Hadoop Spark File Systems: Tachyon Resource Manager: Mesos But, Hadoop has been dominating market Integrating Spark into Hadoop cluster Cloud Computing – Amazon AWS, Azure HDInsight, IBM Bluemix • Object Storage, S3 Hadoop vendors – HDP, CDH Databricks: Spark on AWS & Azure – No Hadoop ecosystems
  • 20. High Performance Information Computing Center Jongwook Woo CalStateLA Sentiment Map of Alphago Positive Negative
  • 21. High Performance Information Computing Center Jongwook Woo CalStateLA Sentiment Map of Lee Se-Dol vs Alphago  YouTube video: “alphago sentiment” by Google  The sentiment of the World in Geo and Time: https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTbToiB8wQ2w14a
  • 22. High Performance Information Computing Center Jongwook Woo CalStateLA K-Election 2017 (April 29 – May 9)
  • 23. High Performance Information Computing Center Jongwook Woo CalStateLA Mapping of Crimes Occurred within 5miles from CalStateLA, UCLA and USC in 2015
  • 24. High Performance Information Computing Center Jongwook Woo CalStateLA Review count of popular sub-categories of business
  • 25. High Performance Information Computing Center Jongwook Woo CalStateLA Businesses popular in 5 miles of CalStateLA, USC , UCLA
  • 26. High Performance Information Computing Center Jongwook Woo CalStateLA Average Undergraduates Receiving PELL GRANT in Each College East Georgia State College: $2,854 Avg. PELL grant: 97.285%
  • 27. High Performance Information Computing Center Jongwook Woo CalStateLA Big Data Analysis Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Datameer, Qlik, Tableau,…) Data Visualization Qlik, Datameer, Excel PowerView - Big Data Engineering - Big Data Analysis - Big Data Science - Data Visualization
  • 28. High Performance Information Computing Center Jongwook Woo CalStateLA Terms We know Data Engineering – Collect, clean, filter data Data Analysis – Find insights from the data Data Science (Predictive Analysis) – Predict the trend or pattern from the existing data Do we know? Big Data Analysis and Science – Using Big Data for Data Analysis and Science • Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,.. – For Massive Data Set • How to store and compute?
  • 29. High Performance Information Computing Center Jongwook Woo CalStateLA NoSQL DB  Key-Value Memcached, Memcachedb, Redis  Column Oriented (Column Family Store) BigTable, Hbase Cassandra (Key-Value Column Oriented) Amazon SimpleDB  Document Oriented MongoDB, Couchbase, CouchDB  Graph Oriented Neo4j, InfiniteGraph
  • 30. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  인공지능  인공지능과 빅데이터  Summary
  • 31. High Performance Information Computing Center Jongwook Woo CalStateLA AI and Deep Learning Artificial Intelligence Machine Learning Deep Learning Neural Networks ▪Deep learning ▪Sub-field of neural networks, machine learning, and artificial intelligence ▪Deep learning is neural networks with many layers ▪Inspired by, but not limited to, ▪ the architecture of the human brain 3 1 © 2017 SAP SE or an SAP affiliate company. All rights reserved. ǀ PUBLIC
  • 32. High Performance Information Computing Center Jongwook Woo CalStateLA Deep Learning and TensorFlow ▪Development led by Google ▪Open-source library for deep learning ▪ Define model structures, library for efficient execution ▪Define once, run anywhere: ▪ can run on on CPUs and GPUs, many devices ▪ NVidia, Google GPU ▪Can be used in Python ▪ and many other languages ▪Built for large-scale machine learning ▪ development and operations 3 2 © 2017 SAP SE or an SAP affiliate company. All rights reserved. ǀ PUBLIC
  • 33. High Performance Information Computing Center Jongwook Woo CalStateLA 7 • Neural Networks • Multi-Layer Perceptron • Convolutional Neural Networks Deep Learning [9]
  • 34. High Performance Information Computing Center Jongwook Woo CalStateLA 7 • good at problems like image classification. Convolutional Neural Networks
  • 35. High Performance Information Computing Center Jongwook Woo CalStateLA 9 • Has 3 types of parameters ▫ W – Hidden weights ▫ U – Hidden to Hidden weights ▫ V – Hidden to Label weights • Good for Text Processing such as sentiment analysis: • My Projects > sapDeepLearningTensorflow > Week_03_Unit_05_S Recurrent Neural Networks (RNN)
  • 36. High Performance Information Computing Center Jongwook Woo CalStateLA 10  Neural Networks are resource intensive o Typically require huge dedicated hardware (RAM, GPUs)  Parameter space huge o 100s of thousands of parameters o Tuning is important  Architecture choice is important: o See http://www.asimovinstitute.org/neural-network-zoo/ Key takeaways from modeling Deep Neural Networks
  • 37. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  인공지능  인공지능과 빅데이터  Summary
  • 38. High Performance Information Computing Center Jongwook Woo CalStateLA Recap Spark: an efficient framework for running computations on thousands of computers TensorFlow: high-performance numerical framework Get the best of both Simple API for distributed numerical computing Can leverage the hardware of the cluster 38
  • 39. High Performance Information Computing Center Jongwook Woo CalStateLA 13  Investment in Big-Data o infrastructure  GPUs o Require specialized hardware o – Niche Use-cases  Can enterprises reuse existing infrastructure o for deep learning applications?  What use-cases in Deep learning can leverage Apache Spark? Deep Learning + Apache Spark
  • 40. High Performance Information Computing Center Jongwook Woo CalStateLA Spark using TensorFlow [8, 9]  Neural networks  have seen spectacular progress during the last few years  the state of the art in image recognition and automated translation.  TensorFlow  a new framework released by Google – for numerical computations and neural networks.  Spark and TensorFlow  use Spark and a cluster of machines – to improve deep learning pipelines with TensorFlow – how to use TensorFlow and Spark together to train and apply deep learning models  Hyperparameter Tuning: – use Spark to find the best set of hyperparameters for neural network training, • leading to 10X reduction in training time and 34% lower error rate.  Deploying models at scale: – use Spark to apply a trained neural network model on a large amount of data
  • 41. High Performance Information Computing Center Jongwook Woo CalStateLA  The accuracy of Spark with the default set of hyperparameters  99.2%.  best result with hyperparameter tuning – has a 99.47% accuracy on the test set, • which is a 34% reduction of the test error. Spark Cluster with TensorFlow
  • 42. High Performance Information Computing Center Jongwook Woo CalStateLA 14  Databricks  Platform for running Spark with TensorFlow  BigDL  Intel’s library for deep learning on existing data frameworks.  TensorflowOnSpark  Yahoo’s Distributed Deep Learning on Big Data  SparkNet  AMPLab’s framework for training deep networks in Spark Efforts on using Deep Learning Frameworks with Spark
  • 43. High Performance Information Computing Center Jongwook Woo CalStateLA 14  DeepLearning4J  Uses Data parallism to train on separate neural networks  DeepDist  Lightning-Fast Deep Learning on Spark Via parallel stochastic gradient updates  IBM DSX Efforts on using Deep Learning Frameworks with Spark
  • 44. High Performance Information Computing Center Jongwook Woo CalStateLA 15  Deploying trained models o to make predictions on data stored in Spark RDDs or Dataframes o Inception model: https://www.tensorflow.org/tutorials/image_recognition o Each prediction requires about 4.8 billion operations o Parallelizing with Spark helps scale operations Databricks https://databricks.com/blog/2016/12/21/deep-learning-on- databricks.html
  • 45. High Performance Information Computing Center Jongwook Woo CalStateLA 16 • Distributed model training  Use deep learning libraries like TensorFlow to test different model hyperparameters on each worker  Task parallelism Databricks https://databricks.com/blog/2016/12/21/deep-learning-on- databricks.html
  • 46. High Performance Information Computing Center Jongwook Woo CalStateLA IBM DSX  Data Science Experience (DSX) includes TensorFlow libraty GPU Easy to develop and run Spark with TensorFlow Don’t need to configure library Databricks’ examples run in DSX –While Databricks CE does not support GPU Brunel for visualization lately ‹#›
  • 47. High Performance Information Computing Center Jongwook Woo CalStateLA Multiple nodes in the cluster:  the computations scaled linearly a graph – the computation times (in seconds) • with respect to the number of machines on the cluster: – using a 13-node cluster, • train 13 models in parallel, • which translates into a 7x speedup compared to training the models one at a time on one machine. Spark Cluster with TensorFlow (Cont’d)
  • 48. High Performance Information Computing Center Jongwook Woo CalStateLA Spark Cluster with TensorFlow (Cont’d)
  • 49. High Performance Information Computing Center Jongwook Woo CalStateLA Spark Cluster with TensorFlow (Cont’d) the learning rate for different numbers of neurons: The learning rate is critical: – if it is too low, • the neural network does not learn anything (high test error). – If it is too high, • the training process may oscillate randomly and even diverge in some configurations. The number of neurons – not as important for getting a good performance, • and networks with many neurons – much more sensitive to the learning rate. – This is Occam’s Razor principle: • simpler model tend to be “good enough” for most purposes. • If you have the time and resource to go after the missing 1% test error, you must be willing to invest a lot of resources in training, • to find the proper hyperparameters that will make the difference.
  • 50. High Performance Information Computing Center Jongwook Woo CalStateLA Distributed processing of images using TensorFlow  Apache Spark with a Deep Learning library takes an existing neural network (INCEPTION-3) – applies it to a corpus of images. requires that TensorFlow be installed on the cluster Run in IBM DSX – Not in Databricks CE • Built by Databricks but needs GPU  Spark integration work flow: define TensorFlow operations as methods, to be used within Spark tasks. broadcast the model for use within Spark tasks. parallelize a list of image URLs. Using Spark, we process the image URLs in parallel: – Load image. – Run inference on the image using TensorFlow to predict the image contents.
  • 51. High Performance Information Computing Center Jongwook Woo CalStateLA Distributed processing of images classification using TensorFlow  use the “Simple image classification with Inception” example from TensorFlow, which applies the Inception model to predict the contents of a set of images.  For example, given Photo of two scuba divers The Inception model will tell us the contents of the image: ('scuba diver', 0.88708681), ('electric ray, crampfish, numbfish, torpedo', 0.012277877), ('sea snake', 0.005639134), ('tiger shark, Galeocerdo cuvieri', 0.0051873429), ('reel', 0.0044495272)
  • 52. High Performance Information Computing Center Jongwook Woo CalStateLA Distributed processing of images classification using TensorFlow (Cont’d) Each of the lines above represents a “synset,” or a set of synonymous terms – representing a concept. The weight given to each synset – represents a confidence in how applicable the synset is to the image. – In this case, “scuba diver” is pretty accurate! Making predictions with Inception-v3  expensive: – each prediction requires about 4.8 billion operations (Szegedy et al., 2015). Even with smaller datasets, – worthwhile to parallelize this computation. – distribute these costly predictions using Spark.
  • 53. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  인공지능  인공지능과 빅데이터  Summary
  • 54. High Performance Information Computing Center Jongwook Woo CalStateLA Summary Introduction to Big Data Introduction to AI AI on Big Data
  • 55. High Performance Information Computing Center Jongwook Woo CalStateLA Databricks Partners
  • 56. High Performance Information Computing Center Jongwook Woo CalStateLA Training Hadoop and Spark Cloudera visits to interview Jongwook Woo
  • 57. High Performance Information Computing Center Jongwook Woo CalStateLA Training Hadoop on IBM Bluemix at California State Univ. Los Angeles
  • 58. High Performance Information Computing Center Jongwook Woo CalStateLA Question?
  • 59. High Performance Information Computing Center Jongwook Woo CalStateLA References 1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011) 2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-452, ISSN 1942-4795 3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016 4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm- choice 5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big- Data-Analysis-using-Spark.pdf 6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html 7. Github URL: https://github.com/nmelche/IntroductionToBigDataScience
  • 60. High Performance Information Computing Center Jongwook Woo CalStateLA References 8. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes-google-tensorflow-on-apache-spark 9. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning- and-apache-spark 10. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark, https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning- frameworks-on-spark 11. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark, https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning- at-scalewith-apache-spark-keynote-by-ziya-ma 12. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and- tensorflow.html 13. Tensor Flow Deep Learning Open SAP
  • 61. High Performance Information Computing Center Jongwook Woo CalStateLA Deep Learning for the Intelligent Enterprise Deep learning Artificial Intelligence Machine Learning Deep Learning Neural Networks ▪ Sub-field of neural networks, machine learning, and artificial intelligence ▪ Deep learning is neural networks with many layers ▪ Inspired by, but not limited to, the architecture of the human brain ▪ Deep learning is the reality behind artificial intelligence 6 1 © 2017 SAP SE or an SAP affiliate company. All rights reserved. ǀ PUBLIC