Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Machine Learning (ML) and
TACC Supercomputers
A little about me
• Data Scientist at Texas Advanced Computing Center
(TACC)
• My Contact: atrivedi@tacc.utexas.edu
• TACC...
Some Basic Observations
 There are fundamental differences in data access
patterns between Data Intensive Computing and H...
Data Intensive Computing
 Specialized in dealing effectively with vast quantities of
data in distributed environments
Ge...
 Big data plays the key role in the popularity
and growth of Data intensive computing
 Increased the volume of data
 Im...
What’s the challenge with the big data
analysis?
5
 Big Data Analysis requires even more computational resources
 Storage is triple the standard data size
 Algorithms use...
High Performance Computing (HPC)
Hardware with more computational power per compute
node
Computation can be done with mu...
Sample TACC Computing Cluster
8
Combine HPC & Data intensive
computing
The intersection of these two domains is mainly driven
by the use of machine learn...
 Stampede – Traditional cluster HPC system
 Stockyard and Corral – 25 Petabytes of combined disk
storage for all data ne...
TACC Ecosystem Goals
 Goal to address the data problem in multiple dimensions
 Supports data in large and small scales
...
 Need to analyze large datasets quickly
 Need a more on-demand interactive analysis environment
 Need to work with data...
TACC Success Stories
13
14
15
Available ML tools/libraries in TACC
Supercomputers
Scikit-learn
Caffe
Theano
CUDA/cuDNN
Hadoop
PyHadoop
RHadoop
M...
Two Sample ML workflows in TACC
Supercomputers
GPU Powered Deep Learning on MRI images with NVIDIA
DIGITS in Maverick Supe...
Deep Learning on Images
 Deep Neural Networks are computationally quite
demanding
 The input data is much larger if we u...
Deep Learning on MRI using
TACC Supercomputers
 Maverick has large GPU Clusters
 There are three major GPU utilizing Dee...
Pubmed Recommender System in
Wrangler
20
What is a Recommendation System?
 Recommender System helps match users with item
 Implicit or explicit user feedback or ...
Types of Recommender System
Types Pros Cons
Knowledge‐based
(i.e, search)
Deterministic
recommendations,
assured quality,
...
Using Vector Space Model (VSM) for
Pubmed
 Given:
 A set of Pubmed documents
 N features (unique terms) describing the ...
MPI or Hadoop or Spark?
Which is really more suitable for this
ML problem in a HPC system ?
24
Message Passing in HPC
Message Passing Interface (MPI) was one of the key factors
which supported the initial growth of c...
Why MPI is not the best tool for ML
A researcher/developer working with MPI needs to
manually decompose the common data s...
 Hadoop is an open source implementation of MapReduce
programming model in JAVA
 It has interface to other programming l...
Limitations of Hadoop in HPC
Hadoop comes with mandatory Map Reduce logging of output to
the disk after every Map/Reduce ...
Spark
 For large-scale technical computing, one very promising
in-memory approach is Spark
 Spark lacks Map/Reduce-style...
Our Recommendation Model
 We apply collaborative filtering on the weighted/ranked documents
 We use Alternating Least Sq...
Performance Evaluation of Pubmed
Recommendation Model
We evaluate our recommendation model using Python Scikit-learn,
Apa...
THANK YOU !
Questions?
32
Upcoming SlideShare
Loading in …5
×

Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL - 9/18/15

724 views

Published on

Building a Recommender System for Publications using Vector Space Model and Python:In recent years, it has become very common that we have access to large number of publications on similar or related topics. Recommendation systems for publications are needed to locate appropriate published articles from a large number of publications on the same topic or on similar topics. In this talk, I will describe a recommender system framework for PubMed articles. PubMed is a free search engine that primarily accesses the MEDLINE database of references and abstracts on life-sciences and biomedical topics. The proposed recommender system produces two types of recommendations – i) content-based recommendation and (ii) recommendations based on similarities with other users’ search profiles. The first type of recommendation, viz., content-based recommendation, can efficiently search for material that is similar in context or topic to the input publication. The second mechanism generates recommendations using the search history of users whose search profiles match the current user. The content-based recommendation system uses a Vector Space model in ranking PubMed articles based on the similarity of content items. To implement the second recommendation mechanism, we use python libraries and frameworks. For the second method, we find the profile similarity of users, and recommend additional publications based on the history of the most similar user. In the talk I will present the background and motivation for these recommendation systems, and discuss the implementations of this PubMed recommendation system with example.

This talk will cover, via live demo & code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll incrementally build a hybrid machine learned model for fraud detection, combining features from natural language processing, topic modeling, time series analysis, link analysis, heuristic rules & anomaly detection. We’ll be looking for fraud signals in public email datasets, using Python & popular open-source libraries for data science and Apache Spark as the compute engine for scalable parallel processing.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL - 9/18/15

  1. 1. Machine Learning (ML) and TACC Supercomputers
  2. 2. A little about me • Data Scientist at Texas Advanced Computing Center (TACC) • My Contact: atrivedi@tacc.utexas.edu • TACC - Independent research center at UT Austin • TACC - One of the largest HIPAA compliant supercomputer center • ~250 faculty, researchers, students and staff • We work on providing support to large scale computing problems 1
  3. 3. Some Basic Observations  There are fundamental differences in data access patterns between Data Intensive Computing and High Performance Computing (HPC)  Today, most of the ML Researchers want/need to work with Big Data, Vectorization, Code Optimization etc. 2
  4. 4. Data Intensive Computing  Specialized in dealing effectively with vast quantities of data in distributed environments Generates high demand for computational resources, e.g. storing capacity, processing power etc. 3
  5. 5.  Big data plays the key role in the popularity and growth of Data intensive computing  Increased the volume of data  Improves accuracy of existing algorithms  Helps create better predictive models  Increased the complexity Data Intensive Computing & Big Data 4
  6. 6. What’s the challenge with the big data analysis? 5
  7. 7.  Big Data Analysis requires even more computational resources  Storage is triple the standard data size  Algorithms use large data points and is memory intensive  The Big Data Analysis takes much longer time  Typical hard drive read-speed is about 150MB/sec  But for reading 1TB ~ 2 hours  Analysis could require processing-time proportional to the size of the data  Data Analysis at the rate of 1GB /second would require 11 days to finish for 1TB data 6
  8. 8. High Performance Computing (HPC) Hardware with more computational power per compute node Computation can be done with multiple nodes Provides highly efficient numeric processing in distributed environments HPC has seen a recent growth in shared memory architectures 7
  9. 9. Sample TACC Computing Cluster 8
  10. 10. Combine HPC & Data intensive computing The intersection of these two domains is mainly driven by the use of machine learning (ML) ML methodologies help extract knowledge from big data These hybrid environments –  take advantage of data locality  keep the data exchanges over the network at a manageable level  offer high performance through distributed libraries 9
  11. 11.  Stampede – Traditional cluster HPC system  Stockyard and Corral – 25 Petabytes of combined disk storage for all data needs  Ranch – 160 Petabytes of tape archive storage  Maverick/Rustler/Rodeo – “Niche” systems with GPU clusters, great for data anatytics and visualization  Wrangler - A New Generation of Data-intensive Supercomputer TACC Ecosystem 10
  12. 12. TACC Ecosystem Goals  Goal to address the data problem in multiple dimensions  Supports data in large and small scales  Supports data reliability  Supports data security  Supports multiple data types: structured and unstructured  Supports sequential access  Fast for large files  Goal to support a wide range of applications and interfaces  Hadoop (and Mahout) & Spark (and MLlib)  Traditional R, GIS, DBs, and other HPC style performing workflows  Goal to support the full data lifecycle  Metadata and collection management support 11
  13. 13.  Need to analyze large datasets quickly  Need a more on-demand interactive analysis environment  Need to work with databases at high transaction rates  Have a Hadoop or Spark workflow with need for large HDFS datastore  Have a dataset that many users will compute with or analyze  In need of a system with data management capabilities  Have a job that is currently IO bound Why use TACC Supercomputers? 12
  14. 14. TACC Success Stories 13
  15. 15. 14
  16. 16. 15
  17. 17. Available ML tools/libraries in TACC Supercomputers Scikit-learn Caffe Theano CUDA/cuDNN Hadoop PyHadoop RHadoop Mahout Spark PySpark SparkR MLlib 16
  18. 18. Two Sample ML workflows in TACC Supercomputers GPU Powered Deep Learning on MRI images with NVIDIA DIGITS in Maverick Supercomputer Pubmed Recommender System in Wrangler Supercomputer 17
  19. 19. Deep Learning on Images  Deep Neural Networks are computationally quite demanding  The input data is much larger if we use even a small image resolution  256 x 256 RGB-pixel implies 196,608 input neurons (256 x 256 x 3)  Many of the involved floating point matrix operations can be addressed by GPUs 18
  20. 20. Deep Learning on MRI using TACC Supercomputers  Maverick has large GPU Clusters  There are three major GPU utilizing Deep Learning frameworks available – Theano, Torch and caffe  We use NVIDIA DIGITS (based on caffe), which is a web server providing a convenient web interface for training and testing Deep Neural Networks  For classification of MRI/images we use a convolutional DNN to figure out the features  We use CUDA 7,cuDNN, caffe and DIGITS on Maverick to classify our MRI/images In the course of 30 epochs, our classification accuracy ranges from 74.21% to 82.09% 19
  21. 21. Pubmed Recommender System in Wrangler 20
  22. 22. What is a Recommendation System?  Recommender System helps match users with item  Implicit or explicit user feedback or item suggestion  Our Recommendation system:  We try to build a model which recommends Pubmed documents to users, based on the user search profile 21
  23. 23. Types of Recommender System Types Pros Cons Knowledge‐based (i.e, search) Deterministic recommendations, assured quality, no cold‐ start Knowledge engineering effort to bootstrap, basically static Content‐based No community required, comparison between items possible Content descriptions necessary, cold start for new users Collaborative No knowledge‐ engineering effort, serendipity of results Requires some form of rating feedback, cold start for new users and new items 22
  24. 24. Using Vector Space Model (VSM) for Pubmed  Given:  A set of Pubmed documents  N features (unique terms) describing the documents in the set  VSM builds an N-dimensional Vector Space  Each item/document is represented as a point in the Vector Space  Information Retrieval based on search  Query: A point in the Vector Space  We apply TFIDF to the tokenized documents to weight the documents and convert the documents to vectors  We compute cosine similarity between the tokenized documents and the query term  We select top 3 documents matching our query  We weight the query term in the sparse matrix and rank documents 2323
  25. 25. MPI or Hadoop or Spark? Which is really more suitable for this ML problem in a HPC system ? 24
  26. 26. Message Passing in HPC Message Passing Interface (MPI) was one of the key factors which supported the initial growth of cluster computing MPI helped shape what the HPC world has become today MPI supported a substantial majority of all supercomputing work  Scientists and engineers have relied upon MPI for the past decades  MPI works great for data intensive computing in a GPU cluster 25
  27. 27. Why MPI is not the best tool for ML A researcher/developer working with MPI needs to manually decompose the common data structures across processors  Every update of the data structure needs to be recast into a flurry of messages, syncs, and data exchange Programming at the transport layer is an awkward fit for numerical application developers This led to the advent of other techniques 26
  28. 28.  Hadoop is an open source implementation of MapReduce programming model in JAVA  It has interface to other programming languages such as R, python etc.  Hadoop includes -  HDFS: A distributed file system based on google file system (GFS)  YARN: A resource manager to assign resources to the computational tasks  MapReduce: A library to enable efficient distributed data processing easily  Mahout: Scalable machine learning and data mining library  Hadoop streaming: It is a generic API which allows writing Mappers and Reducers in any language.  Hadoop is a good fit for large single-pass data processing, but has its own limitations Choosing Hadoop over MPI 27
  29. 29. Limitations of Hadoop in HPC Hadoop comes with mandatory Map Reduce logging of output to the disk after every Map/Reduce stage  In HPC, logging output to disk could be sped up with caching or SSDs In general, this fact rendered Hadoop unusable for many ML approaches which required iteration, or interactive use The real issue with Hadoop was its HDFS file system.  The HDFS file system was intimately tied to Hadoop cluster scheduling The large-scale ML community sought in-memory approaches to avoid this problem 28
  30. 30. Spark  For large-scale technical computing, one very promising in-memory approach is Spark  Spark lacks Map/Reduce-style requirements  Spark can run standalone, without a scheduler like YARN  It has interfaces to other programming languages such as R, python etc.  Spark supports HDFS through YARN  MLlib: Scalable machine learning and data mining library  Spark streaming: Enables stream processing of live data streams 29
  31. 31. Our Recommendation Model  We apply collaborative filtering on the weighted/ranked documents  We use Alternating Least Square (pyspark.mllib.recommendation.ALS) for recommending Pubmed documents  MatrixFactorizationModel.recommendProducts(int user_id, int num_of_iterations)  We use collaborative filtering in Scikit-learn & Hadoop as baselines  We use the python-recsys library along with Python Scikit-learn  svd.recommend(int product_id)  We use the mahout’s Alternating Least Square for Hadoop  Comparative study of our model shows improved performance in Spark 3030
  32. 32. Performance Evaluation of Pubmed Recommendation Model We evaluate our recommendation model using Python Scikit-learn, Apache Mahout and PySpark MLlib in Wrangler Recommendation model use Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) for evaluation Lower the errors, more accurate the model Lower the time taken to train/test the model, better the performance Algo: Type Public Dataset Python ML library Eval Test Model Training Time Model Test Time Recommendation Weighted Pubmed Documents Python Scikit RMSE=17.96% MAE=16.53% 42 secs 19 secs Recommendation Weighted Pubmed Documents Hadoop Mahout RMSE=16.02% MAE=14.98% 38 secs 14 secs Recommendation Weighted Pubmed Documents PySpark MLlib RMSE=15.88% MAE=14.23% 34 secs 11 secs 31
  33. 33. THANK YOU ! Questions? 32

×