SRA-SV | Cloud Research LabSRA-SV | Cloud Research Lab
Guangdeng Liao
Zhan Zhang
Samsung Cloud Research Lab
Data Platform at Samsung
SRA-SV | Cloud Research Lab Slide 2
Our Mission: provide scalable, reliable, and secure storage and
computation for Samsung R&D
Samsung Data Platform
Resources:
• Hundreds of machines
• Petabytes of storage
• keep increasing..
SRA-SV | Cloud Research Lab Slide 3
What we have in our platform
Distributed MR processing
Data warehousing with
Hive/Pig
In-house web-based ETL
portal
Many more..
Offline
K-V store HBase
In-house Blob store
Online Storm
Many more..
Online
Apache Mahout
ElasticSearch
In house unified web portal
In house Single Sign On
Visualization
Many more..
Dev. & management tools
By using platform, we already significantly improve ETL process, data
management and processing for other teams!!
SRA-SV | Cloud Research Lab Slide 4
So, are we done?
No. Many more complex challenges.
SRA-SV | Cloud Research Lab Slide 5
Challenge #1: How to build scalable and efficient machine
learning over Big Data?
SRA-SV | Cloud Research Lab Slide 6
MR-based Mahout is good but...
Not good at expressing data dependency and iterative algorithms like PageRank
Map: distribute rank to link targets
Reduce: collect ranks from multiple sources
Iterate








n
i i
i
tC
tPR
N
xPR
1 )(
)(
)1(
1
)( 
One job/iteration
Startup penaltyI/O Penalty
Unfortunately, a lot of MLDM are iterative jobs
SRA-SV | Cloud Research Lab Slide 7
Graph naturally represents data dependency
SRA-SV | Cloud Research Lab Slide 8
Graph-based Processing: Think like a Vertex
Scheduling
p p
p
p
p
p
p
In-memory data graph over a cluster
Communication
– Message-based
– Shared memory-
based
Vertex abstraction
– Think like a vertex’s
– In-memory processing
Execution engine
– Bulk synchronous
parallel
– Asynchronous parallel
Popular frameworks:
– Giraph
– GraphLab
SRA-SV | Cloud Research Lab Slide 9
Graph-based Machine Learning
We used Apache Giraph 1.0 and developed machine learning library over it:
Alternative Least Square
(ALS)
Weight ALS
SGD ( Matrix Factorization)
Bias SGD
Belief Propagation
Recommendation Graphical Model
KMeans
KMeans++
Fuzzy-Clustering
Clustering
We see one magnitude order of speedups compared to MR-based approach
in our cluster
SRA-SV | Cloud Research Lab Slide 10
Challenge #2: How to make Big Model + Big Data like Deep
Learning scalable and efficient?
SRA-SV | Cloud Research Lab Slide 11
One example: Deep Learning1
Many more examples (millions to billions parameters ) in Speech
Recognition, Image Processing and NLP
1Imagenet classification with deep convolutional neural networks, in NIPS 2012
SRA-SV | Cloud Research Lab Slide 12
Model-Parallel Framework
User
defined
model
Auto-generation
of model topology
Auto-partition of
topology over
cluster
c1
c2
Auto-deployment
of topology (in-
memory)
c3
Neuron-like
programming
Message-based
communication
Message-driven
computation
Parallelize a big machine learning model over a cluster
SRA-SV | Cloud Research Lab Slide 13
Architecture over Yarn
Node Manager
Node manager
Controller
Partition and
deploy topology
Node manager
Application Master
Container
Container
Container
Data Communication:
• node-level
• group-level
Control comm. based on
Thrift
Data comm. based on Netty
SRA-SV | Cloud Research Lab Slide 14
Execution Engine
• Execution Engine (Deep Neural Net)
– Training layer by layer controlled by
Execution Engine..
– Progress reporting
– Process control: end user can control the
training process, and even restart the
process from a certain point
– System snapshot for fault tolerance
Input
RBM
RBMSoftmax
Fully connected
• Generic Execution Engine
– Abstract the common design pattern from our development
experiences of deep neural net algorithm.
– Generalized to support various other algorithms
SRA-SV | Cloud Research Lab Slide 15
Model-parallel is still not scalable enough over Big Data
SRA-SV | Cloud Research Lab Slide 16
Deep Learning Platform: Hybrid of Data-parallelism and Model-
parallelism
……..Data Chunk
Model-parallel Model-parallel
Data Chunk
……..
Parameter
Server 1
Parameter
Server n
……..
Parameters coordination
Data-parallelism
Lots of model
instances
Parameter servers
help models learn
each other
SRA-SV | Cloud Research Lab Slide 17
Distributed Parameter Servers
Client Client Client
HBase/HDFS
In-memory
cache/storage
In-memory
cache/storage
In-memory
cache/storage
Server 1 Server 2 Server 3
Netty communication layer
Currently we support asynchronous parameter pulls and push
Synchronized version is also supported
Pull/Push/Sync
SRA-SV | Cloud Research Lab Slide 18
Deep Learning Algorithms
Aim at three major application fields: speech recognition, image
processing and NLP
What we have developed Our Roadmap
Feed Forward Neural Network
Restricted Boltzmann Machine
Deep Belief Network
Sparse Auto-encoder
Convolutional Neural Network
Recurrent Neural Network
SRA-SV | Cloud Research Lab Slide 19
Summary
• We are providing our Hadoop-based data platform
– hundreds machines, petabytes of storages
– Hadoop ecosystem (MapReduce, HBase, Yarn, HDFS, Zookeeper, Oozie, Lipstick, Mahout etc.)
– In-house ETL pipeline
– In-house unified web portal with SSO
• We are working hard on big learning to make our platform intelligent
– Large-scale graph-based machine learning
– Large-scale deep learning
– And many more under progress
Q&A

Data platform at Samsung (Big Learning)

  • 1.
    SRA-SV | CloudResearch LabSRA-SV | Cloud Research Lab Guangdeng Liao Zhan Zhang Samsung Cloud Research Lab Data Platform at Samsung
  • 2.
    SRA-SV | CloudResearch Lab Slide 2 Our Mission: provide scalable, reliable, and secure storage and computation for Samsung R&D Samsung Data Platform Resources: • Hundreds of machines • Petabytes of storage • keep increasing..
  • 3.
    SRA-SV | CloudResearch Lab Slide 3 What we have in our platform Distributed MR processing Data warehousing with Hive/Pig In-house web-based ETL portal Many more.. Offline K-V store HBase In-house Blob store Online Storm Many more.. Online Apache Mahout ElasticSearch In house unified web portal In house Single Sign On Visualization Many more.. Dev. & management tools By using platform, we already significantly improve ETL process, data management and processing for other teams!!
  • 4.
    SRA-SV | CloudResearch Lab Slide 4 So, are we done? No. Many more complex challenges.
  • 5.
    SRA-SV | CloudResearch Lab Slide 5 Challenge #1: How to build scalable and efficient machine learning over Big Data?
  • 6.
    SRA-SV | CloudResearch Lab Slide 6 MR-based Mahout is good but... Not good at expressing data dependency and iterative algorithms like PageRank Map: distribute rank to link targets Reduce: collect ranks from multiple sources Iterate         n i i i tC tPR N xPR 1 )( )( )1( 1 )(  One job/iteration Startup penaltyI/O Penalty Unfortunately, a lot of MLDM are iterative jobs
  • 7.
    SRA-SV | CloudResearch Lab Slide 7 Graph naturally represents data dependency
  • 8.
    SRA-SV | CloudResearch Lab Slide 8 Graph-based Processing: Think like a Vertex Scheduling p p p p p p p In-memory data graph over a cluster Communication – Message-based – Shared memory- based Vertex abstraction – Think like a vertex’s – In-memory processing Execution engine – Bulk synchronous parallel – Asynchronous parallel Popular frameworks: – Giraph – GraphLab
  • 9.
    SRA-SV | CloudResearch Lab Slide 9 Graph-based Machine Learning We used Apache Giraph 1.0 and developed machine learning library over it: Alternative Least Square (ALS) Weight ALS SGD ( Matrix Factorization) Bias SGD Belief Propagation Recommendation Graphical Model KMeans KMeans++ Fuzzy-Clustering Clustering We see one magnitude order of speedups compared to MR-based approach in our cluster
  • 10.
    SRA-SV | CloudResearch Lab Slide 10 Challenge #2: How to make Big Model + Big Data like Deep Learning scalable and efficient?
  • 11.
    SRA-SV | CloudResearch Lab Slide 11 One example: Deep Learning1 Many more examples (millions to billions parameters ) in Speech Recognition, Image Processing and NLP 1Imagenet classification with deep convolutional neural networks, in NIPS 2012
  • 12.
    SRA-SV | CloudResearch Lab Slide 12 Model-Parallel Framework User defined model Auto-generation of model topology Auto-partition of topology over cluster c1 c2 Auto-deployment of topology (in- memory) c3 Neuron-like programming Message-based communication Message-driven computation Parallelize a big machine learning model over a cluster
  • 13.
    SRA-SV | CloudResearch Lab Slide 13 Architecture over Yarn Node Manager Node manager Controller Partition and deploy topology Node manager Application Master Container Container Container Data Communication: • node-level • group-level Control comm. based on Thrift Data comm. based on Netty
  • 14.
    SRA-SV | CloudResearch Lab Slide 14 Execution Engine • Execution Engine (Deep Neural Net) – Training layer by layer controlled by Execution Engine.. – Progress reporting – Process control: end user can control the training process, and even restart the process from a certain point – System snapshot for fault tolerance Input RBM RBMSoftmax Fully connected • Generic Execution Engine – Abstract the common design pattern from our development experiences of deep neural net algorithm. – Generalized to support various other algorithms
  • 15.
    SRA-SV | CloudResearch Lab Slide 15 Model-parallel is still not scalable enough over Big Data
  • 16.
    SRA-SV | CloudResearch Lab Slide 16 Deep Learning Platform: Hybrid of Data-parallelism and Model- parallelism ……..Data Chunk Model-parallel Model-parallel Data Chunk …….. Parameter Server 1 Parameter Server n …….. Parameters coordination Data-parallelism Lots of model instances Parameter servers help models learn each other
  • 17.
    SRA-SV | CloudResearch Lab Slide 17 Distributed Parameter Servers Client Client Client HBase/HDFS In-memory cache/storage In-memory cache/storage In-memory cache/storage Server 1 Server 2 Server 3 Netty communication layer Currently we support asynchronous parameter pulls and push Synchronized version is also supported Pull/Push/Sync
  • 18.
    SRA-SV | CloudResearch Lab Slide 18 Deep Learning Algorithms Aim at three major application fields: speech recognition, image processing and NLP What we have developed Our Roadmap Feed Forward Neural Network Restricted Boltzmann Machine Deep Belief Network Sparse Auto-encoder Convolutional Neural Network Recurrent Neural Network
  • 19.
    SRA-SV | CloudResearch Lab Slide 19 Summary • We are providing our Hadoop-based data platform – hundreds machines, petabytes of storages – Hadoop ecosystem (MapReduce, HBase, Yarn, HDFS, Zookeeper, Oozie, Lipstick, Mahout etc.) – In-house ETL pipeline – In-house unified web portal with SSO • We are working hard on big learning to make our platform intelligent – Large-scale graph-based machine learning – Large-scale deep learning – And many more under progress
  • 20.