Data platform at Samsung (Big Learning)

SRA-SV | Cloud Research LabSRA-SV | Cloud Research Lab
Guangdeng Liao
Zhan Zhang
Samsung Cloud Research Lab
Data Platform at Samsung

SRA-SV | Cloud Research Lab Slide 2
Our Mission: provide scalable, reliable, and secure storage and
computation for Samsung R&D
Samsung Data Platform
Resources:
• Hundreds of machines
• Petabytes of storage
• keep increasing..

What we have in our platform
Distributed MR processing
Data warehousing with
Hive/Pig
In-house web-based ETL
portal
Many more..
Offline
K-V store HBase
In-house Blob store
Online Storm
Many more..
Online
Apache Mahout
ElasticSearch
In house unified web portal
In house Single Sign On
Visualization
Many more..
Dev. & management tools
By using platform, we already significantly improve ETL process, data
management and processing for other teams!!

So, are we done?
No. Many more complex challenges.

Challenge #1: How to build scalable and efficient machine
learning over Big Data?

MR-based Mahout is good but...
Not good at expressing data dependency and iterative algorithms like PageRank
Map: distribute rank to link targets
Reduce: collect ranks from multiple sources
Iterate








n
i i
i
tC
tPR
N
xPR
1 )(
)(
)1(
1
)( 
One job/iteration
Startup penaltyI/O Penalty
Unfortunately, a lot of MLDM are iterative jobs

Graph naturally represents data dependency

Graph-based Processing: Think like a Vertex
Scheduling
p p
p
p
p
p
p
In-memory data graph over a cluster
Communication
– Message-based
– Shared memory-
based
Vertex abstraction
– Think like a vertex’s
– In-memory processing
Execution engine
– Bulk synchronous
parallel
– Asynchronous parallel
Popular frameworks:
– Giraph
– GraphLab

Graph-based Machine Learning
We used Apache Giraph 1.0 and developed machine learning library over it:
Alternative Least Square
(ALS)
Weight ALS
SGD ( Matrix Factorization)
Bias SGD
Belief Propagation
Recommendation Graphical Model
KMeans
KMeans++
Fuzzy-Clustering
Clustering
We see one magnitude order of speedups compared to MR-based approach
in our cluster

Challenge #2: How to make Big Model + Big Data like Deep
Learning scalable and efficient?

One example: Deep Learning1
Many more examples (millions to billions parameters ) in Speech
Recognition, Image Processing and NLP
1Imagenet classification with deep convolutional neural networks, in NIPS 2012

Model-Parallel Framework
User
defined
model
Auto-generation
of model topology
Auto-partition of
topology over
cluster
c1
c2
Auto-deployment
of topology (in-
memory)
c3
Neuron-like
programming
Message-based
communication
Message-driven
computation
Parallelize a big machine learning model over a cluster

Architecture over Yarn
Node Manager
Node manager
Controller
Partition and
deploy topology
Node manager
Application Master
Container
Container
Container
Data Communication:
• node-level
• group-level
Control comm. based on
Thrift
Data comm. based on Netty

Execution Engine
• Execution Engine (Deep Neural Net)
– Training layer by layer controlled by
Execution Engine..
– Progress reporting
– Process control: end user can control the
training process, and even restart the
process from a certain point
– System snapshot for fault tolerance
Input
RBM
RBMSoftmax
Fully connected
• Generic Execution Engine
– Abstract the common design pattern from our development
experiences of deep neural net algorithm.
– Generalized to support various other algorithms

Model-parallel is still not scalable enough over Big Data

Deep Learning Platform: Hybrid of Data-parallelism and Model-
parallelism
……..Data Chunk
Model-parallel Model-parallel
Data Chunk
……..
Parameter
Server 1
Parameter
Server n
……..
Parameters coordination
Data-parallelism
Lots of model
instances
Parameter servers
help models learn
each other

Distributed Parameter Servers
Client Client Client
HBase/HDFS
In-memory
cache/storage
In-memory
cache/storage
In-memory
cache/storage
Server 1 Server 2 Server 3
Netty communication layer
Currently we support asynchronous parameter pulls and push
Synchronized version is also supported
Pull/Push/Sync

Deep Learning Algorithms
Aim at three major application fields: speech recognition, image
processing and NLP
What we have developed Our Roadmap
Feed Forward Neural Network
Restricted Boltzmann Machine
Deep Belief Network
Sparse Auto-encoder
Convolutional Neural Network
Recurrent Neural Network

Summary
• We are providing our Hadoop-based data platform
– hundreds machines, petabytes of storages
– Hadoop ecosystem (MapReduce, HBase, Yarn, HDFS, Zookeeper, Oozie, Lipstick, Mahout etc.)
– In-house ETL pipeline
– In-house unified web portal with SSO
• We are working hard on big learning to make our platform intelligent
– Large-scale graph-based machine learning
– Large-scale deep learning
– And many more under progress

Data platform at Samsung (Big Learning)

More Related Content

What's hot

Similar to Data platform at Samsung (Big Learning)

Recently uploaded

Data platform at Samsung (Big Learning)