Machine	Intelligent	Cluster:
The	next	generation	of	data	center
Evan	Lin	@Linker	Networks
About me
Cloud Architect @ Linker
Networks
Golang User Group - Co-
Organizer
Top 5 Taiwan Golang open
source contributor (github
award)
Developer, Curator, Blogger
Recap Cloud Summit 2016
Agenda
• Problems on data center
• How machine learning helps
• Machine Intelligent Cluster
• Applications
• Q&A
Data center
• Power consumption
• Low usage
• Unpredictable peak
• Noisy neighbors
Efficiency
• Physical damage
• Networking problem
• Anomaly
• Attack
Risk
Real data center
Power consumption
Low usage and Unpredictable peak
Noisy neighbor
Use machine learning improve DC power
consumption
None of your business?
Modern Data center: Machine Cluster
Before machine cluster
DB Master:
IP: 192.168.1.222
DB Slave:
IP: 192.168.1.223
Web Server 1:
IP: 192.168.1.101
Web Server 2:
IP: 192.168.1.102
Web Server 3:
IP: 192.168.1.103
Load Balancer:
IP: 1.2.3.4
Container orchestration
Resource arrangement
Scalability
Portability
Automation migration
Resource management
3 Web App Servers
2 DB Servers
1 Load Balancer
Scalability
Automation migration
Automation migration
Automation migration
Automation migration
But .. we need better ..
No prediction
How to define scale out threshold?
50 %?
75 %?
25 %?
Machine	Intelligent	Cluster
Efficiency
Maximize
Utilization
Operation
Optimization
Accident
Risk
Mitigation
Serviceability
Management
Machine
Intelligence
Cluster
How MIC helps
Operation Optimization
1. Reinforcement learning
2. Adjust thermostat
3. Check the reward (CPU performance).
[1]: Refer from https://goo.gl/ly3zyX
Maximize Utilization
Analyze utilization and reduce working
machines to save our customer budget
- Predict utilization trend
- Provide auto-scaling threshold
adjustment
Prediction and dynamic threshold
Optimized
Scheduler
Node 1 Node 2
Node 3
Node 1 Node 2 Node 3
Nginx
(CPU 30%)
DB- MySQL
(IO 25%)
DB- Mongo
(IO 30%)
Apache
(CPU 30%)
Backend Process
(CPU 35%)
DB- Oracle
(IO 35%)
NodeJS
(CPU 7%)
Go backend
(CPU 8%) Nginx
(CPU 30%)
DB- MySQL
(IO 25%)
NodeJS
(CPU 7%)
Go backend
(CPU 8%)
Apache
(CPU 30%)
Backend Process
(CPU 35%)
DB- Mongo
(IO 30%)
DB- Oracle
(IO 35%)
Maximize Utilization
P.S. Not rearrange processes, we change the scheduler to avoid it happen..
Model 1
Serial Number Prediction
S.M.A.R.T. RNN Prediction
Serviceability Management (cont.)
Model 2
Dummy VM Detection Outlier Attack Detection
Mitigate risk
Storage SDN
Zombie Tagging system
Architecture
Cloud Native Architecture
HPC (with GPU) Server
Storage SDN
Storage SDN
Data Collect Probe & Sensor & Smart GW
Visualization
Data Process
Data Analysis &
Machine Learning
DCOS/
Kubernetes
Spark ML Tensorflow
DCOS / Kubernetes
Cassandra (Storage)
Kafka (Queueing)
Go/Akka (Connector)
Spark (ETL/Streaming)
D3.js
Scikit Learn R
Interactive
Dashboard
Jupyter Notebook
Zeppelin
ML Job
Scheduler
Chronos
MIC System Architecture
Data Agent Kafka
Spark
Streaming
Cassandra
Spark ML
(Classification,
Clustering)
TensorFlow
(Deep
Learning)
Backend Server
API
Portal
TensorFlow
Predict
SparkML Predict
MIC Data Flow
Applications on MIC
Machine Intelligent Cluster
IOT Gaming 5G NFV E-Commerce
Machine Intelligent Cluster Summary
• Machine cluster with Intelligent
• Features
• Self-Optimization
• Self-Learning
• Self-Recovery
• Green, Secure and Predictive machine cluster
歡迎訂閱 碼天狗
http://weekly.codetengu.com/
Thank	You

iThome Cloud Summit: The next generation of data center: Machine Intelligent Cluster