14. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
Why Real-Time Service System?
User Experience
Mobile
Personalized
Differentiation
15. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
A busy service deployment
● End to end response time
○ <1000ms
● Platform response time
○ < 100ms
● Total number of user
● Concurrent users per region
● Log per day
○ Over 100 TB per day
● Personal model creation time
○ < 30s
● Personal model update time
○ < 1s
● Number of model
○ > 10x10x10
● Production deadline
○ < 30 days
● Farm
○ Sandbox/Staging/Beta/Production
○ US, EU, CN, KR
16. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
Performance goals at Scale
● Production == Benchmark
○ Environment should be capable of handling required workload
○ Strict SLA: Near 100% QoS, HA
○ Scalable architecture. Can’t estimate how many user will connect this at launching time
○ Low cost : More users and user data per machine
○ Automation from deployment to monitoring
○ Highly secure
18. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
Time vs Demand
Demand
TimeLaunching
Demand
TimeLaunching
Application A
Application B
19. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
The other non-functional constraints
● MLers don’t care scalability, cost, operation, maintenance, software version
control
● Abrupt requirement change
● Deadline
● Limited operational resources
● Debugging environment
22. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
Troubleshooting #Automation (½)
● Problem: Automated deployment, HA, Failover
● Solution
○ Autoscaling and Farm/Role based deployment system over multi-region (http://scalr.com)
○ Standardized environments enables application going to production ready with more resource
efficient way.
○ Roles in Scalr are abstract (because they do not include all the information needed to launch
Servers) and reusable (because you can use them multiple times, and share them)
infrastructure component
○ For example, you could have an "ubuntu-1404-ansible" Role, with Images in all regions of
AWS and in your on-premise CloudStack cloud. Orchestration to install and run Ansible at
startup.Global Variables to configure Ansible.
23. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
Troubleshooting #Automation (2/2)
● Problem: Automated deployment, HA, Failover
● Solution
○ Farms are logical groupings of configured Roles (Farm Roles) that are built within the
Environment scope. They represent infrastructure topologies, and may be launched to
provision actual cloud instances.
○ For example, you could have a 3-Tier Web Application Farm, composed of a Load Balancer
Farm Role, an Application Farm Role, and a Database Farm Role.
24. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
Troubleshooting #Bi-directional Transport & Security
● Problem: Bi-directional communication, Binary Format, Reliable
communication, Secure channel, Compression support, License
● Solution
○ gRPC (N/A then)
○ Not HTTP but TLS
○ Candidate : Thrift, Protobuf, RCF, ICE
○ Modified version of its software
25. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
Troubleshooting #Latency (Device-F/E)
● Problem: Too big latency when TLS packet is terminated
● Situation: User at Korea, Service at US
● Solution
○ Profiling
○ Optimize the TLS handshake time
■ Run TLS termination proxy server in Seoul
■ Network traffic between AWS regions traverse the AWS global network backbone by
default
○ Secure Fabric between our VPCs in US and Korea region like SteelConnect in AWS Market
Place
26. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
Troubleshooting #Latency (Server side)
● Problem: End to end latency of pure platform is required to be under 100ms
● Situation: smaller, better.
● Solution
○ RPC framework in server side
○ Placement group guarantees 10G within data center
○ SR-IOV enhances the linearity of IO performance
27. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
Troubleshooting #GPU Instance with high memory
● Problem: GPU instance with high memory are needed
● Situation: a tight deadline to build infrastructure
● Solution
○ Visiting, direct message, calling
○ Use P2 instance
○ Can deploy P2 instance in Korea and China if we guarantee the number of instances
28. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
Troubleshooting #Performance Considerations for P2
● Problem : Performance
● Solution :
○ Use NVIDIA Driver 352.99
○ Use ENA and Placement Groups
○ Set all GPU clock speed to their maximum frequency
○ Setup driver persistence mode
29. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
Troubleshooting #Cost Reduction & Data Protection for Migration
● Problem: By business decision, S3 data must to be transferred to object
storage at US-*** Region under minimum cost
● Situation: Only application running at US-*** Region
● Solution
○ Snowball? Snowmobile?
○ Routing Over AWS Networks
■ AWS Direct Connect Inter-Region Connectivity to access AWS Services in any US
Regions
■ Pay only for data transfer from the remote Regions to Direct Connect Region
○ Routing Over Non-AWS Networks
■ Corporate Network Backbone
30. Copyright@2017 Samsung Electronics Co., Ltd. All Rights Reserved.
Summary
● Why Real-Time Service System? UX, Mobile, Personalized
● Requirements :
● Constraints
● Troubleshooting
○ Automation
○ Bi-directional Transport & Security
○ Latency
○ Cost & Data Protection
○ GPU Instances with High Memory
○ Performance Considerations for P2
○ Cost Reduction & Data Protection for Migration