Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
On premise ai platform - from dc to edge
1. On-Premise AI Platform: From DC to Edge
Bukhary Ikhwan Ismail, Mohammad Fairus Khalid, Rajendar Kandan, Ong Hong Hoe
Advanced Computing Lab, MIMOS Berhad, Kuala Lumpur, Malaysia
{ikhwan.ismail, fairus.khalid, rajendar.kandan, hh.ong}@mimos.my
Abstract— Artificial Intelligence (AI) is powering everything
from devices, applications and services. Machine learning a
branch of AI requires powerful infrastructure platform to do
training and to serve the AI model. In this paper, we share our
blueprint to build and host internal on-premise AI platform.
We make use of our existing services such as private cloud,
distributed storage, unified authentication platform, and build
the AI platform on top of it. We discuss the requirements
gathered from user, the technologies to make it possible,
implementation and lesson learned from hosting it internally.
Based on our evaluation, based on specific need, it is
economical and viable option to host on-premise AI Platform.
Keywords—AI, Artificial Intelligence, Machine Learning,
Kubernetes, Edge Computing, Distributed Computing
I. INTRODUCTION
To support AI projects e.g. machine learning, there is a
readily available commercial cloud AI solutions. The
solutions are comprehensive but does not satisfy every user’s
requirement. There is a need to host an on-premise platform
where the project is sensitive in nature, confidential or for
closed system. For example, government-related projects e.g.
prison, military or enforcement agencies. For others, it could
be simply to own and self-host the infrastructure. Based on
our literature review survey, there is a lack of blueprint to
implement such system that support AI training and
deployment of the AI model.
In this paper, we present our approach to host an on-
premise AI infrastructure. The goal is to identify suitable
technology stack and enable the eco-system for institute to
develop & host AI projects internally. The remaining part of
this paper is organized as follows. Section II, discuss on the
AI background. Section III, discuss the requirements of AI
training & delivery platform. Section IV, discuss in detail our
implementation. Lastly section V, discussion & future work.
We implement this platform to support our future projects
that support deployment to edge on areas of IoT, smart
manufacturing and healthcare projects [1].
II. BACKGROUND
AI techniques apply in many areas such as computer
vision, speech recognition, NLP, audio, material inspection,
and more. Machine learning (ML) a branch of AI, is an
algorithm and mathematical model that makes decisions
without explicitly need to program in order to perform the
task. ML usually requires huge datasets for training & high
compute requirement for multiple iteration of model training.
A. Commercial vs OSS
Commercial AI Platform e.g. Azure ML, AWS
SageMaker, Google AI platform. These platforms is built on
top of their existing cloud services. They off training
services,by providing dedicated GPU, CPU or TPU
hardware or virtual machine (VM) options. User can utilize
object storage e.g. AWS S3 or block storage e.g. EBS block
volume to store training datasets. The customer gets the
latest iteration of technology updates and powerful
integration with existing cloud services e.g. big data, queue
system, compute, and databases. Some providers do offer
limited AI services and tools free. The charging mechanism
is base on resource usage e.g. compute, storage, network.
Running an AI project on the public cloud have its own
limitations. First, it creates vendor lock-in. It is nearly
impossible to port the project outside of its ecosystem once
developer utilize their specific services e.g. AWS Lambda or
AWS SQS . Secondly, to host an AI solution for a sensitive
or highly confidential project will be tricky. Issues such as
data privacy, security or workload placement are some of the
concerns. Using a commercial solution, the developer is
limited to the provider’s intended design.
B. AI project lifecycle
AI process requires different entry point to the facility
e.g. compute, storage, repository & features of the
infrastructure platform. Figure 1, shows a high-level AI
project flow.
Figure 1. High-Level AI Project Flow
1) Train model – training requires high compute power
and consists of multiple training cycles.
2) Package model – developer package the AI model as
an application.
3) Validate model – model & app will undergo a
validation process to attain desirable accuracy & speed.
4) Deploy model – Deploy AI application to target
endpoints e.g. cloud or edge computing [2].
5) Monitor Model – continuous monitor to assess the
accuracy of the results.
C. Core Technologies
To design our AI platform, we rely on the following
technologies stacks:-
1) Virtualization - abstracts resources into virtualize
commodity. It enables resource sharing, multi-tenancy &
increased resource utilization. Hypervisor catalyzes the
cloud and containerization enables the application
virtualization [3] [4].
2. 2) Hardware – such as multicore GPU, or a specific
class of dedicated CPU propels the advancement of AI in
the last decade [5]. It assists in speeding up the training and
inferencing capability.
3) Middleware - solutions such as technology
Opennebula for cloud, CEPH for distributed storage,
Kubernetes for application management is a great platform
stack we use in our solution.
III. THE REQUIREMENT FOR AI PLATFORM
In any AI and ML project, the success of the project
depends on three main key factors [5]:
1) The algorithm as part of the model itself.
2) The quality and quantity of the data available for
training.
3) The available compute cores, network throughput,
and storage performance.
Point 2 & 3 are the core motivation for this paper. We
aim to have proper infrastructure design, coupled with the
good management process that will contribute to the success
of overall AI project delivery.
AI Operations (AIOps) is a recent buzzword, inspired
from DevOps paradigm. It focuses on streamlining AI
process operations from data acquisition to model
deployment. The objective is to be able to automate and able
to reproduce the AI models with a predictable result.
To deliver a platform for machine learning development,
a cohesive ecosystem that encourages sharing, reduce the
time to develop & deploy, and streamlining the process
requires objective and specific requirements. Here, we layout
4 areas of ML management process that are important.
A. Datasets management
Data usually comes in raw format possibly from multiple
sources e.g. in-house database, databank or from other
institute. The inaccessibility of data is often the bottleneck
for AI project as it prevents experimentation effort. In the
early stage of AI project of an enterprise, experimentation is
critical for business staff, DevOps, and operators to develop
a clear understanding of how and where AI be in the most
impactful manner. Thus, an easily accessible dataset is
required. Datasets management requires
1) Storage Optimized for write once, read many manage
through storage pool to minimize cost & maximum usage.
2) Accessibility through shared protocol e.g. NFS,
samba, FTP.
3) Share without copy. The same dataset can be
replicated without a copy. Dataset is huge, it is cheaper to
share through the snapshot and store only the delta.
B. Training Management
Training is an important step in ML. Multiple training
iterations are required until the desired accuracy or speed.
New datasets, hyper-parameters change and new model steps
require retraining. Training requires high compute power but
done sparsely. The requirements are-
1) On-demand non-dedicated training platform.
2) Utilizing Cloud, no dedicated hardware or resources.
3) Multi-tenant, efficient use of resource utilization.
4) Support both GPU & CPU for training &
inferencing.
5) Support Distributed training to speed up training.
C. Version management
Training is very inconclusive activity. At each iteration of
training, the model accuracy will increase or decreases.
Some iteration of the model is worthwhile storing. The
requirements are-
1) A centralized repository that is able to store the AI
model & application.
2) Model Sharing- able to share or restrict the AI model
& application.
3) Trackable – able to track the version of application
and models version.
D. Deployment Management
The deployment consists of the application and the model
itself. A newer model version can replace the old model
while retaining the same application. The system must able
to support the deployment of
1) Private cloud deployment – deploy within our
internal private cloud platform.
2) Edge deployment – deploy at remote site onto a
single device.
3) Multiple site deployment – able to deploy at multiple
sites.
Deploying AI application at the edge is gaining traction.
As we can send the model to the remote, site and do the
processing on-site instead of sending a huge amount of data
to the datacenter. Processing it at the edge is much more
cost-effective, fast and does not require stable internet
connectivity [6][1][7].
IV. IMPLEMENTATION
MIMOS is a research institute, consisting of 11 core
R&D areas. From AI, IoT, Advance Computing,
Microelectronics and more. We host our own on-premise
cloud called Mi-Cloud. Currently, there are 900 VMs across
50 physical machines. We have distributed storage with a
capacity of 1000TB. Current utilization is at 300TB. To
accelerate AI & ML activities, we are proposing to host our
own on-demand AI eco-system. To achieve this, we utilize
exiting on-premise cloud & storage. The goal is to host and
support internal & external deployments.
Figure 2 AI. steps with our solution
Figure 2, shows the AI steps that we plan to support with
the integration of exiting in-house solution as well as with
new components.
1) Mi-Cloud – Our exiting self-service portal to request
VM based on OpenNebula solution.
2) Mi-Ross – our existing distributed storage that stores
VM images, backup, and databases data configured on
different performance, reliability and availability storage
policies. It is based on CEPH storage.
3. 3) Mi-Focus Container – a platform that deploys
containers on Mi-Cloud VM as well as to remote location
such as the edge.
4) Mi-Registry –a platform that stores application &
model version.
Figure 3. Infrastructure layers of overall AI platform
Figure 3, shows the overall components of our AI
platform. We run our VM disk on top of Mi-ROSS
distributed storage. AI training facility and AI inferencing is
hosted using containers on top of our private cloud. To
manage the containers, we utilize Mi-Focus that is based on
kubernetes.
Mi-Focus Registry handles the versioning of application
as well storing the AI models. We enhance existing Docker
registry to make it distributed in order to increase the
performance during high loads of docker image pull & push.
We added vulnerability assessment [8] to help application
developers to see problematic application library or packages
even before the image is deployed [9].
We cater two options to deploy containers to cloud using
Mi-Focus Container & on the edge computing using Mi-
Focus Edge.
Users such as data & ML engineers are able to use its
own laptop to mount the datasets and make any changes at
the ease of their laptop. The same datasets can then, mount
inside the container to proceed with ML training &
validation process.
A. Dataset & storage management
Our distributed storage system (Mi-ROSS) [10] handles
three separate storage pools AI dataset storage, running VM
disk [11] & storage for image registry.
Pool 1 is optimized for dataset management “write once,
read many” policies. We plan to provide facility for volume
management and snapshot capability for the datasets folder
to enable the copy-less capability. Once the user stores their
datasets, they can have the option to create a snapshot of it.
Using snapshot, user can manipulate the data without
affecting the original raw dataset. All of this is done through
copy-less snapshot capability introduce in the storage
system.
Pool 2 is optimized for VM disk usage. Our Mi-Cloud
relies on this pool. Pool 3 optimized for write once, read
many with three replicas to increase availability and
durability of the running VM.
To encourage dataset sharing and usage we expose it via
three protocols, object storage, NFS & WebDAV. To use for
training user have the option to mount dataset using NFS or
Object Storage protocol. The same datasets can be accessed
through WebDAV using NextCloud solution, similar to
Dropbox. Users have the freedom to view & manipulate
datasets through a personal laptop.
Figure 4. Mi-Cloud portal with 6 VM running
Figure 5. Mi-Focus Container with multiple running containers
B. Training management
We provide a training service using our Figure 5: Mi-
Focus container management, and Figure 4: Mi-Cloud VM
platform. Training runs within containers on top of VMs. It
is on-demand using our self-service portal. User needs to
launch VM first and later register that VM within Mi-Focus
system to manage containers. Both Mi-Cloud and Mi-Focus
is multi-tenant.
Once the training is done and the model is safely stored
in shared storage, the training rig is teardown. We do not
apply billing features to charge employees on their usage.
However, we do send periodic weekly & monthly email to
show their current usage of VMs.
Figure 6. Distributed training using containers & VM.
4. We support single and distributed node training facility.
User has the option to utilize GPU or CPU based. Figure 6
shows the distributed training. For distributed training, the
code itself must support distributed mode as seen in
TensorFlow or XGBoost library. To enable this feature, we
rely on KubeFlow that runs on top of a cluster of Kubernetes
node VM.
C. Version Management
Application and the model is package using Docker
image. This method makes it easier to launch and maintain
the deployment of the application in a precise repeatable
manner [12]. Figure 7 shows an image with different model
version.
Figure 7. Mi-Focus registry
Our version management is an enhanced stock Docker
registry. The registry can run on 2+n configuration in order
to increase the speed of downloading the image as well as
increasing the availability of the registry. It is open to the
internal employee user, enabling the docker host to pull the
images. To download the image, the docker host will talk to
a load balancer, and the request spread across multiple
backend registry.
Users are able to push images, which consist of the AI
model & the application itself. Using docker images, we can
track the image version properly as shown in Figure 7. We
enable share & restrict functionality. For any folder, User can
decide to share the images to a specific user or user groups or
make it public for everybody to use. Our Mi-Focus, Mi-Ross,
Mi-Cloud and Mi-Registry is using a centralized Unified
Authentication Platform [13]. All the authentication are sync
across these entire platforms for ease of user management.
Figure 8. pushing the image to a hosts
D. Deployment Management
Docker is the most suitable technology to deploy
application & model for both DC and edge computing [14]
[15]. Docker provides excellent repeatability and less prone
to error. Provisioning & tearing application is fast, making it
suitable for AI deployment.
Figure 9. Edge nodes across different states
We plan to support both DC & Edge devices deployment.
For DC-based deployment, we manage it through Mi-Focus
DC where the applications are in a form of Kubernetes
deployment. Managing using Kubernetes syntax is much
more streamlined than using other deployment tools such as
Docker Swarm.
Our system currently supports a device equipped with
Docker daemon. Figure 9, shows four devices across three
states in Malaysia. Each device displays the total number of
running containers & available cached images. We treat each
device as a Docker host managing commands of deploy,
start, stop and monitor application remotely through Mi-
Focus. We can manage the deployment of a single device or
multiple sites with multiple devices at the same time. These
enable us to, for example, update all AI model at the same
concurrently or do maintenance mode easier.
Figure 10. Edge node with AI application installed on the desktop
In Figure 10, show a screenshot of a remote machine
desktop. We have deployed two types of AI application.
When there is a new application version, we can deploy
remotely and replace or create a new shortcut on that
machine desktop.
V. FUTURE WORK
We have create created a prototype for on-premise AI
platform. We discuss and share the requirements,
implementation and technology that makes it possible.
Based on user feedback, there are possible future
enhancement. For dataset management, there is a need for
on-demand, non-persistent console to modify datasets on the
5. spot rather than mounting it to user’s laptop. Engineers
prefer to use scripting language. Having this functionality
may increase the productivity of the users.
MLOps idea is to have a reproducible and
repeatable model. In future, we plan to enhance our
repository to store all related artefact to AI projects e.g. code,
datasets, model versions in a single workspace. It makes it
easier to replicate or continue the project where every
artefact is within a single location.
Currently, we support edge deployment on a single
docker-machine. Standardizing the deployment of AI project
on DC; Kubernetes setup and edge devices using docker
would ease the manageability of mixed eco-system. K3S or
KubeEdge cluster solution for an edge are good candidates.
AI model are in a single file. The model size ranging
from 100Mb to 700Mb depending on the complexity of the
model. We would like to study the possibility to support
incremental model delta updates. Thus, only the delta is
transferred to the edge instead of one big file. Edge devices
are remote & geographically located, often without a reliable
internet connection and behind a firewall. Implementing
schedule pull command from node to the management node
in the cloud is an approach that is more elegant. The inputs
from the edge during inferencing can help to create new
datasets. On edge, the application may include a systematic
approach to evaluate and reinforce the AI model. Selective
tagging valuable training data can be cached locally at the
edge, and send back to a centralized repository.
REFERENCE
[1] M. Shafique, T. Theocharides, C. Bouganis, and M. A. Hanif, “An
Overview of Next-Generation Architectures for Machine Learning :
Roadmap, Opportunities and Challenges in the IoT Era.”
[2] Y. Huang, X. Ma, X. Fan, J. Liu, and W. Gong, “When Deep
Learning Meets Edge Computing,” 2017.
[3] M. Amaral, J. Polo, D. Carrera, I. Mohomed, M. Unuvar, M. Steinder,
and C. Pahl, “Containerisation and the PaaS Cloud,” Proc. - 2015
IEEE 14th Int. Symp. Netw. Comput. Appl. NCA 2015, no.
September, pp. 1–6, 2015.
[4] D. Bernstein, “Containers and Cloud: From LXC to Docker to
Kubernetes,” IEEE Cloud Comput., vol. 1, no. 3, pp. 81–84, 2014.
[5] T. Volk, “ARTIFICIAL INTELLIGENCE AND MACHINE
LEARNING FOR OPTIMIZING DEVOPS, IT OPERATIONS AND
BUSINESS - EMA Top 3 Report & Decision Guide for Enterprise,”
2018.
[6] S. B. Calo, M. Touna, D. C. Verma, and A. Cullen, “Edge Computing
Architecture for applying AI to IoT.”
[7] Y. He, F. R. Yu, N. Zhao, V. C. M. Leung, and H. Yin, “Software-
Defined Networks with Mobile Edge Computing and Caching for
Smart Cities : A Big Data Deep Reinforcement Learning Approach,”
no. December, 2017.
[8] E. Mostajeran, M. N. M. Mydin, M. F. Khalid, B. I. Ismail, R.
Kandan, and O. H. Hoe, “Quantitative risk assessment of container
based cloud platform,” in 2017 IEEE Conference on Application,
Information and Network Security (AINS), 2017, pp. 19–24.
[9] E. Mostajeran, M. F. Khalid, M. N. M. Mydin, B. I. Ismail, and H. H.
Ong, “Multifaceted Trust Assessment Framework for Container based
Edge Computing Platform,” in Fifth International Conference On
Advances in Computing, Control and Networking - ACCN 2016,
2016.
[10] M. T. Wong, M. B. A. Karim, J.-Y. Luke, and H. Ong, “Ceph as
WAN Filesystem–Performance and Feasibility Study through
Simulation.,” in Proceedings of the Asia-Pacific Advanced Network
38, 2014, pp. 1–11.
[11] J. J. Johari, M. F. Khalid, M. Nizam, M. Mydin, and N. Wijee,
“Comparison of Various Virtual Machine Disk Images Performance
on GlusterFS and Ceph Rados Block Devices,” pp. 1–7, 2014.
[12] M. F. Khalid, B. I. Ismail, and M. N. M. Mydin, “Performance
Comparison of Image and Workload Management of Edge
Computing Using Different Virtualization Technologies,” in 2016 3rd
International Conference on Computer, Communication and Control
Technology (I4CT), 2016.
[13] K. A. A. Bakar and G. R. Haron, “Context-Aware Analysis for
Adaptive Unified Authentication Platform,” in Proceedings of the
5Th International Conference on Computing & Informatics, 2015, pp.
417–422.
[14] B. I. Ismail, E. M. Goortani, M. B. Ab Karim, M. T. Wong, S. Setapa,
J. Y. Luke, and H. H. Ong, “Evaluation of Docker as Edge computing
platform,” in 2015 IEEE Conference on Open Systems (ICOS), 2015,
pp. 130–135.
[15] M. Bazli, A. Karim, B. I. Ismail, W. M. Tat, E. M. Goortani, S.
Setapa, Y. Luke, and H. Ong, “Extending Cloud Resources to the
Edge : Possible Scenarios, Challenges and Experiments,” 2016 Int.
Conf. Cloud Comput. Res. Innov. (ICCCRI 2016), 2016.