The document discusses GSK's Edge Node on Demand (ENOD) capability, which enables rapid prototyping by allowing users to provision pre-configured computing environments on demand. ENOD uses Docker containers to deploy virtual computing environments connected to GSK's Hadoop clusters, allowing users to access data and run queries and Spark jobs without needing local software installations. The demo showcases how ENOD provides user isolation, reusability, seamless migration, and sharing of environments across GSK researchers. Some challenges along the way included optimizing network file system performance and container image sizes.
2. A science-led global healthcare company
with a special purpose: to help people
do more, feel better, live longer.
We have 3 global businesses that research, develop and
manufacture innovative pharmaceutical medicines,
vaccines and consumer healthcare products.
3. Our goal is to be
one of the world’s
most innovative,
best performing
and trusted
healthcare
companies.
5. Drug discovery is tough…
doi:10.1038/nrd4609
450
400
350
300
250
200
150
100
50
0
Candidate
nomination
Phase I Phase II Phase III Phase IV
60
356
157
71
54
89
2
6 2
2
3
6
Not provided
Progressing
Terminated
HIGHEST PHASE
NUMBEROFCOMPOUNDS
*2000 - 2010
225/231 (97%)
small molecules entering
human studies* did not
become medicines
8. Data Lifecycle at GSK
Data creation and editing
Operational data use
& conformational
analytics
Discovery, Clinical studies,
Submissions
Data Re-use and
Curation
Exploratory data use
and analytics
Descriptive, Predictive,
Diagnostic
Data Lock Down
Archive
Retention, Redundant,
Recoverable
9. An easier way to
acquire and store all
R&D data
The opportunity
An agile platform to
enable analytics at
scale
10. R&D Information Platform
An ecosystem of
tools and
technologies that
enable the
secondary use of
data to harness
strategic value for
GSK
Platform
Admins
Data Curation
User
(Meta) Data
Browser
Governance
Data
Acquisition
User
Advanced
Analytics User
Sentry
Zookeepe
r
Docker Jenkins
Spark
Cloudera
Navigator
(Encrypt)
YARN
Anaconda
Package
Manager
Cloudera
Navigator
Kosh
Kafka
R Connect
Anaconda
Enterpris
e
Notebook
s
Local Disk HDFS
Hive
Solr
HBaseOozie
Impala
MongoDB Kinetica
Streamset
s
Trifacta
Tamr
Spark R
AtScale
ZoomDat
a
R Studio
11. Platform Summary
“MCC”
Bring your
own data
“SCC”
Automated
data pipelines
exist here
“VSC”
Validated
Processes are
housed here
10
cluster
s
~700
nodes
12. The NEXT
opportunit
y
We needed an easier
way for users to
interact with the
platform.........be more
user experience driven
14. Connect people to information seamlessly
through an knowledge model based on
personalized experiences that leads to
the creation of value; in doing so, we plan
to become the operating system for
analytics at GSK.
14
Mission
15. Product Lines
15
Delivering platform capabilities
The User
Experienc
e
On Demand
Metrics
Security &
Compliance
Intelligence
Sustainability
Enabling
Understandin
g
Safeguarding
Transforming
Maintaining
16. Deploy
Why Edge Node On Demand?
There is a sustained need for vetting new tools and technologies,
code packages
Procure
3 – 6 months
AgreeAssessAcquire
17. What is “Edge
Node on
Demand”?
A capability that enables
rapid prototyping by
putting the user at the
center with pre-
configurations to enable
multi cluster
communication.
18. PROD Hadoop Cluster
Backend to the PaaP service. Stores
all the user and the container info
Triggers the ansible
playbook with specific
host details
User will request an edge
node to specific cluster.
Can also request for changes
in environments
API call
to the
backend
DEV Hadoop
Cluster
Running Containers
User Access
Docker Python API that
triggers is used to
run the containers
and pull/push images to DTR
Base image for ENOD
Change cluster snapshots
Ansible
UCP Cluster
DTRMongo DB
Backend
Service
(REST)
ENOD
Container
CM Agent
CDH Config
Docker Worker Node 1
Docker Worker Node 2
ENOD
Container
ENOD
Container
CM Agent
ENOD
Container
GUI
Mounted
Mounted
CDH Config
SSH in the container
Architecture - ENOD
19. ENOD Demo Scenario
Owner
Shared
User
SSH but no
admin
access
SSH with
admin
access
CDH Client
Libraries
Mounted
NFS
(Home Directories)
Interne
t
Access Internet
through proxy
Access
Data
Queries/
Spark jobs
Edge Node Container
• Enterprise Integrations
• Enabled users
• Shareable
• Mobility
CDH Cluster
HDFS
HIVE
YARN
Impala
HBase
Spark
23. Opportunities Faced Along the
Way
NFS for data
Systemd &
tmpfs
Huge images,
~1G.
Kernel
patches
Docker UCP/DTR
stability DTR NFS
performance issues
Security concern
for ENOD
container
Data persistency