A Mayo Clinic Big Data Implementation

Mayo Clinic
Big Data Projects
Experience
Brian Brownlow
Big Data Professional

What is Big Data?
• A silver bullet that will solve all the worlds problems? NO
• An arrow in the IT quiver to help solve customer problems?
YES
• Does anyone have large data problems? All sales
transactions, log reviews, device output, text processing?
• How does you relational DB handle index creation or
backup for 500,000,000,000 row tables?
• Popular things that are similar
• Seti, many networked computers doing small pieces of work
• Watson, many networked computers working together to
solve a problem
• What’s one computer that beat a chess master? Kasparov –
Deep Blue (1996–1997), there are others…
• Big data has been around a long time
• Why now? Bigger, cheaper, faster processing, memory,
networking and disk
11/16/2014 2

Mayo Big Data Elements
• Patient Information
• Appointments
• Labs
• Images
• Genome
• Appointment Check-in/Check-out
• Report text
• Vitals
• Device reporting, e.g. Holter Monitor
• Many more, it keeps growing…
11/16/2014 3

Mayo Big Data Elements
Potentially Affecting Patient Care
• ALL OF THEM!
• The more we know about a patient
the better we can build tools and
models to help the care team improve
patient care and help the business
manage to reimbursement.
11/16/2014 4

Mayo Big Data Initial Evaluations
• Hortonworks HDP on a Virtual
Machine on my laptop
• HDP 1.3.2, 2.1 on Oracle VM
• HDP 1.3.2, 2.1 on VMWare
• What can HDP do?
• Pig, Hive, Hbase, HDFS, Ambari,
Hue, MapReduce, FLUME, Storm,
ElasticSearch, Sqoop…
11/16/2014 5

Mayo Big Data Presentations to
Leadership
• What is “BIG DATA”? What is Hadoop?
• What are “BIG DATA” capabilities?
• Here is one way you can answer your customer queries
about big data!
• Many people want to have a “BIG DATA” story
• Proved out at Mayo by some initial proof of concept projects
• Genomics on Cloudera (early work)
• HDP on Oracle VM (my project)
• Multi node DEV environment on HDP 1.3.2 running Centos on
XenCenter and an outside edge node
• Helped by media hype.
11/16/2014 6

The Virtual Machine!
• Show it.
11/16/2014 7

Mayo Big Data DEV
11/16/2014 8

Big Data DEV Setup
• Lots of help on the web, Hortonworks website, other websites
• Using the latest version of CentOS: 6.5 (x64)
• Exported VM to CentOS6.5_Hadoop1.32_SSD3.ova
• Installed as a VM from Oracle Virtual Box on Citrix XenCenter
• Installed or Updated latest packages for yum, rpm, wget, curl,
scp, pdsh, …
• Downloaded and generated local HDP repository /etc/
yum.repo.d (Note: 3 versions HDP hadoop stacks – 1.3.2, 1.3.3,
2.0.6)
• Configured network (hosts, security, firewall…)
• Installed Ambari (v1.4.4.23) and embedded postgresql DB
(v8.4.18)
• Installed Hadoop components from Ambari
11/16/2014 9

Big Data DEV Environment
• Was it Perfect? NO
• Less stable than preferred due to enabled
updates
• Lightly used
• Checked daily
• By the time of heavier we had our INT and
PROD environments so we didn’t need DEV
• Was It Good Enough? YES
11/16/2014 10

Mayo Big Data Platform RFP
• Sent out RFP, got demos based on a
use case we submitted with the RFP
• IBM Big Insights
• Cloudera Hadoop Distribution
• TeraData/Hortonworks Hadoop Distribution
• Selected TeraData/Hortonworks on a
TeraData hardware frame
• TDH (Teradata Hadoop is not a exact copy of
HDP (Hortonworks Data Platform)
• TeraData brings appliance brings some good
things to the table, Viewpoint, HCLI, …
11/16/2014 11

Big Data INT and PROD
• TDH INT in one cabinet, TDH PROD in the other,
asked Teradata for a VM version
• Additional expansion space available in existing
INT and PROD racks, want a big data project?
Fund a new edge or data node!
• TeraData add-ons, RAID, Infiband, Viewpoint,
HCLI
• TDH 1.3.2 not HDP 1.3.2, same source base but
minor differences to support the TeraData
infrastructure
• Ideal: DEV=INT=PROD, hardware and software
11/16/2014 12

11/16/2014 13
Master Prod 2
Master Prod 1
Edge Prod 2
Data Prod 6
Data Prod 5
Data Prod 4
Data Prod 3
Data Prod 2
Edge Prod 1
Data Prod 1
Primary SM Enet Switch
System VMS
Network-0 InfiniBand
Switch
KVM
Cabling Slot
Network-1 InfiniBand
Switch
Space for
Additional Nodes
Secondary SM Enet Switch
Master Test 2
Master Test 1
Edge Test 2
Data Test 6
Data Test 5
Data Test 4
Data Test 3
Data Test 2
Edge Test 1
Data Test 1
Primary SM Enet Switch
Cabinet VMS
Space for
Additional Nodes
Secondary SM Enet Switch
Viewpoint TMS
• 20 Hadoop nodes total – 10 per cabinet
• 2 Hadoop clusters, one per cabinet:
• Prod: 2 Master, 2 Edge, 1 Viewpoint TMS, 6
Data nodes (can add up to 7 more Edge
and/or Data nodes in-cabinet, plus add
additional cabinets to the cluster)
• Integration Test: 2 Master, 2 Edge, 6 Data
nodes (can add up to 8 more Edge and/or
Data nodes in-cabinet, plus add additional
cabinets to the cluster)
• Raw user data capacity per cluster: 57+ TB
• Includes HDFS 3x replication & work space
• Does NOT include any compression!
• Example: at 2x compression, user
data space per cluster is 114+ TB
• Power: 3 phase; 2 x 60 amps per cabinet; bottom
egress
• HDP 1.3.2; Storm, Elasticsearch, and WebSphere
MQ to be installed on appliance by project team
• Teradata Managed Server (TMS) for Viewpoint
TDH INT and PROD

Big Data Project Setup
• Agile development – 2 week sprints, daily scrums
• Extreme Programming
• Java Development Environment tool tree
• SVN (Subversion)
• Jenkins
• Maven
• Eclipse – Kepler
• Open Source Components
• Storm
• Flume
• Elastic Search (Marvel)
• NLP - cTAKES
• Acquired training for all components as needed, e.g. Storm, Flume, Elastic
Search, SVN, Drools
• Used in DEV, INT and PROD environments
• Consulting engagements
11/16/2014 14

DEV Team
• The team
• Executive support
• Project manager
• Senior Technical staff member
• 4 very experienced Programmers
• Very motivated, flexible, hearts of teachers and
learners
• Agile and Extreme programming relatively
new to Mayo IT
• Parts of the tool tree were also relatively
new to Mayo IT
11/16/2014 15

Part 1
• Verify the development tool tree
• Verify the development process
• Verify the open source components
• Define first use cases
• Start and manage the project backlog
list
11/16/2014 16

Part 1 Projects
• Natural Language Processing
• Lets get more value from unstructured text!
• Standard big data use cases
• Exploration
• Log exploration
• Search
• …
• Data lake
• Cohort identification
• …
11/16/2014 17

Part 1 Pig, Hive
PIG
A = LOAD 'default.bnb_test_from_file' USING
org.apache.hcatalog.pig.HCatLoader();
DUMP A;
Hive
'SELECT * FROM default.bnb_test_from_file limit 2'
11/16/2014 18

Part 1
• In production!
• Well received
• Met expectations for the development
process and schedule
• Lots of people lined up now to use
the environment!
11/16/2014 19

Part 2
• More NLP work
• Get more source data from more sources
• Explore via Drools, ElasticSearch, MapReduce
• Many more lined up
• Security – log examination
• Clinical Trials cohort discovery
• Genomics/Phenomics
• Molecular biology
• Protein studies
• …
11/16/2014 20

Conclusion
• Big Data via Hadoop is a relivent
choice in certain problem spaces
• Open source can provide valuable
tools for our customers
• Questions?
11/16/2014 21

A Mayo Clinic Big Data Implementation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Mayo Clinic Big Data Implementation

Similar to A Mayo Clinic Big Data Implementation (20)

More from BDPA Education and Technology Foundation

More from BDPA Education and Technology Foundation (20)

Recently uploaded

Recently uploaded (20)

A Mayo Clinic Big Data Implementation