Mayo Clinic
Big Data Projects
Experience
Brian Brownlow
Big Data Professional
What is Big Data?
• A silver bullet that will solve all the worlds problems? NO
• An arrow in the IT quiver to help solve customer problems?
YES
• Does anyone have large data problems? All sales
transactions, log reviews, device output, text processing?
• How does you relational DB handle index creation or
backup for 500,000,000,000 row tables?
• Popular things that are similar
• Seti, many networked computers doing small pieces of work
• Watson, many networked computers working together to
solve a problem
• What’s one computer that beat a chess master? Kasparov –
Deep Blue (1996–1997), there are others…
• Big data has been around a long time
• Why now? Bigger, cheaper, faster processing, memory,
networking and disk
11/16/2014 2
Mayo Big Data Elements
• Patient Information
• Appointments
• Labs
• Images
• Genome
• Appointment Check-in/Check-out
• Report text
• Vitals
• Device reporting, e.g. Holter Monitor
• Many more, it keeps growing…
11/16/2014 3
Mayo Big Data Elements
Potentially Affecting Patient Care
• ALL OF THEM!
• The more we know about a patient
the better we can build tools and
models to help the care team improve
patient care and help the business
manage to reimbursement.
11/16/2014 4
Mayo Big Data Initial Evaluations
• Hortonworks HDP on a Virtual
Machine on my laptop
• HDP 1.3.2, 2.1 on Oracle VM
• HDP 1.3.2, 2.1 on VMWare
• What can HDP do?
• Pig, Hive, Hbase, HDFS, Ambari,
Hue, MapReduce, FLUME, Storm,
ElasticSearch, Sqoop…
11/16/2014 5
Mayo Big Data Presentations to
Leadership
• What is “BIG DATA”? What is Hadoop?
• What are “BIG DATA” capabilities?
• Here is one way you can answer your customer queries
about big data!
• Many people want to have a “BIG DATA” story
• Proved out at Mayo by some initial proof of concept projects
• Genomics on Cloudera (early work)
• HDP on Oracle VM (my project)
• Multi node DEV environment on HDP 1.3.2 running Centos on
XenCenter and an outside edge node
• Helped by media hype.
11/16/2014 6
The Virtual Machine!
• Show it.
11/16/2014 7
Mayo Big Data DEV
11/16/2014 8
Big Data DEV Setup
• Lots of help on the web, Hortonworks website, other websites
• Using the latest version of CentOS: 6.5 (x64)
• Exported VM to CentOS6.5_Hadoop1.32_SSD3.ova
• Installed as a VM from Oracle Virtual Box on Citrix XenCenter
• Installed or Updated latest packages for yum, rpm, wget, curl,
scp, pdsh, …
• Downloaded and generated local HDP repository /etc/
yum.repo.d (Note: 3 versions HDP hadoop stacks – 1.3.2, 1.3.3,
2.0.6)
• Configured network (hosts, security, firewall…)
• Installed Ambari (v1.4.4.23) and embedded postgresql DB
(v8.4.18)
• Installed Hadoop components from Ambari
11/16/2014 9
Big Data DEV Environment
• Was it Perfect? NO
• Less stable than preferred due to enabled
updates
• Lightly used
• Checked daily
• By the time of heavier we had our INT and
PROD environments so we didn’t need DEV
• Was It Good Enough? YES
11/16/2014 10
Mayo Big Data Platform RFP
• Sent out RFP, got demos based on a
use case we submitted with the RFP
• IBM Big Insights
• Cloudera Hadoop Distribution
• TeraData/Hortonworks Hadoop Distribution
• Selected TeraData/Hortonworks on a
TeraData hardware frame
• TDH (Teradata Hadoop is not a exact copy of
HDP (Hortonworks Data Platform)
• TeraData brings appliance brings some good
things to the table, Viewpoint, HCLI, …
11/16/2014 11
Big Data INT and PROD
• TDH INT in one cabinet, TDH PROD in the other,
asked Teradata for a VM version
• Additional expansion space available in existing
INT and PROD racks, want a big data project?
Fund a new edge or data node!
• TeraData add-ons, RAID, Infiband, Viewpoint,
HCLI
• TDH 1.3.2 not HDP 1.3.2, same source base but
minor differences to support the TeraData
infrastructure
• Ideal: DEV=INT=PROD, hardware and software
11/16/2014 12
11/16/2014 13
Master Prod 2
Master Prod 1
Edge Prod 2
Data Prod 6
Data Prod 5
Data Prod 4
Data Prod 3
Data Prod 2
Edge Prod 1
Data Prod 1
Primary SM Enet Switch
System VMS
Network-0 InfiniBand
Switch
KVM
Cabling Slot
Network-1 InfiniBand
Switch
Space for
Additional Nodes
Secondary SM Enet Switch
Master Test 2
Master Test 1
Edge Test 2
Data Test 6
Data Test 5
Data Test 4
Data Test 3
Data Test 2
Edge Test 1
Data Test 1
Primary SM Enet Switch
Cabinet VMS
Space for
Additional Nodes
Secondary SM Enet Switch
Viewpoint TMS
• 20 Hadoop nodes total – 10 per cabinet
• 2 Hadoop clusters, one per cabinet:
• Prod: 2 Master, 2 Edge, 1 Viewpoint TMS, 6
Data nodes (can add up to 7 more Edge
and/or Data nodes in-cabinet, plus add
additional cabinets to the cluster)
• Integration Test: 2 Master, 2 Edge, 6 Data
nodes (can add up to 8 more Edge and/or
Data nodes in-cabinet, plus add additional
cabinets to the cluster)
• Raw user data capacity per cluster: 57+ TB
• Includes HDFS 3x replication & work space
• Does NOT include any compression!
• Example: at 2x compression, user
data space per cluster is 114+ TB
• Power: 3 phase; 2 x 60 amps per cabinet; bottom
egress
• HDP 1.3.2; Storm, Elasticsearch, and WebSphere
MQ to be installed on appliance by project team
• Teradata Managed Server (TMS) for Viewpoint
TDH INT and PROD
Big Data Project Setup
• Agile development – 2 week sprints, daily scrums
• Extreme Programming
• Java Development Environment tool tree
• SVN (Subversion)
• Jenkins
• Maven
• Eclipse – Kepler
• Open Source Components
• Storm
• Flume
• Elastic Search (Marvel)
• NLP - cTAKES
• Acquired training for all components as needed, e.g. Storm, Flume, Elastic
Search, SVN, Drools
• Used in DEV, INT and PROD environments
• Consulting engagements
11/16/2014 14
DEV Team
• The team
• Executive support
• Project manager
• Senior Technical staff member
• 4 very experienced Programmers
• Very motivated, flexible, hearts of teachers and
learners
• Agile and Extreme programming relatively
new to Mayo IT
• Parts of the tool tree were also relatively
new to Mayo IT
11/16/2014 15
Part 1
• Verify the development tool tree
• Verify the development process
• Verify the open source components
• Define first use cases
• Start and manage the project backlog
list
11/16/2014 16
Part 1 Projects
• Natural Language Processing
• Lets get more value from unstructured text!
• Standard big data use cases
• Exploration
• Log exploration
• Search
• …
• Data lake
• Cohort identification
• …
11/16/2014 17
Part 1 Pig, Hive
PIG
A = LOAD 'default.bnb_test_from_file' USING
org.apache.hcatalog.pig.HCatLoader();
DUMP A;
Hive
'SELECT * FROM default.bnb_test_from_file limit 2'
11/16/2014 18
Part 1
• In production!
• Well received
• Met expectations for the development
process and schedule
• Lots of people lined up now to use
the environment!
11/16/2014 19
Part 2
• More NLP work
• Get more source data from more sources
• Explore via Drools, ElasticSearch, MapReduce
• Many more lined up
• Security – log examination
• Clinical Trials cohort discovery
• Genomics/Phenomics
• Molecular biology
• Protein studies
• …
11/16/2014 20
Conclusion
• Big Data via Hadoop is a relivent
choice in certain problem spaces
• Open source can provide valuable
tools for our customers
• Questions?
11/16/2014 21

A Mayo Clinic Big Data Implementation

  • 1.
    Mayo Clinic Big DataProjects Experience Brian Brownlow Big Data Professional
  • 2.
    What is BigData? • A silver bullet that will solve all the worlds problems? NO • An arrow in the IT quiver to help solve customer problems? YES • Does anyone have large data problems? All sales transactions, log reviews, device output, text processing? • How does you relational DB handle index creation or backup for 500,000,000,000 row tables? • Popular things that are similar • Seti, many networked computers doing small pieces of work • Watson, many networked computers working together to solve a problem • What’s one computer that beat a chess master? Kasparov – Deep Blue (1996–1997), there are others… • Big data has been around a long time • Why now? Bigger, cheaper, faster processing, memory, networking and disk 11/16/2014 2
  • 3.
    Mayo Big DataElements • Patient Information • Appointments • Labs • Images • Genome • Appointment Check-in/Check-out • Report text • Vitals • Device reporting, e.g. Holter Monitor • Many more, it keeps growing… 11/16/2014 3
  • 4.
    Mayo Big DataElements Potentially Affecting Patient Care • ALL OF THEM! • The more we know about a patient the better we can build tools and models to help the care team improve patient care and help the business manage to reimbursement. 11/16/2014 4
  • 5.
    Mayo Big DataInitial Evaluations • Hortonworks HDP on a Virtual Machine on my laptop • HDP 1.3.2, 2.1 on Oracle VM • HDP 1.3.2, 2.1 on VMWare • What can HDP do? • Pig, Hive, Hbase, HDFS, Ambari, Hue, MapReduce, FLUME, Storm, ElasticSearch, Sqoop… 11/16/2014 5
  • 6.
    Mayo Big DataPresentations to Leadership • What is “BIG DATA”? What is Hadoop? • What are “BIG DATA” capabilities? • Here is one way you can answer your customer queries about big data! • Many people want to have a “BIG DATA” story • Proved out at Mayo by some initial proof of concept projects • Genomics on Cloudera (early work) • HDP on Oracle VM (my project) • Multi node DEV environment on HDP 1.3.2 running Centos on XenCenter and an outside edge node • Helped by media hype. 11/16/2014 6
  • 7.
    The Virtual Machine! •Show it. 11/16/2014 7
  • 8.
    Mayo Big DataDEV 11/16/2014 8
  • 9.
    Big Data DEVSetup • Lots of help on the web, Hortonworks website, other websites • Using the latest version of CentOS: 6.5 (x64) • Exported VM to CentOS6.5_Hadoop1.32_SSD3.ova • Installed as a VM from Oracle Virtual Box on Citrix XenCenter • Installed or Updated latest packages for yum, rpm, wget, curl, scp, pdsh, … • Downloaded and generated local HDP repository /etc/ yum.repo.d (Note: 3 versions HDP hadoop stacks – 1.3.2, 1.3.3, 2.0.6) • Configured network (hosts, security, firewall…) • Installed Ambari (v1.4.4.23) and embedded postgresql DB (v8.4.18) • Installed Hadoop components from Ambari 11/16/2014 9
  • 10.
    Big Data DEVEnvironment • Was it Perfect? NO • Less stable than preferred due to enabled updates • Lightly used • Checked daily • By the time of heavier we had our INT and PROD environments so we didn’t need DEV • Was It Good Enough? YES 11/16/2014 10
  • 11.
    Mayo Big DataPlatform RFP • Sent out RFP, got demos based on a use case we submitted with the RFP • IBM Big Insights • Cloudera Hadoop Distribution • TeraData/Hortonworks Hadoop Distribution • Selected TeraData/Hortonworks on a TeraData hardware frame • TDH (Teradata Hadoop is not a exact copy of HDP (Hortonworks Data Platform) • TeraData brings appliance brings some good things to the table, Viewpoint, HCLI, … 11/16/2014 11
  • 12.
    Big Data INTand PROD • TDH INT in one cabinet, TDH PROD in the other, asked Teradata for a VM version • Additional expansion space available in existing INT and PROD racks, want a big data project? Fund a new edge or data node! • TeraData add-ons, RAID, Infiband, Viewpoint, HCLI • TDH 1.3.2 not HDP 1.3.2, same source base but minor differences to support the TeraData infrastructure • Ideal: DEV=INT=PROD, hardware and software 11/16/2014 12
  • 13.
    11/16/2014 13 Master Prod2 Master Prod 1 Edge Prod 2 Data Prod 6 Data Prod 5 Data Prod 4 Data Prod 3 Data Prod 2 Edge Prod 1 Data Prod 1 Primary SM Enet Switch System VMS Network-0 InfiniBand Switch KVM Cabling Slot Network-1 InfiniBand Switch Space for Additional Nodes Secondary SM Enet Switch Master Test 2 Master Test 1 Edge Test 2 Data Test 6 Data Test 5 Data Test 4 Data Test 3 Data Test 2 Edge Test 1 Data Test 1 Primary SM Enet Switch Cabinet VMS Space for Additional Nodes Secondary SM Enet Switch Viewpoint TMS • 20 Hadoop nodes total – 10 per cabinet • 2 Hadoop clusters, one per cabinet: • Prod: 2 Master, 2 Edge, 1 Viewpoint TMS, 6 Data nodes (can add up to 7 more Edge and/or Data nodes in-cabinet, plus add additional cabinets to the cluster) • Integration Test: 2 Master, 2 Edge, 6 Data nodes (can add up to 8 more Edge and/or Data nodes in-cabinet, plus add additional cabinets to the cluster) • Raw user data capacity per cluster: 57+ TB • Includes HDFS 3x replication & work space • Does NOT include any compression! • Example: at 2x compression, user data space per cluster is 114+ TB • Power: 3 phase; 2 x 60 amps per cabinet; bottom egress • HDP 1.3.2; Storm, Elasticsearch, and WebSphere MQ to be installed on appliance by project team • Teradata Managed Server (TMS) for Viewpoint TDH INT and PROD
  • 14.
    Big Data ProjectSetup • Agile development – 2 week sprints, daily scrums • Extreme Programming • Java Development Environment tool tree • SVN (Subversion) • Jenkins • Maven • Eclipse – Kepler • Open Source Components • Storm • Flume • Elastic Search (Marvel) • NLP - cTAKES • Acquired training for all components as needed, e.g. Storm, Flume, Elastic Search, SVN, Drools • Used in DEV, INT and PROD environments • Consulting engagements 11/16/2014 14
  • 15.
    DEV Team • Theteam • Executive support • Project manager • Senior Technical staff member • 4 very experienced Programmers • Very motivated, flexible, hearts of teachers and learners • Agile and Extreme programming relatively new to Mayo IT • Parts of the tool tree were also relatively new to Mayo IT 11/16/2014 15
  • 16.
    Part 1 • Verifythe development tool tree • Verify the development process • Verify the open source components • Define first use cases • Start and manage the project backlog list 11/16/2014 16
  • 17.
    Part 1 Projects •Natural Language Processing • Lets get more value from unstructured text! • Standard big data use cases • Exploration • Log exploration • Search • … • Data lake • Cohort identification • … 11/16/2014 17
  • 18.
    Part 1 Pig,Hive PIG A = LOAD 'default.bnb_test_from_file' USING org.apache.hcatalog.pig.HCatLoader(); DUMP A; Hive 'SELECT * FROM default.bnb_test_from_file limit 2' 11/16/2014 18
  • 19.
    Part 1 • Inproduction! • Well received • Met expectations for the development process and schedule • Lots of people lined up now to use the environment! 11/16/2014 19
  • 20.
    Part 2 • MoreNLP work • Get more source data from more sources • Explore via Drools, ElasticSearch, MapReduce • Many more lined up • Security – log examination • Clinical Trials cohort discovery • Genomics/Phenomics • Molecular biology • Protein studies • … 11/16/2014 20
  • 21.
    Conclusion • Big Datavia Hadoop is a relivent choice in certain problem spaces • Open source can provide valuable tools for our customers • Questions? 11/16/2014 21