SlideShare a Scribd company logo
1 of 21
Mayo Clinic
Big Data Projects
Experience
Brian Brownlow
Big Data Professional
What is Big Data?
• A silver bullet that will solve all the worlds problems? NO
• An arrow in the IT quiver to help solve customer problems?
YES
• Does anyone have large data problems? All sales
transactions, log reviews, device output, text processing?
• How does you relational DB handle index creation or
backup for 500,000,000,000 row tables?
• Popular things that are similar
• Seti, many networked computers doing small pieces of work
• Watson, many networked computers working together to
solve a problem
• What’s one computer that beat a chess master? Kasparov –
Deep Blue (1996–1997), there are others…
• Big data has been around a long time
• Why now? Bigger, cheaper, faster processing, memory,
networking and disk
11/16/2014 2
Mayo Big Data Elements
• Patient Information
• Appointments
• Labs
• Images
• Genome
• Appointment Check-in/Check-out
• Report text
• Vitals
• Device reporting, e.g. Holter Monitor
• Many more, it keeps growing…
11/16/2014 3
Mayo Big Data Elements
Potentially Affecting Patient Care
• ALL OF THEM!
• The more we know about a patient
the better we can build tools and
models to help the care team improve
patient care and help the business
manage to reimbursement.
11/16/2014 4
Mayo Big Data Initial Evaluations
• Hortonworks HDP on a Virtual
Machine on my laptop
• HDP 1.3.2, 2.1 on Oracle VM
• HDP 1.3.2, 2.1 on VMWare
• What can HDP do?
• Pig, Hive, Hbase, HDFS, Ambari,
Hue, MapReduce, FLUME, Storm,
ElasticSearch, Sqoop…
11/16/2014 5
Mayo Big Data Presentations to
Leadership
• What is “BIG DATA”? What is Hadoop?
• What are “BIG DATA” capabilities?
• Here is one way you can answer your customer queries
about big data!
• Many people want to have a “BIG DATA” story
• Proved out at Mayo by some initial proof of concept projects
• Genomics on Cloudera (early work)
• HDP on Oracle VM (my project)
• Multi node DEV environment on HDP 1.3.2 running Centos on
XenCenter and an outside edge node
• Helped by media hype.
11/16/2014 6
The Virtual Machine!
• Show it.
11/16/2014 7
Mayo Big Data DEV
11/16/2014 8
Big Data DEV Setup
• Lots of help on the web, Hortonworks website, other websites
• Using the latest version of CentOS: 6.5 (x64)
• Exported VM to CentOS6.5_Hadoop1.32_SSD3.ova
• Installed as a VM from Oracle Virtual Box on Citrix XenCenter
• Installed or Updated latest packages for yum, rpm, wget, curl,
scp, pdsh, …
• Downloaded and generated local HDP repository /etc/
yum.repo.d (Note: 3 versions HDP hadoop stacks – 1.3.2, 1.3.3,
2.0.6)
• Configured network (hosts, security, firewall…)
• Installed Ambari (v1.4.4.23) and embedded postgresql DB
(v8.4.18)
• Installed Hadoop components from Ambari
11/16/2014 9
Big Data DEV Environment
• Was it Perfect? NO
• Less stable than preferred due to enabled
updates
• Lightly used
• Checked daily
• By the time of heavier we had our INT and
PROD environments so we didn’t need DEV
• Was It Good Enough? YES
11/16/2014 10
Mayo Big Data Platform RFP
• Sent out RFP, got demos based on a
use case we submitted with the RFP
• IBM Big Insights
• Cloudera Hadoop Distribution
• TeraData/Hortonworks Hadoop Distribution
• Selected TeraData/Hortonworks on a
TeraData hardware frame
• TDH (Teradata Hadoop is not a exact copy of
HDP (Hortonworks Data Platform)
• TeraData brings appliance brings some good
things to the table, Viewpoint, HCLI, …
11/16/2014 11
Big Data INT and PROD
• TDH INT in one cabinet, TDH PROD in the other,
asked Teradata for a VM version
• Additional expansion space available in existing
INT and PROD racks, want a big data project?
Fund a new edge or data node!
• TeraData add-ons, RAID, Infiband, Viewpoint,
HCLI
• TDH 1.3.2 not HDP 1.3.2, same source base but
minor differences to support the TeraData
infrastructure
• Ideal: DEV=INT=PROD, hardware and software
11/16/2014 12
11/16/2014 13
Master Prod 2
Master Prod 1
Edge Prod 2
Data Prod 6
Data Prod 5
Data Prod 4
Data Prod 3
Data Prod 2
Edge Prod 1
Data Prod 1
Primary SM Enet Switch
System VMS
Network-0 InfiniBand
Switch
KVM
Cabling Slot
Network-1 InfiniBand
Switch
Space for
Additional Nodes
Secondary SM Enet Switch
Master Test 2
Master Test 1
Edge Test 2
Data Test 6
Data Test 5
Data Test 4
Data Test 3
Data Test 2
Edge Test 1
Data Test 1
Primary SM Enet Switch
Cabinet VMS
Space for
Additional Nodes
Secondary SM Enet Switch
Viewpoint TMS
• 20 Hadoop nodes total – 10 per cabinet
• 2 Hadoop clusters, one per cabinet:
• Prod: 2 Master, 2 Edge, 1 Viewpoint TMS, 6
Data nodes (can add up to 7 more Edge
and/or Data nodes in-cabinet, plus add
additional cabinets to the cluster)
• Integration Test: 2 Master, 2 Edge, 6 Data
nodes (can add up to 8 more Edge and/or
Data nodes in-cabinet, plus add additional
cabinets to the cluster)
• Raw user data capacity per cluster: 57+ TB
• Includes HDFS 3x replication & work space
• Does NOT include any compression!
• Example: at 2x compression, user
data space per cluster is 114+ TB
• Power: 3 phase; 2 x 60 amps per cabinet; bottom
egress
• HDP 1.3.2; Storm, Elasticsearch, and WebSphere
MQ to be installed on appliance by project team
• Teradata Managed Server (TMS) for Viewpoint
TDH INT and PROD
Big Data Project Setup
• Agile development – 2 week sprints, daily scrums
• Extreme Programming
• Java Development Environment tool tree
• SVN (Subversion)
• Jenkins
• Maven
• Eclipse – Kepler
• Open Source Components
• Storm
• Flume
• Elastic Search (Marvel)
• NLP - cTAKES
• Acquired training for all components as needed, e.g. Storm, Flume, Elastic
Search, SVN, Drools
• Used in DEV, INT and PROD environments
• Consulting engagements
11/16/2014 14
DEV Team
• The team
• Executive support
• Project manager
• Senior Technical staff member
• 4 very experienced Programmers
• Very motivated, flexible, hearts of teachers and
learners
• Agile and Extreme programming relatively
new to Mayo IT
• Parts of the tool tree were also relatively
new to Mayo IT
11/16/2014 15
Part 1
• Verify the development tool tree
• Verify the development process
• Verify the open source components
• Define first use cases
• Start and manage the project backlog
list
11/16/2014 16
Part 1 Projects
• Natural Language Processing
• Lets get more value from unstructured text!
• Standard big data use cases
• Exploration
• Log exploration
• Search
• …
• Data lake
• Cohort identification
• …
11/16/2014 17
Part 1 Pig, Hive
PIG
A = LOAD 'default.bnb_test_from_file' USING
org.apache.hcatalog.pig.HCatLoader();
DUMP A;
Hive
'SELECT * FROM default.bnb_test_from_file limit 2'
11/16/2014 18
Part 1
• In production!
• Well received
• Met expectations for the development
process and schedule
• Lots of people lined up now to use
the environment!
11/16/2014 19
Part 2
• More NLP work
• Get more source data from more sources
• Explore via Drools, ElasticSearch, MapReduce
• Many more lined up
• Security – log examination
• Clinical Trials cohort discovery
• Genomics/Phenomics
• Molecular biology
• Protein studies
• …
11/16/2014 20
Conclusion
• Big Data via Hadoop is a relivent
choice in certain problem spaces
• Open source can provide valuable
tools for our customers
• Questions?
11/16/2014 21

More Related Content

What's hot

HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureLynn Langit
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightDataWorks Summit/Hadoop Summit
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicDataWorks Summit
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
 
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelMoving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelDataWorks Summit
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software
 
Breaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AIBreaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AIDataWorks Summit
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesMapR Technologies
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHortonworks
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014Wilfried Hoge
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol HARMAN Services
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationHortonworks
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2DataWorks Summit
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDataWorks Summit
 

What's hot (20)

HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows Azure
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsight
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine Learning
 
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelMoving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 
Breaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AIBreaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AI
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best Practices
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop Implementation
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 

Similar to A Mayo Clinic Big Data Implementation

Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...Avere Systems
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Larry Smarr
 
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Larry Smarr
 
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Larry Smarr
 
Hadoop-Automation-Tool_RamkishorTak
Hadoop-Automation-Tool_RamkishorTakHadoop-Automation-Tool_RamkishorTak
Hadoop-Automation-Tool_RamkishorTakRam Kishor Tak
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made EasyAll Things Open
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologiesneeraj rathore
 
How Open Source is Transforming the Internet. Again.
How Open Source is Transforming the Internet. Again.How Open Source is Transforming the Internet. Again.
How Open Source is Transforming the Internet. Again.Steve Hoffman
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Chris Nauroth
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Community
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Community
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Webinar: Hyperconvergence is Broken, Learn How to Fix it!
Webinar: Hyperconvergence is Broken, Learn How to Fix it!Webinar: Hyperconvergence is Broken, Learn How to Fix it!
Webinar: Hyperconvergence is Broken, Learn How to Fix it!Storage Switzerland
 
Budapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processingBudapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processingGabor Boros
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 

Similar to A Mayo Clinic Big Data Implementation (20)

Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
 
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
 
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
 
Hadoop-Automation-Tool_RamkishorTak
Hadoop-Automation-Tool_RamkishorTakHadoop-Automation-Tool_RamkishorTak
Hadoop-Automation-Tool_RamkishorTak
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made Easy
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
CERNBox: Site Report
CERNBox: Site ReportCERNBox: Site Report
CERNBox: Site Report
 
How Open Source is Transforming the Internet. Again.
How Open Source is Transforming the Internet. Again.How Open Source is Transforming the Internet. Again.
How Open Source is Transforming the Internet. Again.
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's Ceph
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Webinar: Hyperconvergence is Broken, Learn How to Fix it!
Webinar: Hyperconvergence is Broken, Learn How to Fix it!Webinar: Hyperconvergence is Broken, Learn How to Fix it!
Webinar: Hyperconvergence is Broken, Learn How to Fix it!
 
Budapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processingBudapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processing
 
Dns firewalls null-may2020
Dns firewalls null-may2020Dns firewalls null-may2020
Dns firewalls null-may2020
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 

More from BDPA Education and Technology Foundation

More from BDPA Education and Technology Foundation (20)

Oracle Scholarship for BDPA Students
Oracle Scholarship for BDPA StudentsOracle Scholarship for BDPA Students
Oracle Scholarship for BDPA Students
 
BDPA and College Students
BDPA and College StudentsBDPA and College Students
BDPA and College Students
 
Eli Lilly Scholarship for BDPA Students (2018)
Eli Lilly Scholarship for BDPA Students (2018)Eli Lilly Scholarship for BDPA Students (2018)
Eli Lilly Scholarship for BDPA Students (2018)
 
Johnson & Johnson Scholarship (2018)
Johnson & Johnson Scholarship (2018)Johnson & Johnson Scholarship (2018)
Johnson & Johnson Scholarship (2018)
 
flyer-BDPAConnect Virtual Career Fair
flyer-BDPAConnect Virtual Career Fair flyer-BDPAConnect Virtual Career Fair
flyer-BDPAConnect Virtual Career Fair
 
Nomination form * BDPA Cincinnati (2017)
Nomination form * BDPA Cincinnati (2017)Nomination form * BDPA Cincinnati (2017)
Nomination form * BDPA Cincinnati (2017)
 
Newsletter: BDPA Memphis (June 2017)
Newsletter: BDPA Memphis (June 2017) Newsletter: BDPA Memphis (June 2017)
Newsletter: BDPA Memphis (June 2017)
 
Newsletter: BDPA Washington DC (May 2017)
Newsletter: BDPA Washington DC (May 2017) Newsletter: BDPA Washington DC (May 2017)
Newsletter: BDPA Washington DC (May 2017)
 
Oracle Scholarship for BDPA Students
Oracle Scholarship for BDPA StudentsOracle Scholarship for BDPA Students
Oracle Scholarship for BDPA Students
 
Wanda Everett BDPA Scholarship
Wanda Everett BDPA ScholarshipWanda Everett BDPA Scholarship
Wanda Everett BDPA Scholarship
 
BDPA Technology Conference Flyer (2017)
BDPA Technology Conference Flyer (2017)BDPA Technology Conference Flyer (2017)
BDPA Technology Conference Flyer (2017)
 
2017 BDPA Individual PACEsetter Awards Program
2017 BDPA Individual PACEsetter Awards Program2017 BDPA Individual PACEsetter Awards Program
2017 BDPA Individual PACEsetter Awards Program
 
Top Companies for Blacks in Technology `
Top Companies for Blacks in Technology `Top Companies for Blacks in Technology `
Top Companies for Blacks in Technology `
 
flyer-BDPAConnect Virtual Career Fair
flyer-BDPAConnect Virtual Career Fairflyer-BDPAConnect Virtual Career Fair
flyer-BDPAConnect Virtual Career Fair
 
BDPA Cincinnati Computer Camp Orientation (2017)
BDPA Cincinnati Computer Camp Orientation (2017)BDPA Cincinnati Computer Camp Orientation (2017)
BDPA Cincinnati Computer Camp Orientation (2017)
 
BDPA Connect Virtual Career Fair
BDPA Connect Virtual Career FairBDPA Connect Virtual Career Fair
BDPA Connect Virtual Career Fair
 
Overview-SITES_Triangle-2016
Overview-SITES_Triangle-2016Overview-SITES_Triangle-2016
Overview-SITES_Triangle-2016
 
National BDPA Mobile Application Showcase
National BDPA Mobile Application ShowcaseNational BDPA Mobile Application Showcase
National BDPA Mobile Application Showcase
 
ITSMF Educational Scholarship
ITSMF Educational ScholarshipITSMF Educational Scholarship
ITSMF Educational Scholarship
 
2016 Golf Classic Trifold
2016 Golf Classic Trifold2016 Golf Classic Trifold
2016 Golf Classic Trifold
 

Recently uploaded

How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxsqpmdrvczh
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 

Recently uploaded (20)

How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 

A Mayo Clinic Big Data Implementation

  • 1. Mayo Clinic Big Data Projects Experience Brian Brownlow Big Data Professional
  • 2. What is Big Data? • A silver bullet that will solve all the worlds problems? NO • An arrow in the IT quiver to help solve customer problems? YES • Does anyone have large data problems? All sales transactions, log reviews, device output, text processing? • How does you relational DB handle index creation or backup for 500,000,000,000 row tables? • Popular things that are similar • Seti, many networked computers doing small pieces of work • Watson, many networked computers working together to solve a problem • What’s one computer that beat a chess master? Kasparov – Deep Blue (1996–1997), there are others… • Big data has been around a long time • Why now? Bigger, cheaper, faster processing, memory, networking and disk 11/16/2014 2
  • 3. Mayo Big Data Elements • Patient Information • Appointments • Labs • Images • Genome • Appointment Check-in/Check-out • Report text • Vitals • Device reporting, e.g. Holter Monitor • Many more, it keeps growing… 11/16/2014 3
  • 4. Mayo Big Data Elements Potentially Affecting Patient Care • ALL OF THEM! • The more we know about a patient the better we can build tools and models to help the care team improve patient care and help the business manage to reimbursement. 11/16/2014 4
  • 5. Mayo Big Data Initial Evaluations • Hortonworks HDP on a Virtual Machine on my laptop • HDP 1.3.2, 2.1 on Oracle VM • HDP 1.3.2, 2.1 on VMWare • What can HDP do? • Pig, Hive, Hbase, HDFS, Ambari, Hue, MapReduce, FLUME, Storm, ElasticSearch, Sqoop… 11/16/2014 5
  • 6. Mayo Big Data Presentations to Leadership • What is “BIG DATA”? What is Hadoop? • What are “BIG DATA” capabilities? • Here is one way you can answer your customer queries about big data! • Many people want to have a “BIG DATA” story • Proved out at Mayo by some initial proof of concept projects • Genomics on Cloudera (early work) • HDP on Oracle VM (my project) • Multi node DEV environment on HDP 1.3.2 running Centos on XenCenter and an outside edge node • Helped by media hype. 11/16/2014 6
  • 7. The Virtual Machine! • Show it. 11/16/2014 7
  • 8. Mayo Big Data DEV 11/16/2014 8
  • 9. Big Data DEV Setup • Lots of help on the web, Hortonworks website, other websites • Using the latest version of CentOS: 6.5 (x64) • Exported VM to CentOS6.5_Hadoop1.32_SSD3.ova • Installed as a VM from Oracle Virtual Box on Citrix XenCenter • Installed or Updated latest packages for yum, rpm, wget, curl, scp, pdsh, … • Downloaded and generated local HDP repository /etc/ yum.repo.d (Note: 3 versions HDP hadoop stacks – 1.3.2, 1.3.3, 2.0.6) • Configured network (hosts, security, firewall…) • Installed Ambari (v1.4.4.23) and embedded postgresql DB (v8.4.18) • Installed Hadoop components from Ambari 11/16/2014 9
  • 10. Big Data DEV Environment • Was it Perfect? NO • Less stable than preferred due to enabled updates • Lightly used • Checked daily • By the time of heavier we had our INT and PROD environments so we didn’t need DEV • Was It Good Enough? YES 11/16/2014 10
  • 11. Mayo Big Data Platform RFP • Sent out RFP, got demos based on a use case we submitted with the RFP • IBM Big Insights • Cloudera Hadoop Distribution • TeraData/Hortonworks Hadoop Distribution • Selected TeraData/Hortonworks on a TeraData hardware frame • TDH (Teradata Hadoop is not a exact copy of HDP (Hortonworks Data Platform) • TeraData brings appliance brings some good things to the table, Viewpoint, HCLI, … 11/16/2014 11
  • 12. Big Data INT and PROD • TDH INT in one cabinet, TDH PROD in the other, asked Teradata for a VM version • Additional expansion space available in existing INT and PROD racks, want a big data project? Fund a new edge or data node! • TeraData add-ons, RAID, Infiband, Viewpoint, HCLI • TDH 1.3.2 not HDP 1.3.2, same source base but minor differences to support the TeraData infrastructure • Ideal: DEV=INT=PROD, hardware and software 11/16/2014 12
  • 13. 11/16/2014 13 Master Prod 2 Master Prod 1 Edge Prod 2 Data Prod 6 Data Prod 5 Data Prod 4 Data Prod 3 Data Prod 2 Edge Prod 1 Data Prod 1 Primary SM Enet Switch System VMS Network-0 InfiniBand Switch KVM Cabling Slot Network-1 InfiniBand Switch Space for Additional Nodes Secondary SM Enet Switch Master Test 2 Master Test 1 Edge Test 2 Data Test 6 Data Test 5 Data Test 4 Data Test 3 Data Test 2 Edge Test 1 Data Test 1 Primary SM Enet Switch Cabinet VMS Space for Additional Nodes Secondary SM Enet Switch Viewpoint TMS • 20 Hadoop nodes total – 10 per cabinet • 2 Hadoop clusters, one per cabinet: • Prod: 2 Master, 2 Edge, 1 Viewpoint TMS, 6 Data nodes (can add up to 7 more Edge and/or Data nodes in-cabinet, plus add additional cabinets to the cluster) • Integration Test: 2 Master, 2 Edge, 6 Data nodes (can add up to 8 more Edge and/or Data nodes in-cabinet, plus add additional cabinets to the cluster) • Raw user data capacity per cluster: 57+ TB • Includes HDFS 3x replication & work space • Does NOT include any compression! • Example: at 2x compression, user data space per cluster is 114+ TB • Power: 3 phase; 2 x 60 amps per cabinet; bottom egress • HDP 1.3.2; Storm, Elasticsearch, and WebSphere MQ to be installed on appliance by project team • Teradata Managed Server (TMS) for Viewpoint TDH INT and PROD
  • 14. Big Data Project Setup • Agile development – 2 week sprints, daily scrums • Extreme Programming • Java Development Environment tool tree • SVN (Subversion) • Jenkins • Maven • Eclipse – Kepler • Open Source Components • Storm • Flume • Elastic Search (Marvel) • NLP - cTAKES • Acquired training for all components as needed, e.g. Storm, Flume, Elastic Search, SVN, Drools • Used in DEV, INT and PROD environments • Consulting engagements 11/16/2014 14
  • 15. DEV Team • The team • Executive support • Project manager • Senior Technical staff member • 4 very experienced Programmers • Very motivated, flexible, hearts of teachers and learners • Agile and Extreme programming relatively new to Mayo IT • Parts of the tool tree were also relatively new to Mayo IT 11/16/2014 15
  • 16. Part 1 • Verify the development tool tree • Verify the development process • Verify the open source components • Define first use cases • Start and manage the project backlog list 11/16/2014 16
  • 17. Part 1 Projects • Natural Language Processing • Lets get more value from unstructured text! • Standard big data use cases • Exploration • Log exploration • Search • … • Data lake • Cohort identification • … 11/16/2014 17
  • 18. Part 1 Pig, Hive PIG A = LOAD 'default.bnb_test_from_file' USING org.apache.hcatalog.pig.HCatLoader(); DUMP A; Hive 'SELECT * FROM default.bnb_test_from_file limit 2' 11/16/2014 18
  • 19. Part 1 • In production! • Well received • Met expectations for the development process and schedule • Lots of people lined up now to use the environment! 11/16/2014 19
  • 20. Part 2 • More NLP work • Get more source data from more sources • Explore via Drools, ElasticSearch, MapReduce • Many more lined up • Security – log examination • Clinical Trials cohort discovery • Genomics/Phenomics • Molecular biology • Protein studies • … 11/16/2014 20
  • 21. Conclusion • Big Data via Hadoop is a relivent choice in certain problem spaces • Open source can provide valuable tools for our customers • Questions? 11/16/2014 21