Using Advanced Analytics for Value-based Healthcare Delivery

•Download as PPTX, PDF•

5 likes•1,490 views

Promoting Value-based Healthcare Delivery The fundamental principles of the Affordable Care Act recognize that the volume-based, fee-for-service payment model is unsustainable and that a value-based healthcare delivery system is essential. With the emergence of Accountable Care Organizations (ACOs), providers are incentivized to implement payment reforms and participate in shared savings programs that seek to balance quality of care, access to care and cost of care. Our healthcare analytics payment model uses predictive analytics to assist ACOs in patient attribution, budget development, bench-marking and performance monitoring to maximize incentives through shared savings and quality improvements.

Technology Health & Medicine Business

Using Advanced Analytics for
Value-based Healthcare Delivery

Copyright © Prime Dimensions 2013 All rights reserved.

We are a DC-based consulting firm that provides advanced
analytics capabilities and implementation services
 Prime Dimensions offers expertise in the processes, tools and techniques
associated with:
 Data Management
 Business Intelligence
 Advanced Analytics

 We focus on operational aspects and emphasis on Big Data strategy and
technology

 We assist organizations in transforming data into actionable insights to improve
performance, make informed decisions and achieve measurable results.
 We partner closely with clients to ensure cost-effective implementations.

Big Data requires a new generation of scalable
technologies designed to extract insights from very large
volumes of disparate, multi-structured data by enabling
high velocity capture, discovery, and analysis.
Copyright © Prime Dimensions 2013 All rights reserved.

1

Agenda
 Challenges & Opportunities
 Payment Reform
 Big Data Enabling Technologies
 Big Data Roadmap

Copyright © Prime Dimensions 2013 All rights reserved.

2

Typical Client Challenges

Copyright © Prime Dimensions 2013 All rights reserved.

3

Healthcare in the 21st Century
“The transformational force that has brought affordability and accessibility to
other industries is disruptive innovation. Today's health-care industry screams for
disruption. Politicians are consumed with how we can afford health care. But
disruption solves the more fundamental question: How do we make health care
affordable?”
Clayton M. Christensen, The Innovator's Prescription: A Disruptive Solution for
Health Care
National Challenges







Escalating Costs
Dwindling Budgets
More Oversight and Scrutiny
Expectation of Transparency
and Accountability
New Laws, Regulations and
Polices
Focus on Performance

Transformational Opportunities






Radical innovation
Go beyond measuring
outcomes to changing them
Aggregate and/or discover data
in ways that were not possible
until recently
Visualize problems in humanfriendly formats

Copyright © Prime Dimensions 2013 All rights reserved.

4

Current U.S. Healthcare Situation

Copyright © Prime Dimensions 2013 All rights reserved.

5

Medicare Accounts for Over 20% of Total U.S.
Healthcare Spending in 2010

Note: Other health insurance
programs includes DoD and VA

Source: Medicare Payment Advisory Committee, June 2012 Data Book

Copyright © Prime Dimensions 2013 All rights reserved.

6

Healthcare Spending as a Percentage of GDP
In 2010, total U.S.
healthcare spending was
$2.6
trillion, representing
18% of Gross Domestic
Product, doubling over
the past 30 years. If this
trend
continues, estimates are
that healthcare
spending will reach a
staggering $5 trillion, or
20% of GDP, by 2021.
Moreover, patients’ outof-pocket costs have
doubled over the past 10
years.
Source: Medicare Payment Advisory Committee, June 2012 Data Book

Copyright © Prime Dimensions 2013 All rights reserved.

7

Transforming the Healthcare Ecosystem

Copyright © Prime Dimensions 2013 All rights reserved.

8

The DNA of Healthcare Data

Copyright © Prime Dimensions 2013 All rights reserved.

10

Unsustainable Fee-for-Service (FFS)
Payment Model
 Current FFS payment structure results in redundant testing, medical

errors, and over-utilization of the system





Maximizes providers’ fees and reimbursements based on volume
Incentivizes multiple tests and procedures, regardless of necessity
Limited coordination of care across provider network
Medical errors and failed procedures result in higher fees for providers

 Data overload: vital health information assets are not being leveraged
 96% of Medicare dollars account for treating chronic illnesses, with

only 3-5% for prevention (Source: CMS, Chronic Conditions Chartbook, 2012)

Research indicates that payment reform would reduce spending by
$1.33 Trillion by 2023 and significantly improve quality and outcomes.
(Source: The Commonwealth Fund Commission on a High Performance Health System, 1/13)

Copyright © Prime Dimensions 2013 All rights reserved.

11

Four Major Payment Reform Models
Global
Payments
• A single
PMPM
payment is
made for all
services
delivered to
a
patient, with
payment
adjustments
based on
measured
performanc
e and
patient risk.

Bundled
Payments

Patientcentered
Medical
Home

• A single
• A physician
“bundled”
practice or
payment, w
other
hich may
provider is
include
eligible to
multiple
receive
providers in
additional
multiple
payments if
care
medical
settings, is
home
made for
criteria are
services
met, based
delivered
on quality
during an
and cost
Copyright Prime
episode ©of Dimensions 2013 All rights reserved.
performanc

Accountab
le Care
Organizati
ons
• Groups of
providers
that
voluntarily
assume
responsibilit
y for the
care of a
population
of patients
share payer
savings if
they meet
quality and
12
cost

ACO Implementation Process

Copyright © Prime Dimensions 2013 All rights reserved.

13

Solution
 We selected three CMS programs for analytics: Accountable Care

Organizations (ACO), Bundled Payments, and Patient Centered
Medical Home.
 The focus of Phase 1 of the prototype is on ACOs.

Sources:
Bundled Payments for Care Improvement (BPCI) Initiative: General Information, http://innovation.cms.gov/initiatives/bundled-payments/
Against this backdrop, CMS has launched new delivery and payment initiatives aimed to improve patient outcomes and quality while containing costs. We have
selected the following programs for this business case: Bundled Payments, Accountable Care Organizations, and Patient Centered Medical Home.
Bundled Payments for Care Improvement (BPCI) Initiative: General Information, http://innovation.cms.gov/initiatives/bundled-payments/

Copyright © Prime Dimensions 2013 All rights reserved.

14

Payment Reform Solution
 Helps healthcare payers find the optimal cost, revenue, and provider

incentive models to improve health outcomes and contain patient cost
 Embeds business rules and algorithms of the Medicare Shared Savings
Program (MSSP) Accountable Care Organization (ACO).
 Includes the following features and capabilities:
 ACO benchmarks and budget based on historical cost baseline, trend
estimates and risk adjustments
 Performance monitoring of key measures and metrics related to cost,
utilization and quality
 Predictive modeling to determine the proper mix of inputs to
maximize payment incentives
 Dynamic dashboards and visualizations to perform trade-off analysis
and scenario planning

Copyright © Prime Dimensions 2013 All rights reserved.

15

Methodology and Scope
 The team acquired and loaded public datasets from
CMS and related sources
 Phase 1 scope:
In-patient Medicare data for 2010
Sampling of 5 metropolitan areas representing a range
of regions and population sizes (CA, FL, MA, OH, GA)
 Diabetes, hypertension, coronary disease, and heart
failure measures
 Readmissions and ER utilization



 Created scenarios for comparative analysis and
benchmarking


Population, cost, clinical and utilization data

 Analytical models and predictive modeling
Seeking input from SMEs
 Applying industry best practices
 Assessing technical solutions


Data Sources
1. Data Entrepreneurs’ Synthetic Public Use
File
2. Institutional Provider & Beneficiary
Summary
3. Data for ACO Applicant Share Calculations
4. CMS Statistics reference booklet for the
Medicare and Medicaid health insurance
programs
5. National Health Expenditure Accounts
(health care goods and services, public
health activities, government
administration, the net cost of health
insurance, and investment related to health
care)
6. Premier Hospital Quality Incentive
Demonstration (expanding the information
available about quality of care and through
direct incentives to reward the delivery of
superior quality care)
7. Hospital Compare (including 27 quality
measures of clinical process of care and
outcomes)
8. Assorted files from the CMS Download
Database
9. The Dartmouth Atlas of Health Care

Copyright © Prime Dimensions 2013 All rights reserved.

16

Big Data Enabling Technologies
 New systems that handle a wide variety of data from

sensor data to web and social media data
Big Data requires a new
 Improved analytical capabilities (sometimes called
generation of scalable
advanced analytics) including event, predictive and technologies designed to
extract meaning from very
text analytics
large volumes of
 Operational business intelligence that improves disparate, multi-structured
data by enabling high
business agility by enabling automated real-time
velocity
actions and better decision making
capture, discovery, and
analysis.
 Faster hardware ranging from faster multi-core

processors and large memory spaces, to solid-state  Massively Parallel
Processing
drives and virtual data storage

 Hadoop and MapReduce

 Cloud computing including on-demand software-as- NoSQL Databases


a-service (SaaS) and platform-as-a-service (PaaS)
 In-Memory Technology
analytical solutions in public and private clouds

Copyright © Prime Dimensions 2013 All rights reserved.

18

Massively Parallel Processing

 Massively parallel processing (MPP) is the coordinated processing of a program by
multiple processors that work on different parts of the program, with each processor
using its own operating system and memory
 Known as “loosely coupled” or “share nothing” systems
 Enables “analytic offload”
 Permits parallelized data loading, i.e. E-L-T vs. E-T-L

Copyright © Prime Dimensions 2013 All rights reserved.

19

The Hadoop Ecosystem
Compute Cluster
Multi-structured
Source Data
Master Node
NameNode
DataNode

Slave Node

JobTracker
TaskTracker

DataNode
TaskTracker

Slave Node
DataNode
TaskTracker
Slave Node

MapReduce
Engine

DataNode
Slave Node

TaskTracker

YARN

DataNode
TaskTracker
Slave Node
DataNode
TaskTracker

DATA LAYER

WORKLOAD MANAGEMENT LAYER
APPLICATION LAYER
Copyright © Prime Dimensions 2013 All rights reserved.

20

The Hadoop Ecosystem
Compute Cluster
ApplicationMaster

Multi-structured
Source Data

ResourceMaster
Master Node
NameNode
DataNode

Slave Node
NodeMaster
DataNode

JobTracker
TaskTracker

TaskTracker
Slave Node
NodeMaster
DataNode
TaskTracker
Slave Node
NodeMaster
DataNode
Slave Node
NodeMaster
DataNode

MapReduce
Engine

TaskTracker

YARN
MapReduce 2.0

TaskTracker
Slave Node
NodeMaster
DataNode
TaskTracker

WORKLOAD MANAGEMENT LAYER
Copyright © Prime Dimensions 2013 All rights reserved.

21

NoSQL (Not only SQL) Databases
 Like HDFS, NoSQL databases have distributed, schema-less data structure
and high scale out across low cost commodity hardware
 NoSQL databases are not reliant on a relational, fixed data model and
Structured Query Language (SQL)
 NoSQL databases are not ACID-compliant and lack the strict data integrity of
relational databases
 NoSQL databases have BASE properties – Basically Available, Soft
state, Eventual consistency
 NoSQL databases are queried via REST APIs.

Key-Value

Columnar

Document/Object

Graph

Increasing Data Complexity

Copyright © Prime Dimensions 2013 All rights reserved.

22

In-memory Database



An in-memory database stores (IMDS) data entirely in main memory using NVDIMMs, as opposed to
disk-based storage
Storing and retrieving data in memory is much faster than writing to and reading from disc





Reduce data access and latency caused by I/O operations and significantly speeding query and response
times.

This trend toward in-memory is made possible by low memory prices and multi-core 64-bit
processing capabilities.
Data stored entirely in-memory is immediately available for analytic processing

Copyright © Prime Dimensions 2013 All rights reserved.

23

Database Options

Data Velocity

Scale Up

Analytic
Databases

Relational
Databases

NoSQL
Databases

Scale Out
Data Variety
Copyright © Prime Dimensions 2013 All rights reserved.

24

Agenda
 Challenges & Opportunities
 Payment Reform
 Solution Prototype
 Big Data Enabling Technologies

 Big Data Roadmap

Copyright © Prime Dimensions 2013 All rights reserved.

25

Our roadmap reduces risk while progressing towards a
unified analytics environment
Continuous improvement & innovation

Changes to
laws
(e.g., ACA)

Permanent
budget
pressure

Information
Management Health
Assessor

Design Thinking & Prototyping

Mission
Oriented
Analytics
Framework

Unified Analytics
Environment

Evolving On-Demand Infrastructure

Commodity IT

Open source tools

Copyright © Prime Dimensions 2013 All rights reserved.

Smarter, mor
e demanding
data
consumers &
partners

Your Brain on Big Data
Query & Reporting

Discovery & Engagement

Copyright © Prime Dimensions 2013 All rights reserved.

27

Growth of Data Exploding

 Eric Schmidt: “Every 2 days we create as much

information as we did up to 2003.”

 Relational databases can not effectively ingest, store and

query the increased volume, variety and velocity of Big
Data

Copyright © Prime Dimensions 2013 All rights reserved.

28

Begin with the End in Mind

Copyright © Prime Dimensions 2013 All rights reserved.

29

Cycles of Design Thinking and Prototyping

Copyright © Prime Dimensions 2013 All rights reserved.

30

Mission Oriented Analytics Framework

Copyright © Prime Dimensions 2013 All rights reserved.

31

Evolving On-Demand Infrastructure for Big Data
Advanced Analytics

Analytic Offload
Scale-up/Scale-out

YARN

Storm

NoSQL Database

Tez

Analytic
Applications
REST/JSON
APIs

Dashboards &
Visualizations
Analytic
Database
E-L-T

In-memory
Columnar

HCatalog

Data Warehouse Augmentation

OLAP

Data
Warehouse

E-T-L

Multi-structured
And Stream
Source Data
Copyright © Prime Dimensions 2013 All rights reserved.

Data
Discovery

Structured Source Data

Unified Analytics Environment

Copyright © Prime Dimensions 2013 All rights reserved.

33

Data Management | Business Intelligence | Advanced Analytics

Copyright © Prime Dimensions 2013 All rights reserved.

34

Questions?
Our Contact Information
Michael Joseph
Managing Partner
Direct: 703.861.9897
Email: mjoseph@primedimensions.com

Richard Rowan
Managing Director
Direct: 703.201.2641
Email: rrowan@primedimensions.com

Prime Dimensions, LLC
www.primedimensions.com
Data Management | Business Intelligence | Advanced Analytics
Follow us @primedimensions

Copyright © Prime Dimensions 2013 All rights reserved.

35

What's hot

The Foundations of Success in Population Health ManagementHealth Catalyst

Why Accurate Financial Data is Critical for Successful Value TransformationHealth Catalyst

Tackle These 8 Challenges of MACRA Quality MeasuresHealth Catalyst

Why Healthcare Costing Matters to Enable Strategy and Financial PerformanceHealth Catalyst

Healthcare Reform Initiatives Affecting Physician CompensationPYA, P.C.

Platforms and Partnerships: The Building Blocks for Digital InnovationHealth Catalyst

Population Health Management: Path to ValueHealth Catalyst

Making Sense of MACRAHealth Catalyst

Population Stratification Made Easy, Quick, and Transparent for AnyoneHealth Catalyst

Webinar Deck: The Changing Face of IT Outsourcing in the Healthcare Payer Mar...Everest Group

Activity-Based Costing in Healthcare During COVID-19: Meeting Four Critical N...Health Catalyst

The Top Seven Healthcare Outcome Measures and Three Measurement EssentialsHealth Catalyst

Healthcare Consumerism and Cost: Dispelling the Myth of Price TransparencyHealth Catalyst

Optimize Your Labor Management with Health Catalyst PowerLabor™Health Catalyst

Why Health Systems Must Use Data Science to Improve OutcomesHealth Catalyst

A Reference Architecture for Digital Health: The Health Catalyst Data Operati...Health Catalyst

The Doctor’s Orders for Engaging Physicians to Drive ImprovementsHealth Catalyst

The Data Maze: Navigating the Complexities of Data GovernanceHealth Catalyst

Todd Berner: Assessment of Payer ACOs: Industry's RoleTodd Berner MD

Against the Odds: How this Small Community Hospital Used Six Strategies to Su...Health Catalyst

What's hot (20)

The Foundations of Success in Population Health Management

Why Accurate Financial Data is Critical for Successful Value Transformation

Tackle These 8 Challenges of MACRA Quality Measures

Why Healthcare Costing Matters to Enable Strategy and Financial Performance

Healthcare Reform Initiatives Affecting Physician Compensation

Platforms and Partnerships: The Building Blocks for Digital Innovation

Population Health Management: Path to Value

Making Sense of MACRA

Population Stratification Made Easy, Quick, and Transparent for Anyone

Webinar Deck: The Changing Face of IT Outsourcing in the Healthcare Payer Mar...

Activity-Based Costing in Healthcare During COVID-19: Meeting Four Critical N...

The Top Seven Healthcare Outcome Measures and Three Measurement Essentials

Healthcare Consumerism and Cost: Dispelling the Myth of Price Transparency

Optimize Your Labor Management with Health Catalyst PowerLabor™

Why Health Systems Must Use Data Science to Improve Outcomes

A Reference Architecture for Digital Health: The Health Catalyst Data Operati...

The Doctor’s Orders for Engaging Physicians to Drive Improvements

The Data Maze: Navigating the Complexities of Data Governance

Todd Berner: Assessment of Payer ACOs: Industry's Role

Against the Odds: How this Small Community Hospital Used Six Strategies to Su...

Viewers also liked

Quality and Value-based Healthcare India PresentationJoseph Britto

Software Tools to Improve Your Workflow and Project ManagementDAHLIA+

JODEE R OLSON Resume 5Jodee Olson

EHR Integration: The Decision to Build or BuyRedox Engine

Healthcare Interface Engine SelectionChad Johnson

Integration in the Age of Value-Based Care: Webinar SlidesRedox Engine

Drive Healthcare Transformation with a Strategic Analytics Framework and Impl...Frank Wang

Redox EnterpriseRedox Engine

Integrating PRO Solutions with Health System EHRsRedox Engine

Value Based Purchasing for Back PainSelena Horner

Developing a Strategic Analytics Framework that Drives Healthcare TransformationTrevor Strome

Surviving Value-Based Purchasing in Healthcare: Connecting Your Clinical and ...Health Catalyst

The Formula for Optimizing the Value-Based Healthcare EquationHealth Catalyst

7 Essential Practices for Data Governance in HealthcareHealth Catalyst

How to Build a Rock-Solid Analytics and Business Intelligence StrategySAP Analytics

Healthcare Analytics Adoption Model -- UpdatedHealth Catalyst

6 Steps for Implementing Successful Performance Improvement Initiatives in He...Health Catalyst

The Key to Transitioning from Fee-for-Service to Value-Based ReimbursementsHealth Catalyst

Data Analytics StrategyeHealthCareers

Viewers also liked (19)

Quality and Value-based Healthcare India Presentation

Software Tools to Improve Your Workflow and Project Management

JODEE R OLSON Resume 5

EHR Integration: The Decision to Build or Buy

Healthcare Interface Engine Selection

Integration in the Age of Value-Based Care: Webinar Slides

Drive Healthcare Transformation with a Strategic Analytics Framework and Impl...

Redox Enterprise

Integrating PRO Solutions with Health System EHRs

Value Based Purchasing for Back Pain

Developing a Strategic Analytics Framework that Drives Healthcare Transformation

Surviving Value-Based Purchasing in Healthcare: Connecting Your Clinical and ...

The Formula for Optimizing the Value-Based Healthcare Equation

7 Essential Practices for Data Governance in Healthcare

How to Build a Rock-Solid Analytics and Business Intelligence Strategy

Healthcare Analytics Adoption Model -- Updated

6 Steps for Implementing Successful Performance Improvement Initiatives in He...

The Key to Transitioning from Fee-for-Service to Value-Based Reimbursements

Data Analytics Strategy

Similar to Using Advanced Analytics for Value-based Healthcare Delivery

OPY-150702_BigData_Healthcare_070915Ravi Sripada

Big data -future_of_healthcarehealthitech

Data-driven Healthcare for ManufacturersLindaWatson19

Data-Driven Healthcare for Manufacturers Amit Mishra

Data-driven Healthcare for PayersLindaWatson19

ACO faq 111611Corrine Borghi Cutler

Copyright © 2017 Health CatalystWhite Paperby Steve Ba.docx

Three Must-Haves for a Successful Healthcare Data StrategyHealth Catalyst

Harnessing the Power of Healthcare Data: Are We There YetHealth Catalyst

Developing an adaptable and sustainable All Payer DatabaseRyan Hayden

CMS’ New Interoperability and Patient Access Proposed Rule - Top 5 Payer ImpactsCitiusTech

Life Sciences: Leveraging Customer Data for Commercial SuccessCognizant

Big Data: Implications of Data Mining for Employed Physician Compliance Manag...PYA, P.C.

The Digitization of Healthcare: Why the Right Approach Matters and Five Steps...Health Catalyst

Introducing The Next-Generation In Opportunity Analysis, Benchmarking, And Im...Health Catalyst

2016 IBM Interconnect - medical devices transformationElizabeth Koumpan

Predictive modeling healthcareTaposh Roy

Health Information Technology Implementation Challenges and Responsive Soluti...International Journal of Modern Research in Engineering and Technology

CSC_HealthcareJourneyBrian McCalley

Similar to Using Advanced Analytics for Value-based Healthcare Delivery (20)

OPY-150702_BigData_Healthcare_070915

Big data -future_of_healthcare

Data-driven Healthcare for Manufacturers

Data-Driven Healthcare for Manufacturers

Data-driven Healthcare for Payers

ACO faq 111611

Three Must-Haves for a Successful Healthcare Data Strategy

Harnessing the Power of Healthcare Data: Are We There Yet

Developing an adaptable and sustainable All Payer Database

CMS’ New Interoperability and Patient Access Proposed Rule - Top 5 Payer Impacts

Life Sciences: Leveraging Customer Data for Commercial Success

Big Data: Implications of Data Mining for Employed Physician Compliance Manag...

The Digitization of Healthcare: Why the Right Approach Matters and Five Steps...

Introducing The Next-Generation In Opportunity Analysis, Benchmarking, And Im...

2016 IBM Interconnect - medical devices transformation

Predictive modeling healthcare

Health Information Technology Implementation Challenges and Responsive Soluti...

CSC_HealthcareJourney

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Artificial intelligence in cctv survelliance.pptxhariprasad279825

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

AI as an Interface for Commercial BuildingsMemoori

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

costume and set research powerpoint presentationphoebematthew05

Story boards and shot lists for my a level piececharlottematthew16

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf

Artificial intelligence in cctv survelliance.pptx

SIP trunking in Janus @ Kamailio World 2024

Designing IA for AI - Information Architecture Conference 2024

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

My Hashitalk Indonesia April 2024 Presentation

Vertex AI Gemini Prompt Engineering Tips

"Debugging python applications inside k8s environment", Andrii Soldatenko

AI as an Interface for Commercial Buildings

Human Factors of XR: Using Human Factors to Design XR Systems

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Developer Data Modeling Mistakes: From Postgres to NoSQL

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Unraveling Multimodality with Large Language Models.pdf

costume and set research powerpoint presentation

Story boards and shot lists for my a level piece

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

APIForce Zurich 5 April Automation LPDG

Scanning the Internet for External Cloud Exposures via SSL Certs

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Using Advanced Analytics for Value-based Healthcare Delivery

2. We are a DC-based consulting firm that provides advanced analytics capabilities and implementation services  Prime Dimensions offers expertise in the processes, tools and techniques associated with:  Data Management  Business Intelligence  Advanced Analytics  We focus on operational aspects and emphasis on Big Data strategy and technology  We assist organizations in transforming data into actionable insights to improve performance, make informed decisions and achieve measurable results.  We partner closely with clients to ensure cost-effective implementations. Big Data requires a new generation of scalable technologies designed to extract insights from very large volumes of disparate, multi-structured data by enabling high velocity capture, discovery, and analysis. Copyright © Prime Dimensions 2013 All rights reserved. 1

5. Healthcare in the 21st Century “The transformational force that has brought affordability and accessibility to other industries is disruptive innovation. Today's health-care industry screams for disruption. Politicians are consumed with how we can afford health care. But disruption solves the more fundamental question: How do we make health care affordable?” Clayton M. Christensen, The Innovator's Prescription: A Disruptive Solution for Health Care National Challenges       Escalating Costs Dwindling Budgets More Oversight and Scrutiny Expectation of Transparency and Accountability New Laws, Regulations and Polices Focus on Performance Transformational Opportunities     Radical innovation Go beyond measuring outcomes to changing them Aggregate and/or discover data in ways that were not possible until recently Visualize problems in humanfriendly formats Copyright © Prime Dimensions 2013 All rights reserved. 4

7. Medicare Accounts for Over 20% of Total U.S. Healthcare Spending in 2010 Note: Other health insurance programs includes DoD and VA Source: Medicare Payment Advisory Committee, June 2012 Data Book Copyright © Prime Dimensions 2013 All rights reserved. 6

8. Healthcare Spending as a Percentage of GDP In 2010, total U.S. healthcare spending was $2.6 trillion, representing 18% of Gross Domestic Product, doubling over the past 30 years. If this trend continues, estimates are that healthcare spending will reach a staggering $5 trillion, or 20% of GDP, by 2021. Moreover, patients’ outof-pocket costs have doubled over the past 10 years. Source: Medicare Payment Advisory Committee, June 2012 Data Book Copyright © Prime Dimensions 2013 All rights reserved. 7

12. Unsustainable Fee-for-Service (FFS) Payment Model  Current FFS payment structure results in redundant testing, medical errors, and over-utilization of the system     Maximizes providers’ fees and reimbursements based on volume Incentivizes multiple tests and procedures, regardless of necessity Limited coordination of care across provider network Medical errors and failed procedures result in higher fees for providers  Data overload: vital health information assets are not being leveraged  96% of Medicare dollars account for treating chronic illnesses, with only 3-5% for prevention (Source: CMS, Chronic Conditions Chartbook, 2012) Research indicates that payment reform would reduce spending by $1.33 Trillion by 2023 and significantly improve quality and outcomes. (Source: The Commonwealth Fund Commission on a High Performance Health System, 1/13) Copyright © Prime Dimensions 2013 All rights reserved. 11

13. Four Major Payment Reform Models Global Payments • A single PMPM payment is made for all services delivered to a patient, with payment adjustments based on measured performanc e and patient risk. Bundled Payments Patientcentered Medical Home • A single • A physician “bundled” practice or payment, w other hich may provider is include eligible to multiple receive providers in additional multiple payments if care medical settings, is home made for criteria are services met, based delivered on quality during an and cost Copyright Prime episode ©of Dimensions 2013 All rights reserved. performanc Accountab le Care Organizati ons • Groups of providers that voluntarily assume responsibilit y for the care of a population of patients share payer savings if they meet quality and 12 cost

15. Solution  We selected three CMS programs for analytics: Accountable Care Organizations (ACO), Bundled Payments, and Patient Centered Medical Home.  The focus of Phase 1 of the prototype is on ACOs. Sources: Bundled Payments for Care Improvement (BPCI) Initiative: General Information, http://innovation.cms.gov/initiatives/bundled-payments/ Against this backdrop, CMS has launched new delivery and payment initiatives aimed to improve patient outcomes and quality while containing costs. We have selected the following programs for this business case: Bundled Payments, Accountable Care Organizations, and Patient Centered Medical Home. Bundled Payments for Care Improvement (BPCI) Initiative: General Information, http://innovation.cms.gov/initiatives/bundled-payments/ Copyright © Prime Dimensions 2013 All rights reserved. 14

16. Payment Reform Solution  Helps healthcare payers find the optimal cost, revenue, and provider incentive models to improve health outcomes and contain patient cost  Embeds business rules and algorithms of the Medicare Shared Savings Program (MSSP) Accountable Care Organization (ACO).  Includes the following features and capabilities:  ACO benchmarks and budget based on historical cost baseline, trend estimates and risk adjustments  Performance monitoring of key measures and metrics related to cost, utilization and quality  Predictive modeling to determine the proper mix of inputs to maximize payment incentives  Dynamic dashboards and visualizations to perform trade-off analysis and scenario planning Copyright © Prime Dimensions 2013 All rights reserved. 15

17. Methodology and Scope  The team acquired and loaded public datasets from CMS and related sources  Phase 1 scope: In-patient Medicare data for 2010 Sampling of 5 metropolitan areas representing a range of regions and population sizes (CA, FL, MA, OH, GA)  Diabetes, hypertension, coronary disease, and heart failure measures  Readmissions and ER utilization    Created scenarios for comparative analysis and benchmarking  Population, cost, clinical and utilization data  Analytical models and predictive modeling Seeking input from SMEs  Applying industry best practices  Assessing technical solutions  Data Sources 1. Data Entrepreneurs’ Synthetic Public Use File 2. Institutional Provider & Beneficiary Summary 3. Data for ACO Applicant Share Calculations 4. CMS Statistics reference booklet for the Medicare and Medicaid health insurance programs 5. National Health Expenditure Accounts (health care goods and services, public health activities, government administration, the net cost of health insurance, and investment related to health care) 6. Premier Hospital Quality Incentive Demonstration (expanding the information available about quality of care and through direct incentives to reward the delivery of superior quality care) 7. Hospital Compare (including 27 quality measures of clinical process of care and outcomes) 8. Assorted files from the CMS Download Database 9. The Dartmouth Atlas of Health Care Copyright © Prime Dimensions 2013 All rights reserved. 16

19. Big Data Enabling Technologies  New systems that handle a wide variety of data from sensor data to web and social media data Big Data requires a new  Improved analytical capabilities (sometimes called generation of scalable advanced analytics) including event, predictive and technologies designed to extract meaning from very text analytics large volumes of  Operational business intelligence that improves disparate, multi-structured data by enabling high business agility by enabling automated real-time velocity actions and better decision making capture, discovery, and analysis.  Faster hardware ranging from faster multi-core processors and large memory spaces, to solid-state  Massively Parallel Processing drives and virtual data storage  Hadoop and MapReduce  Cloud computing including on-demand software-as- NoSQL Databases  a-service (SaaS) and platform-as-a-service (PaaS)  In-Memory Technology analytical solutions in public and private clouds Copyright © Prime Dimensions 2013 All rights reserved. 18

20. Massively Parallel Processing  Massively parallel processing (MPP) is the coordinated processing of a program by multiple processors that work on different parts of the program, with each processor using its own operating system and memory  Known as “loosely coupled” or “share nothing” systems  Enables “analytic offload”  Permits parallelized data loading, i.e. E-L-T vs. E-T-L Copyright © Prime Dimensions 2013 All rights reserved. 19

21. The Hadoop Ecosystem Compute Cluster Multi-structured Source Data Master Node NameNode DataNode Slave Node JobTracker TaskTracker DataNode TaskTracker Slave Node DataNode TaskTracker Slave Node MapReduce Engine DataNode Slave Node TaskTracker YARN DataNode TaskTracker Slave Node DataNode TaskTracker DATA LAYER WORKLOAD MANAGEMENT LAYER APPLICATION LAYER Copyright © Prime Dimensions 2013 All rights reserved. 20

22. The Hadoop Ecosystem Compute Cluster ApplicationMaster Multi-structured Source Data ResourceMaster Master Node NameNode DataNode Slave Node NodeMaster DataNode JobTracker TaskTracker TaskTracker Slave Node NodeMaster DataNode TaskTracker Slave Node NodeMaster DataNode Slave Node NodeMaster DataNode MapReduce Engine TaskTracker YARN MapReduce 2.0 TaskTracker Slave Node NodeMaster DataNode TaskTracker WORKLOAD MANAGEMENT LAYER Copyright © Prime Dimensions 2013 All rights reserved. 21

23. NoSQL (Not only SQL) Databases  Like HDFS, NoSQL databases have distributed, schema-less data structure and high scale out across low cost commodity hardware  NoSQL databases are not reliant on a relational, fixed data model and Structured Query Language (SQL)  NoSQL databases are not ACID-compliant and lack the strict data integrity of relational databases  NoSQL databases have BASE properties – Basically Available, Soft state, Eventual consistency  NoSQL databases are queried via REST APIs. Key-Value Columnar Document/Object Graph Increasing Data Complexity Copyright © Prime Dimensions 2013 All rights reserved. 22

24. In-memory Database   An in-memory database stores (IMDS) data entirely in main memory using NVDIMMs, as opposed to disk-based storage Storing and retrieving data in memory is much faster than writing to and reading from disc    Reduce data access and latency caused by I/O operations and significantly speeding query and response times. This trend toward in-memory is made possible by low memory prices and multi-core 64-bit processing capabilities. Data stored entirely in-memory is immediately available for analytic processing Copyright © Prime Dimensions 2013 All rights reserved. 23

27. Our roadmap reduces risk while progressing towards a unified analytics environment Continuous improvement & innovation Changes to laws (e.g., ACA) Permanent budget pressure Information Management Health Assessor Design Thinking & Prototyping Mission Oriented Analytics Framework Unified Analytics Environment Evolving On-Demand Infrastructure Commodity IT Open source tools Copyright © Prime Dimensions 2013 All rights reserved. Smarter, mor e demanding data consumers & partners

29. Growth of Data Exploding  Eric Schmidt: “Every 2 days we create as much information as we did up to 2003.”  Relational databases can not effectively ingest, store and query the increased volume, variety and velocity of Big Data Copyright © Prime Dimensions 2013 All rights reserved. 28

33. Evolving On-Demand Infrastructure for Big Data Advanced Analytics Analytic Offload Scale-up/Scale-out YARN Storm NoSQL Database Tez Analytic Applications REST/JSON APIs Dashboards & Visualizations Analytic Database E-L-T In-memory Columnar HCatalog Data Warehouse Augmentation OLAP Data Warehouse E-T-L Multi-structured And Stream Source Data Copyright © Prime Dimensions 2013 All rights reserved. Data Discovery Structured Source Data

36. Questions? Our Contact Information Michael Joseph Managing Partner Direct: 703.861.9897 Email: mjoseph@primedimensions.com Richard Rowan Managing Director Direct: 703.201.2641 Email: rrowan@primedimensions.com Prime Dimensions, LLC www.primedimensions.com Data Management | Business Intelligence | Advanced Analytics Follow us @primedimensions Copyright © Prime Dimensions 2013 All rights reserved. 35

Editor's Notes

The fundamental challenge: balancing these three aspects: cost of care, access to care, and the quality of care provided. This is known as the iron triangle of healthcare. Unfortunately, the truth is that only two of these three areas can be optimized. For example, if a nation chose to provide high quality care to all, then the costs must be high as well.If instead a healthcare system was designed to be low cost but give high quality care, access to care would need to be limited. Finally if a country wanted to have a low cost healthcare system with universal access, the quality delivered would not be high quality.The US Healthcare CrisisThe healthcare crisis in America is particularly startling when you realize that the United States does the worst on all three aspects.We have the highest cost per capita, don’t provide healthcare to all, and the health quality outcomes are the worst of the industrialized nations. It is expected that by 2016, healthcare costs will account for one of every five dollars spent. This is double the amount of what was spent a decade earlier and likely the same problems facing the country today will remain unchanged: Limited access, less than optimal care, and high costs.Nevertheless within our country, there are many health plans, employers, doctors, and other organizations trying to reform our healthcare system. Time will tell whether they will succeed. Whatever healthcare reform will look like in our country, the solution will be uniquely American as many innovative ideas to problems often are.
The current payment structure results in redundant testing, medical errors, and over-utilization that maximizes providers’ fees and reimbursements; focuses on volume of services provided and revenue generated--not quality and outcomes; and incentivizes multiple tests and procedures , regardless of necessity, quality, and efficiency
Our model embeds business rules and algorithms of the Medicare Shared Savings Program (MSSP) Accountable Care Organization (ACO). Our application includes the following features and capabilities: ACO benchmarks and budget based on historical cost baseline, trend estimates and risk adjustments Performance monitoring of key measures and metrics related to cost, utilization and quality Predictive modeling to determine the proper mix of inputs to maximize payment incentivesDynamic dashboards and visualizations to perform trade-off analysis and scenario planning
Two questions: How many seconds would it take for one person to find four Jacks in a deck of 52 cards? How long would it take 52 people, each holding one card?Why important?Accelerating time to value…MPP serves as the basis for “next generation” database management software designed to run on a shared-nothing massively parallel processing (MPP) platform with features such as row- and/or column-based storage, compression, and in-database analytics. These database systems alter the data management landscape in terms of response times at almost any scale, enabling analytic offload. Apache Hadoop, which will follow, is based on MPP and has emerged from humble beginnings to worldwide adoption - infusing data centers with new infrastructure concepts and generating new business opportunities by placing parallel processing into the hands of the average programmer.http://whatis.techtarget.com/definition/MPP-massively-parallel-processingIn some implementations, up to 200 or more processors can work on the same application. An MPP system is also known as a "loosely coupled" or "shared nothing" system.Many vendor solutions, both appliance-based and software only, deployed on commodity HW.MPP architecture offers linear scalability on commodity technology at a competitive price point. Analytical applications with frequent full table scans and complex algorithms such as regression analysis, joins, sorting, and aggregations can potentially saturate network bandwidth in a shared everything architecture. In MPP architecture, each compute node (usually a standard server) equally shares the workload in parallel and is able to utilize the full IO bandwidth of locally-attached disk storage. The ability to take queries that once ran in hours and reduce them tominutes opens up the opportunity to search for value in massive data volumes quickly and iteratively. Parallelized Data LoadingThe MPP architecture is designed to compartmentalize individual node processing and is therefore ideal for moving load processes from dedicated servers to the MPP database. Rather than extract, transform, and load, processes are shifted to extract, load, and then transform, taking advantage of parallelism. The same is true for advanced analytics: the more functions that can be pushed down into the database, the faster analysis will complete. Reducing network traffic between the database and the analytical application frees up IO and bandwidth for additional processing and data loads.Intel has packed just shy of a billion transistors into the 216 square millimeters of silicon that compose its latest chip, each one far, far thinner than a sliver of human hair. >50 nm feature sizes using optical lithography13.5 nm wavelength using extreme UV lithography
Hadoop is an Apache Open Source project that provides a framework that allows for the distributed processing of large data sets across clusters of computers, each offering local computation and storage. Based on Google File System and MapReduce papers.Hadoop scales out to large clusters of servers (nodes) using the Hadoop Distributed File System (HDFS) to manage huge data sets and spread them across the servers.Hadoop’s disributed architecture as a Big Data platform allows MapReduce programs to run in parallel across 10s to 1000s of servers, or nodes.MapReduce is a general-purpose execution engine that handles the complexities of parallel programming for a wide variety of applicationsThe Map phase, for computation or analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairsThe Reduce phase, in which the set of values associated with the intermediate key/value pairs output by the Map phase are combined (that is, “reduced”) to provide the results.More on MapReduce later…We have seen that Hadoop also augments Data Warehouse environments. Hadoop is becoming a critical part of many modern information technology (IT) departments. It is being used for a growing range of requirements, including analytics, data storage, data processing, and shared compute resources. As Hadoop’s significance grows, it is important that it be treated as a component of your larger IT organization, and managed as one. Hadoop is no longer relegated to only research projects, and should be managed as your agency would manage any other large component of your IT infrastructure.A multi-node Hadoop clusterA small Hadoop cluster will include a single master and multiple slave nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. These are normally used only in nonstandard applications.[13]Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the file system index, and Similarly, a standalone JobTracker server can manage job scheduling. HADOOP—THE FOUNDATION FOR CHANGEHadoop has the potential to reach beyond Big Data to catalyze new levels of business productivity and transformation. As the foundation for change in business, Hadoop represents an unprecedented opportunity to improve how organizations can get the most value from large amounts of data. Businesses that rely on Hadoop as the core of their infrastructure can not only do analytics on top of vast amounts of data, but can also go beyond analytics and the foundation for that data layer to build applications that are meaningful, and that have a very tightly coupled relationship with the data. Consumer Internet companies have reaped the benefits of this approach, and EMC believes more traditional enterprises will adopt the same model as they evolve and transform their businesses.Hadoop has rapidly emerged as the preferred solution for Big Data analytics applications that grapple with vast repositories of unstructured data. It is flexible, scalable, inexpensive, fault-tolerant, and enjoys rapid adoption rates and a rich ecosystem surrounded by massive investment. However, customers face high hurdles to broadly adopting Hadoop as their singular data repository due to a lack of useful interfaces and high-level tooling for Business Intelligence and data mining—components that are critical to data analytics and building a data-driven enterprise. As the world's first true SQL processing for Hadoop, Pivotal HD addresses these challenges. THE HADOOP ECOSYSTEM The Hadoop family of products includes the Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase, Mahout, Lucene, Oozie, Flume, Cassandra, YARN, Ambari, Avro, Chukwa, and Zookeeper. Pivtoal hd: HDFS, MapReduce, Hive, Mahout, Pig, HBase, Yarn, Zookeeper, Sqoop and FlumeHDFS A distributed file system that partitions large files across multiple machines for high-throughput access to dataData LayerFlume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target. HCatalog: HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored. Workload Management LayerMapReduceA programming framework for distributed batch processing of large data sets distributed across multiple serversMapReduce, which is typically used to analyze web logs on hundreds, sometimes thousands of web application servers without moving the data into a data warehouse, is not a database system, but is a parallel and distributed programming model for analyzing massive data sets (“big data”). One elegant aspect of the MapReduce is its simplicity, mostly due to its dependence on two basic operations that are applied to sets or lists of data value pairs:The Map phase, for computation or analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairs, andThe Reduce phase, in which the set of values associated with the intermediate key/value pairs output by the Map phase are combined (that is, “reduced”) to provide the results. MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. Oozie is a workflow scheduler system to coordinate and manage Apache Hadoop jobs.Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed. Mahout: Scalable to reasonably large data sets. Mahout also provides Java libraries for common math (focused on linear algebra and statistics) operations and primitive Java collections. Mahout is a work in progress; the number of implemented algorithms has grown quickly,[3] but there are still various algorithms missing.While Mahout's core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm, it does not restrict contributions to Hadoop based implementationsMahout: Mahout is a scalable machine learning and data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model. Application LayerApache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:Pig [1] is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin.[1] Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy [2] and then call directly from the language. PigA high-level data-flow language for expressing Map/Reduce programs for analyzing large HDFS distributed data setsPig was originally [3] developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large data sets. In 2007,[4] it was moved into the Apache Software Foundation.[5] Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily.HBaseAn open-source, distributed, versioned, column-oriented store modelled after Google’s BigtableHive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. Hive: Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language caled HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.HiveA data warehouse system for Hadoop that facilitates data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query it using a SQL-like language called HiveQL. HiveQL programs are converted into Map/Reduce programsCluster Sizing The sizing guide for HDFS is very simple: each file has a default replication factor of 3 and you need to leave approximately 25% of the disk space for intermediate shuffle files. So you need 4x times the raw size of the data you will store in the HDFS. However, the files are rarely stored uncompressed and, depending on the file content and the compression algorithm, on average we have seen a compression ratio of up to 10-20 for the text files stored in HDFS. So the actual raw disk space required is only about 30-50% of the original uncompressed size. Compression also helps in moving the data between different systems, e.g. Teradata and Hadoop.MemoryMemory demand for a master node is based on the NameNode data structures that grow with the storage capacity of your cluster. We found 1 GB per petabyte of storage is a good guideline for master node memory. You then need to add on your OS overhead,etc. We have found that with Intel Sandybridge processors 32GB is more than enough memory for a master node.Cluster Design TradeoffsWe classify clusters as small (around 2-3 racks), medium(4-10 racks) and large(above 10 racks). What we have been covering so far are design guidelines and part of the design process is to understand how to bend the design guidelines to meet you goals. In the case of small, medium and large clusters things get progressively more stringent and sensitive when you bend the guidelines. For a small the smaller number of slave nodes allow you greater flexibility in your decisions. There are a few guidelines you don’t want to violate like isolation. When you get to a medium size cluster the number of nodes will increase your design sensitivity. You also now have enough hardware the physical plant issues of cooling and power become more important. Your interconnects also become more important. At the large scale things become really sensitive and you have to be careful because making a mistake here could result in a design that will fail. Our experience at Hortonworks has allowed us to develop expertise in this area and we strongly recommend you work with us if you want to build Internet scale clusters. detailed and specific on what a typical slave node for Hadoop should be: Mid-range processor4 to 32 GB memory1 GbE network connection to each node, with a 10 GbE top-of-rack switchA dedicated switching infrastructure to avoid Hadoop saturating the network4 to 12 drives (cores) per machine, Non-RAIDEach node has 8 cores, 16G RAM and 1.4T storage.FacebookWe use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.Currently we have 2 major clusters:A 1100-machine cluster with 8800 cores and about 12 PB raw storage.A 300-machine cluster with 2400 cores and about 3 PB raw storage.Each (commodity) node has 8 cores and 12 TB of storage. Yahoo now manages more than 42,000 Hadoop nodes.(2011)Yahoo!More than 100,000 cores in >40,00 nodes running HadoopOur biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
MapReduce is a general-purpose execution engine that handles the complexities of parallel programming for a wide variety of applicationsThe Map phase, for computation or analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairsThe Reduce phase, in which the set of values associated with the intermediate key/value pairs output by the Map phase are combined (that is, “reduced”) to provide the results.YARN: Apache Hadoop NextGen MapReduce (YARN)MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.As folks are aware, Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer. However, the MapReduce algorithm, by itself, isn’t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. With YARN, Hadoop now has a generic resource-management and distributed application framework, where by, one can implement multiple data processing applications customized for the task at hand. Hadoop MapReduce is now one such application for YARN and I see several others given my vantage point – in future you will see MPI, graph-processing, simple services etc.; all co-existing with MapReduce applications in a Hadoop YARN cluster.
There are many categories of NoSQL designs -- key value, graph, document-oriented -- and well-known technologies include BigTable, HBase, Cassandra, Couchbase, MongoDB and SimpleDB.The buzzIf Not Tables, Then What?Instead of using structured tables to store multiple related attributes in a row, NoSQL databases use the concept of a key/value store. Quite simply, there is no schema for the database. It simply stores values for each provided key, distributes them across the database and then allows their efficient retrieval. The lack of a schema prevents complex queries and essentially prevents the use of NoSQL as a transactional database environment. There are four main types of NoSQL databases:The basic key/value store performs nothing other than the function described above – taking a binary data object, associating it with a key, and storing it in the database for later retrieval.Columnar databases are a hybrid between NoSQL and relational databases. They provide some row-and-column structure, but do not have the strict rules of relational databases.Document stores go beyond this slightly by imposing a little more structure on the binary object. The objects must be documents, encoded in some recognizable format, such as XML or PDF, but there are no requirements about the structure or content of the document. Each document is stored as the value portion of a key/value store and may be accompanied by metadata embedded in the document itself.Graph databases store information in multi-attribute tuples that reflect relationships in a different way. For example, a graph database might be used to store the "friend" relationships of a social network, with a record merely consisting of two friends who share a relationship.NoSQL ArchitectureThe core of the NoSQL database is the hash function – a mathematical algorithm that takes a variable length input and produces a consistent, fixed-length output. The key of each key/value pair being fed to a NoSQL database is hashed and this hash value is used to direct the pair to a particular NoSQL database server, where the record is stored for later retrieval. When an application wishes to retrieve a key value pair, it provides the database with the key. This key is then hashed again to determine the appropriate server where the data would be stored (if the key exists in the database) and then the database engine retrieves the key/value pair from that server. As you read the description of this process, you may find yourself wondering “How does the user or application perform more advanced queries, such as finding all of the keys that have a particular value or sorting data by a value?” And, there’s the rub – NoSQL databases simply do not support this type of functionality. They are designed for the rapid, efficient storage of key/value pairs where the application only needs a place to stash data, later retrieving it by the key, and only by the key. If you need to perform other queries, NoSQL is not the appropriate platform for your use.Redundancy and Scalability in NoSQLThe simplistic architecture of NoSQL databases is a major benefit when it comes to redundancy and scalability. To add redundancy to a database, administrators simply add duplicate nodes and configure replication between a primary node and its counterpart. Scalability is simply a matter of adding additional nodes. When those nodes are added, the NoSQL engine adjusts the hash function to assign records to the new node in a balanced fashion.STILL WANT TO KNOW -- WHAT ARE NOSQL DATABASES?NoSQL database technologymakes it to ThoughtWorksMPP hardware, NoSQL databases: New DBMS optionsCan Oracle peddle NoSQL databases?Developers looking to build big Web applications needed to add more and more processing nodes to keep up with almost boundless demand for computing power. Relational databases came up short, so software engineers at Google, Amazon, Facebook and Yahoo devised non-SQL solutions -- laying the groundwork for big data analytics and expanded cloud computing services. Since 2009, a cavalry charge of NoSQL software startups entered the void with commercial products.The realityNoSQL is also known as "not only SQL," because some NoSQL databases do support SQL elements. But most don’t share key traits of relational databases like atomicity and consistency, so though NoSQL may help keep the auction chant going on eBay, it might break any bank using it for transaction processing. And with a ready pool of skilled developers, the incumbent relational database will hold on in most business applications.Data integrity. The ACID properties (atomicity, consistency, isolation, durability) guarantee that database transactions are processed with integrity. Hadoop and NoSQL are not a DBMS’s, so it is not ACID compliant, and therefore is not appropriate where inserts and updates are required. The advantages of NoSQL data stores are:elastic scaling meaning that they scale up transparently by adding a new node, and they are usually designed with low-cost hardware in mind.NoSQL data stores can handle Big data easilyNoSQL databases are designed to have less management, automatic repair, data distribution, and simpler data models therefore no need to have a DBA on site for using it.NoSQL databases use clusters of cheap servers to manage the exploding data and transaction volumes and therefore they are cheap in comparison to the high cost of licenses of RDBMS systems.Flexible data models, the key value stores and document databases schema changes don’t have to be managed as on complicated change unit, therefore it lets application to iterate faster.http://catmousavi.wordpress.com/2012/03/29/what-is-big-data-what-are-nosql-databases-what-is-hadoop-pig-hive/http://nosql-database.org/ACID (atomicity, consistency, isolation, and durability) is an acronym and mnemonic device for learning and remembering the four primary attributes ensured to any transaction by atransaction manager (which is also called a transaction monitor). These attributes are:Atomicity. In a transaction involving two or more discrete pieces of information, either all of the pieces are committed or none are.Consistency. A transaction either creates a new and valid state of data, or, if any failure occurs, returns all data to its state before the transaction was started.Isolation. A transaction in process and not yet committed must remain isolated from any other transaction.Durability. Committed data is saved by the system such that, even in the event of a failure and system restart, the data is available in its correct state.A well-known problem with NoSQL databases in general is that they do not support the 'ACID' principals held dear by traditional RDBMS DBAs. The Register is reporting that this may soon change with FoundationDB. Atomicity ensures that a transaction is saved or undone, but never exists halfway between the two states. Consistency ensures that only valid data can be stored. Isolation of transactions prevents one transaction interfering with another. Finally, durability ensures that transactions committed will endure and be protected from loss.ACID provides principles governing how changes are applied to a database. In a very simplified way, it states (my own version):(A) when you do something to change a database the change should work or fail as a whole(C) the database should remain consistent (this is a pretty broad topic)(I) if other things are going on at the same time they shouldn't be able to see things mid-update(D) if the system blows up (hardware or software) the database needs to be able to pick itself back up; and if it says it finished applying an update, it needs to be certainAtomicity: Either the task (or all tasks) within a transaction are performed or none of them are. This is the all-or-none principle. If one element of a transaction fails the entire transaction fails.Consistency: The transaction must meet all protocols or rules defined by the system at all times. The transaction does not violate those protocols and the database must remain in a consistent state at the beginning and end of a transaction; there are never any half-completed transactions.Isolation: No transaction has access to any other transaction that is in an intermediate or unfinished state. Thus, each transaction is independent unto itself. This is required for both performance and consistency of transactions within a database.Durability: Once the transaction is complete, it will persist as complete and cannot be undone; it will survive system failure, power loss and other types of system breakdowns.There are of course many facets to those definitions and within the actual ACID requirement of each particular database, but overall in the RDBMS world, ACID is overlord and without ACID reliability is uncertain. BASE Introduces Itself and Takes a BowLuckily for the world of distributed computing systems, their engineers are clever. How do the vast data systems of the world such as Google’s BigTable and Amazon’s Dynamo and Facebook’s Cassandra (to name only three of many) deal with a loss of consistency and still maintain system reliability? The answer, while certainly not simple, was actually a matter of chemistry or pH: BASE (Basically Available, Soft state, Eventual consistency). In a system where BASE is the prime requirement for reliability, the activity/potential (p) of the data (H) changes; it essentiallyslows down. On the pH scale, a BASE system is closer to soapy water (12) or maybe the Great Salt Lake (10). Such a statement is not claiming that billions of transactions are not happening rapidly, they still are, but it is the constraints on those transactions that have changed; those constraints are happening at different times with different rules. In an ACID system, the data fizzes and bubbles and is perpetually active; in a BASE system, the bubbles are still there much like bath water, popping, gurgling, and spinning, but not with the same vigor required from ACID. Here is why:Basically Available: This constraint states that the system does guarantee the availability of the data as regards CAP Theorem; there will be a response to any request. But, that response could still be ‘failure’ to obtain the requested data or the data may be in an inconsistent or changing state, much like waiting for a check to clear in your bank account.Soft state: The state of the system could change over time, so even during times without input there may be changes going on due to ‘eventual consistency,’ thus the state of the system is always ‘soft.’Eventual consistency: The system will eventually become consistent once it stops receiving input. The data will propagate to everywhere it should sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one. Werner Vogel’s article “Eventually Consistent – Revisited” covers this topic is much greater detail.Conclusion – Moving ForwardThe new pH of database transaction processing has allowed for more efficient vertical scaling at cost effective levels; checking the consistency of every single transaction at every moment of every change adds gargantuan costs to a system that has literally trillions of transactions occurring. The computing requirements are even more astronomical. Eventual consistency gave organizations such as Yahoo! and Google and Twitter and Amazon, plus thousands (if not millions) more the ability to interact with customers across the globe, continuously, with the necessary availability and partition tolerance, while keeping their costs down, their systems up, and their customers happy. Of course they would all like to have complete consistency all the time, but as Dan Pritchett discusses in his article “BASE: An Acid Alternative,” there has to be tradeoffs, and eventual consistency allowed for the effective development of systems that could deal with the exponential increase of data due to social networking, cloud computing and other Big Data projects.Why NoSQL Is Effective for Mobile DevicesNoSQL databases are designed to handle the dynamic needs of mobile applications. NoSQL databases do not use fixed schemas. So, in the example used above, adding new characters does not require developers to make drastic changes to the database. The developer would just be adding to the database rather than altering an existing schema. I mentioned the different use cases that mobile applications must address. This is another issue that is fixed when using NoSQL databases. One of the best examples of NoSQL databases handling the complex use cases of mobile users is Foursquare. Because Foursquare is location based, the results users get from queries or even the options available to them will differ based on location. The geospatial capabilities of an open source NoSQL database such as MongoDB make it possible for developers to easily add location-aware features. Another issue with mobile applications that NoSQL addresses is the need for constant updates. After an application has been released, maintenance becomes a major concern, among other things to consider. Because NoSQL is document based, fixing certain types of bugs and other problems doesn’t require a complete overhaul of the database, because the changes made by developers don’t necessarily affect every other aspect of the application. Finally, NoSQL is well known for its scalability. Unlike relational databases, NoSQL databases scale outward rather than vertically. This is important because as the application’s user base grows, so will the amount of data being stored in the database. It’s important to have a growth strategy in place prior to developing an application because worrying about data constraints after the application has been released will result in downtime for maintenance and upset users. REST stands for Representational State Transfer, and it was proposed in a doctorate dissertation. It uses the four HTTP methods GET, POST, PUT and DELETE to execute different operations. This in contrast to SOAP (Simple Object Access Protocol) for example, which creates new arbitrary commands (verbs) like getAccounts() orapplyDiscount()A REST API is a set of operations that can be invoked by means of any the four verbs, using the actual URI as parameters for your operations. For example you may have a method to query all your accounts which can be called from /accounts/all/ this invokes a HTTP GET and the 'all' parameter tells your application that it shall return all accounts.A RESTful web API (also called a RESTful web service) is a web API implemented using HTTP and REST principles. It is a collection of resources, with four defined aspects:the base URI for the web API, such as http://example.com/resources/the Internet media type of the data supported by the web API. This is often JSON but can be any other valid Internet media type provided that it is a valid hypertext standard.the set of operations supported by the web API using HTTP methods (e.g., GET, PUT, POST, or DELETE).The API must be hypertext driven.[13]A good example for horizontal scaling is Cassandra , MongoDB (scale out; add nodes)Vertical scaling – relational data – limitations (scale up; add CPU/RAM to existing )NoSQL versus relational columnar databases – Is NoSql right for you?Relational columnar databases such as SybaseIQ continue to use a relational model and are accessed via traditional SQL. The physical storage structure is very different when compared to non-relational NoSQL columnar stores, which store data as rows whose structure may vary and are organized by the developer into families of columns according to the application use case.Relational columnar databases, on the other hand, require a fixed schema with each column physically distinct from the others, which makes it impossible to declaratively optimize retrievals by organizing logical units or families. Because a NoSQL database retrieval can specify one or more column families while ignoring others, NoSQL databases can offer a significant advantage when performing individual row queries. NoSQL databases cannot meet the performance characteristics of relational columnar databases when it comes to retrieving aggregated results from groups of underlying records, however.This distinction is a litmus test when deciding between NoSQL and traditional SQL databases. NoSQL databases are not as flexible and are exceptional at speedily returning individual rows from a query. Traditional SQL databases, on the other hand, forfeit some storage capacity and scalability but provide extra flexibility with a standard, more familiar SQL interface.Since relational databases must adhere to a schema, they typically need to reserve space even for unused columns. NoSQL databases have a dense per-row schema and so tend to be better at optimizing the storage of sparse data, although the relational databases often use sophisticated storage-optimization techniques to mitigate this perceived shortcoming.Most importantly, relational columnar databases are generally intended for the read-only access found in conjunction with data warehouses, which provide data that was loaded collectively from conventional data stores. This can be contrasted with NoSQL columnar tables, which can handle a much higher rate of updates.
relies on main memory for computer data storage. It is contrasted with database management systems which employ a disk storage mechanism. Main memory databases are faster than disk-optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk.[1][2]In applications where response time is critical, such as telecommunications network equipment and mobile advertising networks, main memory databases are often used.[3] IMDBs have gained a lot of traction, especially in the data analytics space, starting mid-2000s mainly due to cheaper RAM.[4][5]With the introduction of NVDIMM technology[6] , in-memory databases will now be able to run at full speed and maintain data in the event of power failure.In-memory allows analytics that are unconstrained by hardware and software limitations…The NVDIMM (dual in-line memory module) is a mixed memory subsystem that combines the speed and endurance of DRAM, together with the non-volatile data retention properties of NAND flash. NVDIMMs using DRAM and NAND technology can deliver high speed and low latency “non-volatile / persistent” memory with unlimited read/write activity that can sustain itself from host power failure or a system crash. The NVDIMM can be viewed as the first commercially viable “Storage Class Memory” for the enterprise computing market.(In digital electronics, a NAND gate (Negated AND or NOT AND) is a logic gate which produces an output that is false only if all its inputs are true. A LOW (0) output results only if both the inputs to the gate are HIGH (1); if one or both inputs are LOW (0), a HIGH (1) output results. It is made using transistors.The NAND gate is significant because any boolean function can be implemented by using a combination of NAND gates. This property is called functional completeness.)With pressure growing by the day to make their products quicker to deploy and easier to use, business intelligence (BI) and data warehouse vendors are increasingly turning to in-memory technology in place of traditional disk-based storage to speed up implementations and extend self-service capabilities.Traditional BI technology loads data onto disk, often in the form of intricately modeled tables and multidimensional cubes, which can take weeks or months to develop. Queries are then made against the tables and cubes on disk. In-memory technology removes these steps, as data is loaded into random access memory and queried in the application or database itself. This greatly increases query speed and lessens the amount of data modeling needed, experts agree, meaning that in-memory BI apps can be up and running significantly faster than disk-based tools.Caching on disc not as efficient as in-memory: Caching is the process whereby on-disk databases keep frequently-accessed records in memory, for faster access. However, caching only speeds up retrieval of information, or “database reads.” Any database write – that is, an update to a record or creation of a new record – must still be written through the cache, to disk. So, the performance benefit only applies to a subset of database tasks. In addition, managing the cache is itself a process that requires substantial memory and CPU resources, so even a “cache hit” underperforms an in-memory database.In-memory technology is emerging now thanks to both increased customer demand for fast and flexible operational BI and data analysis capabilities, as well as technological innovation, specifically the emergence of 64-bit processors.64-bit processors, which began to replace 32-bit processors in personal computers earlier this decade, significantly increased the amount of data that could be stored in-memory and ultimately helped reduce the price of memory, which traditionally had been much more expensive than disk, spurring its use in enterprise applications.http://whatis.techtarget.com/definition/in-memory-databaseAn in-memory database (IMDB, also known as a main memory database or MMDB) is a database whose data is stored in main memory to facilitate faster response times. Source data is loaded into system memory in a compressed, non-relational format. In-memory databases streamline the work involved in processing queries. An IMDB is one type of analytic database, which is a read-only system that stores historical data on metrics for business intelligence/business analytics (BI/BA) applications, typically as part of a data warehouse or data mart. These systems allow users to run queries and reports on the information contained, which is regularly updated to incorporate recent transaction data from an organization’s operational systems.In addition to providing extremely fast query response times, in-memory analytics can reduce or eliminate the need for data indexing and storing pre-aggregated data in OLAPcubes or aggregate tables. This capacity reduces IT costs and allows faster implementation of BI/BA applications.Three developments in recent years have made in-memory analytics increasingly feasible:64-bit computing, multi-core servers and lower RAM prices. In-memory analytics is an approach to querying data when it resides in a computer’s random access memory (RAM), as opposed to querying data that is stored on physical disks. This results in vastly shortened query response times, allowing business intelligence (BI) and analytic applications to support faster business decisions.As the cost of RAM declines, in-memory analytics is becoming feasible for many businesses. BI and analytic applications have long supported caching data in RAM, but older 32-bit operating systems provided only 4 GB of addressable memory. Newer 64-bitoperating systems, with up to 1 terabyte (TB) addressable memory (and perhaps more in the future), have made it possible to cache large volumes of data -- potentially an entire data warehouse or data mart -- in a computer’s RAM.In addition to providing incredibly fast query response times, in-memory analytics can reduce or eliminate the need for data indexing and storing pre-aggregated data in OLAPcubes or aggregate tables. This reduces IT costs and allows faster implementation of BI and analytic applications. It is anticipated that as BI and analytic applications embrace in-memory analytics, traditional data warehouses may eventually be used only for data that is not queried frequently.
http://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/102-nosql-database-tutorial.htmlOptimize the environment based on the analytical workloadThe type of database to select depends on the characteristics of the data and mix of workloads – transaction/batch . For high velocity capture and analysis, you need a “scale up” approach, i.e. vertical scaling using in-memory databases. For high variety datasets, you need the ability to distribute processing and leverage parallelism. As illustrated, the relational database has limited ability to scale up and out. As data volume increases (with velocity and/or variety), you can see that both in-memory and MPP are necessary. Our sponsor, SAP, offers solutions that embed in-memory and MPP. HANA is an in-memory, appliance-based database, while Sybase IQ is on on-disc, software-only (commodity hardware) database. Both are columnar structures (NoSQL) to optimize performance by having a compression rate of at least a factor of 5x that of relational databases. An analytic database, also called an analytical database, is a read-only system that stores historical data on business metrics such as sales performance and inventory levels. Business analysts, corporate executives and other workers can run queries and reports against an analytic database. The information is updated on a regular basis to incorporate recent transaction data from an organization’s operational systems.An analytic database is specifically designed to support business intelligence (BI) and analytic applications, typically as part of a data warehouse or data mart. This differentiates it from an operational, transactional or OLTP database, which is used for transaction processing – i.e., order entry and other “run the business” applications. Databases that do transaction processing can also be used to support data warehouses and BI applications, but analytic database vendors claim that their products offer performance and scalabilityadvantages over conventional relational database software. There currently are five main types of analytic databases on the market:Columnar databases, which organize data by columns instead of rows – thus reducing the number of data elements that typically have to be read by the database engine while processing queries.Data warehouse appliances, which combine the database with hardware and BI tools in an integrated platform that’s tuned for analytical workloads and designed to be easy to install and operate.In-memory databases, which load the source data into system memory in a compressed, non-relational format in an attempt to streamline the work involved in processing queries.Massively parallel processing (MPP) databases, which spread data across a cluster of servers, enabling the systems to share the query processing workload.Online analytical processing (OLAP) databases, which store multidimensional “cubes” of aggregated data for analyzing information based on multiple data attributes.
The left side is your relational brain modeled around many tables, each with rows, columns, attributes, and keysThe right side is your more casual brain, mixing formal and distributed data structures that scale horizontally (maximizing in-memory). Used for exploring large amounts of data, when the performance and real-time nature is more important than consistencyQuery: Trying to find a needle in a haystackDiscovery: Trying to find a specific needle in a needle stackBut discovery problems are being worked on every day. Intelligence analysts are trying to discover new threats. Medical researchers are trying to discover new drugs. Financial analysts are trying to discover new trading strategies. Across every industry, discovery problems abound – and solving these discovery problems generates incredible value to organizations – whether it’s discovering a new threat, a new drug, a new trading strategy, etc.
We have all heard about the explosive growth in data. The more important story is the software and hardware advancements that allow users to explore this data.Data Growth Chart1 Bit = Binary Digit8 Bits = 1 Byte1000 Bytes = 1 Kilobyte1000 Kilobytes = 1 Megabyte1000 Megabytes = 1 Gigabyte1000 Gigabytes = 1 Terabyte1000 Terabytes = 1 Petabyte1000 Petabytes = 1 Exabyte1000 Exabytes = 1 Zettabyte1000 Zettabytes = 1 Yottabyte1000 Yottabytes = 1 Brontobyte1000 Brontobytes = 1 GeopbyteWikipedia definition that was crafted by melding together commentary from analysts at The 451, IDC, Monash Research, a TDWI alumnus, and a Gartner expert.“Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.”RDBMS store gigabytes of data (for transactional data)Data Warehouses store terabytes of information (for analysis)Big Data repositories hold petabytes of data and growing
Answers to questions from attendees
The overall goal is to move from exploratory ideation and prototyping to high-value business intelligence that a program or policy organization can act upon for the benefit of customers and taxpayers.The typical path to actionable business intelligence begins with several iterations of customer-driven Design-Thinking sessions, using observations, questions and hypotheses to explore the customers’ complex problems and their context (e.g., the customers’ customer or the regulatory environment). The initial exploratory session is focused on defining the problem, understanding the impact on customers and users, and rapidly creating and testing alternative solutions on paper.During subsequent sessions, customers and analysts work together to mature the prototype by aggregating, filtering and correlating data using cloud-based software tools. It is important to confirm (reality test) any insights resulting from exploratory activities before making policy or program changes.The entire process takes a small, diverse team from 60 to 120 days.
Distributed DW architecture. The issue in a multi-workload environment is whether a single-platform data warehouse can be designed and optimized such that all workloads run optimally, even when concurrent. More DW teams are concluding that a multi-platform data warehouse environment is more cost-effective and flexible. Plus, some workloads receive better optimization when moved to a platform beside the data warehouse. In reaction, many organizations now maintain a core DW platform for traditional workloads but offload other workloads to other platforms. For example, data and processing for SQL-based analytics are regularly offloaded to DW appliances and columnar DBMSs. A few teams offload workloads for big data and advanced analytics to HDFS, discovery platforms, MapReduce, and similar platforms. The result is a strong trend toward distributed DW architectures, where many areas of the logical DW architecture are physically deployed on standalone platforms instead of the core DW platform. Big Data requires a new generation of scalable technologies designed to extract meaning from very large volumes of disparate, multi-structured data by enabling high velocity capture, discovery, and analysisSource of second graphic: http://www.saama.com/blog/bid/78289/Why-large-enterprises-and-EDW-owners-suddenly-care-about-BigDatahttp://www.cloudera.com/content/dam/cloudera/Resources/PDF/Hadoop_and_the_Data_Warehouse_Whitepaper.pdfComplex Hadoop jobs can use the data warehouse as a data source, simultaneously leveraging the massively parallel capabilities of two systems. Any MapReduce program can issue SQL statements to the data warehouse. In one context, a MapReduce program is “just another program,” and the data warehouse is “just another database.” Now imagine 100 MapReduce programs concurrently accessing 100 data warehouse nodes in parallel. Both raw processing and the data warehouse scale to meet any big data challenge. Inevitably, visionary companies will take this step to achieve competitive advantages.Promising Uses of Hadoop that Impact DW Architectures I see a handful of areas in data warehouse architectures where HDFS and other Hadoop products have the potential to play positive roles: Data staging. A lot of data processing occurs in a DW’s staging area, to prepare source data for specific uses (reporting, analytics, OLAP) and for loading into specific databases (DWs, marts, appliances). Much of this processing is done by homegrown or tool-based solutions for extract, transform, and load (ETL). Imagine staging and processing a wide variety of data on HDFS. For users who prefer to hand-code most of their solutions for extract, transform, and load (ETL), they will most likely feel at home in code-intense environments like Apache MapReduce. And they may be able to refactor existing code to run there. For users who prefer to build their ETL solutions atop a vendor tool, the community of vendors for ETL and other data management tools is rolling out new interfaces and functions for the entire Hadoop product family. Note that I’m assuming that (whether you use Hadoop or not), you should physically locate your data staging area(s) on standalone systems outside the core data warehouse, if you haven’t already. That way, you preserve the core DW’s capacity for what it does best: squeaky clean, well modeled data (with an audit trail via metadata and master data) for standard reports, dashboards, performance management, and OLAP. In this scenario, the standalone data staging area(s) offload most of the management of big data, archiving source data, and much of the data processing for ETL, data quality, and so on. Data archiving. When organizations embrace forms of advanced analytics that require detail source data, they amass large volumes of source data, which taxes areas of the DW architecture where source data is stored. Imagine managing detailed source data as an archive on HDFS. You probably already do archiving with your data staging area, though you probably don’t call it archiving. If you think of it as an archive, maybe you’ll adopt the best practices of archiving, especially information lifecycle management (ILM), which I feel is valuable but woefully vacant from most DWs today. Archiving is yet another thing the staging area in a modern DW architecture must do, thus another reason to offload the staging area from the core DW platform. Traditionally, enterprises had three options when it came to archiving data: leave it within a relational database, move it to tape or optical disk, or delete it. Hadoop’s scalability and low cost enable organizations to keep far more data in a readily accessible online environment. An online archive can greatly expand applications in business intelligence, advanced analytics, data exploration, auditing, security, and risk management. Multi-structured data. Relatively few organizations are getting BI value from semi- and unstructured data, despite years of wishing for it. Imagine HDFS as a special place within your DW environment for managing and processing semi-structured and unstructured data. Another way to put it is: imagine not stretching your RDBMS-based DW platform to handle data types that it’s not all that good with. One of Hadoop’s strongest complements to a DW is its handling of semi- and unstructured data. But don’t go thinking that Hadoop is only for unstructured data: HDFS handles the full range of data, including structured forms, too. In fact, Hadoop can manage just about any data you can store in a file and copy into HDFS. Processing flexibility. Given its ability to manage diverse multi-structured data, as I just described, Hadoop’sNoSQL approach is a natural framework for manipulating non-traditional data types. Note that these data types are often free of schema or metadata, which makes them challenging for SQL-based relational DBMSs. Hadoop supports a variety of programming languages (Java, R, C), thus providing more capabilities than SQL alone can offer. In addition, Hadoop enables the growing practice of “late binding”. Instead of transforming data as it’s ingested by Hadoop (the way you often do with ETL for data warehousing), which imposes an a priori model on data, structure is applied at runtime. This, in turn, enables the open-ended data exploration and discovery analytics that many users are looking for today. Advanced analytics. Imagine HDFS as a data stage, archive, or twenty-first-century operational data store that manages and processes big data for advanced forms of analytics, especially those based on MapReduce, data mining, statistical analysis, and natural language processing (NLP). There’s much to say about this; in a future blog I’ll drill into how advanced analytics is one of the strongest influences on data warehouse architectures today, whether Hadoop is in use or not.Analyze and Store Approach (ELT?)The analyze and store approach analyzes data as it flows through businessprocesses, across networks, and between systems. The analytical results can then bepublished to interactive dashboards and/or published into a data store (such as a datawarehouse) for user access, historical reporting and additional analysis. Thisapproach can also be used to filter and aggregate big data before it is brought into adata warehouse.There are two main ways of implementing the analyze and store approach:• Embedding the analytical processing in business processes. This techniqueworks well when implementing business process management and serviceorientedtechnologies because the analytical processing can be called as a servicefrom the process workflow. This technique is particularly useful for monitoring andanalyzing business processes and activities in close to real-time – action times ofa few seconds or minutes are possible here. The process analytics created canalso be published to an operational dashboard or stored in a data warehouse forsubsequent use.• Analyzing streaming data as it flows across networks and between systems.This technique is used to analyze data from a variety of different (possiblyunrelated) data sources where the volumes are too high for the store and analyzeapproach, sub-second action times are required, and/or where there is a need toanalyze the data streams for patterns and relationships. To date, many vendorshave focused on analyzing event streams (from trading systems, for example)using the services of a complex event processing (CEP) engine, but this style ofprocessing is evolving to support a wider variety of streaming technologies anddata. Creates stream analytics from many types of streamingdata such as event, video and GPS data.The benefits of the analyze and store approach are fast action times and lower datastorage overheads because the raw data does not have to be gathered andconsolidated before it can be analyzed.using HiveQL to create a load-ready file for a relational database.
From the EDW to the multi-platform unified data architecture. A consequence of the workload-centric approach (coupled with a reassessment of DW economics) is a trend away from the single-platform monolith of the enterprise data warehouse (EDW) and toward a physically distributed unified data architecture (UDA).2 A modern UDA is a logical design that assumes deployment onto multiple platform types, ranging from the traditional warehouse (and its satellite systems for marts and ODSs) to new platforms such as DW appliances, columnar DBMSs, NoSQL databases, MapReduce tools, and even a file system on steroids such as HDFS. The multi-platform approach of UDA adds more complexity to the DW environment; yet, the complexity is being addressed by vendor R&D to abstract the complexity and take advantage of the various capability and cost options. Moving data around is inevitable in a multi-platform UDA, so there needs to be a well-defined data integration architecture as well. Even so, an assumption behind UDA is that data structures and their deployment platforms will integrate on the fly (due to the exploratory nature of analytics) in a loosely coupled fashion. So the architecture should define data standards, preferred interfaces, and shared business rules to give loose coupling consistent usage. Data virtualization is an umbrella term used to describe any approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted or where it is physically located.You are probably familiar with the concept of data virtualization if you store photos on the social networking site Facebook. When you upload a photo to Facebook from your desktop computer, you must provide the upload tool with information about the location of the photo -- the photo's file path. Once it has been uploaded to Facebook, however, you can retrieve the photo without having to know its new file path. In fact, you will have absolutely no idea where Facebook is storing your photo because Facebook software has an abstraction layer that hides that technical information. This abstraction layer is what is meant by some vendors when they use the term data virtualization. The term can be confusing because some vendors use the labels data virtualization anddata federation interchangeably. They do, however, mean slightly different things. The goal of data federation technology is to aggregate heterogeneous data from disparate sources and view it in a consistent manner from a single point of access. The term data virtualization, however, simply means that the technical information about the data has been hidden. Strictly speaking, it does not imply that the data is heterogeneous or that it can be viewed from a single point of access.

Using Advanced Analytics for Value-based Healthcare Delivery

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Using Advanced Analytics for Value-based Healthcare Delivery

Similar to Using Advanced Analytics for Value-based Healthcare Delivery (20)

Recently uploaded

Recently uploaded (20)

Using Advanced Analytics for Value-based Healthcare Delivery

Editor's Notes