Shreya Pal
Chief Architect Saama
9-Sep-2017
BDaaS - Bigdata as
Service
Content
Digital Vortex 2015
• What is BDaaS ?
• Challenges
• BDaaS layers
• BDaaS Advantages
• BDaaS Enterprise Requirements
• Life Sciences Case Study
Conflicting
Enterprise Needs
Data Scientist wants flexibility
• Different versions (new releases) of
Hadoop, spark etc.
• Different sets of BI/Analytics tools
IT wants control
• Multitenancy
• QOS, Data access
• Security
• Network Authentication and
Authorization
DigitalVortex 2015
Challenges
• Data is becoming increasingly :
• Voluminous
• Varied
• Complex
• Less Structured
• Infrastructure setup
• Maintenance of Infrastructure (Update, patching etc.)
• Deployment time
• On Demand Scaling
• Cost
Rise of BDaaS
Digital Vortex 2015
What is BDaaS ?
On
Demand
Self
Service
Elastic
Bigdata
Infrastructure
Applications
Analytics
BDaaS provides a cloud based framework that offers end-to-end BigData
solutions to business organizations
Layers in BDaaS
Infrastructure
Cloud Infrastructure
Data Storage
Computing
Data Management
Data AnalyticsPresentation
Layer
Easeofuse
Bigdataasaservice
Hardware
platform
IaaS
HDFS
Spark,
MR
RDS
Tableau,
R
BDaaS Advantages
- Scalability
- Reliability
- Availability
- Flexibility
- Pre stitched big data stack
- Cost Effectiveness
BDaaS Enterprise Requirements
- Multitenancy
- Support for Application
- High Availability
- Support for HA
- Cluster expansion and contraction
- Infrastructure and Operation requirements
- Integration with existing network configuration
- Supported versions of OS, containers etc.
- Integration with LDAP
- Upgrade
- Capacity expansion
- Monitoring
Life Sciences Case Study -
Operational data repository
Business Problem
CDISC Standards
Clinical Data
Safety Data
Varied Sources
Syndicated & Large Data
Enabled Analytics
Patient & Studies
Analytics
 Clinical Study Data
Mart
 Clinical Outcomes
Analytics
Drug Safety & Analytics
 Safety Outcome &
Reporting Analytics
 Trial Management
Analytics
 Real World Signal
Detection Analytics
 Activity Enablement
Big Data
Relational
Data
Advanced
Analytical
Tools
Shared
Metadata
 Electronic Data Capture
 Clinical Trials
Management System
 Safety Data Warehouse
 Global Safety Data
Warehouse
 ARGUS
 Clinical Study Reports
 Disparate Business Unit Reports
 External analyses
 Non-Clinical, Pre-Clinical Data &
Reports
 Real World Claims Data
 Internal Genomics Data
 Public Data (Kegg, NCBI,CHEMBL,etc.)
 Trials Trove, CT.gov
Varied Structure Data
Infrastructure
Data Sources
Technology Stack
Fluid analytics Engine and AWS
Cloud Provider – AWS
Hadoop distribution – Cloudera
Storage – S3, Hive, Impala
Archival - Glacier
Processing – Spark
Monitoring – Cloud Watch
Metadata storage – Amazon RDS
Automation – Cloud Formation Template
Access – AWS IAM
Cluster – VPC
LAN connectivity – Direct Connect
High Level Flow
Master data
Raw CDC
Data Quality Rules
Repository
Data
Vocabulary
Scheduling
Data Security & Governance
Lading Layer
Standardized Layer
Reporting & Analysis
Layer
CTMS
Alerts and Notifications
IRT
EDC
Aggregated Layer
Detail data
CRO Data
Data
Transformation
Common
Data Model
Aggregated
Data Model
Monitoring
Metadata Repository and execution Engine
Data
Aggregation
Data CleansingFAE
FAE
FAE
FAE
FAE
FAE
FAE
F
A
E
FAE FAE
FAE
AWS
AWS
AWS
Advantages
• Development time reduced by 35-40%
• Testing of individual components not required
• Pre built data quality rules
• Pre built workflows
• Pre built KPIs
• Pre built common data model and aggregated data model
Questions ??

BDaas- BigData as a service

  • 1.
    Shreya Pal Chief ArchitectSaama 9-Sep-2017 BDaaS - Bigdata as Service
  • 2.
    Content Digital Vortex 2015 •What is BDaaS ? • Challenges • BDaaS layers • BDaaS Advantages • BDaaS Enterprise Requirements • Life Sciences Case Study
  • 3.
    Conflicting Enterprise Needs Data Scientistwants flexibility • Different versions (new releases) of Hadoop, spark etc. • Different sets of BI/Analytics tools IT wants control • Multitenancy • QOS, Data access • Security • Network Authentication and Authorization DigitalVortex 2015
  • 4.
    Challenges • Data isbecoming increasingly : • Voluminous • Varied • Complex • Less Structured • Infrastructure setup • Maintenance of Infrastructure (Update, patching etc.) • Deployment time • On Demand Scaling • Cost
  • 5.
  • 6.
    What is BDaaS? On Demand Self Service Elastic Bigdata Infrastructure Applications Analytics BDaaS provides a cloud based framework that offers end-to-end BigData solutions to business organizations
  • 7.
    Layers in BDaaS Infrastructure CloudInfrastructure Data Storage Computing Data Management Data AnalyticsPresentation Layer Easeofuse Bigdataasaservice Hardware platform IaaS HDFS Spark, MR RDS Tableau, R
  • 8.
    BDaaS Advantages - Scalability -Reliability - Availability - Flexibility - Pre stitched big data stack - Cost Effectiveness
  • 9.
    BDaaS Enterprise Requirements -Multitenancy - Support for Application - High Availability - Support for HA - Cluster expansion and contraction - Infrastructure and Operation requirements - Integration with existing network configuration - Supported versions of OS, containers etc. - Integration with LDAP - Upgrade - Capacity expansion - Monitoring
  • 10.
    Life Sciences CaseStudy - Operational data repository
  • 11.
    Business Problem CDISC Standards ClinicalData Safety Data Varied Sources Syndicated & Large Data Enabled Analytics Patient & Studies Analytics  Clinical Study Data Mart  Clinical Outcomes Analytics Drug Safety & Analytics  Safety Outcome & Reporting Analytics  Trial Management Analytics  Real World Signal Detection Analytics  Activity Enablement Big Data Relational Data Advanced Analytical Tools Shared Metadata  Electronic Data Capture  Clinical Trials Management System  Safety Data Warehouse  Global Safety Data Warehouse  ARGUS  Clinical Study Reports  Disparate Business Unit Reports  External analyses  Non-Clinical, Pre-Clinical Data & Reports  Real World Claims Data  Internal Genomics Data  Public Data (Kegg, NCBI,CHEMBL,etc.)  Trials Trove, CT.gov Varied Structure Data Infrastructure Data Sources
  • 12.
    Technology Stack Fluid analyticsEngine and AWS Cloud Provider – AWS Hadoop distribution – Cloudera Storage – S3, Hive, Impala Archival - Glacier Processing – Spark Monitoring – Cloud Watch Metadata storage – Amazon RDS Automation – Cloud Formation Template Access – AWS IAM Cluster – VPC LAN connectivity – Direct Connect
  • 13.
    High Level Flow Masterdata Raw CDC Data Quality Rules Repository Data Vocabulary Scheduling Data Security & Governance Lading Layer Standardized Layer Reporting & Analysis Layer CTMS Alerts and Notifications IRT EDC Aggregated Layer Detail data CRO Data Data Transformation Common Data Model Aggregated Data Model Monitoring Metadata Repository and execution Engine Data Aggregation Data CleansingFAE FAE FAE FAE FAE FAE FAE F A E FAE FAE FAE AWS AWS AWS
  • 14.
    Advantages • Development timereduced by 35-40% • Testing of individual components not required • Pre built data quality rules • Pre built workflows • Pre built KPIs • Pre built common data model and aggregated data model
  • 15.

Editor's Notes

  • #8 Data Analytics: This layer includes high-level analytical applications similar to R or Tableau delivered over a cloud computing platform which can be used to analyze the underlying data. Users can access these technologies in this layer through a web interface where they can create queries and define reports that will be based on the underlying data in the storage layer. Technologies in the data analytics layer abstract complexities of the underlying BDaaS stack and enable better utilization of data within the system. The web interface of those technologies may have wizards and graphical tools that enable the user to perform complex statistical analysis. Data Management: In this layer, higher level applications such as Amazon Relational Database Service (RDS) and DynamoDB (see Chapter 6) are implemented to provide distributed data management and processing services. Technologies contained in this layer provide database management services over a cloud platform. Computation Layer: This layer is composed of technologies that provide computing services over a web platform. For example, using Amazon Elastic MapReduce (EMR), users can write programs to manipulate data and store the results in a cloud platform. This layer includes the processing framework as well as APIs and other programs to help the programs utilize it. Cloud Infrastructure: In this layer cloud platforms such as open stack or VMware ESX server provide the virtual cloud environment that forms the basis of the BDaaS stack. Data Infrastructure: This layer is composed of the actual data center hardware and the physical nodes of the system. Data centers are typically composed of thousands of servers connected to each other by a high-speed network line enabling transfer of data. The data centers also have routers, firewalls, and backup systems to insure protection against data loss.