#ATAGTR2017
16th 17th March
BigData Performance Testing
Abhinav Gupta
Agile Testing Alliance Global Testing Retreat 2017
 What is BigData
 Motivation
 Data Acquisition HDFS ARCHITECTURE
 Why Performance testing
 Performance testing approach
 Security
 Summary
Content
Agile Testing Alliance Global Testing Retreat 2017
In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but
for more systems of computers.
—Grace Hopper
What is BigData?
Agile Testing Alliance Global Testing Retreat 2017
Resilient to failure
Highly scalable
Cost Effective
Flexible
Fast to process massive volumes of data
Resilient to failure
SAMPLE
EXPLORE
MODIFY
MODEL
ASSESS
Motivation
With the advent of Digital platform, data creation has outgrown
data storage and processing capabilities of single server. This has led
to design new frameworks and tools to keep pace with the data
generation and use it for competitive advantage.
Agile Testing Alliance Global Testing Retreat 2017
Apache Flume
SinkSource Channel
SinkChannel
SinkChannel
Source
Source
Agent
Agent
App
Logs Name Node
Data Node
Teradata Sqoop
HDFS clients talk to NameNode
for metadata-related activities,
and DataNodes to read and write
files
The HDFS NameNode keeps in memory the
metadata about the filesystem, such as
which DataNodes manage the blocks for
each file
DataNodes communicate with each
other for pipeline file reads and
writes
Data is split and stored
on the HDFS
Data in the HDFS
distributed over many
nodes for fault
tolerance
HDFS has 1-2 name
node and many
slave/data nodes
Name nodes and data
Nodes reside on
commodity servers
Each node/server
offers local storage and
computation
Data Acquisition & HDFS
Structured Data
Unstructured Data
Concept of Hadoop is to move the processing to data store and not move the data to the processing system
Agile Testing Alliance Global Testing Retreat 2017
Why Performance Testing?
 Rise in digital platforms to engage
customers has generate massive
volumes of data
 Increase in compliance cost due
to regulatory demand
 Upsurge in competition
 Numerous technologies with high
interactions
 Minimal application downtime
Need
 Increase customer base by harnessing data from digital
channels (Mobile, Website, Social)
 Reduce compliance and operational cost in maintaining
data
 Faster decisions using advanced data analytics
 High availability of systems and reduced down time
 Low Maintenance cost
 Minimal Data loss
Benefits
Opportunity
Agile Testing Alliance Global Testing Retreat 2017
7
Performance Testing Approach
Setup BigData
Application
Identify and Design
Workload
Prepare Individual
Client
Execution and
Analysis
Optimum
Configuration
Tune Components and
Deployment
Performance testing for big data application involves testing of huge volumes of structured and unstructured data, and it requires a specific testing
approach to test such massive data.
Level of Parallelism
Serialization Format
Memory Management
User Code
Hardware Provisioning
Key Performance Considerations
Key Performance Indicators
HDFS metrics
Namenode metrics:
• Capacity Remaining
• Corrupt Blocks/Missing Blocks
• Volume Failures Total
• Num Live Data Node/Num Dead Data Node
• Files Total
• Total Load
• Block Capacity/Block Total
• Under Replicated Block
• Num Stale Data Nodes
Data Node Metrics:
• Remaining
• Num Failed Volumes
Key Performance Indicators
Yarn Metrics
Cluster Metrics:
• Unhealthy Nodes
• Active Nodes
• Lost Nodes
• Apps Failed
• Total MB/allocated MB
Application Metrics :
• Progress
Node Manager Metrics:
• Containers Failed
Agile Testing Alliance Global Testing Retreat 2017
8
Security
• File permission system in HDFS prevents one user from
accidently wiping out the whole filesystem.
• It doesn’t prevent a malicious user from assuming root’s
identity to access or delete any data in the cluster.
• To meet regulatory requirements for data protection,
secure authentication must be in place for shared cluster.
• This let to implementation of Kerberos, a mature open
source network authentication protocol to authenticate
the user.
Source : Hadoop the Definitive Guide
hadoop fs –rmr /
Agile Testing Alliance Global Testing Retreat 2017
Customer Needs
 Data loading in Hadoop for
various scenarios
 Transform unstructured data
into Structure data
 Workload models using Hive
The Solution
• Cognizant helped finalize the Non-functional requirements and the data loading strategy during planning.
• A performance monitoring strategy was defined
• Team provided recommendations to resolve the issues related to scalability, memory and improve e2e execution timings
• Recommendations were implemented and framework was successfully operationalised in production setting
Cognizant needed to define
approach for testing, identify
monitoring tool for Hadoop,
provide recommendation for
data loading and create
performance testing strategy.
Due to huge data-volume,
performance monitoring on
Hadoop was required to
inspire stakeholder’s
confidence in performance
Challenges
• Historical Data Retention
• Delay in getting the required
results from data mining
• New technology
• NFT enabled successful testing
of the framework and provided
various solutions for tuning the
framework
• An enhanced approach was
created from the lessons learnt
Client decided to move to Bigdata
Hadoop. Huge volume of
transactions data (multi terabytes)
from production was moved onto
Hadoop platform.
Case Study
• The client – a U.S. insurer, is a major writer of commercial property casualty and personal insurance
• The Personal Insurance Line of Business developed a framework to consume all the application logs and DB
data and to use the data to troubleshoot production issues, understand user pattern, forecast transactions
and create workload model.
Case Study
Agile Testing Alliance Global Testing Retreat 2017
Example – Data Transformation
Agile Testing Alliance Global Testing Retreat 2017
• There is no doubt that performance testing of Big Data applications is a pressing need. It not only is necessary to
validate applications’ reliability under massive data flow and several computations, but also is essential to validate
applications’ robustness and minimal data loss.
• Some of the benefits of the above performance testing approach:
 is to validate application stability under heavy data load
 achieve high code efficiency which transforms and take action on the data.
 Establish a framework for validating infrastructure and
 Drives any server related changes, should there are memory or CPU related errors.
• This has also paved a way for us to explore various other areas such as
 forecasting transaction response time from current operation metrics,
 predicting efficient workload models across IT infrastructure and
 predicting production bottlenecks.
Summary
Agile Testing Alliance Global Testing Retreat 2017
Appendix
Agile Testing Alliance Global Testing Retreat 2017
Metrics Captured on the NameNode
Agile Testing Alliance Global Testing Retreat 2017
An Example on intermediately results are split into parts and store into HDFS

ATAGTR2017 Performance Testing and Non-Functional Testing Strategy for Big Data Applications

  • 1.
    #ATAGTR2017 16th 17th March BigDataPerformance Testing Abhinav Gupta
  • 2.
    Agile Testing AllianceGlobal Testing Retreat 2017  What is BigData  Motivation  Data Acquisition HDFS ARCHITECTURE  Why Performance testing  Performance testing approach  Security  Summary Content
  • 3.
    Agile Testing AllianceGlobal Testing Retreat 2017 In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. —Grace Hopper What is BigData?
  • 4.
    Agile Testing AllianceGlobal Testing Retreat 2017 Resilient to failure Highly scalable Cost Effective Flexible Fast to process massive volumes of data Resilient to failure SAMPLE EXPLORE MODIFY MODEL ASSESS Motivation With the advent of Digital platform, data creation has outgrown data storage and processing capabilities of single server. This has led to design new frameworks and tools to keep pace with the data generation and use it for competitive advantage.
  • 5.
    Agile Testing AllianceGlobal Testing Retreat 2017 Apache Flume SinkSource Channel SinkChannel SinkChannel Source Source Agent Agent App Logs Name Node Data Node Teradata Sqoop HDFS clients talk to NameNode for metadata-related activities, and DataNodes to read and write files The HDFS NameNode keeps in memory the metadata about the filesystem, such as which DataNodes manage the blocks for each file DataNodes communicate with each other for pipeline file reads and writes Data is split and stored on the HDFS Data in the HDFS distributed over many nodes for fault tolerance HDFS has 1-2 name node and many slave/data nodes Name nodes and data Nodes reside on commodity servers Each node/server offers local storage and computation Data Acquisition & HDFS Structured Data Unstructured Data Concept of Hadoop is to move the processing to data store and not move the data to the processing system
  • 6.
    Agile Testing AllianceGlobal Testing Retreat 2017 Why Performance Testing?  Rise in digital platforms to engage customers has generate massive volumes of data  Increase in compliance cost due to regulatory demand  Upsurge in competition  Numerous technologies with high interactions  Minimal application downtime Need  Increase customer base by harnessing data from digital channels (Mobile, Website, Social)  Reduce compliance and operational cost in maintaining data  Faster decisions using advanced data analytics  High availability of systems and reduced down time  Low Maintenance cost  Minimal Data loss Benefits Opportunity
  • 7.
    Agile Testing AllianceGlobal Testing Retreat 2017 7 Performance Testing Approach Setup BigData Application Identify and Design Workload Prepare Individual Client Execution and Analysis Optimum Configuration Tune Components and Deployment Performance testing for big data application involves testing of huge volumes of structured and unstructured data, and it requires a specific testing approach to test such massive data. Level of Parallelism Serialization Format Memory Management User Code Hardware Provisioning Key Performance Considerations Key Performance Indicators HDFS metrics Namenode metrics: • Capacity Remaining • Corrupt Blocks/Missing Blocks • Volume Failures Total • Num Live Data Node/Num Dead Data Node • Files Total • Total Load • Block Capacity/Block Total • Under Replicated Block • Num Stale Data Nodes Data Node Metrics: • Remaining • Num Failed Volumes Key Performance Indicators Yarn Metrics Cluster Metrics: • Unhealthy Nodes • Active Nodes • Lost Nodes • Apps Failed • Total MB/allocated MB Application Metrics : • Progress Node Manager Metrics: • Containers Failed
  • 8.
    Agile Testing AllianceGlobal Testing Retreat 2017 8 Security • File permission system in HDFS prevents one user from accidently wiping out the whole filesystem. • It doesn’t prevent a malicious user from assuming root’s identity to access or delete any data in the cluster. • To meet regulatory requirements for data protection, secure authentication must be in place for shared cluster. • This let to implementation of Kerberos, a mature open source network authentication protocol to authenticate the user. Source : Hadoop the Definitive Guide hadoop fs –rmr /
  • 9.
    Agile Testing AllianceGlobal Testing Retreat 2017 Customer Needs  Data loading in Hadoop for various scenarios  Transform unstructured data into Structure data  Workload models using Hive The Solution • Cognizant helped finalize the Non-functional requirements and the data loading strategy during planning. • A performance monitoring strategy was defined • Team provided recommendations to resolve the issues related to scalability, memory and improve e2e execution timings • Recommendations were implemented and framework was successfully operationalised in production setting Cognizant needed to define approach for testing, identify monitoring tool for Hadoop, provide recommendation for data loading and create performance testing strategy. Due to huge data-volume, performance monitoring on Hadoop was required to inspire stakeholder’s confidence in performance Challenges • Historical Data Retention • Delay in getting the required results from data mining • New technology • NFT enabled successful testing of the framework and provided various solutions for tuning the framework • An enhanced approach was created from the lessons learnt Client decided to move to Bigdata Hadoop. Huge volume of transactions data (multi terabytes) from production was moved onto Hadoop platform. Case Study • The client – a U.S. insurer, is a major writer of commercial property casualty and personal insurance • The Personal Insurance Line of Business developed a framework to consume all the application logs and DB data and to use the data to troubleshoot production issues, understand user pattern, forecast transactions and create workload model. Case Study
  • 10.
    Agile Testing AllianceGlobal Testing Retreat 2017 Example – Data Transformation
  • 11.
    Agile Testing AllianceGlobal Testing Retreat 2017 • There is no doubt that performance testing of Big Data applications is a pressing need. It not only is necessary to validate applications’ reliability under massive data flow and several computations, but also is essential to validate applications’ robustness and minimal data loss. • Some of the benefits of the above performance testing approach:  is to validate application stability under heavy data load  achieve high code efficiency which transforms and take action on the data.  Establish a framework for validating infrastructure and  Drives any server related changes, should there are memory or CPU related errors. • This has also paved a way for us to explore various other areas such as  forecasting transaction response time from current operation metrics,  predicting efficient workload models across IT infrastructure and  predicting production bottlenecks. Summary
  • 12.
    Agile Testing AllianceGlobal Testing Retreat 2017 Appendix
  • 13.
    Agile Testing AllianceGlobal Testing Retreat 2017 Metrics Captured on the NameNode
  • 14.
    Agile Testing AllianceGlobal Testing Retreat 2017 An Example on intermediately results are split into parts and store into HDFS