11©ThoughtWorks 2019 Commercial in Confidence
Insights Into Big Data
Testing
VodQA 2019
AGENDA
• Introduction to Big Data Application
• Testing Aspects
• Various Types of Tests
• Automation Tools/Framework
• Testing Challenges
2
3
What is Big Data ???
©ThoughtWorks 2019 Commercial in Confidence
4
Mammoth of Data
?
Is Any Data Big ?
©ThoughtWorks 2019 Commercial in Confidence
V’s Of Big Data
Volume,
Velocity,
Variety &
Veracity
5©ThoughtWorks 2019 Commercial in Confidence
6
Big Data Applications
©ThoughtWorks 2019 Commercial in Confidence
7
8
Hadoop is one of the solutions to Big Data Problem
Testing depends on kind of tools being used in Big Data Application
BigDataEcosystem
9
Big Data Application
Workflow
©ThoughtWorks 2019 Commercial in Confidence
Typically has following stages
Reporting &
Analysis
Extract/Query
the output
from HDFS
Perform
MapReduce/
spark
operations
Loading Source
Data Files into
HDFS
©ThoughtWorks 2019 Commercial in Confidence
Big Data Testing Aspects
1
1
©ThoughtWorks 2019 Commercial in Confidence
©ThoughtWorks 2019 Commercial in Confidence
1. Validation Of Data
2. Structured or unstructured data consideration
3. Optimal Test Environment
4. Availability of Hadoop centric testing tools
5. Performing Non Functional Testing
6. Efficient test data set
7. Hive Internal External tables
12
Few things to consider during testing
In this case data is structured
Big Data Application Must Have Tests
1
3
Unit Test Hive Query Validator
Hive Test Integration Test
Oozie Test Functional Test
©ThoughtWorks 2019 Commercial in Confidence
1
4
©ThoughtWorks 2019 Commercial in Confidence
Automation tools
& Framework
1
5
©ThoughtWorks 2019 Commercial in Confidence
Unit Testing Framework
1
6
Mockito : Java Based Mocking Framework
Worker Bee : Framework to perform tasks with
Apache HIVE
Junit : Unit Testing Framework
©ThoughtWorks 2019 Commercial in Confidence
MOCKITO
MOCK FRAMEWORK
● Mocks external dependencies
● Insert mocks into code under test
● Execute the code
● Validate if code executed as
expected
● When ThenReturn Rule
17©ThoughtWorks 2019 Commercial in Confidence
Creating Database
& table
Using new operator
& tables using
havingTable(Class)
Generate Migration
Files
Using Migration Genrator
Setup Test Data
Verify Result &
Execute Queries
Execute function and assert
result
WORKFLOW
WORKER BEE
HIVE TEST FRAMEWORK
18©ThoughtWorks 2019 Commercial in Confidence
● Define schema of database & table
● Querybuilder at disposal
● Go with TDD
● Run migrations against test table
Let's Understand
1. Create Database and Table
public static final BaseBall db = new BaseBall();
db.havingTable(Batting.tb);
2. Define Columns and types as per need
public static final Column playerId = HavingColumn(tb, "player_id", Column.Type.STRING);
1. Create Rows (Dataset)
private static Row<Batting> lowestRun
= Batting.tb.getNewRow()
.set(Batting.playerId, PLAYER_1_ID)
.set(Batting.year, 1990)
1. Call the script to logic using Execute
List<Row<Table>> years = repo.execute(BaseBall.highestScoreForEachYear());
1. Verify data using Assert
assertThat(years.size(), is(1));
1
9
©ThoughtWorks 2019 Commercial in Confidence
Functional Test
20
Verification Of End To
End Workflows
Verification Of Data
Setup
Verification Of Reports
Functional Test
Pipeline Smoke &
Regression Pipeline
Selenium
Cucumber
Junit
Dedicated Cluster for Automation
FRAMEWORK INFORMATION
21©ThoughtWorks 2019 Commercial in Confidence
Let's Understand
End To End workflows is called
2
2
Data Set up
Excel contains
tabular data
Data is entered as
Text tables
Table File conversions
As per table, text
table is converted
to Parquet table or
Avro Table
Data Verification
Using Cucumber’s
Data Table
Front End Validation
Selenium
©ThoughtWorks 2019 Commercial in Confidence
Oozie Test
23
● Worker Bee
● Oozie Client
● Junit
Oozie Framework
RUNS ON CLUSTER
❏ This directly works on any workflow
❏ Feedback cycle is quick
❏ Submits jobs and track its
completion
❏ Easier to set up test data
❏ Helpful in debugging production
issues
Does the following
24©ThoughtWorks 2019 Commercial in Confidence
Tools and framework
Let’s Understand,
At Workflow Level
Once the test is completed, drop the
table using Query Generator
25
Create Table Class
row
method
Use Create method
create(Table Name)
Call oozie
workflow &
set properties
Verify table
headers,
Records
Verify
Result
Insert
Records
Run
Workflow
Identify
Test table
Create
Schema
Columns &
Partitions
©ThoughtWorks 2019 Commercial in Confidence
Data Set Up
Testing Challenges
2
6
©ThoughtWorks 2019 Commercial in Confidence
Challenges
• Big Data Pipelines can take huge amount of time
• Importance for cluster environment for testing
• Support for different file formats (like parquet, avro,
text)
• To test only few workflows
• To maintain test data set
2
7
Challenges
• Testing Migration scenarios...in case if file format is
preferred
• Cluster performance issues
• Hive-impala service issues
• Incorporating schema changes in Automation test set
and table set
• Managing partition format as well
2
8
Issues Caught In Testing
Spark has known issue if a partition
exists but not directory, then it
throws exception
Spark 1.6 Issue
Sorting Issues on columns,
importance of secondary sort
Report Columns
Query performance decreases
Small number of large files
in HDFS
29©ThoughtWorks 2019 Commercial in Confidence
©ThoughtWorks 2019 Commercial in Confidence
PRIYANKA RAWAT
QUALITY ANALYST
prawat@thoughtworks.com | thoughtworks.com
30
THANK YOU

vodQA Pune (2019) - Insights into big data testing

  • 1.
    11©ThoughtWorks 2019 Commercialin Confidence Insights Into Big Data Testing VodQA 2019
  • 2.
    AGENDA • Introduction toBig Data Application • Testing Aspects • Various Types of Tests • Automation Tools/Framework • Testing Challenges 2
  • 3.
    3 What is BigData ??? ©ThoughtWorks 2019 Commercial in Confidence
  • 4.
    4 Mammoth of Data ? IsAny Data Big ? ©ThoughtWorks 2019 Commercial in Confidence
  • 5.
    V’s Of BigData Volume, Velocity, Variety & Veracity 5©ThoughtWorks 2019 Commercial in Confidence
  • 6.
    6 Big Data Applications ©ThoughtWorks2019 Commercial in Confidence
  • 7.
  • 8.
    8 Hadoop is oneof the solutions to Big Data Problem Testing depends on kind of tools being used in Big Data Application BigDataEcosystem
  • 9.
  • 10.
    Typically has followingstages Reporting & Analysis Extract/Query the output from HDFS Perform MapReduce/ spark operations Loading Source Data Files into HDFS ©ThoughtWorks 2019 Commercial in Confidence
  • 11.
    Big Data TestingAspects 1 1 ©ThoughtWorks 2019 Commercial in Confidence
  • 12.
    ©ThoughtWorks 2019 Commercialin Confidence 1. Validation Of Data 2. Structured or unstructured data consideration 3. Optimal Test Environment 4. Availability of Hadoop centric testing tools 5. Performing Non Functional Testing 6. Efficient test data set 7. Hive Internal External tables 12 Few things to consider during testing In this case data is structured
  • 13.
    Big Data ApplicationMust Have Tests 1 3 Unit Test Hive Query Validator Hive Test Integration Test Oozie Test Functional Test ©ThoughtWorks 2019 Commercial in Confidence
  • 14.
  • 15.
  • 16.
    Unit Testing Framework 1 6 Mockito: Java Based Mocking Framework Worker Bee : Framework to perform tasks with Apache HIVE Junit : Unit Testing Framework ©ThoughtWorks 2019 Commercial in Confidence
  • 17.
    MOCKITO MOCK FRAMEWORK ● Mocksexternal dependencies ● Insert mocks into code under test ● Execute the code ● Validate if code executed as expected ● When ThenReturn Rule 17©ThoughtWorks 2019 Commercial in Confidence
  • 18.
    Creating Database & table Usingnew operator & tables using havingTable(Class) Generate Migration Files Using Migration Genrator Setup Test Data Verify Result & Execute Queries Execute function and assert result WORKFLOW WORKER BEE HIVE TEST FRAMEWORK 18©ThoughtWorks 2019 Commercial in Confidence ● Define schema of database & table ● Querybuilder at disposal ● Go with TDD ● Run migrations against test table
  • 19.
    Let's Understand 1. CreateDatabase and Table public static final BaseBall db = new BaseBall(); db.havingTable(Batting.tb); 2. Define Columns and types as per need public static final Column playerId = HavingColumn(tb, "player_id", Column.Type.STRING); 1. Create Rows (Dataset) private static Row<Batting> lowestRun = Batting.tb.getNewRow() .set(Batting.playerId, PLAYER_1_ID) .set(Batting.year, 1990) 1. Call the script to logic using Execute List<Row<Table>> years = repo.execute(BaseBall.highestScoreForEachYear()); 1. Verify data using Assert assertThat(years.size(), is(1)); 1 9
  • 20.
    ©ThoughtWorks 2019 Commercialin Confidence Functional Test 20
  • 21.
    Verification Of EndTo End Workflows Verification Of Data Setup Verification Of Reports Functional Test Pipeline Smoke & Regression Pipeline Selenium Cucumber Junit Dedicated Cluster for Automation FRAMEWORK INFORMATION 21©ThoughtWorks 2019 Commercial in Confidence
  • 22.
    Let's Understand End ToEnd workflows is called 2 2 Data Set up Excel contains tabular data Data is entered as Text tables Table File conversions As per table, text table is converted to Parquet table or Avro Table Data Verification Using Cucumber’s Data Table Front End Validation Selenium
  • 23.
    ©ThoughtWorks 2019 Commercialin Confidence Oozie Test 23
  • 24.
    ● Worker Bee ●Oozie Client ● Junit Oozie Framework RUNS ON CLUSTER ❏ This directly works on any workflow ❏ Feedback cycle is quick ❏ Submits jobs and track its completion ❏ Easier to set up test data ❏ Helpful in debugging production issues Does the following 24©ThoughtWorks 2019 Commercial in Confidence Tools and framework
  • 25.
    Let’s Understand, At WorkflowLevel Once the test is completed, drop the table using Query Generator 25 Create Table Class row method Use Create method create(Table Name) Call oozie workflow & set properties Verify table headers, Records Verify Result Insert Records Run Workflow Identify Test table Create Schema Columns & Partitions ©ThoughtWorks 2019 Commercial in Confidence Data Set Up
  • 26.
  • 27.
    Challenges • Big DataPipelines can take huge amount of time • Importance for cluster environment for testing • Support for different file formats (like parquet, avro, text) • To test only few workflows • To maintain test data set 2 7
  • 28.
    Challenges • Testing Migrationscenarios...in case if file format is preferred • Cluster performance issues • Hive-impala service issues • Incorporating schema changes in Automation test set and table set • Managing partition format as well 2 8
  • 29.
    Issues Caught InTesting Spark has known issue if a partition exists but not directory, then it throws exception Spark 1.6 Issue Sorting Issues on columns, importance of secondary sort Report Columns Query performance decreases Small number of large files in HDFS 29©ThoughtWorks 2019 Commercial in Confidence
  • 30.
    ©ThoughtWorks 2019 Commercialin Confidence PRIYANKA RAWAT QUALITY ANALYST prawat@thoughtworks.com | thoughtworks.com 30 THANK YOU