vodQA Pune (2019) - Insights into big data testing

11©ThoughtWorks 2019 Commercial in Confidence
Insights Into Big Data
Testing
VodQA 2019

AGENDA
• Introduction to Big Data Application
• Testing Aspects
• Various Types of Tests
• Automation Tools/Framework
• Testing Challenges
2

3
What is Big Data ???
©ThoughtWorks 2019 Commercial in Confidence

4
Mammoth of Data
?
Is Any Data Big ?

V’s Of Big Data
Volume,
Velocity,
Variety &
Veracity

6
Big Data Applications

8
Hadoop is one of the solutions to Big Data Problem
Testing depends on kind of tools being used in Big Data Application
BigDataEcosystem

9
Big Data Application
Workflow

Typically has following stages
Reporting &
Analysis
Extract/Query
the output
from HDFS
Perform
MapReduce/
spark
operations
Loading Source
Data Files into
HDFS

Big Data Testing Aspects
1
1

1. Validation Of Data
2. Structured or unstructured data consideration
3. Optimal Test Environment
4. Availability of Hadoop centric testing tools
5. Performing Non Functional Testing
6. Efficient test data set
7. Hive Internal External tables
12
Few things to consider during testing
In this case data is structured

Big Data Application Must Have Tests
1
3
Unit Test Hive Query Validator
Hive Test Integration Test
Oozie Test Functional Test

1
4

Automation tools
& Framework
1
5

Unit Testing Framework
1
6
Mockito : Java Based Mocking Framework
Worker Bee : Framework to perform tasks with
Apache HIVE
Junit : Unit Testing Framework

MOCKITO
MOCK FRAMEWORK
● Mocks external dependencies
● Insert mocks into code under test
● Execute the code
● Validate if code executed as
expected
● When ThenReturn Rule

Creating Database
& table
Using new operator
& tables using
havingTable(Class)
Generate Migration
Files
Using Migration Genrator
Setup Test Data
Verify Result &
Execute Queries
Execute function and assert
result
WORKFLOW
WORKER BEE
HIVE TEST FRAMEWORK
● Define schema of database & table
● Querybuilder at disposal
● Go with TDD
● Run migrations against test table

Let's Understand
1. Create Database and Table
public static final BaseBall db = new BaseBall();
db.havingTable(Batting.tb);
2. Define Columns and types as per need
public static final Column playerId = HavingColumn(tb, "player_id", Column.Type.STRING);
1. Create Rows (Dataset)
private static Row<Batting> lowestRun
= Batting.tb.getNewRow()
.set(Batting.playerId, PLAYER_1_ID)
.set(Batting.year, 1990)
1. Call the script to logic using Execute
List<Row<Table>> years = repo.execute(BaseBall.highestScoreForEachYear());
1. Verify data using Assert
assertThat(years.size(), is(1));
1
9

Functional Test
20

Verification Of End To
End Workflows
Verification Of Data
Setup
Verification Of Reports
Functional Test
Pipeline Smoke &
Regression Pipeline
Selenium
Cucumber
Junit
Dedicated Cluster for Automation
FRAMEWORK INFORMATION

Let's Understand
End To End workflows is called
2
2
Data Set up
Excel contains
tabular data
Data is entered as
Text tables
Table File conversions
As per table, text
table is converted
to Parquet table or
Avro Table
Data Verification
Using Cucumber’s
Data Table
Front End Validation
Selenium

Oozie Test
23

● Worker Bee
● Oozie Client
● Junit
Oozie Framework
RUNS ON CLUSTER
❏ This directly works on any workflow
❏ Feedback cycle is quick
❏ Submits jobs and track its
completion
❏ Easier to set up test data
❏ Helpful in debugging production
issues
Does the following
Tools and framework

Let’s Understand,
At Workflow Level
Once the test is completed, drop the
table using Query Generator
25
Create Table Class
row
method
Use Create method
create(Table Name)
Call oozie
workflow &
set properties
Verify table
headers,
Records
Verify
Result
Insert
Records
Run
Workflow
Identify
Test table
Create
Schema
Columns &
Partitions
Data Set Up

Testing Challenges
2
6

Challenges
• Big Data Pipelines can take huge amount of time
• Importance for cluster environment for testing
• Support for different file formats (like parquet, avro,
text)
• To test only few workflows
• To maintain test data set
2
7

Challenges
• Testing Migration scenarios...in case if file format is
preferred
• Cluster performance issues
• Hive-impala service issues
• Incorporating schema changes in Automation test set
and table set
• Managing partition format as well
2
8

Issues Caught In Testing
Spark has known issue if a partition
exists but not directory, then it
throws exception
Spark 1.6 Issue
Sorting Issues on columns,
importance of secondary sort
Report Columns
Query performance decreases
Small number of large files
in HDFS

PRIYANKA RAWAT
QUALITY ANALYST
prawat@thoughtworks.com | thoughtworks.com
30
THANK YOU

vodQA Pune (2019) - Insights into big data testing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to vodQA Pune (2019) - Insights into big data testing

Similar to vodQA Pune (2019) - Insights into big data testing (20)

More from vodQA

More from vodQA (20)

Recently uploaded

Recently uploaded (20)

vodQA Pune (2019) - Insights into big data testing