A glimpse of test automation in hadoop ecosystem by Deepika Achary

A Glimpse of Test
Automation in
Hadoop Ecosystem
Deepika Achary
Test Engineer, OCLC

Who Am I?
def self.info()
name = ‘Deepika Achary’
job = ‘Test Automation Engineer’
company = ‘OCLC’
email = ‘acharyd@oclc.org’
hobbies = [‘Baking’, ‘Watching movies’, ‘Painting’]
end

What are we going to talk about?

What is BigData?
u “BigData” refers to data that is so large, with
unstructured or semi-structured format of data
that it’s difficult to process using traditional
methods
u Volumes too great for a typical DBMS (terabytes,
petabytes, exabytes of data)
u Sources of BigData – Social media, IoT
Appliances(Smart devices), E-commerce
transactions, GPS location data etc.

Let’s have a look how much data is
generated per minute on internet
1+ Million2+ Million2+ Million4.5 Million

Initial
thought
process and
Approach
We decided to move to Hadoop to handle the
huge data
Some of the applications are being developed
in Hadoop ecosystem
That’s when the Hadoop ecosystem
components came into picture and thought
about test automation for the applications
Why Jruby?

What is JRuby and Why use it?
u JRuby is a 100% pure-Java
implementation of the Ruby
programming language
u JRuby allows Ruby programs to use
Java classes. This is a powerful concept
that JRuby now brings to Ruby users.
u JRuby can integrate with Java code. If
you have Java class libraries (.jar's), you
can reference and use them from within
Ruby code with JRuby
u By leveraging Java the platform with
the power of the Ruby programming
language, programmers get the best
from both worlds

Ruby vs JRuby
Ruby JRuby
A dynamic, interpreted, open
source programming language
with a focus on simplicity and
productivity
A high performance, stable, fully
threaded Java implementation
of the Ruby programming
language
High performance Comparatively higher
performance than Ruby
Uses Ruby gems Can use Ruby gems along with
java libraries

Let’s explore
individual components

HDFS
Hadoop Distributed File System
Storage layer of Hadoop
Data gets stored in distributed manner in HDFS
Files broken down in smaller chunks and stored in
various machines
Breaking the files and creating copies of the files
and stored in different nodes
If one machine fails, make sure data can be
retrieved from other machines

300 MB
100 MB
100MB
BigData
HDFS
A
AB
B
C
C
100MB
Hadoop
Framework

DATA
HDFS
KAFKA
SOLR
HBASE
Automation

Automation Setup for HDFS
u Jar used - org.apache.hadoop:hadoop-common
u Some Classes used -
q org.apache.hadoop.conf.Configuration
Provides access to HDFS configuration parameters
q org.apache.hadoop.fs.Path
Used to construct a file path from a string
q org.apache.hadoop.fs.FileStatus
Interface that represents the information for a file
q org.apache.hadoop.fs.FileSystem
Class that provides an object to interact with Hadoop file system
u Some HDFS operations –
q copyFromLocal – Moving files to HDFS for storage or processing
q copyToLocal – Moving stored files from HDFS to local

HBase
u No SQL database
u Wrapper built over HDFS
u HBase is a database which is column-oriented
distributed database designed to work on
Distributed File System called HDFS
u It is a part of the Hadoop ecosystem that
provides random real-time read/write access
to data in the Hadoop File System
u One can store the data in HDFS either directly
or through HBase. Data consumer
reads/accesses the data in HDFS randomly
using HBase

Hbase Table
Row Key - 1234
Personal Professional
Name City Designation Salary
Josh Columbus TAD 80,000
Row Key - 5678
Personal Professional
Name City Designation Salary
Alex Atlanta Sr. Developer 90,000

Automation Setup for HBASE
u Gem used - hbase-jruby
u hbase-jruby is a simple JRuby binding for HBase
u hbase-jruby provides the followings:
q Easy, Ruby interface for the fundamental HBase operation
u Operations done using hbase-jruby –
q PUT – Puts data into the table
q GET – Retrieve data from table using one or more rowkeys
q SCAN – Scans the table for given range of rowkeys
q DELETE – Deletes data from table

Kafka
u Kafka is publish-subscribe/ pub-
sub messaging system
u Publish–subscribe is a messaging
pattern where senders of
messages are called publishers
and receivers are called
consumers. Publishers will send
messages into the kafka topic
and subscribers will consume the
messages from kafka topic

Automation Setup for KAFKA
u Jar used - org.apache.kafka:kafka-clients
u Classes used -
u org.apache.kafka.clients.consumer.ConsumerConfig
Configuration for the Kafka Consumer
u org.apache.kafka.clients.producer.ProducerConfig
Configuration for the Kafka Producer

Solr
u Solr is a
search/storage engine
where you can index a
set of documents and
then query to return a
set of documents that
matches user query
u Solr can be used along
with Hadoop. As
Hadoop handles a
large amount of data,
Solr helps us in finding
the required
information from such
a large source
REST
SERVICE
Request
SOLRResponse

Automation Setup for SOLR
u Jar used - org.apache.solr:solr-solrj
u Classes used -
q org.apache.solr.common.SolrDocument
A concrete representation of a document within a Solr index
q org.apache.solr.client.solrj.SolrQuery
This is an augmented SolrParams with get/set/add fields for common fields used in the Standard
and Dismax request handlers
q org.apache.solr.client.solrj.impl.CloudSolrClient
Instances of this class communicate with Zookeeper to discover Solr endpoints for Solr collections
q org.apache.solr.client.solrj.impl.HttpSolrClient
A SolrClient implementation that talks directly to a Solr server via HTTP

Advantages of Automation
u Concrete framework - Validating individual components and pinpoint where
things went wrong
u Gray box for QA – Ability to provide additional information to developers
which helps them debug the issue and do root cause analysis
u Ease of use – Execute automation scripts near real time/ batch mode
u Data Flexibility – Create our own test data as we want based on use cases and
have it automated – makes life simple
u Bugs – Identifying an incorrect entry from a million records
u Quick turnaround time and faster feedback
u Always up and running - Health check of all systems
u Reusability - Created a gem, used across OCLC

Key Takeaways
u Overview of BigData
u Hadoop Ecosystem and its components
u How to Automate?

A glimpse of test automation in hadoop ecosystem by Deepika Achary

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A glimpse of test automation in hadoop ecosystem by Deepika Achary

Similar to A glimpse of test automation in hadoop ecosystem by Deepika Achary (20)

More from QA or the Highway

More from QA or the Highway (20)

Recently uploaded

Recently uploaded (20)

A glimpse of test automation in hadoop ecosystem by Deepika Achary