Anjuman
How this translated into their business objective would be something like this. They have a vision...
Pranesh
Thanks Anjuman
So this particular problem had two perspectives, one is Data Science to come up with an algorithm that can address Sally’s problem statement to predict the product prices by using historical data and analysis. So i can say data science is something that provides meaningful information based on large amounts of complex data.
Pranesh
And another perspective to solve this problem is Data Engineering, that will collect the product specific data from different sources and transform it as per requirements for further processing or store the transformed data in some storage for data science to make use of.
Anjuman
That sounds about the way to go. Visually, if I were to break Sally’s ask down, this is what it may look like.
Essentially what she is asking for are prices and reports on periodic basis.
Since we want prices to make profits, we want data intelligent mechanism to get prices at optimised profitability.
Which need some valid sales data of products, Like Aggregate Prices across stores, Aggregate sales on weekly basis may be.
And this data needs to be ingest from source systems like Point of Sales transactions, historical product prices from production pricing databases, competitor data from external agencies.
Pranesh
So what you are saying, we have translated problem statement into Data Architecture to solve the sally’s problem. Now we will try to map those with data terminologies
We first identified all the data sources as input, following which we ingest it to the system having Quality checks on it. Then we moved with applying data transformation rules on ingested data so that it will be available for Algorithm to consume it. Finally we got the expected results in our case its price predictions and we published it and have them stored somewhere so that we can export the outputs to pricing system.
Anjuman
Pranesh
With that said, the way a story would move in usual non data agile projects would fairly differ on data projects. Such as this one here as example of usual non data story life cycle..
We start with Iteration planning meeting what we called as IPM, where BAs, Developer and tester will sit together to discuss the scoping and analysing the stories to be covered in particular sprint.
Once the IPM is done team will continue with story kick offs where developer and testers will discuss on functional and technical aspects of the story.
Once this is done Developer will construct the actual logic followed by desk check where Developer will showcase the developed functionality to BAs and Testers…. QAs then test the functionality will all the checks and showcase the functionality to stakeholders to provide the sign off so that functionality can be promoted to next stage.
Now lets see how this fits on data projects with problem statement like ours.
Pranesh
So let’s now consider a specific story from our problem statement, where we need to consume historical product prices to predict optimum price
With this said here comes the pain points.
Since scope of this story is so wast, its Analysis and Scoping become tricky since its coverage could spill up into next cycles.
Since we are only consuming the historical data, Desk check of it will only include developer showing that we have consumed specific data. But what we do in regular Desk check is validating more checks on logic So that we can find issues early in life cycle. So in case of desk check for data project, it is humanly impossible to check different data variations and its outputs.
Next pain point would be providing sign off from perspective of Data Scientists and since we are only consuming the historical data, Deployment of this functionality may not add much value to go into next phase.
Pranesh
So with data projects, we see certain practice changes on day to day product life cycle. So if you can remember the previous pain points that we talk about, we will try to map Data Engineering analogies instead.
As mentioned here for Data Engineering first we do Data Mapping where we identify the data sources and the required data to be collected from those sources. In our case it will be Historical pricing Data.
Once we identify all the data sources and required data ,next stage is Data modeling; which focuses on how do we structure and store this data for our use case.
Then Acquiring that particular data and validating its quality.
Afterwards we will transform the acquired data to extract required attributes say product information and its prices. And validating the transformed data for next phase.
Anjuman
With that said, even the way a story would move in non-data agile projects would fairly differ on data projects. Such as this one here…
DS
Pranesh/Anjuman
Where as Data Science jourben usually begins with literature review to get more insights on problem statement and forces at work in given domain. With this basic understanding we analyse the data to find the patterns. An algorithm is developed based on understanding of previous 2 stages. The results of this algorithm is analysed to see if they give results with desired accuracy. Once we are satisfied with algorithm or model it can be deployed.
Anjuman
Pranesh
Going back to data pipeline, this is what a data architecture for data pipeline commonly looks like. We can now look into Specific QA activities around each stage in it.
Pranesh
As first part in our data pipeline is consuming the data from different sources, now let’s discuss QA activities around this stage.
We need continuous stream of events like sales transactions from Point of Sales Terminals of different stores, historical product prices from production pricing databases, competitor data from external agencies.
We Ensured that all of the inputs are latest as they are on production, because this helped algorithm to analyse the valid data.
Since we must test using different set of data, as the algorithm is going to need different types of pricing as inputs. we ensured that all of these prices are available from their respective sources. Like some files
We also ensured that all mandatory attributes are present in source data, in our case it is the selling price of the products, discount prices etc.
Pranesh
If we can recall the next phase in data pipeline is ingesting the source data into system.
With that said we validated if products and their respective prices which we consumed, are stored in the correct underlying storage locations. Storage systems here can be HDFS (Hadoop Distributed File System) or even storage buckets like S3.
If we can recall we only needed historical data let’s say in the range of past 5 years. Hence from the input sources, we also ensured that we are ingesting data within this specific time period only, rather than considering all historical data.
Pranesh
The next stage in data pipeline is to check the Data quality of data ingested in previous step
After ingestion, We ensured the data integrity, by comparing products and their prices with source data.
We also Validated that all products have valid prices and data is pushed to correct storage location, for example no negative selling price and no null or blank values, etc.
We also checked for duplicate or missing product information, as this could impact outcome from the algorithm intelligence.
Pranesh
The most crucial step in our data pipeline is Transforming the ingested data. So till last stage we just validated the data quality of data we ingested from source. Data Transformation on high level means extracting out necessary attributes from data set since we might not interested all of that ingested data. With said that
We insured transformed data met algorithm requirements. For example, aggregate prices of a product across different stores, or aggregate sales for a product on weekly basis. Once that is done
We validated there is no corrupt data in system
After transformation, we also verified data integrity to check values are intact after transformation. (logic on aggregates does not impact)
I can give one example here how we Ensured that, let’s say we ingested data of X products and we applied some transformation rules on product prices. so If we applied transformation on X product prices, in the end we still have X transformed product prices.
We also validated that data is ready for next stage to be consumed by algorithm.
Data transformation is defined by what input algorithm needs or what needs to be generated in the reports. Example, if there was a report that needed to be generated out of the system, required by C level executives every week, then transformation logic would only apply to be run weekly.
Anjuman
All data to be consumed by the algorithm is in expected format, for example - converting into csv at the time of run.
Data modeling parameters are available to be used by the algorithm. Hyper parameter.
Ensuring that the data ingested has all outliers removed. An example would be of a product whose selling price is way off than its average maintained price over the years. This could be due to good relations with the sales manager who gave heavy discounts for her favorite customer.
Anjuman
Anjuman
(Example price points, etc.)
Algorithm Errors
To append to the historical data and to also evaluate if the algorithm is improving or not
Anjuman
Mostly with these kind of storage systems, you would also take care if the data i stored in correct partitions and correction locations
Right level of meta-data is generated with every storage.
Anjuman
Anjuman
Pranesh
Apart from stages mentioned in data pipeline earlier, we feel that there could be more validation checks we can apply to further test our pipeline.
One of it was validating how pipeline behaves when it encountered huge data.
We validated if system consume such huge data efficiently ro not.
Message queues are getting cleared on time so that there is no overlap between 2 subsequent jobs.
This huge data is getting stored in underlying storage systems with ease
And the same is getting transformed while applying data processing rule.
We also ensured Memory and resource utilization of our jobs are within threshold while handling huge data
And there is no any pain in visualizing it.
Pranesh
Another form of testing , we thought was Recovery test for jobs.
Like how system recovered it self from any failures.
And even after recovery, how data is getting consumed
We also ensured if any node failed , other nodes shared the load. So that our process of data analysis is intact.
The next important aspect that we covered is logging mechanism as it helped debug the failures
Pranesh
Health of environment need not to be validated in each iteration, So this checks can be done with some time interval . Like checking if environment is having enough storage capacity and clusters with distributed nodes.
So just to summarize on QA activities that we talked about, they are not as same as traditional activities that we follow in normal projects. But approach for testing remains the same that is “challenge the business logic to make it more robust and the one that gives us confidence". But this QA activities might differ from other data projects.
Anjuman
Automation tool selection examples
Anjuman
Precise data - We need only selling prices, discount prices and promotion prices and no need of buying prices or refunds
Production data - we are dealing with modeling of an algorithm that takes into account historical data from production and produce near accurate price recommendation
Logging and Monitoring - Helped in debugging the job failure
Close collaboration
Setting up expectation
Change management - Developing framework to adapt the changes at any data processing stage
Iterative approach - So that we can promote the algorithm or business logic with MVP to production phase
Important 3vs - to have variations in data so that price predictions will be close to accurate
Good friendship - Data scientists and QAs should work hand in hand, Collaborating with data scientists
Since data is most valuable to organisations, it’s preciseness and integrity are the most important attributes
Important 3 V’s while testing huge data systems - Volume, Variety and Velocity
Validating output with SMEs
QA need not wait for the end result on Data projects
Always use production like data to test
Change Management - Scope Management?
Good friendships with data scientists, bribe them!
As a QA, if I were to QA a Data Science project then do I need to know about Data Science.