SlideShare a Scribd company logo
1 of 44
Lessons learned from
designing a QA automation
for analytics databases
(Big Data)
Omid Vahdaty, Big Data ninja.
Agenda
● Introduction to Big Data Testing
● Methodological approach for QA Automation road map
● True Story
● The Solution?
● Challenges of Big Data Testing.
● Performance & Maintenance.
Introduction to Big Data Testing
● 100M ?
● 1B ?
● 10B?
● 100B?
● 1Tr?
How Big is Big Data?
Scale OUT VS. Scale UP systems
Big Data ?
● Structured (SQL, tabular, strings, numbers)
● Unstructured (logs, pictures, binary, json,blob, video etc)
OLAP or OLTP?
● Simply put:
○ One query running on a lot of data for reporting/analytics
○ Many queries running quickly in parralel applications (e.g user login to a website, credentials
are stored on a DB)
Type of Big Data products
● Databases
○ OLAP, OLTP
○ Sql, NO SQL
○ In Memory, Disk based.
○ Hadoop Ecosystem, Spark
● Ecosystem tools
○ ETL tools
○ Visualizations tools
● General Application
● Analytics Based products
○ Froud
○ Cyber
○ Finances and etc.
Expectation Matching for today’s lecture...
● OLAP Database
● Structured data only, Synthetic data was used in testing.
● (not about QA of analytics based products/solutions).
● (not about hadoop , event streaming cluster ).
● (not try to sell anything)
● In a nutshell:
○ How to test an SQL based database?
○ What sort of challenges were met?
How Hard Can it be (aggregation)?
● Select Count(*) from t;
○ Assume
■ 1 Trillion records
■ ad-hoc query (no indexes)
■ Full Scan (no cache)
○ Challenges
■ Time it takes to compute?
■ IO bottleneck? What is IO pattern?
■ CPU bottleneck?
■ Scale up limitations?
How Hard Can it be (join)?
● Select * from t1 join t2 where t1.id =t2.id
○ Assume
■ 1 Trillion records both tables.
■ ad-hoc query (no indexes)
■ Full Scan (no cache)
○ Challenges
■ Time it takes to compute?
■ IO bottleneck? What is IO pattern?
■ CPU bottleneck?
■ Scale out limitations?
3 Pain Points in Big Data DBMS
Big Data System
Ingress data rate Egress data rate
Maintenance rate
Big Data ingress challenges
● Parsing of Data type
○ Strings
○ Dates
○ Floats
● Rate of data coming into the system
● ACID: Atomicity, Consistency, Isolation, Durability
● Compression on the fly (non unique values)
● amount of data
● On the fly analytics
● Time constraints (target: x rows per sec/hour)
Big Data egress challenges
● Sort
● Group by
● Reduce
● Join (sortMergeJoin)
● Data distribution
● Compression
● Theoretical Bottlenecks of hardware.
Methodological approach
for QA Automation road map
Business constraints?
● Budget?
● Time to market?
● Resources available?
● Skill set available ?
Product requirements?
● What are the product’s supported use cases?
● Supported Scale?
● Supported rate of ingress?
● Supported rate on egress?
● Complexity of Cluster?
● High availability?
The Automation: Must have requirements?
● Scale out/up?
● Fault tolerance!
● Reproducibility from scratch!
● Reporting!
● “Views” to avoid duplication for testing?
● Orchestration of same environment per Developer!
● Debugging options
● versioning!!!
The method● Phase 0: get ready
○ Product requirements & business constraints
○ Automation requirements
○ Creating a testing matrix based on insight and product features.
● Phase 1: Get insights (some baselines)
○ Test Ingress separately, find your baseline.
○ Test Egress separately, find your baseline.
○ Test Ingress while Egress in progress., find your baseline.
○ Stability and Performance
● Phase 2: Agile: Design an automation system that satisfies the requirements
○ Prioritize automation features based on your insights from phase 1.
○ Implement an automation infrastructure as soon as possible.
○ Update your testing matrix as you go (insight will keep coming)
○ Analyze the test results daily.
● Phase 3: Cost reduction
○ How to reduce compute time/ IO pattern/ network pattern/storage footprint
○ Maintenance time (build/package/deploy/monitor)
○ Hardware costs
True Story
Testing GPU based DBMS
(Peta Scale)
So why build a DBMS from scratch?
● Most compiler are designed for CPU → execution tree for CPU runtime
engine → new compiler with GPU resource in mind.
● Most Algorithm were written for CPU → same algorithm logic, redesign for
parallelism philosophy of GPU → new runtime with GPU resource in mind →
performance increase by order of magnitude.
● VISION: fastest DBMS, cost effective, true scalability.
True story
● Product: Big Data gpu based DBMS system for analytics. [peta scale]
○ Structed data only, designed for OLAP use cases only
● Technical Constraints
○ How to test SQL syntax?(infinite input) SQL Coverage from 0% to 20% to 90% to 100%?.
○ What is the expected impact of big data?
○ How to debug performance issues?
○ Can you virtualize? Should you virtualize?
○ Running time - how much it takes to run all the tests?
● Company Challenges
○ Cost of Hardware? High end 2U Server
○ Expertise ? skillset?
○ Human resources: Size of QA Automation team?
○ How do you manage automation?
○ What is your MVP? (Chicken and Egg)
○ TIME TO MARKET!!! The product needs to be released.
The baseline concept
● Requirements
○ CSV , DDL , query
○ Competing DB
○ Your DB
○ Synthetic data used mostly.
○ (real data in peta scale is hard to comeby)
● Steps
○ Insert same data to same table on both DB’s
○ Run same query on both DB’s
○ Compare results on both.
○ If equal → test pass.
The Solution: SQL testing framework.
● Lightweight python test suite per feature
○ DDL
○ Set of Queries
○ Json for data generation requirements
○ Data generation utilities (genUtil)
○ test results report - in CSV format.
○ Expected error mechanism.
○ Results compare mechanism
○ Command line arguments for advanced testing/config/tuning/views.
● Wrapper test suite
○ group set of test suites by logic - e.g Daily
○ Aggregate reporting mechanism
● Scheduling ,reporting,monitoring,deployment, alerting mechanism
Testing concept of SQL syntax
Simple Aggregation Joins
Simple x x x
Aggregation x x x
Joins x x x
Challenges with SQL syntax testing
● (Binary data not supported in the product)
● Repetition of queries.
● Different data type names.
● Different data ranges per data types.
● Different accuracy per data types.
● The competition DB that supports big data is expensive...
● Accuracy of results. (different DB’s return different accuracy)
● SQL has some extreme cases (differ per vendor).
● Datetime format.
● Unsupported features.
● Duplication of testing.
● Very hard to predict which queries are useful (negative and positive testing).
Challenges with Small Data testing
● Generating Random data
○ Reproducible every time -->Strict data ranges per test (length, numeric range, format, accuracy)
○ What is the amount of unique values? Different histogram, different bottlenecks:
■ Unique values challenges:
● Per chunk?
● Per Column
■ Non unique values challenges:
● String?
○ Lengths
○ Compressible?
● Numeric?
○ Overflow?
○ Floating point?
Testing matrix: Data Integrity on Big data.
Data (no null) Compression Null Data Lookup. Partitions JDBC/ODBC
1 row
1M Rows
10M rows
100M Rows
1B rows
10B rows
100B rows
1 Trillion rows
10 Trillion
rows
Challenges with Big Data testing.
● Generating Time of data → genUtil with json for user input
● Reproducibility → genUtil again.
● Disk space for testing data → peta scale product → peta scale testing data
● Results compared mechanism on big data require a big data solution→ SMJ
● Network bottlenecks on a scale out system...
● ORDER IS NOT GUARANTEED!
Insights on the fly: User Defined Test Flow
● A series of queries in a specific order - crushed the system
● Solution: Automation infrastructure on a group of queries: User defined Test
Flow.
● Good for: pre and post condition of , custom testing, system perspective
testing. (partitions)
The solution: some Metrics
1. Extra large Sanity Testing per commit (!)
2. Hourly testing on latest commit (24 X 7)
a. 800 queries
b. Zero data
3. Daily test:
a. 400,000 queries
b. Zero data for 90% , 10% testing upto 1B records.
4. Weekly testing:
a. Big data testing 10B and above.
5. Monthly testing:
a. Full regression per version.
The solution: Hardware perspective
● 50 x Desktops: Optiplex 7040, I7 4 cores, 32GB. 1X 4Tb.
● 10 x High End Servers: Dell R730/R720: 20 cores, 128GB. 16X 1TB disks.
Nvidia K40
● DDN storage with 200TB+ GPFS
● Mellanox FDR switch for the storage
● 1GB switches for the desktop compute nodes.
Performance and Maintenance
Performance Challenges: app perspective
● Architecture bottlenecks [ count(*), group by, compression, sort Merge Join]
● What is IO pattern?
○ OLTP VS OLAP.
○ Columnar or Row based?
○ How big is your data?
○ READ % vs WRITE %.
○ Sequential? random?
○ Temporary VS permanent
○ Latency VS. throughput.
○ Multi threaded ? or single thread?
○ Power query? Or production?
○ Cold data? Host data?
Performance Challenges: OPS perspective
● Metrics
○ What is Expected Total RAM required per Query?
○ OS RAM cache , CPU Cache hits ?
○ SWAP ? yes/no how much?
○ OS metrics - open files, stack size, realtime priority?
● Theoretical limits?
○ Disk type selection -limitation, expected throughput
○ Raid controller, RAID type? Configuration? caching?
○ File system used? DFS? PFS? Ext4? XFS?
○ Recommended file system filesystem Block size? File system metadata?
○ Recommended File size on disk?
○ CPU selection - number of threads , cache level, hyper threading. Cilicon
○ Ram Selection - and placement on chassi
○ Network - the difference b/w 40Gbit and 25Gbit. NIC Bonding is good for?
○ PCI Express 3, 16 Langes. PCI Switch.
Maintenance challenges: OPS Perspective
How to Optimize your hardware selection ? ( Analytics on your testing data)
a. Compute intensive
i. GPU CORE intensive
ii. CPUcores intensive
1. Frequency
2. Amount of thread for concurrency
b. IO intensive?
i. SSD?
ii. NearLine sata?
iii. Partitions?
c. Hardware uniformity - use same hardware all over.
d. Get rid of weak links
Maintenance challenges: DevOps Perspective
● Lightweight - python code
○ Advantage → focus on generic skill set (python coding)
○ Disadvantage → reinventing the wheel
● Fault tolerance - Scale out topology
○ Cheap desktops → quick recovery from Image when hardware crushes
■ Advantage - >
● Cheap, low risk, pay as you go.
● Gets 90 % of the job DONE!
■ Disadvantage
● footprint is hell.
● Management of desktops - requires creativity
● Rely heavily on automation feature: reproducibility from scratch
● Validation automation on infrastructure & Deployment mechanism.
Maintenance challenges: DevOps Perspective
● Infrastructure
○ Continuous integration: continuous build, continuous packaging, installer.
○ Continuous Deployment: pre flight check, remote machine (on site, private cloud, cloud)
○ Continuous testing: Sanity, Hourly, Daily, weekly
● Reporting and Monitoring:
○ Extensive report server and analytics on current testing.
○ Monitoring: Green/Red and system real time metrics.
Maintenance challenges: Innovation Perspective
1. Innovation on Data generation (using GPU to generate data)
2. Innovation on Competitor DB running time (hashing)
3. Innovation on Workload (one dispatcher, many compute nodes)
4. Innovation on saving testing data /compute time
a. Hashing
b. Cacheing
c. Smart Data Creating - no need to create 100TB CSV to generate 100TB test.
d. Logs: Test results were managed on The DBMS itself :)
5. File system ZFS Cluster (utilizing unused disk space, redundancy, deduplication)
6. Nvidia TK1 /GRID GPU → Footprint!!!
7. “Virtualizing GPU” → Footprint!!!
a. Dockers
b. rCUDA
8. Amazon p2 gpu instances did not exist at that time (even so still very expensive)
Building the team Challenge
● Engineering skills required
■ Hardware: Servers
■ IO (storage, disks, RAID, IO patten)
■ FileSystems
■ Linux , CLI, CMD, installing, compiling, GIT
■ Basic network understanding
■ High Availability, Redundancy
■ HPC
● Coding: python.
● QA Big Data mindset
● DevOps mindset
● Automation Mindset
● Systematic Innovation methodologies
Lesson learned (insights)● Very hard to guarantee 100% QA coverage
● Quality in a reasonable cost is a huge challenge
● Very easy to miss milestones with QA of big data,
● each mistake create a huge ripple effect in timelines
● Main costs:
○ Employees : training them.
○ Running Time: Coding of innovative algorithms inside testing framework,
○ Maintenance: Automation, Monitoring, Analyzing test results, Environments setup
○ Hardware: innovative DevOps , innovative OPS, research
● Most of our innovation was spent on:
○ Doing more tests with less resources
○ agile improvement of backbone
○ Avoiding big data complexity problems in our testing!
● Industry tools ? great, but not always enough. Know their philosophy...
● Sometimes you need to get your hands dirty
● Keep a fine balance between business and technology
Take away message:
● Testing simplifications:
○ Ingress only
○ Egress only
○ Ingress while egress
○ Testing matrix - useful on complex big data systems
● The QA automation team
○ coding/scripting skills is a MUST
○ Engineering skill is a MUST
○ DevOps understanding is a MUST
○ Allocating time for innovation is a MUST
○ Allocating time for cost reduction is a MUST
● The Automation infrastructure
○ KISS
○ MVP
○ IS A PRODUCT by itself. With its own Product manager and Architect
Special Thanks...
● Citi innovation Center
● Asaf - Birenzvieg Big Data & Data Science - Israel meetup
● PayPal Risk and Data solution Group
Stay in touch...
● Omid Vahdaty
● +972-54-2384178

More Related Content

What's hot

Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
Florian Lautenschlager
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 

What's hot (15)

Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @Twitter
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
Efficient and Fast Time Series Storage - The missing link in dynamic software...
Efficient and Fast Time Series Storage - The missing link in dynamic software...Efficient and Fast Time Series Storage - The missing link in dynamic software...
Efficient and Fast Time Series Storage - The missing link in dynamic software...
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
 
Debugging Your Production JVM
Debugging Your Production JVMDebugging Your Production JVM
Debugging Your Production JVM
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Garbage collection in JVM
Garbage collection in JVMGarbage collection in JVM
Garbage collection in JVM
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for Scylla
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Indexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesIndexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data Types
 

Viewers also liked

Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
Omid Vahdaty
 
Ppt on automation
Ppt on automation Ppt on automation
Ppt on automation
harshaa
 

Viewers also liked (9)

Designing of databases
Designing of databasesDesigning of databases
Designing of databases
 
Getting Started With QA Automation
Getting Started With QA AutomationGetting Started With QA Automation
Getting Started With QA Automation
 
Aws s3 security
Aws s3 securityAws s3 security
Aws s3 security
 
QA automation
QA automationQA automation
QA automation
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
 
Automation presentation
Automation presentationAutomation presentation
Automation presentation
 
BDD with Cucumber
BDD with CucumberBDD with Cucumber
BDD with Cucumber
 
Ppt on automation
Ppt on automation Ppt on automation
Ppt on automation
 

Similar to Lessons learned from designing a QA Automation for analytics databases (big data)

AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 

Similar to Lessons learned from designing a QA Automation for analytics databases (big data) (20)

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
My benchmarks brings all the boys to the yard
My benchmarks brings all the boys to the yardMy benchmarks brings all the boys to the yard
My benchmarks brings all the boys to the yard
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 

More from Omid Vahdaty

More from Omid Vahdaty (20)

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 

Recently uploaded

Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdf
Kamal Acharya
 
Final DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manualFinal DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manual
BalamuruganV28
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Lovely Professional University
 

Recently uploaded (20)

Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdf
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
Piping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdfPiping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdf
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
Interfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfInterfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdf
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor bank
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
 
Final DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manualFinal DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manual
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
Geometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdfGeometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdf
 
BORESCOPE INSPECTION for engins CFM56.pdf
BORESCOPE INSPECTION for engins CFM56.pdfBORESCOPE INSPECTION for engins CFM56.pdf
BORESCOPE INSPECTION for engins CFM56.pdf
 
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
 
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 

Lessons learned from designing a QA Automation for analytics databases (big data)

  • 1. Lessons learned from designing a QA automation for analytics databases (Big Data) Omid Vahdaty, Big Data ninja.
  • 2. Agenda ● Introduction to Big Data Testing ● Methodological approach for QA Automation road map ● True Story ● The Solution? ● Challenges of Big Data Testing. ● Performance & Maintenance.
  • 3. Introduction to Big Data Testing
  • 4. ● 100M ? ● 1B ? ● 10B? ● 100B? ● 1Tr? How Big is Big Data?
  • 5. Scale OUT VS. Scale UP systems
  • 6. Big Data ? ● Structured (SQL, tabular, strings, numbers) ● Unstructured (logs, pictures, binary, json,blob, video etc)
  • 7. OLAP or OLTP? ● Simply put: ○ One query running on a lot of data for reporting/analytics ○ Many queries running quickly in parralel applications (e.g user login to a website, credentials are stored on a DB)
  • 8. Type of Big Data products ● Databases ○ OLAP, OLTP ○ Sql, NO SQL ○ In Memory, Disk based. ○ Hadoop Ecosystem, Spark ● Ecosystem tools ○ ETL tools ○ Visualizations tools ● General Application ● Analytics Based products ○ Froud ○ Cyber ○ Finances and etc.
  • 9. Expectation Matching for today’s lecture... ● OLAP Database ● Structured data only, Synthetic data was used in testing. ● (not about QA of analytics based products/solutions). ● (not about hadoop , event streaming cluster ). ● (not try to sell anything) ● In a nutshell: ○ How to test an SQL based database? ○ What sort of challenges were met?
  • 10. How Hard Can it be (aggregation)? ● Select Count(*) from t; ○ Assume ■ 1 Trillion records ■ ad-hoc query (no indexes) ■ Full Scan (no cache) ○ Challenges ■ Time it takes to compute? ■ IO bottleneck? What is IO pattern? ■ CPU bottleneck? ■ Scale up limitations?
  • 11. How Hard Can it be (join)? ● Select * from t1 join t2 where t1.id =t2.id ○ Assume ■ 1 Trillion records both tables. ■ ad-hoc query (no indexes) ■ Full Scan (no cache) ○ Challenges ■ Time it takes to compute? ■ IO bottleneck? What is IO pattern? ■ CPU bottleneck? ■ Scale out limitations?
  • 12. 3 Pain Points in Big Data DBMS Big Data System Ingress data rate Egress data rate Maintenance rate
  • 13. Big Data ingress challenges ● Parsing of Data type ○ Strings ○ Dates ○ Floats ● Rate of data coming into the system ● ACID: Atomicity, Consistency, Isolation, Durability ● Compression on the fly (non unique values) ● amount of data ● On the fly analytics ● Time constraints (target: x rows per sec/hour)
  • 14. Big Data egress challenges ● Sort ● Group by ● Reduce ● Join (sortMergeJoin) ● Data distribution ● Compression ● Theoretical Bottlenecks of hardware.
  • 15. Methodological approach for QA Automation road map
  • 16. Business constraints? ● Budget? ● Time to market? ● Resources available? ● Skill set available ?
  • 17. Product requirements? ● What are the product’s supported use cases? ● Supported Scale? ● Supported rate of ingress? ● Supported rate on egress? ● Complexity of Cluster? ● High availability?
  • 18. The Automation: Must have requirements? ● Scale out/up? ● Fault tolerance! ● Reproducibility from scratch! ● Reporting! ● “Views” to avoid duplication for testing? ● Orchestration of same environment per Developer! ● Debugging options ● versioning!!!
  • 19. The method● Phase 0: get ready ○ Product requirements & business constraints ○ Automation requirements ○ Creating a testing matrix based on insight and product features. ● Phase 1: Get insights (some baselines) ○ Test Ingress separately, find your baseline. ○ Test Egress separately, find your baseline. ○ Test Ingress while Egress in progress., find your baseline. ○ Stability and Performance ● Phase 2: Agile: Design an automation system that satisfies the requirements ○ Prioritize automation features based on your insights from phase 1. ○ Implement an automation infrastructure as soon as possible. ○ Update your testing matrix as you go (insight will keep coming) ○ Analyze the test results daily. ● Phase 3: Cost reduction ○ How to reduce compute time/ IO pattern/ network pattern/storage footprint ○ Maintenance time (build/package/deploy/monitor) ○ Hardware costs
  • 20. True Story Testing GPU based DBMS (Peta Scale)
  • 21. So why build a DBMS from scratch? ● Most compiler are designed for CPU → execution tree for CPU runtime engine → new compiler with GPU resource in mind. ● Most Algorithm were written for CPU → same algorithm logic, redesign for parallelism philosophy of GPU → new runtime with GPU resource in mind → performance increase by order of magnitude. ● VISION: fastest DBMS, cost effective, true scalability.
  • 22. True story ● Product: Big Data gpu based DBMS system for analytics. [peta scale] ○ Structed data only, designed for OLAP use cases only ● Technical Constraints ○ How to test SQL syntax?(infinite input) SQL Coverage from 0% to 20% to 90% to 100%?. ○ What is the expected impact of big data? ○ How to debug performance issues? ○ Can you virtualize? Should you virtualize? ○ Running time - how much it takes to run all the tests? ● Company Challenges ○ Cost of Hardware? High end 2U Server ○ Expertise ? skillset? ○ Human resources: Size of QA Automation team? ○ How do you manage automation? ○ What is your MVP? (Chicken and Egg) ○ TIME TO MARKET!!! The product needs to be released.
  • 23. The baseline concept ● Requirements ○ CSV , DDL , query ○ Competing DB ○ Your DB ○ Synthetic data used mostly. ○ (real data in peta scale is hard to comeby) ● Steps ○ Insert same data to same table on both DB’s ○ Run same query on both DB’s ○ Compare results on both. ○ If equal → test pass.
  • 24. The Solution: SQL testing framework. ● Lightweight python test suite per feature ○ DDL ○ Set of Queries ○ Json for data generation requirements ○ Data generation utilities (genUtil) ○ test results report - in CSV format. ○ Expected error mechanism. ○ Results compare mechanism ○ Command line arguments for advanced testing/config/tuning/views. ● Wrapper test suite ○ group set of test suites by logic - e.g Daily ○ Aggregate reporting mechanism ● Scheduling ,reporting,monitoring,deployment, alerting mechanism
  • 25. Testing concept of SQL syntax Simple Aggregation Joins Simple x x x Aggregation x x x Joins x x x
  • 26. Challenges with SQL syntax testing ● (Binary data not supported in the product) ● Repetition of queries. ● Different data type names. ● Different data ranges per data types. ● Different accuracy per data types. ● The competition DB that supports big data is expensive... ● Accuracy of results. (different DB’s return different accuracy) ● SQL has some extreme cases (differ per vendor). ● Datetime format. ● Unsupported features. ● Duplication of testing. ● Very hard to predict which queries are useful (negative and positive testing).
  • 27. Challenges with Small Data testing ● Generating Random data ○ Reproducible every time -->Strict data ranges per test (length, numeric range, format, accuracy) ○ What is the amount of unique values? Different histogram, different bottlenecks: ■ Unique values challenges: ● Per chunk? ● Per Column ■ Non unique values challenges: ● String? ○ Lengths ○ Compressible? ● Numeric? ○ Overflow? ○ Floating point?
  • 28. Testing matrix: Data Integrity on Big data. Data (no null) Compression Null Data Lookup. Partitions JDBC/ODBC 1 row 1M Rows 10M rows 100M Rows 1B rows 10B rows 100B rows 1 Trillion rows 10 Trillion rows
  • 29. Challenges with Big Data testing. ● Generating Time of data → genUtil with json for user input ● Reproducibility → genUtil again. ● Disk space for testing data → peta scale product → peta scale testing data ● Results compared mechanism on big data require a big data solution→ SMJ ● Network bottlenecks on a scale out system... ● ORDER IS NOT GUARANTEED!
  • 30. Insights on the fly: User Defined Test Flow ● A series of queries in a specific order - crushed the system ● Solution: Automation infrastructure on a group of queries: User defined Test Flow. ● Good for: pre and post condition of , custom testing, system perspective testing. (partitions)
  • 31. The solution: some Metrics 1. Extra large Sanity Testing per commit (!) 2. Hourly testing on latest commit (24 X 7) a. 800 queries b. Zero data 3. Daily test: a. 400,000 queries b. Zero data for 90% , 10% testing upto 1B records. 4. Weekly testing: a. Big data testing 10B and above. 5. Monthly testing: a. Full regression per version.
  • 32. The solution: Hardware perspective ● 50 x Desktops: Optiplex 7040, I7 4 cores, 32GB. 1X 4Tb. ● 10 x High End Servers: Dell R730/R720: 20 cores, 128GB. 16X 1TB disks. Nvidia K40 ● DDN storage with 200TB+ GPFS ● Mellanox FDR switch for the storage ● 1GB switches for the desktop compute nodes.
  • 34. Performance Challenges: app perspective ● Architecture bottlenecks [ count(*), group by, compression, sort Merge Join] ● What is IO pattern? ○ OLTP VS OLAP. ○ Columnar or Row based? ○ How big is your data? ○ READ % vs WRITE %. ○ Sequential? random? ○ Temporary VS permanent ○ Latency VS. throughput. ○ Multi threaded ? or single thread? ○ Power query? Or production? ○ Cold data? Host data?
  • 35. Performance Challenges: OPS perspective ● Metrics ○ What is Expected Total RAM required per Query? ○ OS RAM cache , CPU Cache hits ? ○ SWAP ? yes/no how much? ○ OS metrics - open files, stack size, realtime priority? ● Theoretical limits? ○ Disk type selection -limitation, expected throughput ○ Raid controller, RAID type? Configuration? caching? ○ File system used? DFS? PFS? Ext4? XFS? ○ Recommended file system filesystem Block size? File system metadata? ○ Recommended File size on disk? ○ CPU selection - number of threads , cache level, hyper threading. Cilicon ○ Ram Selection - and placement on chassi ○ Network - the difference b/w 40Gbit and 25Gbit. NIC Bonding is good for? ○ PCI Express 3, 16 Langes. PCI Switch.
  • 36. Maintenance challenges: OPS Perspective How to Optimize your hardware selection ? ( Analytics on your testing data) a. Compute intensive i. GPU CORE intensive ii. CPUcores intensive 1. Frequency 2. Amount of thread for concurrency b. IO intensive? i. SSD? ii. NearLine sata? iii. Partitions? c. Hardware uniformity - use same hardware all over. d. Get rid of weak links
  • 37. Maintenance challenges: DevOps Perspective ● Lightweight - python code ○ Advantage → focus on generic skill set (python coding) ○ Disadvantage → reinventing the wheel ● Fault tolerance - Scale out topology ○ Cheap desktops → quick recovery from Image when hardware crushes ■ Advantage - > ● Cheap, low risk, pay as you go. ● Gets 90 % of the job DONE! ■ Disadvantage ● footprint is hell. ● Management of desktops - requires creativity ● Rely heavily on automation feature: reproducibility from scratch ● Validation automation on infrastructure & Deployment mechanism.
  • 38. Maintenance challenges: DevOps Perspective ● Infrastructure ○ Continuous integration: continuous build, continuous packaging, installer. ○ Continuous Deployment: pre flight check, remote machine (on site, private cloud, cloud) ○ Continuous testing: Sanity, Hourly, Daily, weekly ● Reporting and Monitoring: ○ Extensive report server and analytics on current testing. ○ Monitoring: Green/Red and system real time metrics.
  • 39. Maintenance challenges: Innovation Perspective 1. Innovation on Data generation (using GPU to generate data) 2. Innovation on Competitor DB running time (hashing) 3. Innovation on Workload (one dispatcher, many compute nodes) 4. Innovation on saving testing data /compute time a. Hashing b. Cacheing c. Smart Data Creating - no need to create 100TB CSV to generate 100TB test. d. Logs: Test results were managed on The DBMS itself :) 5. File system ZFS Cluster (utilizing unused disk space, redundancy, deduplication) 6. Nvidia TK1 /GRID GPU → Footprint!!! 7. “Virtualizing GPU” → Footprint!!! a. Dockers b. rCUDA 8. Amazon p2 gpu instances did not exist at that time (even so still very expensive)
  • 40. Building the team Challenge ● Engineering skills required ■ Hardware: Servers ■ IO (storage, disks, RAID, IO patten) ■ FileSystems ■ Linux , CLI, CMD, installing, compiling, GIT ■ Basic network understanding ■ High Availability, Redundancy ■ HPC ● Coding: python. ● QA Big Data mindset ● DevOps mindset ● Automation Mindset ● Systematic Innovation methodologies
  • 41. Lesson learned (insights)● Very hard to guarantee 100% QA coverage ● Quality in a reasonable cost is a huge challenge ● Very easy to miss milestones with QA of big data, ● each mistake create a huge ripple effect in timelines ● Main costs: ○ Employees : training them. ○ Running Time: Coding of innovative algorithms inside testing framework, ○ Maintenance: Automation, Monitoring, Analyzing test results, Environments setup ○ Hardware: innovative DevOps , innovative OPS, research ● Most of our innovation was spent on: ○ Doing more tests with less resources ○ agile improvement of backbone ○ Avoiding big data complexity problems in our testing! ● Industry tools ? great, but not always enough. Know their philosophy... ● Sometimes you need to get your hands dirty ● Keep a fine balance between business and technology
  • 42. Take away message: ● Testing simplifications: ○ Ingress only ○ Egress only ○ Ingress while egress ○ Testing matrix - useful on complex big data systems ● The QA automation team ○ coding/scripting skills is a MUST ○ Engineering skill is a MUST ○ DevOps understanding is a MUST ○ Allocating time for innovation is a MUST ○ Allocating time for cost reduction is a MUST ● The Automation infrastructure ○ KISS ○ MVP ○ IS A PRODUCT by itself. With its own Product manager and Architect
  • 43. Special Thanks... ● Citi innovation Center ● Asaf - Birenzvieg Big Data & Data Science - Israel meetup ● PayPal Risk and Data solution Group
  • 44. Stay in touch... ● Omid Vahdaty ● +972-54-2384178