SlideShare a Scribd company logo
Lessons learned from
designing a QA automation
for analytics databases
(Big Data)
Omid Vahdaty, Big Data ninja.
Agenda
● Introduction to Big Data Testing
● Methodological approach for QA Automation road map
● True Story
● The Solution?
● Challenges of Big Data Testing.
● Performance & Maintenance.
Introduction to Big Data Testing
● 100M ?
● 1B ?
● 10B?
● 100B?
● 1Tr?
How Big is Big Data?
Scale OUT VS. Scale UP systems
Big Data ?
● Structured (SQL, tabular, strings, numbers)
● Unstructured (logs, pictures, binary, json,blob, video etc)
OLAP or OLTP?
● Simply put:
○ One query running on a lot of data for reporting/analytics
○ Many queries running quickly in parralel applications (e.g user login to a website, credentials
are stored on a DB)
Type of Big Data products
● Databases
○ OLAP, OLTP
○ Sql, NO SQL
○ In Memory, Disk based.
○ Hadoop Ecosystem, Spark
● Ecosystem tools
○ ETL tools
○ Visualizations tools
● General Application
● Analytics Based products
○ Froud
○ Cyber
○ Finances and etc.
Expectation Matching for today’s lecture...
● OLAP Database
● Structured data only, Synthetic data was used in testing.
● (not about QA of analytics based products/solutions).
● (not about hadoop , event streaming cluster ).
● (not try to sell anything)
● In a nutshell:
○ How to test an SQL based database?
○ What sort of challenges were met?
How Hard Can it be (aggregation)?
● Select Count(*) from t;
○ Assume
■ 1 Trillion records
■ ad-hoc query (no indexes)
■ Full Scan (no cache)
○ Challenges
■ Time it takes to compute?
■ IO bottleneck? What is IO pattern?
■ CPU bottleneck?
■ Scale up limitations?
How Hard Can it be (join)?
● Select * from t1 join t2 where t1.id =t2.id
○ Assume
■ 1 Trillion records both tables.
■ ad-hoc query (no indexes)
■ Full Scan (no cache)
○ Challenges
■ Time it takes to compute?
■ IO bottleneck? What is IO pattern?
■ CPU bottleneck?
■ Scale out limitations?
3 Pain Points in Big Data DBMS
Big Data System
Ingress data rate Egress data rate
Maintenance rate
Big Data ingress challenges
● Parsing of Data type
○ Strings
○ Dates
○ Floats
● Rate of data coming into the system
● ACID: Atomicity, Consistency, Isolation, Durability
● Compression on the fly (non unique values)
● amount of data
● On the fly analytics
● Time constraints (target: x rows per sec/hour)
Big Data egress challenges
● Sort
● Group by
● Reduce
● Join (sortMergeJoin)
● Data distribution
● Compression
● Theoretical Bottlenecks of hardware.
Methodological approach
for QA Automation road map
Business constraints?
● Budget?
● Time to market?
● Resources available?
● Skill set available ?
Product requirements?
● What are the product’s supported use cases?
● Supported Scale?
● Supported rate of ingress?
● Supported rate on egress?
● Complexity of Cluster?
● High availability?
The Automation: Must have requirements?
● Scale out/up?
● Fault tolerance!
● Reproducibility from scratch!
● Reporting!
● “Views” to avoid duplication for testing?
● Orchestration of same environment per Developer!
● Debugging options
● versioning!!!
The method● Phase 0: get ready
○ Product requirements & business constraints
○ Automation requirements
○ Creating a testing matrix based on insight and product features.
● Phase 1: Get insights (some baselines)
○ Test Ingress separately, find your baseline.
○ Test Egress separately, find your baseline.
○ Test Ingress while Egress in progress., find your baseline.
○ Stability and Performance
● Phase 2: Agile: Design an automation system that satisfies the requirements
○ Prioritize automation features based on your insights from phase 1.
○ Implement an automation infrastructure as soon as possible.
○ Update your testing matrix as you go (insight will keep coming)
○ Analyze the test results daily.
● Phase 3: Cost reduction
○ How to reduce compute time/ IO pattern/ network pattern/storage footprint
○ Maintenance time (build/package/deploy/monitor)
○ Hardware costs
True Story
Testing GPU based DBMS
(Peta Scale)
So why build a DBMS from scratch?
● Most compiler are designed for CPU → execution tree for CPU runtime
engine → new compiler with GPU resource in mind.
● Most Algorithm were written for CPU → same algorithm logic, redesign for
parallelism philosophy of GPU → new runtime with GPU resource in mind →
performance increase by order of magnitude.
● VISION: fastest DBMS, cost effective, true scalability.
True story
● Product: Big Data gpu based DBMS system for analytics. [peta scale]
○ Structed data only, designed for OLAP use cases only
● Technical Constraints
○ How to test SQL syntax?(infinite input) SQL Coverage from 0% to 20% to 90% to 100%?.
○ What is the expected impact of big data?
○ How to debug performance issues?
○ Can you virtualize? Should you virtualize?
○ Running time - how much it takes to run all the tests?
● Company Challenges
○ Cost of Hardware? High end 2U Server
○ Expertise ? skillset?
○ Human resources: Size of QA Automation team?
○ How do you manage automation?
○ What is your MVP? (Chicken and Egg)
○ TIME TO MARKET!!! The product needs to be released.
The baseline concept
● Requirements
○ CSV , DDL , query
○ Competing DB
○ Your DB
○ Synthetic data used mostly.
○ (real data in peta scale is hard to comeby)
● Steps
○ Insert same data to same table on both DB’s
○ Run same query on both DB’s
○ Compare results on both.
○ If equal → test pass.
The Solution: SQL testing framework.
● Lightweight python test suite per feature
○ DDL
○ Set of Queries
○ Json for data generation requirements
○ Data generation utilities (genUtil)
○ test results report - in CSV format.
○ Expected error mechanism.
○ Results compare mechanism
○ Command line arguments for advanced testing/config/tuning/views.
● Wrapper test suite
○ group set of test suites by logic - e.g Daily
○ Aggregate reporting mechanism
● Scheduling ,reporting,monitoring,deployment, alerting mechanism
Testing concept of SQL syntax
Simple Aggregation Joins
Simple x x x
Aggregation x x x
Joins x x x
Challenges with SQL syntax testing
● (Binary data not supported in the product)
● Repetition of queries.
● Different data type names.
● Different data ranges per data types.
● Different accuracy per data types.
● The competition DB that supports big data is expensive...
● Accuracy of results. (different DB’s return different accuracy)
● SQL has some extreme cases (differ per vendor).
● Datetime format.
● Unsupported features.
● Duplication of testing.
● Very hard to predict which queries are useful (negative and positive testing).
Challenges with Small Data testing
● Generating Random data
○ Reproducible every time -->Strict data ranges per test (length, numeric range, format, accuracy)
○ What is the amount of unique values? Different histogram, different bottlenecks:
■ Unique values challenges:
● Per chunk?
● Per Column
■ Non unique values challenges:
● String?
○ Lengths
○ Compressible?
● Numeric?
○ Overflow?
○ Floating point?
Testing matrix: Data Integrity on Big data.
Data (no null) Compression Null Data Lookup. Partitions JDBC/ODBC
1 row
1M Rows
10M rows
100M Rows
1B rows
10B rows
100B rows
1 Trillion rows
10 Trillion
rows
Challenges with Big Data testing.
● Generating Time of data → genUtil with json for user input
● Reproducibility → genUtil again.
● Disk space for testing data → peta scale product → peta scale testing data
● Results compared mechanism on big data require a big data solution→ SMJ
● Network bottlenecks on a scale out system...
● ORDER IS NOT GUARANTEED!
Insights on the fly: User Defined Test Flow
● A series of queries in a specific order - crushed the system
● Solution: Automation infrastructure on a group of queries: User defined Test
Flow.
● Good for: pre and post condition of , custom testing, system perspective
testing. (partitions)
The solution: some Metrics
1. Extra large Sanity Testing per commit (!)
2. Hourly testing on latest commit (24 X 7)
a. 800 queries
b. Zero data
3. Daily test:
a. 400,000 queries
b. Zero data for 90% , 10% testing upto 1B records.
4. Weekly testing:
a. Big data testing 10B and above.
5. Monthly testing:
a. Full regression per version.
The solution: Hardware perspective
● 50 x Desktops: Optiplex 7040, I7 4 cores, 32GB. 1X 4Tb.
● 10 x High End Servers: Dell R730/R720: 20 cores, 128GB. 16X 1TB disks.
Nvidia K40
● DDN storage with 200TB+ GPFS
● Mellanox FDR switch for the storage
● 1GB switches for the desktop compute nodes.
Performance and Maintenance
Performance Challenges: app perspective
● Architecture bottlenecks [ count(*), group by, compression, sort Merge Join]
● What is IO pattern?
○ OLTP VS OLAP.
○ Columnar or Row based?
○ How big is your data?
○ READ % vs WRITE %.
○ Sequential? random?
○ Temporary VS permanent
○ Latency VS. throughput.
○ Multi threaded ? or single thread?
○ Power query? Or production?
○ Cold data? Host data?
Performance Challenges: OPS perspective
● Metrics
○ What is Expected Total RAM required per Query?
○ OS RAM cache , CPU Cache hits ?
○ SWAP ? yes/no how much?
○ OS metrics - open files, stack size, realtime priority?
● Theoretical limits?
○ Disk type selection -limitation, expected throughput
○ Raid controller, RAID type? Configuration? caching?
○ File system used? DFS? PFS? Ext4? XFS?
○ Recommended file system filesystem Block size? File system metadata?
○ Recommended File size on disk?
○ CPU selection - number of threads , cache level, hyper threading. Cilicon
○ Ram Selection - and placement on chassi
○ Network - the difference b/w 40Gbit and 25Gbit. NIC Bonding is good for?
○ PCI Express 3, 16 Langes. PCI Switch.
Maintenance challenges: OPS Perspective
How to Optimize your hardware selection ? ( Analytics on your testing data)
a. Compute intensive
i. GPU CORE intensive
ii. CPUcores intensive
1. Frequency
2. Amount of thread for concurrency
b. IO intensive?
i. SSD?
ii. NearLine sata?
iii. Partitions?
c. Hardware uniformity - use same hardware all over.
d. Get rid of weak links
Maintenance challenges: DevOps Perspective
● Lightweight - python code
○ Advantage → focus on generic skill set (python coding)
○ Disadvantage → reinventing the wheel
● Fault tolerance - Scale out topology
○ Cheap desktops → quick recovery from Image when hardware crushes
■ Advantage - >
● Cheap, low risk, pay as you go.
● Gets 90 % of the job DONE!
■ Disadvantage
● footprint is hell.
● Management of desktops - requires creativity
● Rely heavily on automation feature: reproducibility from scratch
● Validation automation on infrastructure & Deployment mechanism.
Maintenance challenges: DevOps Perspective
● Infrastructure
○ Continuous integration: continuous build, continuous packaging, installer.
○ Continuous Deployment: pre flight check, remote machine (on site, private cloud, cloud)
○ Continuous testing: Sanity, Hourly, Daily, weekly
● Reporting and Monitoring:
○ Extensive report server and analytics on current testing.
○ Monitoring: Green/Red and system real time metrics.
Maintenance challenges: Innovation Perspective
1. Innovation on Data generation (using GPU to generate data)
2. Innovation on Competitor DB running time (hashing)
3. Innovation on Workload (one dispatcher, many compute nodes)
4. Innovation on saving testing data /compute time
a. Hashing
b. Cacheing
c. Smart Data Creating - no need to create 100TB CSV to generate 100TB test.
d. Logs: Test results were managed on The DBMS itself :)
5. File system ZFS Cluster (utilizing unused disk space, redundancy, deduplication)
6. Nvidia TK1 /GRID GPU → Footprint!!!
7. “Virtualizing GPU” → Footprint!!!
a. Dockers
b. rCUDA
8. Amazon p2 gpu instances did not exist at that time (even so still very expensive)
Building the team Challenge
● Engineering skills required
■ Hardware: Servers
■ IO (storage, disks, RAID, IO patten)
■ FileSystems
■ Linux , CLI, CMD, installing, compiling, GIT
■ Basic network understanding
■ High Availability, Redundancy
■ HPC
● Coding: python.
● QA Big Data mindset
● DevOps mindset
● Automation Mindset
● Systematic Innovation methodologies
Lesson learned (insights)● Very hard to guarantee 100% QA coverage
● Quality in a reasonable cost is a huge challenge
● Very easy to miss milestones with QA of big data,
● each mistake create a huge ripple effect in timelines
● Main costs:
○ Employees : training them.
○ Running Time: Coding of innovative algorithms inside testing framework,
○ Maintenance: Automation, Monitoring, Analyzing test results, Environments setup
○ Hardware: innovative DevOps , innovative OPS, research
● Most of our innovation was spent on:
○ Doing more tests with less resources
○ agile improvement of backbone
○ Avoiding big data complexity problems in our testing!
● Industry tools ? great, but not always enough. Know their philosophy...
● Sometimes you need to get your hands dirty
● Keep a fine balance between business and technology
Take away message:
● Testing simplifications:
○ Ingress only
○ Egress only
○ Ingress while egress
○ Testing matrix - useful on complex big data systems
● The QA automation team
○ coding/scripting skills is a MUST
○ Engineering skill is a MUST
○ DevOps understanding is a MUST
○ Allocating time for innovation is a MUST
○ Allocating time for cost reduction is a MUST
● The Automation infrastructure
○ KISS
○ MVP
○ IS A PRODUCT by itself. With its own Product manager and Architect
Special Thanks...
● Citi innovation Center
● Asaf - Birenzvieg Big Data & Data Science - Israel meetup
● PayPal Risk and Data solution Group
Stay in touch...
● Omid Vahdaty
● +972-54-2384178

More Related Content

What's hot

Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @Twitter
Attila Szegedi
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
Nathaniel Braun
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
Florian Lautenschlager
 
Efficient and Fast Time Series Storage - The missing link in dynamic software...
Efficient and Fast Time Series Storage - The missing link in dynamic software...Efficient and Fast Time Series Storage - The missing link in dynamic software...
Efficient and Fast Time Series Storage - The missing link in dynamic software...
Florian Lautenschlager
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
Florian Lautenschlager
 
Debugging Your Production JVM
Debugging Your Production JVMDebugging Your Production JVM
Debugging Your Production JVM
kensipe
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
NoSQLmatters
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
Garbage collection in JVM
Garbage collection in JVMGarbage collection in JVM
Garbage collection in JVM
aragozin
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for Scylla
ScyllaDB
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
HBaseCon
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
PingCAP
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
Lars Albertsson
 
Indexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesIndexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data Types
Jonathan Katz
 

What's hot (15)

Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @Twitter
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
Efficient and Fast Time Series Storage - The missing link in dynamic software...
Efficient and Fast Time Series Storage - The missing link in dynamic software...Efficient and Fast Time Series Storage - The missing link in dynamic software...
Efficient and Fast Time Series Storage - The missing link in dynamic software...
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
 
Debugging Your Production JVM
Debugging Your Production JVMDebugging Your Production JVM
Debugging Your Production JVM
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Garbage collection in JVM
Garbage collection in JVMGarbage collection in JVM
Garbage collection in JVM
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for Scylla
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Indexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesIndexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data Types
 

Viewers also liked

Designing of databases
Designing of databasesDesigning of databases
Designing of databases
Ansh Jhanji
 
Getting Started With QA Automation
Getting Started With QA AutomationGetting Started With QA Automation
Getting Started With QA Automation
Giovanni Scerra ☃
 
Aws s3 security
Aws s3 securityAws s3 security
Aws s3 security
Omid Vahdaty
 
QA automation
QA automationQA automation
QA automation
Strategybeach
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
Omid Vahdaty
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
Stéphane Fréchette
 
Automation presentation
Automation presentationAutomation presentation
Automation presentation
AKANSHA GURELE
 
BDD with Cucumber
BDD with CucumberBDD with Cucumber
BDD with Cucumber
Knoldus Inc.
 
Ppt on automation
Ppt on automation Ppt on automation
Ppt on automation
harshaa
 

Viewers also liked (9)

Designing of databases
Designing of databasesDesigning of databases
Designing of databases
 
Getting Started With QA Automation
Getting Started With QA AutomationGetting Started With QA Automation
Getting Started With QA Automation
 
Aws s3 security
Aws s3 securityAws s3 security
Aws s3 security
 
QA automation
QA automationQA automation
QA automation
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
 
Automation presentation
Automation presentationAutomation presentation
Automation presentation
 
BDD with Cucumber
BDD with CucumberBDD with Cucumber
BDD with Cucumber
 
Ppt on automation
Ppt on automation Ppt on automation
Ppt on automation
 

Similar to Lessons learned from designing a QA Automation for analytics databases (big data)

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
Demi Ben-Ari
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
Gruter
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
Jihoon Son
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
Joachim Draeger
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
Lex Toumbourou
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
james tong
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
My benchmarks brings all the boys to the yard
My benchmarks brings all the boys to the yardMy benchmarks brings all the boys to the yard
My benchmarks brings all the boys to the yard
Ion Dormenco
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
EDB
 

Similar to Lessons learned from designing a QA Automation for analytics databases (big data) (20)

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
My benchmarks brings all the boys to the yard
My benchmarks brings all the boys to the yardMy benchmarks brings all the boys to the yard
My benchmarks brings all the boys to the yard
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 

More from Omid Vahdaty

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
Omid Vahdaty
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
Omid Vahdaty
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
Omid Vahdaty
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
Omid Vahdaty
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
Omid Vahdaty
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
Omid Vahdaty
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
Omid Vahdaty
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
Omid Vahdaty
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
Omid Vahdaty
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...
Omid Vahdaty
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Omid Vahdaty
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
Omid Vahdaty
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
Omid Vahdaty
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
Omid Vahdaty
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty
 

More from Omid Vahdaty (20)

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 

Recently uploaded

Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
HODECEDSIET
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
enizeyimana36
 

Recently uploaded (20)

Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
 

Lessons learned from designing a QA Automation for analytics databases (big data)

  • 1. Lessons learned from designing a QA automation for analytics databases (Big Data) Omid Vahdaty, Big Data ninja.
  • 2. Agenda ● Introduction to Big Data Testing ● Methodological approach for QA Automation road map ● True Story ● The Solution? ● Challenges of Big Data Testing. ● Performance & Maintenance.
  • 3. Introduction to Big Data Testing
  • 4. ● 100M ? ● 1B ? ● 10B? ● 100B? ● 1Tr? How Big is Big Data?
  • 5. Scale OUT VS. Scale UP systems
  • 6. Big Data ? ● Structured (SQL, tabular, strings, numbers) ● Unstructured (logs, pictures, binary, json,blob, video etc)
  • 7. OLAP or OLTP? ● Simply put: ○ One query running on a lot of data for reporting/analytics ○ Many queries running quickly in parralel applications (e.g user login to a website, credentials are stored on a DB)
  • 8. Type of Big Data products ● Databases ○ OLAP, OLTP ○ Sql, NO SQL ○ In Memory, Disk based. ○ Hadoop Ecosystem, Spark ● Ecosystem tools ○ ETL tools ○ Visualizations tools ● General Application ● Analytics Based products ○ Froud ○ Cyber ○ Finances and etc.
  • 9. Expectation Matching for today’s lecture... ● OLAP Database ● Structured data only, Synthetic data was used in testing. ● (not about QA of analytics based products/solutions). ● (not about hadoop , event streaming cluster ). ● (not try to sell anything) ● In a nutshell: ○ How to test an SQL based database? ○ What sort of challenges were met?
  • 10. How Hard Can it be (aggregation)? ● Select Count(*) from t; ○ Assume ■ 1 Trillion records ■ ad-hoc query (no indexes) ■ Full Scan (no cache) ○ Challenges ■ Time it takes to compute? ■ IO bottleneck? What is IO pattern? ■ CPU bottleneck? ■ Scale up limitations?
  • 11. How Hard Can it be (join)? ● Select * from t1 join t2 where t1.id =t2.id ○ Assume ■ 1 Trillion records both tables. ■ ad-hoc query (no indexes) ■ Full Scan (no cache) ○ Challenges ■ Time it takes to compute? ■ IO bottleneck? What is IO pattern? ■ CPU bottleneck? ■ Scale out limitations?
  • 12. 3 Pain Points in Big Data DBMS Big Data System Ingress data rate Egress data rate Maintenance rate
  • 13. Big Data ingress challenges ● Parsing of Data type ○ Strings ○ Dates ○ Floats ● Rate of data coming into the system ● ACID: Atomicity, Consistency, Isolation, Durability ● Compression on the fly (non unique values) ● amount of data ● On the fly analytics ● Time constraints (target: x rows per sec/hour)
  • 14. Big Data egress challenges ● Sort ● Group by ● Reduce ● Join (sortMergeJoin) ● Data distribution ● Compression ● Theoretical Bottlenecks of hardware.
  • 15. Methodological approach for QA Automation road map
  • 16. Business constraints? ● Budget? ● Time to market? ● Resources available? ● Skill set available ?
  • 17. Product requirements? ● What are the product’s supported use cases? ● Supported Scale? ● Supported rate of ingress? ● Supported rate on egress? ● Complexity of Cluster? ● High availability?
  • 18. The Automation: Must have requirements? ● Scale out/up? ● Fault tolerance! ● Reproducibility from scratch! ● Reporting! ● “Views” to avoid duplication for testing? ● Orchestration of same environment per Developer! ● Debugging options ● versioning!!!
  • 19. The method● Phase 0: get ready ○ Product requirements & business constraints ○ Automation requirements ○ Creating a testing matrix based on insight and product features. ● Phase 1: Get insights (some baselines) ○ Test Ingress separately, find your baseline. ○ Test Egress separately, find your baseline. ○ Test Ingress while Egress in progress., find your baseline. ○ Stability and Performance ● Phase 2: Agile: Design an automation system that satisfies the requirements ○ Prioritize automation features based on your insights from phase 1. ○ Implement an automation infrastructure as soon as possible. ○ Update your testing matrix as you go (insight will keep coming) ○ Analyze the test results daily. ● Phase 3: Cost reduction ○ How to reduce compute time/ IO pattern/ network pattern/storage footprint ○ Maintenance time (build/package/deploy/monitor) ○ Hardware costs
  • 20. True Story Testing GPU based DBMS (Peta Scale)
  • 21. So why build a DBMS from scratch? ● Most compiler are designed for CPU → execution tree for CPU runtime engine → new compiler with GPU resource in mind. ● Most Algorithm were written for CPU → same algorithm logic, redesign for parallelism philosophy of GPU → new runtime with GPU resource in mind → performance increase by order of magnitude. ● VISION: fastest DBMS, cost effective, true scalability.
  • 22. True story ● Product: Big Data gpu based DBMS system for analytics. [peta scale] ○ Structed data only, designed for OLAP use cases only ● Technical Constraints ○ How to test SQL syntax?(infinite input) SQL Coverage from 0% to 20% to 90% to 100%?. ○ What is the expected impact of big data? ○ How to debug performance issues? ○ Can you virtualize? Should you virtualize? ○ Running time - how much it takes to run all the tests? ● Company Challenges ○ Cost of Hardware? High end 2U Server ○ Expertise ? skillset? ○ Human resources: Size of QA Automation team? ○ How do you manage automation? ○ What is your MVP? (Chicken and Egg) ○ TIME TO MARKET!!! The product needs to be released.
  • 23. The baseline concept ● Requirements ○ CSV , DDL , query ○ Competing DB ○ Your DB ○ Synthetic data used mostly. ○ (real data in peta scale is hard to comeby) ● Steps ○ Insert same data to same table on both DB’s ○ Run same query on both DB’s ○ Compare results on both. ○ If equal → test pass.
  • 24. The Solution: SQL testing framework. ● Lightweight python test suite per feature ○ DDL ○ Set of Queries ○ Json for data generation requirements ○ Data generation utilities (genUtil) ○ test results report - in CSV format. ○ Expected error mechanism. ○ Results compare mechanism ○ Command line arguments for advanced testing/config/tuning/views. ● Wrapper test suite ○ group set of test suites by logic - e.g Daily ○ Aggregate reporting mechanism ● Scheduling ,reporting,monitoring,deployment, alerting mechanism
  • 25. Testing concept of SQL syntax Simple Aggregation Joins Simple x x x Aggregation x x x Joins x x x
  • 26. Challenges with SQL syntax testing ● (Binary data not supported in the product) ● Repetition of queries. ● Different data type names. ● Different data ranges per data types. ● Different accuracy per data types. ● The competition DB that supports big data is expensive... ● Accuracy of results. (different DB’s return different accuracy) ● SQL has some extreme cases (differ per vendor). ● Datetime format. ● Unsupported features. ● Duplication of testing. ● Very hard to predict which queries are useful (negative and positive testing).
  • 27. Challenges with Small Data testing ● Generating Random data ○ Reproducible every time -->Strict data ranges per test (length, numeric range, format, accuracy) ○ What is the amount of unique values? Different histogram, different bottlenecks: ■ Unique values challenges: ● Per chunk? ● Per Column ■ Non unique values challenges: ● String? ○ Lengths ○ Compressible? ● Numeric? ○ Overflow? ○ Floating point?
  • 28. Testing matrix: Data Integrity on Big data. Data (no null) Compression Null Data Lookup. Partitions JDBC/ODBC 1 row 1M Rows 10M rows 100M Rows 1B rows 10B rows 100B rows 1 Trillion rows 10 Trillion rows
  • 29. Challenges with Big Data testing. ● Generating Time of data → genUtil with json for user input ● Reproducibility → genUtil again. ● Disk space for testing data → peta scale product → peta scale testing data ● Results compared mechanism on big data require a big data solution→ SMJ ● Network bottlenecks on a scale out system... ● ORDER IS NOT GUARANTEED!
  • 30. Insights on the fly: User Defined Test Flow ● A series of queries in a specific order - crushed the system ● Solution: Automation infrastructure on a group of queries: User defined Test Flow. ● Good for: pre and post condition of , custom testing, system perspective testing. (partitions)
  • 31. The solution: some Metrics 1. Extra large Sanity Testing per commit (!) 2. Hourly testing on latest commit (24 X 7) a. 800 queries b. Zero data 3. Daily test: a. 400,000 queries b. Zero data for 90% , 10% testing upto 1B records. 4. Weekly testing: a. Big data testing 10B and above. 5. Monthly testing: a. Full regression per version.
  • 32. The solution: Hardware perspective ● 50 x Desktops: Optiplex 7040, I7 4 cores, 32GB. 1X 4Tb. ● 10 x High End Servers: Dell R730/R720: 20 cores, 128GB. 16X 1TB disks. Nvidia K40 ● DDN storage with 200TB+ GPFS ● Mellanox FDR switch for the storage ● 1GB switches for the desktop compute nodes.
  • 34. Performance Challenges: app perspective ● Architecture bottlenecks [ count(*), group by, compression, sort Merge Join] ● What is IO pattern? ○ OLTP VS OLAP. ○ Columnar or Row based? ○ How big is your data? ○ READ % vs WRITE %. ○ Sequential? random? ○ Temporary VS permanent ○ Latency VS. throughput. ○ Multi threaded ? or single thread? ○ Power query? Or production? ○ Cold data? Host data?
  • 35. Performance Challenges: OPS perspective ● Metrics ○ What is Expected Total RAM required per Query? ○ OS RAM cache , CPU Cache hits ? ○ SWAP ? yes/no how much? ○ OS metrics - open files, stack size, realtime priority? ● Theoretical limits? ○ Disk type selection -limitation, expected throughput ○ Raid controller, RAID type? Configuration? caching? ○ File system used? DFS? PFS? Ext4? XFS? ○ Recommended file system filesystem Block size? File system metadata? ○ Recommended File size on disk? ○ CPU selection - number of threads , cache level, hyper threading. Cilicon ○ Ram Selection - and placement on chassi ○ Network - the difference b/w 40Gbit and 25Gbit. NIC Bonding is good for? ○ PCI Express 3, 16 Langes. PCI Switch.
  • 36. Maintenance challenges: OPS Perspective How to Optimize your hardware selection ? ( Analytics on your testing data) a. Compute intensive i. GPU CORE intensive ii. CPUcores intensive 1. Frequency 2. Amount of thread for concurrency b. IO intensive? i. SSD? ii. NearLine sata? iii. Partitions? c. Hardware uniformity - use same hardware all over. d. Get rid of weak links
  • 37. Maintenance challenges: DevOps Perspective ● Lightweight - python code ○ Advantage → focus on generic skill set (python coding) ○ Disadvantage → reinventing the wheel ● Fault tolerance - Scale out topology ○ Cheap desktops → quick recovery from Image when hardware crushes ■ Advantage - > ● Cheap, low risk, pay as you go. ● Gets 90 % of the job DONE! ■ Disadvantage ● footprint is hell. ● Management of desktops - requires creativity ● Rely heavily on automation feature: reproducibility from scratch ● Validation automation on infrastructure & Deployment mechanism.
  • 38. Maintenance challenges: DevOps Perspective ● Infrastructure ○ Continuous integration: continuous build, continuous packaging, installer. ○ Continuous Deployment: pre flight check, remote machine (on site, private cloud, cloud) ○ Continuous testing: Sanity, Hourly, Daily, weekly ● Reporting and Monitoring: ○ Extensive report server and analytics on current testing. ○ Monitoring: Green/Red and system real time metrics.
  • 39. Maintenance challenges: Innovation Perspective 1. Innovation on Data generation (using GPU to generate data) 2. Innovation on Competitor DB running time (hashing) 3. Innovation on Workload (one dispatcher, many compute nodes) 4. Innovation on saving testing data /compute time a. Hashing b. Cacheing c. Smart Data Creating - no need to create 100TB CSV to generate 100TB test. d. Logs: Test results were managed on The DBMS itself :) 5. File system ZFS Cluster (utilizing unused disk space, redundancy, deduplication) 6. Nvidia TK1 /GRID GPU → Footprint!!! 7. “Virtualizing GPU” → Footprint!!! a. Dockers b. rCUDA 8. Amazon p2 gpu instances did not exist at that time (even so still very expensive)
  • 40. Building the team Challenge ● Engineering skills required ■ Hardware: Servers ■ IO (storage, disks, RAID, IO patten) ■ FileSystems ■ Linux , CLI, CMD, installing, compiling, GIT ■ Basic network understanding ■ High Availability, Redundancy ■ HPC ● Coding: python. ● QA Big Data mindset ● DevOps mindset ● Automation Mindset ● Systematic Innovation methodologies
  • 41. Lesson learned (insights)● Very hard to guarantee 100% QA coverage ● Quality in a reasonable cost is a huge challenge ● Very easy to miss milestones with QA of big data, ● each mistake create a huge ripple effect in timelines ● Main costs: ○ Employees : training them. ○ Running Time: Coding of innovative algorithms inside testing framework, ○ Maintenance: Automation, Monitoring, Analyzing test results, Environments setup ○ Hardware: innovative DevOps , innovative OPS, research ● Most of our innovation was spent on: ○ Doing more tests with less resources ○ agile improvement of backbone ○ Avoiding big data complexity problems in our testing! ● Industry tools ? great, but not always enough. Know their philosophy... ● Sometimes you need to get your hands dirty ● Keep a fine balance between business and technology
  • 42. Take away message: ● Testing simplifications: ○ Ingress only ○ Egress only ○ Ingress while egress ○ Testing matrix - useful on complex big data systems ● The QA automation team ○ coding/scripting skills is a MUST ○ Engineering skill is a MUST ○ DevOps understanding is a MUST ○ Allocating time for innovation is a MUST ○ Allocating time for cost reduction is a MUST ● The Automation infrastructure ○ KISS ○ MVP ○ IS A PRODUCT by itself. With its own Product manager and Architect
  • 43. Special Thanks... ● Citi innovation Center ● Asaf - Birenzvieg Big Data & Data Science - Israel meetup ● PayPal Risk and Data solution Group
  • 44. Stay in touch... ● Omid Vahdaty ● +972-54-2384178