SlideShare a Scribd company logo
1 of 40
Applications of
HBase/Phoenix @ 23andMe
Tulasi Paradarami, Engineering Tech Lead
Marcell Ortutay, Sr. Software Engineer
2Copyright © 2017 23andMe, Inc. All rights reserved.
Agenda
● Phoenix Use Cases
○ Identical By Descent (IBD)
○ Ancestry Composition
● Hash Joins
● Cluster Configuration
● Lessons Learnt
● Q & A
3Copyright © 2017 23andMe, Inc. All rights reserved.
BigData @ 23andMe
● Over 5 million customers, 1 million genotyped and 10s of millions of
imputed SNPs
● Close to trillion shared segments and relationships
● Billions of customer reported survey answers
● Support real-time access to data at scale
● Simple arbitrary joins and aggregations
● Enable internal research through fast random access to any genetic
location across the database
4Copyright © 2017 23andMe, Inc. All rights reserved.
HBase/Phoenix Cluster
● HBase 1.3, Phoenix 4.11
● Mirror cluster for load balancing and hot failover
● Live and mirror clusters run in different AWS availability zones
● Data Pipeline
1. Spark based computation job stages data to HDFS
2. MR cluster for generating incremental HFiles from staged data
3. Transfer HFiles to live and mirror HBase clusters
4. Backup HFiles to S3 for disaster recovery
5Copyright © 2017 23andMe, Inc. All rights reserved.
Phoenix
HBase
● 60 region servers
● 8 vCPUs, 60G RAM
● 30G RS (d2.2xlarge)
● 6 * 2T HDD
● enhanced networking
PQS
cluster
● 10 c5.4xlarge
● 16 vCPUs, 32G RAM
● EBS only
● enhanced networking
Import
Cluster
HFile
transfer
HFile
transfer
HFile
Backup
PQS
cluster
Phoenix
HBase
...
API hosts/
phoenixdb
Architecture
us-west-2a us-west-2b
S3 bucket
(99.999999999%
durability)
live mirror
6Copyright © 2017 23andMe, Inc. All rights reserved.
Phoenix Optimizations
● FAST_DIFF block encoding
● GZ compression
● Enable ROW level bloomfilters
● Salted tables
● Tune query for RANGE SCAN or SKIP SCAN (HBase SEEK_NEXT_USING_HINT for intra
region scan) over FULL SCAN
● Apply filters to leverage predicate push down
7Copyright © 2017 23andMe, Inc. All rights reserved.
Query Performance
● RELATION
● SEGMENT
8Copyright © 2017 23andMe, Inc. All rights reserved.
Close Relatives
SELECT r1.person_id_2
FROM relation r1
WHERE r1.person_id_1 = ?
AND r1.half > ?;
9Copyright © 2017 23andMe, Inc. All rights reserved.
Shared Segments
SELECT *
FROM segment
WHERE person_id_2 IN (?, ?) AND
person_id_1 = ? AND
cm > ?
ORDER BY chromosome, start_
10Copyright © 2017 23andMe, Inc. All rights reserved.
Relatives in Common
SELECT r1.person_id_2
FROM relation r1 JOIN relation r2
ON r1.person_id_2 = r2.person_id_2
WHERE r1.person_id_1 = <your-id>
AND r2.person_id_1 = <relative’s-id>
11Copyright © 2017 23andMe, Inc. All rights reserved.
Ancestry Composition (use case II)
● 23andMe's Ancestry Composition report is a powerful and
well-tested system for analyzing ancestry based on DNA
● Your Ancestry Composition report shows the percentage of your
DNA that comes from 31 populations.
● We use Phoenix to join data from segments, relations and
customer reported survey answers for building the report.
12Copyright © 2017 23andMe, Inc. All rights reserved.
Ancestry Composition
Common across
queries
Billions of rows
per table
13Copyright © 2017 23andMe, Inc. All rights reserved.
Joins
Eye Color ID Color
1 blue
2 green
3 brown
Person ID Eye Color ID
1 2
2 1
2 2
4 3
14Copyright © 2017 23andMe, Inc. All rights reserved.
Joins
Eye Color ID Color
1 blue
2 green
3 brown
Person ID Eye Color ID Color
1 2 green
2 1 blue
2 2 green
4 3 brown
15Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins
Person ID Eye Color ID
1 2
2 1
2 2
4 3
1 => “blue”
2 => “green”
3 => “brown”
- Iterate over rows
- Use hashmap to look up
joined values
- O(N) performance
16Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix
Client
Region servers
Example query:
SELECT * FROM person
JOIN eye_color ON eye_color.id = person.eye_color_id
17Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix
Client
Region servers
Step 1: Build the hash join table
18Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix
Client
Region servers
Step 1: Build the hash join table
19Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix
Client
Region servers
Step 1: Build the hash join table
20Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix
Client
Region servers
Step 1: Build the hash join table
Step 2: Distribute the hash join table
21Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix
Client
Region servers
Step 1: Build the hash join table
Step 2: Distribute the hash join table
Step 3: Execute join algorithm on Region Servers
22Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix
Client
Region servers
Step 1: Build the hash join table
Step 2: Distribute the hash join table
Step 3: Execute join algorithm on Region Servers
Step 4: Release the hash join tables, return results
23Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix
Step 1: Build the hash join table
Step 2: Distribute the hash join table
Step 3: Execute join algorithm on Region Servers
Step 4: Release the hash join tables, return results
Step 1:
compute/IO
Step 2:
network
Step 3:
compute/IO
Step 4
Phase 1: generate hash join table Phase 2: execute join
Timing breakdown
24Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix - Pros/Cons
Good:
- Hash joins are fast, O(N) performance
Bad:
- Broadcasting can be slow for large hash tables
- Needs to re-run every time
SELECT * FROM person
WHERE person.some_value IN (SELECT value FROM large_expensive_table)
AND person_id = 1234;
- Only person_id changes
25Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix - Pros/Cons
Performance breakdown for example query
Region server 1
Region server 2
Region server 3
SELECT * FROM person
WHERE person.some_value IN (SELECT value FROM large_expensive_table)
AND person_id = 1234;
26Copyright © 2017 23andMe, Inc. All rights reserved.
Subquery caching to improve hash joins
SELECT * FROM person
WHERE person.some_value IN (SELECT value FROM large_expensive_table)
AND person_id = 1234;
- Sometimes subquery to “large_expensive_table” repeats across queries
- In the example, only person_id changes
- Application level caching not possible if query is not identical
What if we keep the subquery results around?
27Copyright © 2017 23andMe, Inc. All rights reserved.
Subquery caching to improve hash joins
New hint! /*+ USE_PERSISTENT_CACHE */
SELECT /*+ USE_PERSISTENT_CACHE */ * FROM person
WHERE person.some_value IN (SELECT value FROM large_expensive_table)
AND person_id = 1234;
- Each subquery gets assigned a cache key based on the hash of its query
statement
- Eg. Hash(“SELECT value FROM large_expensive_table”) = 0xa034bf...
28Copyright © 2017 23andMe, Inc. All rights reserved.
Subquery caching to improve hash joins
SELECT /*+ USE_PERSISTENT_CACHE */ * FROM person
WHERE person.some_value IN (SELECT value FROM large_expensive_table)
AND person_id = 1234;
First execution:
1. Generate hash join table (query on large_expensive_table)
2. Scan person table and execute join
O(large_expensive_table) + O(person)
Second execution:
1. Scan person table and execute join
O(person)
29Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix - with subquery cache
Performance breakdown for example query
Region server 1
Region server 2
Region server 3
SELECT * FROM person
WHERE person.some_value IN (SELECT value FROM large_expensive_table)
AND person_id = 1234;
30Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix - with subquery cache
Performance breakdown for example query
Region server 1
Region server 2
Region server 3
SELECT /*+ USE_PERSISTENT_CACHE */ * FROM person
WHERE person.some_value IN (SELECT value FROM large_expensive_table)
AND person_id = 1234;
31Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix - with subquery cache
Performance breakdown for example query
Region server 1
Region server 2
Region server 3
SELECT /*+ USE_PERSISTENT_CACHE */ * FROM person
WHERE person.some_value IN (SELECT value FROM large_expensive_table)
AND person_id = 1234;
32Copyright © 2017 23andMe, Inc. All rights reserved.
Hash Joins in Phoenix - with subquery cache
Performance breakdown for example query
Region server 1
Region server 2
Region server 3
SELECT /*+ USE_PERSISTENT_CACHE */ * FROM person
WHERE person.some_value IN (SELECT value FROM large_expensive_table)
AND person_id = 1234;
33Copyright © 2017 23andMe, Inc. All rights reserved.
Subquery Cache at 23andMe
- Countries of Ancestry speedup:
- Before: ~3 seconds/query
- After: ~1 second/query
- Your DNA Family speedup:
- Before: 30+ seconds/query
- After: ~0.2 seconds/query
34Copyright © 2017 23andMe, Inc. All rights reserved.
New config for subquery cache
phoenix.coprocessor.maxServerCachePersistenceTimeToLiveMs
- Sets TTL for items in persistent subquery cache
35Copyright © 2017 23andMe, Inc. All rights reserved.
Future work for subquery cache
- Add cache key hint
- /*+ USE_PERSISTENT_CACHE(“2017-01-01”) */
- Treated as a suffix, so cache key is now:
Key = Hash(“...”) + “2017-01-01”
- Allows explicit cache eviction, cache warming
- Efficiently handle large caches (offheap?)
- Eg. 1GB+
- Plug into caching systems like memcache, redis?
36Copyright © 2017 23andMe, Inc. All rights reserved.
Cluster Configuration
HDFS Properties
● dfs.datanode.max.xceivers = 4096
○ Default value of 256 is too low for production
○ Server side threads used for connections
○ 1 MB per thread, so, tune it!
● dfs.datanode.handler.count = 36
● dfs.client.read.shortcircuit = true
○ Read directly from local disk
● dfs.namenode.avoid.read.stale.datanode = true
● dfs.namenode.avoid.write.stale.datanode = true
OS settings
● Disable transparent hugepages
● vm.swappiness = 0
Phoenix Properties
● phoenix.query.queueSize = 5000
● phoenix.coprocessor.maxServerCacheTimeToLiveMs = 90s
● phoenix.query.force.rowkeyorder = false
● phoenix.query.maxServerCacheBytes = 256M
HBase Properties
● Disable default major compactions schedule (run on demand)
● zookeeper.session.timeout = 30s
● hbase.regionserver.handler.count = 48
○ Default value of 10 is too low for production loads
○ Factor of number of cores and disks
● Hbase.hregion.memstore.flush.size
37Copyright © 2017 23andMe, Inc. All rights reserved.
Phoenix Query Timeouts
● phoenix.query.timeoutMs
● hbase.client.scanner.caching
● hbase.client.scanner.timeout.period
● hbase.rpc.timeout
● hbase.regionserver.lease.period
38Copyright © 2017 23andMe, Inc. All rights reserved.
Lessons learnt in production
● Enable enhanced networking in AWS
● Host all regionservers and PQS within same AZ for consistent
performance
● Disable statistics collection
○ Duplicate rows
○ Impact on query performance
● Monitoring!
39Copyright © 2017 23andMe, Inc. All rights reserved.
JIRA tickets
● PHOENIX-4666: Subquery caching for hash join
● PHOENIX-4751: Client side hash aggregation for
sort merge join
40Copyright © 2017 23andMe, Inc. All rights reserved.
Q & A

More Related Content

Similar to PhoenixCon 2018 - Applications of HBase/Phoenix @ 23andMe

Networking State of the Union - NET205 - re:Invent 2017
Networking State of the Union - NET205 - re:Invent 2017Networking State of the Union - NET205 - re:Invent 2017
Networking State of the Union - NET205 - re:Invent 2017Amazon Web Services
 
OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...
OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...
OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...NETWAYS
 
Building a Multi-Region, Active-Active Serverless Backends.
Building a Multi-Region, Active-Active Serverless Backends.Building a Multi-Region, Active-Active Serverless Backends.
Building a Multi-Region, Active-Active Serverless Backends.Adrian Hornsby
 
CTD406_Measuring the Internet in Real Time
CTD406_Measuring the Internet in Real TimeCTD406_Measuring the Internet in Real Time
CTD406_Measuring the Internet in Real TimeAmazon Web Services
 
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...Amazon Web Services
 
Debugging and Performance tricks for MXNet Gluon
Debugging and Performance tricks for MXNet GluonDebugging and Performance tricks for MXNet Gluon
Debugging and Performance tricks for MXNet GluonApache MXNet
 
Big Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's PerspectiveBig Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's PerspectiveAerospike, Inc.
 
Introducing Service Discovery for Amazon ECS - CON403 - re:Invent 2017
Introducing Service Discovery for Amazon ECS - CON403 - re:Invent 2017Introducing Service Discovery for Amazon ECS - CON403 - re:Invent 2017
Introducing Service Discovery for Amazon ECS - CON403 - re:Invent 2017Amazon Web Services
 
Building Global Serverless Backends powered by Amazon DynamoDB Global Tables
Building Global Serverless Backends powered by Amazon DynamoDB Global TablesBuilding Global Serverless Backends powered by Amazon DynamoDB Global Tables
Building Global Serverless Backends powered by Amazon DynamoDB Global TablesAmazon Web Services
 
Consul and Complex Networks
Consul and Complex NetworksConsul and Complex Networks
Consul and Complex Networksslackpad
 
Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...
Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...
Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...AWS Germany
 
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.io
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.ioCompleting the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.io
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.ioCA Technologies
 
CTD307_Case Study How Mobile Device Service Company Asurion Architected Its A...
CTD307_Case Study How Mobile Device Service Company Asurion Architected Its A...CTD307_Case Study How Mobile Device Service Company Asurion Architected Its A...
CTD307_Case Study How Mobile Device Service Company Asurion Architected Its A...Amazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Dating and Data Science: How Coffee Meets Bagel Uses Amazon ElastiCache to De...
Dating and Data Science: How Coffee Meets Bagel Uses Amazon ElastiCache to De...Dating and Data Science: How Coffee Meets Bagel Uses Amazon ElastiCache to De...
Dating and Data Science: How Coffee Meets Bagel Uses Amazon ElastiCache to De...Amazon Web Services
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...Amazon Web Services
 
Big Data LDN 2017: Will they blend?
Big Data LDN 2017: Will they blend?Big Data LDN 2017: Will they blend?
Big Data LDN 2017: Will they blend?Matt Stubbs
 
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...Amazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 

Similar to PhoenixCon 2018 - Applications of HBase/Phoenix @ 23andMe (20)

Networking State of the Union - NET205 - re:Invent 2017
Networking State of the Union - NET205 - re:Invent 2017Networking State of the Union - NET205 - re:Invent 2017
Networking State of the Union - NET205 - re:Invent 2017
 
OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...
OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...
OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...
 
Building a Multi-Region, Active-Active Serverless Backends.
Building a Multi-Region, Active-Active Serverless Backends.Building a Multi-Region, Active-Active Serverless Backends.
Building a Multi-Region, Active-Active Serverless Backends.
 
CTD406_Measuring the Internet in Real Time
CTD406_Measuring the Internet in Real TimeCTD406_Measuring the Internet in Real Time
CTD406_Measuring the Internet in Real Time
 
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
 
Debugging and Performance tricks for MXNet Gluon
Debugging and Performance tricks for MXNet GluonDebugging and Performance tricks for MXNet Gluon
Debugging and Performance tricks for MXNet Gluon
 
Big Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's PerspectiveBig Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's Perspective
 
Introducing Service Discovery for Amazon ECS - CON403 - re:Invent 2017
Introducing Service Discovery for Amazon ECS - CON403 - re:Invent 2017Introducing Service Discovery for Amazon ECS - CON403 - re:Invent 2017
Introducing Service Discovery for Amazon ECS - CON403 - re:Invent 2017
 
Building Global Serverless Backends powered by Amazon DynamoDB Global Tables
Building Global Serverless Backends powered by Amazon DynamoDB Global TablesBuilding Global Serverless Backends powered by Amazon DynamoDB Global Tables
Building Global Serverless Backends powered by Amazon DynamoDB Global Tables
 
Consul and Complex Networks
Consul and Complex NetworksConsul and Complex Networks
Consul and Complex Networks
 
Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...
Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...
Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...
 
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.io
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.ioCompleting the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.io
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.io
 
CTD307_Case Study How Mobile Device Service Company Asurion Architected Its A...
CTD307_Case Study How Mobile Device Service Company Asurion Architected Its A...CTD307_Case Study How Mobile Device Service Company Asurion Architected Its A...
CTD307_Case Study How Mobile Device Service Company Asurion Architected Its A...
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Dating and Data Science: How Coffee Meets Bagel Uses Amazon ElastiCache to De...
Dating and Data Science: How Coffee Meets Bagel Uses Amazon ElastiCache to De...Dating and Data Science: How Coffee Meets Bagel Uses Amazon ElastiCache to De...
Dating and Data Science: How Coffee Meets Bagel Uses Amazon ElastiCache to De...
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
 
Big Data LDN 2017: Will they blend?
Big Data LDN 2017: Will they blend?Big Data LDN 2017: Will they blend?
Big Data LDN 2017: Will they blend?
 
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 

Recently uploaded

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Recently uploaded (20)

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 

PhoenixCon 2018 - Applications of HBase/Phoenix @ 23andMe

  • 1. Applications of HBase/Phoenix @ 23andMe Tulasi Paradarami, Engineering Tech Lead Marcell Ortutay, Sr. Software Engineer
  • 2. 2Copyright © 2017 23andMe, Inc. All rights reserved. Agenda ● Phoenix Use Cases ○ Identical By Descent (IBD) ○ Ancestry Composition ● Hash Joins ● Cluster Configuration ● Lessons Learnt ● Q & A
  • 3. 3Copyright © 2017 23andMe, Inc. All rights reserved. BigData @ 23andMe ● Over 5 million customers, 1 million genotyped and 10s of millions of imputed SNPs ● Close to trillion shared segments and relationships ● Billions of customer reported survey answers ● Support real-time access to data at scale ● Simple arbitrary joins and aggregations ● Enable internal research through fast random access to any genetic location across the database
  • 4. 4Copyright © 2017 23andMe, Inc. All rights reserved. HBase/Phoenix Cluster ● HBase 1.3, Phoenix 4.11 ● Mirror cluster for load balancing and hot failover ● Live and mirror clusters run in different AWS availability zones ● Data Pipeline 1. Spark based computation job stages data to HDFS 2. MR cluster for generating incremental HFiles from staged data 3. Transfer HFiles to live and mirror HBase clusters 4. Backup HFiles to S3 for disaster recovery
  • 5. 5Copyright © 2017 23andMe, Inc. All rights reserved. Phoenix HBase ● 60 region servers ● 8 vCPUs, 60G RAM ● 30G RS (d2.2xlarge) ● 6 * 2T HDD ● enhanced networking PQS cluster ● 10 c5.4xlarge ● 16 vCPUs, 32G RAM ● EBS only ● enhanced networking Import Cluster HFile transfer HFile transfer HFile Backup PQS cluster Phoenix HBase ... API hosts/ phoenixdb Architecture us-west-2a us-west-2b S3 bucket (99.999999999% durability) live mirror
  • 6. 6Copyright © 2017 23andMe, Inc. All rights reserved. Phoenix Optimizations ● FAST_DIFF block encoding ● GZ compression ● Enable ROW level bloomfilters ● Salted tables ● Tune query for RANGE SCAN or SKIP SCAN (HBase SEEK_NEXT_USING_HINT for intra region scan) over FULL SCAN ● Apply filters to leverage predicate push down
  • 7. 7Copyright © 2017 23andMe, Inc. All rights reserved. Query Performance ● RELATION ● SEGMENT
  • 8. 8Copyright © 2017 23andMe, Inc. All rights reserved. Close Relatives SELECT r1.person_id_2 FROM relation r1 WHERE r1.person_id_1 = ? AND r1.half > ?;
  • 9. 9Copyright © 2017 23andMe, Inc. All rights reserved. Shared Segments SELECT * FROM segment WHERE person_id_2 IN (?, ?) AND person_id_1 = ? AND cm > ? ORDER BY chromosome, start_
  • 10. 10Copyright © 2017 23andMe, Inc. All rights reserved. Relatives in Common SELECT r1.person_id_2 FROM relation r1 JOIN relation r2 ON r1.person_id_2 = r2.person_id_2 WHERE r1.person_id_1 = <your-id> AND r2.person_id_1 = <relative’s-id>
  • 11. 11Copyright © 2017 23andMe, Inc. All rights reserved. Ancestry Composition (use case II) ● 23andMe's Ancestry Composition report is a powerful and well-tested system for analyzing ancestry based on DNA ● Your Ancestry Composition report shows the percentage of your DNA that comes from 31 populations. ● We use Phoenix to join data from segments, relations and customer reported survey answers for building the report.
  • 12. 12Copyright © 2017 23andMe, Inc. All rights reserved. Ancestry Composition Common across queries Billions of rows per table
  • 13. 13Copyright © 2017 23andMe, Inc. All rights reserved. Joins Eye Color ID Color 1 blue 2 green 3 brown Person ID Eye Color ID 1 2 2 1 2 2 4 3
  • 14. 14Copyright © 2017 23andMe, Inc. All rights reserved. Joins Eye Color ID Color 1 blue 2 green 3 brown Person ID Eye Color ID Color 1 2 green 2 1 blue 2 2 green 4 3 brown
  • 15. 15Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins Person ID Eye Color ID 1 2 2 1 2 2 4 3 1 => “blue” 2 => “green” 3 => “brown” - Iterate over rows - Use hashmap to look up joined values - O(N) performance
  • 16. 16Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix Client Region servers Example query: SELECT * FROM person JOIN eye_color ON eye_color.id = person.eye_color_id
  • 17. 17Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix Client Region servers Step 1: Build the hash join table
  • 18. 18Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix Client Region servers Step 1: Build the hash join table
  • 19. 19Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix Client Region servers Step 1: Build the hash join table
  • 20. 20Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix Client Region servers Step 1: Build the hash join table Step 2: Distribute the hash join table
  • 21. 21Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix Client Region servers Step 1: Build the hash join table Step 2: Distribute the hash join table Step 3: Execute join algorithm on Region Servers
  • 22. 22Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix Client Region servers Step 1: Build the hash join table Step 2: Distribute the hash join table Step 3: Execute join algorithm on Region Servers Step 4: Release the hash join tables, return results
  • 23. 23Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix Step 1: Build the hash join table Step 2: Distribute the hash join table Step 3: Execute join algorithm on Region Servers Step 4: Release the hash join tables, return results Step 1: compute/IO Step 2: network Step 3: compute/IO Step 4 Phase 1: generate hash join table Phase 2: execute join Timing breakdown
  • 24. 24Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix - Pros/Cons Good: - Hash joins are fast, O(N) performance Bad: - Broadcasting can be slow for large hash tables - Needs to re-run every time SELECT * FROM person WHERE person.some_value IN (SELECT value FROM large_expensive_table) AND person_id = 1234; - Only person_id changes
  • 25. 25Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix - Pros/Cons Performance breakdown for example query Region server 1 Region server 2 Region server 3 SELECT * FROM person WHERE person.some_value IN (SELECT value FROM large_expensive_table) AND person_id = 1234;
  • 26. 26Copyright © 2017 23andMe, Inc. All rights reserved. Subquery caching to improve hash joins SELECT * FROM person WHERE person.some_value IN (SELECT value FROM large_expensive_table) AND person_id = 1234; - Sometimes subquery to “large_expensive_table” repeats across queries - In the example, only person_id changes - Application level caching not possible if query is not identical What if we keep the subquery results around?
  • 27. 27Copyright © 2017 23andMe, Inc. All rights reserved. Subquery caching to improve hash joins New hint! /*+ USE_PERSISTENT_CACHE */ SELECT /*+ USE_PERSISTENT_CACHE */ * FROM person WHERE person.some_value IN (SELECT value FROM large_expensive_table) AND person_id = 1234; - Each subquery gets assigned a cache key based on the hash of its query statement - Eg. Hash(“SELECT value FROM large_expensive_table”) = 0xa034bf...
  • 28. 28Copyright © 2017 23andMe, Inc. All rights reserved. Subquery caching to improve hash joins SELECT /*+ USE_PERSISTENT_CACHE */ * FROM person WHERE person.some_value IN (SELECT value FROM large_expensive_table) AND person_id = 1234; First execution: 1. Generate hash join table (query on large_expensive_table) 2. Scan person table and execute join O(large_expensive_table) + O(person) Second execution: 1. Scan person table and execute join O(person)
  • 29. 29Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix - with subquery cache Performance breakdown for example query Region server 1 Region server 2 Region server 3 SELECT * FROM person WHERE person.some_value IN (SELECT value FROM large_expensive_table) AND person_id = 1234;
  • 30. 30Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix - with subquery cache Performance breakdown for example query Region server 1 Region server 2 Region server 3 SELECT /*+ USE_PERSISTENT_CACHE */ * FROM person WHERE person.some_value IN (SELECT value FROM large_expensive_table) AND person_id = 1234;
  • 31. 31Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix - with subquery cache Performance breakdown for example query Region server 1 Region server 2 Region server 3 SELECT /*+ USE_PERSISTENT_CACHE */ * FROM person WHERE person.some_value IN (SELECT value FROM large_expensive_table) AND person_id = 1234;
  • 32. 32Copyright © 2017 23andMe, Inc. All rights reserved. Hash Joins in Phoenix - with subquery cache Performance breakdown for example query Region server 1 Region server 2 Region server 3 SELECT /*+ USE_PERSISTENT_CACHE */ * FROM person WHERE person.some_value IN (SELECT value FROM large_expensive_table) AND person_id = 1234;
  • 33. 33Copyright © 2017 23andMe, Inc. All rights reserved. Subquery Cache at 23andMe - Countries of Ancestry speedup: - Before: ~3 seconds/query - After: ~1 second/query - Your DNA Family speedup: - Before: 30+ seconds/query - After: ~0.2 seconds/query
  • 34. 34Copyright © 2017 23andMe, Inc. All rights reserved. New config for subquery cache phoenix.coprocessor.maxServerCachePersistenceTimeToLiveMs - Sets TTL for items in persistent subquery cache
  • 35. 35Copyright © 2017 23andMe, Inc. All rights reserved. Future work for subquery cache - Add cache key hint - /*+ USE_PERSISTENT_CACHE(“2017-01-01”) */ - Treated as a suffix, so cache key is now: Key = Hash(“...”) + “2017-01-01” - Allows explicit cache eviction, cache warming - Efficiently handle large caches (offheap?) - Eg. 1GB+ - Plug into caching systems like memcache, redis?
  • 36. 36Copyright © 2017 23andMe, Inc. All rights reserved. Cluster Configuration HDFS Properties ● dfs.datanode.max.xceivers = 4096 ○ Default value of 256 is too low for production ○ Server side threads used for connections ○ 1 MB per thread, so, tune it! ● dfs.datanode.handler.count = 36 ● dfs.client.read.shortcircuit = true ○ Read directly from local disk ● dfs.namenode.avoid.read.stale.datanode = true ● dfs.namenode.avoid.write.stale.datanode = true OS settings ● Disable transparent hugepages ● vm.swappiness = 0 Phoenix Properties ● phoenix.query.queueSize = 5000 ● phoenix.coprocessor.maxServerCacheTimeToLiveMs = 90s ● phoenix.query.force.rowkeyorder = false ● phoenix.query.maxServerCacheBytes = 256M HBase Properties ● Disable default major compactions schedule (run on demand) ● zookeeper.session.timeout = 30s ● hbase.regionserver.handler.count = 48 ○ Default value of 10 is too low for production loads ○ Factor of number of cores and disks ● Hbase.hregion.memstore.flush.size
  • 37. 37Copyright © 2017 23andMe, Inc. All rights reserved. Phoenix Query Timeouts ● phoenix.query.timeoutMs ● hbase.client.scanner.caching ● hbase.client.scanner.timeout.period ● hbase.rpc.timeout ● hbase.regionserver.lease.period
  • 38. 38Copyright © 2017 23andMe, Inc. All rights reserved. Lessons learnt in production ● Enable enhanced networking in AWS ● Host all regionservers and PQS within same AZ for consistent performance ● Disable statistics collection ○ Duplicate rows ○ Impact on query performance ● Monitoring!
  • 39. 39Copyright © 2017 23andMe, Inc. All rights reserved. JIRA tickets ● PHOENIX-4666: Subquery caching for hash join ● PHOENIX-4751: Client side hash aggregation for sort merge join
  • 40. 40Copyright © 2017 23andMe, Inc. All rights reserved. Q & A