1
Enterprise Distributed Query Service powered by Presto
& Alluxio across clouds @ WalmartLabs
Ashish Tadose
Principal Engineer
2
Agenda
• Data stores @ Walmart Labs
• Motivation for Presto as Distributed Query service
• Multi-tenant managed Distributed Query service
• Alluxio caching to optimize the performance
• Architectural components
• Alluxio to support query federation in hybrid
Footer
3
Data stores @ Walmart Labs
Access needs are varied from team to team – one solution does not fit all….
4
Motivation for Presto..
• DataLake cluster - powered by on-prem Hadoop/HDFS
• Compute storage colocation – GOOD
• Need to ingest data from all diverse sources – CHALLENGING
• Scaling out compute with growing needs – CHALLENGING
• Need to separate storage & compute / support federated query capability – PRESTO..
• Isolated clusters in private cloud powering dedicated data-marts
Datajourney
5
• Simplified query access layer
• Leverage cloud elastic compute
• Better scalability & Effective cluster utilization by auto-scaling
• Performant query response times
• Security
– Authentication – LDAP
– Authorization – work with existing policies
• Handle sensitive data – encryption at rest & over the wire
• Efficient Monitoring & alerting
• Dedicated quotas – SLA guarantees
• Flexibility to configure query configuration per tenant
Multi-tenant Query service - requirements
6
• Authentication
– Presto LDAP
– Custom authentication service
• Authorization
– Custom Presto ranger plugin
– Hadoop impersonation support
• Quota
– Presto resource groups
• Query configuration tuning
– Session property managers
– Customizations to make it for Unix groups
• Query audit
– Presto’s event listener framework
• Auto-scaling in GCP
– GCP instance group auto-scaling
– Auto scaling based on CPU load and queued
queries
Architectural components
7
Presto & Alluxio
Works well together…
Small range query response time
Lower is better
Large scan query response time
Lower is better
Concurrency
Higher is better
Presto Presto + Alluxio
• Query performance bottlenecks
• Un-predictable network IO
• Query pattern - Datasets modelled in star
schema could benefit by dimension table
caching
• Presto + Alluxio
• Avoids unpredictable network
• Consistent query latency
• Higher throughput and better concurrency
8
• Presto + Alluxio collocated
cluster
• Meta synch components to
automatically crate alluxio
backed tables and create
alluxio mount points
• Tweak auto scaling to keep
the min number of alluxio
workers
• Pin frequently used dimension
tables to avoid cache
evictions
Presto + Alluxio – architectural components
9
• Ability to query datasets couldn’t make it
to public clouds
• Alluxio greatly improved query
performance to avoid network hops
recurrent queries
• Avoids creating data copies in clouds of
datasets – alluxio mounts file meta
changes
• Enabled query guards in Presto to avoid
abuse of this connectors
Presto + Alluxio – hybrid cloud
10
Page 10
unlimited
Query service backed by Presto + Alluxio
We provide the analyst with a query tool for interactive ad-hoc analysis over different source system through a unified SQL query interface.
Put ALL your data to
work
SQL on Anything
Optimized performance Improve data’s time to
value
Increase Your
Optionality
Never get
deprived of
cluster resources
Query service hosted in GCP & On-
prem is powered by Presto + Alluxio
and is offered as a managed
distributed service.
We also help business in optimizing
their SQL queries to make sure they
run within expected time.
Using the platform’s
Federated Query
Capabilities, data can be
queries and joined from
multiple data sources
No fancy technology
needed to query data.
All you need is ANSI SQL
Performance boost
compared to Hive
Queries can be executed
across any number of
data resources
regardless of where the
data resides
With auto-scaling in place,
queries always get enough
resources to perform fast
Choose BI tool
of your own choice
11
THANKS!
11

Enterprise Distributed Query Service powered by Presto & Alluxio across clouds at WalmartLabs

  • 1.
    1 Enterprise Distributed QueryService powered by Presto & Alluxio across clouds @ WalmartLabs Ashish Tadose Principal Engineer
  • 2.
    2 Agenda • Data stores@ Walmart Labs • Motivation for Presto as Distributed Query service • Multi-tenant managed Distributed Query service • Alluxio caching to optimize the performance • Architectural components • Alluxio to support query federation in hybrid Footer
  • 3.
    3 Data stores @Walmart Labs Access needs are varied from team to team – one solution does not fit all….
  • 4.
    4 Motivation for Presto.. •DataLake cluster - powered by on-prem Hadoop/HDFS • Compute storage colocation – GOOD • Need to ingest data from all diverse sources – CHALLENGING • Scaling out compute with growing needs – CHALLENGING • Need to separate storage & compute / support federated query capability – PRESTO.. • Isolated clusters in private cloud powering dedicated data-marts Datajourney
  • 5.
    5 • Simplified queryaccess layer • Leverage cloud elastic compute • Better scalability & Effective cluster utilization by auto-scaling • Performant query response times • Security – Authentication – LDAP – Authorization – work with existing policies • Handle sensitive data – encryption at rest & over the wire • Efficient Monitoring & alerting • Dedicated quotas – SLA guarantees • Flexibility to configure query configuration per tenant Multi-tenant Query service - requirements
  • 6.
    6 • Authentication – PrestoLDAP – Custom authentication service • Authorization – Custom Presto ranger plugin – Hadoop impersonation support • Quota – Presto resource groups • Query configuration tuning – Session property managers – Customizations to make it for Unix groups • Query audit – Presto’s event listener framework • Auto-scaling in GCP – GCP instance group auto-scaling – Auto scaling based on CPU load and queued queries Architectural components
  • 7.
    7 Presto & Alluxio Workswell together… Small range query response time Lower is better Large scan query response time Lower is better Concurrency Higher is better Presto Presto + Alluxio • Query performance bottlenecks • Un-predictable network IO • Query pattern - Datasets modelled in star schema could benefit by dimension table caching • Presto + Alluxio • Avoids unpredictable network • Consistent query latency • Higher throughput and better concurrency
  • 8.
    8 • Presto +Alluxio collocated cluster • Meta synch components to automatically crate alluxio backed tables and create alluxio mount points • Tweak auto scaling to keep the min number of alluxio workers • Pin frequently used dimension tables to avoid cache evictions Presto + Alluxio – architectural components
  • 9.
    9 • Ability toquery datasets couldn’t make it to public clouds • Alluxio greatly improved query performance to avoid network hops recurrent queries • Avoids creating data copies in clouds of datasets – alluxio mounts file meta changes • Enabled query guards in Presto to avoid abuse of this connectors Presto + Alluxio – hybrid cloud
  • 10.
    10 Page 10 unlimited Query servicebacked by Presto + Alluxio We provide the analyst with a query tool for interactive ad-hoc analysis over different source system through a unified SQL query interface. Put ALL your data to work SQL on Anything Optimized performance Improve data’s time to value Increase Your Optionality Never get deprived of cluster resources Query service hosted in GCP & On- prem is powered by Presto + Alluxio and is offered as a managed distributed service. We also help business in optimizing their SQL queries to make sure they run within expected time. Using the platform’s Federated Query Capabilities, data can be queries and joined from multiple data sources No fancy technology needed to query data. All you need is ANSI SQL Performance boost compared to Hive Queries can be executed across any number of data resources regardless of where the data resides With auto-scaling in place, queries always get enough resources to perform fast Choose BI tool of your own choice
  • 11.