SlideShare a Scribd company logo
1 of 11
Download to read offline
1
Enterprise Distributed Query Service powered by Presto
& Alluxio across clouds @ WalmartLabs
Ashish Tadose
Principal Engineer
2
Agenda
• Data stores @ Walmart Labs
• Motivation for Presto as Distributed Query service
• Multi-tenant managed Distributed Query service
• Alluxio caching to optimize the performance
• Architectural components
• Alluxio to support query federation in hybrid
Footer
3
Data stores @ Walmart Labs
Access needs are varied from team to team – one solution does not fit all….
4
Motivation for Presto..
• DataLake cluster - powered by on-prem Hadoop/HDFS
• Compute storage colocation – GOOD
• Need to ingest data from all diverse sources – CHALLENGING
• Scaling out compute with growing needs – CHALLENGING
• Need to separate storage & compute / support federated query capability – PRESTO..
• Isolated clusters in private cloud powering dedicated data-marts
Datajourney
5
• Simplified query access layer
• Leverage cloud elastic compute
• Better scalability & Effective cluster utilization by auto-scaling
• Performant query response times
• Security
– Authentication – LDAP
– Authorization – work with existing policies
• Handle sensitive data – encryption at rest & over the wire
• Efficient Monitoring & alerting
• Dedicated quotas – SLA guarantees
• Flexibility to configure query configuration per tenant
Multi-tenant Query service - requirements
6
• Authentication
– Presto LDAP
– Custom authentication service
• Authorization
– Custom Presto ranger plugin
– Hadoop impersonation support
• Quota
– Presto resource groups
• Query configuration tuning
– Session property managers
– Customizations to make it for Unix groups
• Query audit
– Presto’s event listener framework
• Auto-scaling in GCP
– GCP instance group auto-scaling
– Auto scaling based on CPU load and queued
queries
Architectural components
7
Presto & Alluxio
Works well together…
Small range query response time
Lower is better
Large scan query response time
Lower is better
Concurrency
Higher is better
Presto Presto + Alluxio
• Query performance bottlenecks
• Un-predictable network IO
• Query pattern - Datasets modelled in star
schema could benefit by dimension table
caching
• Presto + Alluxio
• Avoids unpredictable network
• Consistent query latency
• Higher throughput and better concurrency
8
• Presto + Alluxio collocated
cluster
• Meta synch components to
automatically crate alluxio
backed tables and create
alluxio mount points
• Tweak auto scaling to keep
the min number of alluxio
workers
• Pin frequently used dimension
tables to avoid cache
evictions
Presto + Alluxio – architectural components
9
• Ability to query datasets couldn’t make it
to public clouds
• Alluxio greatly improved query
performance to avoid network hops
recurrent queries
• Avoids creating data copies in clouds of
datasets – alluxio mounts file meta
changes
• Enabled query guards in Presto to avoid
abuse of this connectors
Presto + Alluxio – hybrid cloud
10
Page 10
unlimited
Query service backed by Presto + Alluxio
We provide the analyst with a query tool for interactive ad-hoc analysis over different source system through a unified SQL query interface.
Put ALL your data to
work
SQL on Anything
Optimized performance Improve data’s time to
value
Increase Your
Optionality
Never get
deprived of
cluster resources
Query service hosted in GCP & On-
prem is powered by Presto + Alluxio
and is offered as a managed
distributed service.
We also help business in optimizing
their SQL queries to make sure they
run within expected time.
Using the platform’s
Federated Query
Capabilities, data can be
queries and joined from
multiple data sources
No fancy technology
needed to query data.
All you need is ANSI SQL
Performance boost
compared to Hive
Queries can be executed
across any number of
data resources
regardless of where the
data resides
With auto-scaling in place,
queries always get enough
resources to perform fast
Choose BI tool
of your own choice
11
THANKS!
11

More Related Content

What's hot

What's hot (20)

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
 
Accelerate Cloud Training with Alluxio
Accelerate Cloud Training with AlluxioAccelerate Cloud Training with Alluxio
Accelerate Cloud Training with Alluxio
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
What's New in Alluxio 2.3
What's New in Alluxio 2.3What's New in Alluxio 2.3
What's New in Alluxio 2.3
 
Alluxio + Spark: Accelerating Auto Data Tagging in WeRide
Alluxio + Spark: Accelerating Auto Data Tagging in WeRideAlluxio + Spark: Accelerating Auto Data Tagging in WeRide
Alluxio + Spark: Accelerating Auto Data Tagging in WeRide
 
Hands-on with Alluxio Structured Data Management
Hands-on with Alluxio Structured Data ManagementHands-on with Alluxio Structured Data Management
Hands-on with Alluxio Structured Data Management
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoStorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 
Building Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, AlluxioBuilding Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, Alluxio
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
 
Burst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesBurst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copies
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Speeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomSpeeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China Unicom
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and Applications
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Orchestrate a Data Symphony
Orchestrate a Data SymphonyOrchestrate a Data Symphony
Orchestrate a Data Symphony
 

Similar to Enterprise Distributed Query Service powered by Presto & Alluxio across clouds at WalmartLabs

Presentation Presentation Presentation Presentation Presentation
Presentation Presentation Presentation Presentation PresentationPresentation Presentation Presentation Presentation Presentation
Presentation Presentation Presentation Presentation Presentation
bangel105
 

Similar to Enterprise Distributed Query Service powered by Presto & Alluxio across clouds at WalmartLabs (20)

Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
 
Enterprise Presto PaaS offering in Google Cloud
Enterprise Presto PaaS offering in Google Cloud Enterprise Presto PaaS offering in Google Cloud
Enterprise Presto PaaS offering in Google Cloud
 
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
 
Azure synapse by usama whaba khan
Azure synapse by usama whaba khanAzure synapse by usama whaba khan
Azure synapse by usama whaba khan
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
Building A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopBuilding A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on Hadoop
 
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace Images
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse OptimisationBigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
 
MySQL :What's New #GIDS16
MySQL :What's New #GIDS16MySQL :What's New #GIDS16
MySQL :What's New #GIDS16
 
Presentation Presentation Presentation Presentation Presentation
Presentation Presentation Presentation Presentation PresentationPresentation Presentation Presentation Presentation Presentation
Presentation Presentation Presentation Presentation Presentation
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Apereo OAE - Architectural overview
Apereo OAE - Architectural overviewApereo OAE - Architectural overview
Apereo OAE - Architectural overview
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
What is Cloud computing?
What is Cloud computing?What is Cloud computing?
What is Cloud computing?
 

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 

Recently uploaded

Recently uploaded (20)

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
 
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfAzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
 
The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)
 
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
 
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration ToolingWSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration Tooling
 
BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AI
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next IntegrationWSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Novo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMsNovo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMs
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 

Enterprise Distributed Query Service powered by Presto & Alluxio across clouds at WalmartLabs

  • 1. 1 Enterprise Distributed Query Service powered by Presto & Alluxio across clouds @ WalmartLabs Ashish Tadose Principal Engineer
  • 2. 2 Agenda • Data stores @ Walmart Labs • Motivation for Presto as Distributed Query service • Multi-tenant managed Distributed Query service • Alluxio caching to optimize the performance • Architectural components • Alluxio to support query federation in hybrid Footer
  • 3. 3 Data stores @ Walmart Labs Access needs are varied from team to team – one solution does not fit all….
  • 4. 4 Motivation for Presto.. • DataLake cluster - powered by on-prem Hadoop/HDFS • Compute storage colocation – GOOD • Need to ingest data from all diverse sources – CHALLENGING • Scaling out compute with growing needs – CHALLENGING • Need to separate storage & compute / support federated query capability – PRESTO.. • Isolated clusters in private cloud powering dedicated data-marts Datajourney
  • 5. 5 • Simplified query access layer • Leverage cloud elastic compute • Better scalability & Effective cluster utilization by auto-scaling • Performant query response times • Security – Authentication – LDAP – Authorization – work with existing policies • Handle sensitive data – encryption at rest & over the wire • Efficient Monitoring & alerting • Dedicated quotas – SLA guarantees • Flexibility to configure query configuration per tenant Multi-tenant Query service - requirements
  • 6. 6 • Authentication – Presto LDAP – Custom authentication service • Authorization – Custom Presto ranger plugin – Hadoop impersonation support • Quota – Presto resource groups • Query configuration tuning – Session property managers – Customizations to make it for Unix groups • Query audit – Presto’s event listener framework • Auto-scaling in GCP – GCP instance group auto-scaling – Auto scaling based on CPU load and queued queries Architectural components
  • 7. 7 Presto & Alluxio Works well together… Small range query response time Lower is better Large scan query response time Lower is better Concurrency Higher is better Presto Presto + Alluxio • Query performance bottlenecks • Un-predictable network IO • Query pattern - Datasets modelled in star schema could benefit by dimension table caching • Presto + Alluxio • Avoids unpredictable network • Consistent query latency • Higher throughput and better concurrency
  • 8. 8 • Presto + Alluxio collocated cluster • Meta synch components to automatically crate alluxio backed tables and create alluxio mount points • Tweak auto scaling to keep the min number of alluxio workers • Pin frequently used dimension tables to avoid cache evictions Presto + Alluxio – architectural components
  • 9. 9 • Ability to query datasets couldn’t make it to public clouds • Alluxio greatly improved query performance to avoid network hops recurrent queries • Avoids creating data copies in clouds of datasets – alluxio mounts file meta changes • Enabled query guards in Presto to avoid abuse of this connectors Presto + Alluxio – hybrid cloud
  • 10. 10 Page 10 unlimited Query service backed by Presto + Alluxio We provide the analyst with a query tool for interactive ad-hoc analysis over different source system through a unified SQL query interface. Put ALL your data to work SQL on Anything Optimized performance Improve data’s time to value Increase Your Optionality Never get deprived of cluster resources Query service hosted in GCP & On- prem is powered by Presto + Alluxio and is offered as a managed distributed service. We also help business in optimizing their SQL queries to make sure they run within expected time. Using the platform’s Federated Query Capabilities, data can be queries and joined from multiple data sources No fancy technology needed to query data. All you need is ANSI SQL Performance boost compared to Hive Queries can be executed across any number of data resources regardless of where the data resides With auto-scaling in place, queries always get enough resources to perform fast Choose BI tool of your own choice