Submit Search
Upload
Solving low latency query over big data with Spark SQL
•
Download as PPTX, PDF
•
1 like
•
795 views
Julien Pierre
Follow
Spark Summit 2015 talk
Read less
Read more
Technology
Report
Share
Report
Share
1 of 24
Download now
Recommended
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
Memory Management in Apache Spark
Memory Management in Apache Spark
Databricks
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Databricks
Spark on YARN
Spark on YARN
Adarsh Pannu
Recommended
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
Memory Management in Apache Spark
Memory Management in Apache Spark
Databricks
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Databricks
Spark on YARN
Spark on YARN
Adarsh Pannu
Scaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Databricks
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
Databricks
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-Jobserver
Databricks
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
Spark Tips & Tricks
Spark Tips & Tricks
Jason Hubbard
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Databricks
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Databricks
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
Databricks
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
Databricks
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
Databricks
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
Databricks
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
20170126 big data processing
20170126 big data processing
Vienna Data Science Group
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
Amit Banerjee
More Related Content
What's hot
Scaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Databricks
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
Databricks
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-Jobserver
Databricks
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
Spark Tips & Tricks
Spark Tips & Tricks
Jason Hubbard
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Databricks
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Databricks
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
Databricks
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
Databricks
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
Databricks
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
Databricks
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
What's hot
(20)
Scaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-Jobserver
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Spark Tips & Tricks
Spark Tips & Tricks
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Similar to Solving low latency query over big data with Spark SQL
20170126 big data processing
20170126 big data processing
Vienna Data Science Group
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
Amit Banerjee
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
DataWorks Summit
My Master's Thesis
My Master's Thesis
Humoyun Ahmedov
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
Log everything! @DC13
Log everything! @DC13
DECK36
Discovery Day 2019 Sofia - Big data clusters
Discovery Day 2019 Sofia - Big data clusters
Ivan Donev
Scalding big ADta
Scalding big ADta
b0ris_1
What's New for Data?
What's New for Data?
ukdpe
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
de:code 2017
Discovery Day 2019 Sofia - What is new in SQL Server 2019
Discovery Day 2019 Sofia - What is new in SQL Server 2019
Ivan Donev
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Santosh Sahoo
5 Years of Progress in Active Data Warehousing
5 Years of Progress in Active Data Warehousing
Teradata
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
N Masahiro
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
Spark Summit
Similar to Solving low latency query over big data with Spark SQL
(20)
20170126 big data processing
20170126 big data processing
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
My Master's Thesis
My Master's Thesis
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Log everything! @DC13
Log everything! @DC13
Discovery Day 2019 Sofia - Big data clusters
Discovery Day 2019 Sofia - Big data clusters
Scalding big ADta
Scalding big ADta
What's New for Data?
What's New for Data?
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
Discovery Day 2019 Sofia - What is new in SQL Server 2019
Discovery Day 2019 Sofia - What is new in SQL Server 2019
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
5 Years of Progress in Active Data Warehousing
5 Years of Progress in Active Data Warehousing
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
Recently uploaded
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Wonjun Hwang
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Memoori
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Fwdays
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Mattias Andersson
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Lorenzo Miniero
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
Alfredo García Lavilla
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
comworks
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
Fwdays
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
Ridwan Fadjar
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
BookNet Canada
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Mark Billinghurst
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
The Digital Insurer
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
UiPathCommunity
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Patryk Bandurski
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Rizwan Syed
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Sergiu Bodiu
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
charlottematthew16
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Fwdays
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
Miki Katsuragi
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
Recently uploaded
(20)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
Solving low latency query over big data with Spark SQL
1.
2.
3.
4.
Client Data Fluency Office Skype Bing Modern Data Capability Instrumentation
& Ingestion Processing & Storage Reporting & Analytics Information Management Mobile-First Analytics Experience Experimentation
5.
6.
7.
Data Size Query Latency
8.
9.
Get results inline in
Zeppelin Need to open the results in Excel
10.
0 20 40
60 80 100 120 140 160 180 200 Cosmos SparkSQL SparkSQL with Cache Write and Compile Query Submit and Wait in Job Queue Job Run Time
11.
12.
13.
14.
Mesos Cluster/HDFS Job Manager
Zookeeper Job Frontend Web API Spark Driver Host Pool Spark Hive Thrift ServerZeppelin Server Avocado (Hive Query + Schedule Task) Rover (Drag & Drop BI tool with Hive Code Gen) Zeppelin Web UI MetastoreDB Hive Loader Cosmos Storage
15.
Partition 1 Partition 2 ... Partition
n Export Cosmos Partition Partition 1 Partition 2 ... Partition n Task 2 HDFS.copyFrom LocalFile ... Task n Partition 1 Partition 2 ... Partition n saveAsParquetFile Task 2 ... Task n
16.
<Database2> <Table1> <Database1> <Partition1> <Table2> <Partition2> MetastoreDB Hive Thrift Server Hive
Loader Zeppelin Server UserQuery Query
17.
18.
19.
20.
21.
22.
Data Ingest Services Clients Transform Compute Transform Compute Data Streams Data Sets Store Event Processing HDFSData Transportation Spark Streaming Receiver Analyst Zeppelin Notebooks Avocado Simple query Query language “Analyze” “Debug” “Mine” “Glance” Data Unified platform Intelligence Interactive analytics Data Products Better Digital Experiences Dual users “Bing” “Office”
Download now