Solving low latency query over big data with Spark SQL

•Download as PPTX, PDF•

1 like•795 views

Julien Pierre

Spark Summit 2015 talk

Technology

Client Data
Fluency
Office
Skype
Bing
Modern Data
Capability
Instrumentation & Ingestion
Processing & Storage
Reporting & Analytics
Information Management
Mobile-First Analytics Experience
Experimentation

Get results inline
in Zeppelin
Need to open the
results in Excel

0 20 40 60 80 100 120 140 160 180 200
Cosmos
SparkSQL
SparkSQL with Cache
Write and Compile Query Submit and Wait in Job Queue Job Run Time

Mesos Cluster/HDFS
Job Manager Zookeeper
Job Frontend Web API
Spark Driver Host Pool
Spark Hive Thrift ServerZeppelin Server
Avocado
(Hive Query + Schedule Task)
Rover
(Drag & Drop
BI tool with
Hive Code Gen)
Zeppelin Web UI
MetastoreDB Hive Loader
Cosmos Storage

Partition 1
Partition 2
...
Partition n
Export Cosmos
Partition
Partition 1
Partition 2
...
Partition n
Task 2
HDFS.copyFrom
LocalFile
...
Task n
Partition 1
Partition 2
...
Partition n
saveAsParquetFile
Task 2
...
Task n

<Database2>
<Table1>
<Database1>
<Partition1>
<Table2>
<Partition2>
MetastoreDB
Hive Thrift Server
Hive Loader
Zeppelin Server
UserQuery
Query

Data
Ingest
Services
Clients
Transform
Compute
Transform
Compute
Data
Streams
Data
Sets
Store
Event
Processing
HDFSData
Transportation
Spark
Streaming
Receiver
Analyst
Zeppelin
Notebooks
Avocado
Simple
query
Query
language
“Analyze”
“Debug”
“Mine”
“Glance”
Data
Unified
platform
Intelligence
Interactive
analytics
Data Products
Better
Digital
Experiences
Dual users
“Bing”
“Office”

Solving low latency query over big data with Spark SQL

What's hot

Scaling Apache Spark at FacebookDatabricks

Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks

Optimising Geospatial Queries with Dynamic File PruningDatabricks

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks

Operating and Supporting Delta Lake in ProductionDatabricks

Faster Data Integration Pipeline Execution using Spark-JobserverDatabricks

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Deep Dive into GPU Support in Apache Spark 3.xDatabricks

Spark Tips & TricksJason Hubbard

Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks

Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks

Parallelize R Code Using Apache Spark Databricks

Apache Spark 3.0: Overview of What’s New and Why CareDatabricks

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Optimizing Apache Spark UDFsDatabricks

Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...Databricks

Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks

What's hot (20)

Scaling Apache Spark at Facebook

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

Optimising Geospatial Queries with Dynamic File Pruning

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...

Operating and Supporting Delta Lake in Production

Faster Data Integration Pipeline Execution using Spark-Jobserver

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Deep Dive into GPU Support in Apache Spark 3.x

Spark Tips & Tricks

Performant Streaming in Production: Preventing Common Pitfalls when Productio...

Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...

Re-Architecting Spark For Performance Understandability

Presto on Apache Spark: A Tale of Two Computation Engines

Parallelize R Code Using Apache Spark

Apache Spark 3.0: Overview of What’s New and Why Care

Top 5 Mistakes When Writing Spark Applications

Optimizing Apache Spark UDFs

Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...

Robust and Scalable ETL over Cloud Storage with Apache Spark

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...

Similar to Solving low latency query over big data with Spark SQL

20170126 big data processingVienna Data Science Group

The Roadmap for SQL Server 2019Amit Banerjee

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit

My Master's ThesisHumoyun Ahmedov

Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov

Paris Data Geek - Spark Streaming Djamel Zouaoui

Log everything! @DC13DECK36

Discovery Day 2019 Sofia - Big data clustersIvan Donev

Scalding big ADtab0ris_1

What's New for Data?ukdpe

[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017

Discovery Day 2019 Sofia - What is new in SQL Server 2019Ivan Donev

Spark Seattle meetup - Breaking ETL barrier with Spark StreamingSantosh Sahoo

5 Years of Progress in Active Data WarehousingTeradata

Fluentd - RubyKansai 65N Masahiro

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей

IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore

Spark Summit EU talk by Sital KediaSpark Summit

Similar to Solving low latency query over big data with Spark SQL (20)

20170126 big data processing

The Roadmap for SQL Server 2019

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

My Master's Thesis

Spark Based Distributed Deep Learning Framework For Big Data Applications

Paris Data Geek - Spark Streaming

Log everything! @DC13

Discovery Day 2019 Sofia - Big data clusters

Scalding big ADta

What's New for Data?

[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR

Discovery Day 2019 Sofia - What is new in SQL Server 2019

Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

5 Years of Progress in Active Data Warehousing

Fluentd - RubyKansai 65

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...

IBM Cloud Native Day April 2021: Serverless Data Lake

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Spark Summit EU talk by Sital Kedia

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

AI as an Interface for Commercial BuildingsMemoori

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Commit 2024 - Secret Management made easyAlfredo García Lavilla

CloudStudio User manual (basic edition):comworks

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

AI as an Interface for Commercial Buildings

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Are Multi-Cloud and Serverless Good or Bad?

SIP trunking in Janus @ Kamailio World 2024

Commit 2024 - Secret Management made easy

CloudStudio User manual (basic edition):

"Debugging python applications inside k8s environment", Andrii Soldatenko

My Hashitalk Indonesia April 2024 Presentation

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Human Factors of XR: Using Human Factors to Design XR Systems

My INSURER PTE LTD - Insurtech Innovation Award 2024

Dev Dives: Streamline document processing with UiPath Studio Web

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Scanning the Internet for External Cloud Exposures via SSL Certs

DevEX - reference for building teams, processes, and platforms

Powerpoint exploring the locations used in television show Time Clash

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Vertex AI Gemini Prompt Engineering Tips

Developer Data Modeling Mistakes: From Postgres to NoSQL