Submit Search
Upload
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
•
4 likes
•
2,434 views
Spark Summit
Follow
Spark Summit East talk
Read less
Read more
Data & Analytics
Report
Share
Report
Share
1 of 26
Download now
Download to read offline
Recommended
The Future of Real-Time in Spark
The Future of Real-Time in Spark
Databricks
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
Spark Summit
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
Spark Summit
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
Pat Patterson
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Codecamp Romania
Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
Recommended
The Future of Real-Time in Spark
The Future of Real-Time in Spark
Databricks
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
Spark Summit
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
Spark Summit
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
Pat Patterson
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Codecamp Romania
Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
Databricks
Big Data Computing Architecture
Big Data Computing Architecture
Gang Tao
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
Solving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
Tyler Mitchell
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
Spark Summit
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
Databricks
Data Pipelines With Streamsets
Data Pipelines With Streamsets
Jowanza Joseph
Big Data Testing
Big Data Testing
QA InfoTech
Spark - Migration Story
Spark - Migration Story
Roman Chukh
Demystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
Spark sql meetup
Spark sql meetup
Michael Zhang
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
DataWorks Summit/Hadoop Summit
Data Engineering for Data Scientists
Data Engineering for Data Scientists
jlacefie
Versa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinar
Shawn Rao
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
Serverless data pipelines gcp
Serverless data pipelines gcp
Catherine Kimani
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos Christine
Spark Summit
More Related Content
What's hot
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
Databricks
Big Data Computing Architecture
Big Data Computing Architecture
Gang Tao
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
Solving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
Tyler Mitchell
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
Spark Summit
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
Databricks
Data Pipelines With Streamsets
Data Pipelines With Streamsets
Jowanza Joseph
Big Data Testing
Big Data Testing
QA InfoTech
Spark - Migration Story
Spark - Migration Story
Roman Chukh
Demystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
Spark sql meetup
Spark sql meetup
Michael Zhang
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
DataWorks Summit/Hadoop Summit
Data Engineering for Data Scientists
Data Engineering for Data Scientists
jlacefie
Versa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinar
Shawn Rao
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
Serverless data pipelines gcp
Serverless data pipelines gcp
Catherine Kimani
What's hot
(20)
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
Big Data Computing Architecture
Big Data Computing Architecture
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Solving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
Data Pipelines With Streamsets
Data Pipelines With Streamsets
Big Data Testing
Big Data Testing
Spark - Migration Story
Spark - Migration Story
Demystifying data engineering
Demystifying data engineering
Spark sql meetup
Spark sql meetup
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
Data Engineering for Data Scientists
Data Engineering for Data Scientists
Versa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinar
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
Serverless data pipelines gcp
Serverless data pipelines gcp
Viewers also liked
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos Christine
Spark Summit
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
DataWorks Summit
Kodu Game Lab e Project Spark
Kodu Game Lab e Project Spark
Fabrício Catae
Apache Spark Overview
Apache Spark Overview
airisData
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Spark Summit
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min Qiu
Spark Summit
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
Spark Summit
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Spark Summit
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Spark Summit
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
Jen Aman
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Spark Summit
Parquet and AVRO
Parquet and AVRO
airisData
Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu Adunuthula
Spark Summit
Getting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Spark Summit
2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia
Databricks
Viewers also liked
(20)
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos Christine
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
Kodu Game Lab e Project Spark
Kodu Game Lab e Project Spark
Apache Spark Overview
Apache Spark Overview
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min Qiu
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Parquet and AVRO
Parquet and AVRO
Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu Adunuthula
Getting The Best Performance With PySpark
Getting The Best Performance With PySpark
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia
Similar to Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Labmatrix
Labmatrix
shc66columbia
Rdbms
Rdbms
Parthiv Prem
Analyzing Extended and Scientific Metadata for Scalable Index Designs
Analyzing Extended and Scientific Metadata for Scalable Index Designs
Aleatha Parker-Wood
Labmatrix
Labmatrix
jwppz
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
Intake at AnacondaCon
Intake at AnacondaCon
Martin Durant
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
Jon Haddad
DBtrends Semantics 2016
DBtrends Semantics 2016
Edgard Marx
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05
bhughes26
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
semanticsconference
Apache Spark sql
Apache Spark sql
aftab alam
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...
Lucas Jellema
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
Uri Laserson
BigData Testing by Shreya Pal
BigData Testing by Shreya Pal
Agile Testing Alliance
Data Science: why, what, and how?
Data Science: why, what, and how?
Muhammad Shahid
Dwdmunit1 a
Dwdmunit1 a
bhagathk
Similar to Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
(20)
Labmatrix
Labmatrix
Rdbms
Rdbms
Analyzing Extended and Scientific Metadata for Scalable Index Designs
Analyzing Extended and Scientific Metadata for Scalable Index Designs
Labmatrix
Labmatrix
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
Intake at AnacondaCon
Intake at AnacondaCon
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
DBtrends Semantics 2016
DBtrends Semantics 2016
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Apache Spark sql
Apache Spark sql
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
BigData Testing by Shreya Pal
BigData Testing by Shreya Pal
Data Science: why, what, and how?
Data Science: why, what, and how?
Dwdmunit1 a
Dwdmunit1 a
More from Spark Summit
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
More from Spark Summit
(20)
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Recently uploaded
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
Florian Roscheck
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
thyngster
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
Suhani Kapoor
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Jack DiGiovanna
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
Suhani Kapoor
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
Emmanuel Dauda
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
soniya singh
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
Anupama Kate
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Sapana Sha
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
Suhani Kapoor
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
sapnasaifi408
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
Pramod Kumar Srivastava
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
Sonatrach
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Rachmat Ramadhan H
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
Boston Institute of Analytics
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
dajasot375
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
Aishani27
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
ranjana rawat
Recently uploaded
(20)
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
1.
Interac(ve Queries on Compressed RDD Succinct Spark Rachit Agarwal AMPLab ragarwal@berkeley.edu TwiEer: @_ragarwal_
2.
No secondary indexes, no data scans, no data decompression A distributed compressed data store Succinct Point queries • search • random access •
range queries • regular expressions Unified Interface • Unstructured data • Key-value store • Document store • Tables
3.
Interactive point queries Random access Search Range Queries Regular Expressions Aggregate queries Updates Graph queries
4.
0, 10, 14,
16, 19, 26, 29 1, 4, 5, 8, 20, 22, 24 2, 15, 17, 27 3, 6, 7, 9, 12, 13, 18, 23 .. 11, 21 Data Scans Indexes Low storage High Latency High storage Low Latency Existing systems, e.g., search( ) Search( )
5.
Indexes in slower storage Scans in faster storage execu(ng queries off slower storage Input size Query Latency Data scans Indexes Scans in slower storage Indexes in faster storage Existing systems “at scale” (qualitatively)
6.
Succinct Low storage Low Latency Queries executed directly on the compressed representa(on What makes Succinct unique No addi(onal indexes Query responses embedded within the compressed representa(on No data scans Func(onality of indexes No decompression Queries directly on the compressed representa(on (except for data access queries) Succinct
7.
Input size Query Latency Indexes Succinct Avoiding data scans Avoiding queries off slower storage Data scans Succinct tradeoffs
8.
Original Input Extract: returns data at arbitrary offsets in uncompressed fileCount: returns count of arbitrary strings in uncompressed file Succinct Search( ) = {0, 10, 14, 16, 19, 26, 29} Extract(0, 5) = { , , , , } Count( ) = 7 Search: returns offsets of arbitrary strings in uncompressed file Input: flat (unstructured) files Append( , , , , ) Range queries Succinct Data model and Functionality
9.
Supported, but traded-off in favor of point queries on compressed data • Preprocessing time • CPU (data access) •
Sequential scan throughput • “In-place” updates What do we lose? Succinct tradeoffs
10.
No secondary indexes, no data scans, no data decompression A distributed compressed data store Succinct Point queries • search • random access •
range queries • regular expressions Unified Interface • Unstructured data • Key-value store • Document store • Tables
11.
With all the powerful queries on values, documents, columns • Unstructured data • Key-value stores (Voldemort, Dynamo) •
Document store (Elasticsearch, MongoDB) • Tables (Cassandra, BigTable) • And many more …. Unified Interface Succinct Data Model: Flat File Interface
12.
Search(Column1, )Search( ) Succinct Flat File Interface: Unification
13.
Where are we? • Succinct • Succinct Spark Where are we going? •
Industry collabora(on • Succinct++ A distributed compressed data store Succinct
14.
• System (prototyped & tested) • As a library •
C++, Java, Scala • for ease of integration • All functionalities supported Succinct Succinct: Where are we?
15.
• A Spark package • Enables new functionalities •
Document stores • Point queries • Faster filters • Compressed RDDs: More in-memory • Dataframes API not so mature Queries on compressed RDDs Succinct Spark Succinct: Where are we?
16.
If you are already using Spark New func(onali(es Document store, Key-Value store search on documents, values Faster opera(ons into RDDs random access, filters avoid scans More in-memory Compressed RDDs no decompression overheads Succinct Spark
17.
import edu.berkeley.cs.succinct._ val rdd = ctx.textFile(...).map(_.getBytes) val bytes = succinctRDD.extract(50, 100) val count = succinctRDD.count("Berkeley") val offsets = succinctRDD.search("Berkeley") Import classes Create an RDD Extract 100 bytes from offset 50 Count #occurrences of “Berkeley” Find all occurrences of “Berkeley” val succinctRDD = rdd.succinct Compress using Succinct Succinct Spark: SuccinctRDD (unstructured data)
18.
import edu.berkeley.cs.succinct.kv._ val kvRDD = rdd.zipWithIndex.map(t => (t._2, t._1.getBytes)) val value = succinctKVRDD.get(0) val valueData = succinctKVRDD.extract(0, 50, 100) val keys = succinctKVRDD.search("Berkeley") Import classes Load data Get value for key 0 Extract 100 bytes at offset 50 in the value for key 0 Find all keys for values that contain “Berkeley” val succinctKVRDD = kvRDD.succinctKV Compress using Succinct Succinct Spark: SuccinctKVRDD (document store)
19.
• 5x Amazon EC2 servers, 30GB RAM each • Wikipedia dataset, 40GB •
Spark, Elasticsearch • search queries • #occurrences 1-10k Succinct Evaluation
20.
Take-away: Succinct Spark 2.75x faster than Elas(cSearch while being 2.5x more space efficient (data fits in memory for all systems) Succinct Spark Evaluation (search latency)
21.
Succinct Spark now supports Regular Expressions! val matches = succinctRDD.regexSearch("William.*Clinton") Find all matches for the RegEx “William.*Clinton” val matchKeys = succinctKVRDD.regexSearch("William.*Clinton") Find all keys for values that contain matches for the RegEx “William.*Clinton” SuccinctRDD SuccinctKVRDD
22.
Take-away: Succinct significantly speeds up RegEx queries even when all the data fits in memory for all systems Succinct Spark Evaluation (RegEx latency)
23.
val jsonDoc = succinctJsonRDD.get(0) val ids1 = succinctJsonRDD.filter("city", "Berkeley") val ids2 = succinctJsonRDD.search("AMPLab") Get JSON document with id 0 Filter JSON documents where “city = Berkeley” Search for JSON documents containing “AMPLab” Succinct Spark now supports JSON documents!
24.
• More testing, benchmarking • Succinct Spark Dataframes •
New functionalities Where are we going?
25.
Queries on compressed and encrypted data • BlowFish • Succinct Encryption •
Succinct Graphs New functionalities Succinct BlowFish Indexes Queries on compressed graphs Storage Query Latency
26.
AND MANY MORE! succinct.cs.berkeley.edu
Download now