SlideShare a Scribd company logo
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Taro L. Saito
Arm Treasure Data
July 31th, 2020
Spark Meetup Tokyo #3
td-spark internals
Extending Spark with Airframe
1
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
About Me: Taro L. Saito
2
● Ph.D., Principal Software Engineer of
Arm Treasure Data
● Living US for 5 years
● Created Presto as a service
● Processing 1 million SQL queries /
day on the cloud. Presto Webinar
● OSS:
● Airframe, snappy-java (used in
Parquet, Spark core),
sbt-sonatype, etc.
● Books:
WIP
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Challenge: Adding Treasure Data Support to Spark
● PlazmaDB: Cloud Data Store of Treasure Data
● MessagePack-based columnar format (MPC1)
● Each table column is represented as a sequence of MessagePack values
● What was necessary for supporting Spark?
● td-spark driver (td-spark-assembly.jar)
■ MPC1 <-> DataFrame conversion
● Plazma Public API
■ APIs for reading and writing MPC1 files from PlazmaDB
● Created these two components with Airframe OSS
Airframe
3
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Airframe: Core Scala Modules of Treasure Data
● Airframe
● Scala OSS assets of our knowledges, production experiences, and design decisions
● 20+ Common Utilities for Scala
● Dependency Injection (DI)
● Airframe RPC
■ HTTP Server, Client builder (ScalaMatsuri. Tokyo, October 2020)
● AirSpec
■ Testing framework for Scala (ScalaDays. Seattle, May 2021)
4
Knowledge
Experiences
Design Decisions
Products
24/7 Services
Business Values
Programming OSS Outcome
Airframe
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Airframe Modules Used Inside td-spark
Airframe DI
DataFrame MPC1
airframe-codec
airframe-msgpack
Plazma Public API
airframe-http
airframe-finagle
Airframe DI
Airframe RPC
airframe-fluentd
Master Worker
DesignSparkContext
TDSparkContext TDSparkService
MPC1 Reader/Writer IO Manager
Airframe DI
airframe-http
airframe-config
airframe-launcher
airframe-jmx
airframe-metrics
airframe-control
airframe-metrics
td-spark.jarairframe-log
airframe-log
airframe-codec
airframe-json
Airframe
5
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Reading MPC1 Partitions and Column Blocks
6
Table Data
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
Plazma
Public API
Table Data
column blocks
column blocks
column blocks
column blocks
column blocks
td-spark
Table Data
Data Frame
Data Frame
Data Frame
Data Frame
Data Frame
td-pyspark
Parallel Read
User
Programs
Columnar Data Download
DataFrame
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Uploading DataFrame as MPC1 Partitions
7
td-spark
td-pyspark
User
Programs
Table Data
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
DataFrame
Format
Conversion
Plazma
Public API
Amazon S3
Parallel Upload
Copy
Transaction
Table Data
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
Table DataTable Data
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Airframe
Airframe RPC
RPC Interface
Router
Scala.js Client
RPC Web Server
Generate
HTTP/gRPC Client
Open API Spec
RPC Impl
Create
RPC CallsJSON
Cross-Language
RPC Client
Scala.js
Web Application
Micro Servicesbt-airframeairframe-http
airframe-http-finagle
airframe-http-rx
airframe-codec
API Documentation
airframe-gRPC
8
● Use Scala As An RPC Interface
● Generate HTTP Server/Client (REST or gRPC)
● HTTP calls -> JSON/MessagePack data -> Remote Scala function calls
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
td-spark: Adding More Functions to Spark
● Using an implicit class to extend SparkSession (spark variable)
● Adding TD-specific functionalities
● Time series data queries
■ e.g., : spark.td.table(“TD’s table”).within(“-1h”).df
● Predicate pushdown for time-series data
● etc.
9
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Tips: Avoid Task Serialization Errors with Airframe DI
● Serializable
● spark.conf key-value properties inside SparkContext
● Non-Serializable
● Complex service objects
● Solution: Airframe DI (Dependency Injection)
● Distribute the service design (= how to construct objects) with the jar file
● Build service objects from the design (20+ components and config objects)
Airframe DI
Master
Worker
TDSparkContext TDSparkService
TDSparkContext
td-spark.jar
td-spark.jarSerialization
Error (!) Design
Design
TDSparkService
Airframe
Build OK!
Config
Config
10
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Flexible Format Conversion with MessagePack
DataFrame
Airframe
Codec
Pack/Unpack Pack/Unpack
MPC1
JDBC
ResultSet
Plazma Public API
Airframe
11
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Spark 3.0 and PySpark Support
12
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Resources
● Airframe: https://wvlet.org/airframe/
● Airframe Meetup #1 ~ #3 reports
● ScalaMatsuri 2019 presentation
■ And more!!
● td-spark documentation: https://treasure-data.github.io/td-spark/
● See Also: Spark with Airframe (@smdmts)
● Spark to Spark data transfer with MessagePack-based airframe-codec
● Spark -> AWS service call management with airframe-control
13
Airframe
New!
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Appendix
14
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Treasure Data: A Ready-to-Use Cloud Data Platform
15
Logs
Device
Data
Batch
Data
PlazmaDB
Table Schema
Data Collection Cloud Storage Distributed Data Processing
Jobs
Job Management
SQL Editor
Scheduler
Workflows
Machine
Learning
Treasure Data OSS
Third Party OSS
Data
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
TDSparkContext
16
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
td-pyspark
● Supporting PySpark
● Access Scala methods of td-spark:
■ sparkContext._jvm.(jvm package name).method(...)
● Conversion to PySpark’s DataFrame
● DataFrame(Scala DataFrame, sqlContext)
17
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Predicate Pushdown
● Traverse DataFrame Column Filters
● Extract time conditions (e.g., -1d, -1w, -7d, etc.)
●
18
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Class Loader Hierarchy of Databricks
● Base class loader
● User library class loader
● td-spark.jar will be loaded here
● Shared between multiple notebooks
■ Static variables used inside td-spark.jar
will be shared by multiple notebooks!
● REPL class loader
● Shared between multiple notebooks
● Spark-library class loader
● Notebook-local
● Notebook-local class loader
● Caching local instances to static variables in
td-spark caused ClassNotFound error in
other notebooks
19
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Using Presto with Spark
● presto-jdbc
● Submit select * from (Original SQL) limit 0 => Query result schema
● JDBC ResultSet => Airframe Codec => DataFrame
20
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
td-prestobase: A Proxy Gateway to Presto Clusters
21
● td-prestobase is a proxy gateway to Presto clusters that talks standard presto
protocol to support any Presto clients (e.g., presto-cli, jdbc, odbc, etc.)
● td-spark uses presto-jdbc and td-prestobase APIs for making Presto queries
Airframe

More Related Content

What's hot

Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Taro L. Saito
 
Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics Delivery
InfluxData
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
Search engine based on Elasticsearch
Search engine based on ElasticsearchSearch engine based on Elasticsearch
Search engine based on Elasticsearch
Radek Baczynski
 
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core enginePLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
Ryu Kobayashi
 
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache KafkaFast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Altinity Ltd
 
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Muga Nishizawa
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
Rob Skillington
 
Fluentd: Data streams in Ruby world #rdrc2014
Fluentd: Data streams in Ruby world #rdrc2014Fluentd: Data streams in Ruby world #rdrc2014
Fluentd: Data streams in Ruby world #rdrc2014SATOSHI TAGOMORI
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP
 
201810 td tech_talk
201810 td tech_talk201810 td tech_talk
201810 td tech_talk
Keisuke Suzuki
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
InfoFarm
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
PingCAP
 
PGConf APAC 2018 Keynote: PostgreSQL goes eleven
PGConf APAC 2018 Keynote: PostgreSQL goes elevenPGConf APAC 2018 Keynote: PostgreSQL goes eleven
PGConf APAC 2018 Keynote: PostgreSQL goes eleven
PGConf APAC
 
OrientDB Distributed Architecture v2.0
OrientDB Distributed Architecture v2.0OrientDB Distributed Architecture v2.0
OrientDB Distributed Architecture v2.0
Orient Technologies
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
PingCAP
 
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
Equnix Business Solutions
 
Rust in TiKV
Rust in TiKVRust in TiKV
Rust in TiKV
PingCAP
 
OrientDB and Hazelcast
OrientDB and HazelcastOrientDB and Hazelcast
OrientDB and Hazelcast
Luca Garulli
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
PingCAP
 

What's hot (20)

Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
 
Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics Delivery
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
Search engine based on Elasticsearch
Search engine based on ElasticsearchSearch engine based on Elasticsearch
Search engine based on Elasticsearch
 
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core enginePLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
 
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache KafkaFast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
 
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
Fluentd: Data streams in Ruby world #rdrc2014
Fluentd: Data streams in Ruby world #rdrc2014Fluentd: Data streams in Ruby world #rdrc2014
Fluentd: Data streams in Ruby world #rdrc2014
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
201810 td tech_talk
201810 td tech_talk201810 td tech_talk
201810 td tech_talk
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
 
PGConf APAC 2018 Keynote: PostgreSQL goes eleven
PGConf APAC 2018 Keynote: PostgreSQL goes elevenPGConf APAC 2018 Keynote: PostgreSQL goes eleven
PGConf APAC 2018 Keynote: PostgreSQL goes eleven
 
OrientDB Distributed Architecture v2.0
OrientDB Distributed Architecture v2.0OrientDB Distributed Architecture v2.0
OrientDB Distributed Architecture v2.0
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
 
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
 
Rust in TiKV
Rust in TiKVRust in TiKV
Rust in TiKV
 
OrientDB and Hazelcast
OrientDB and HazelcastOrientDB and Hazelcast
OrientDB and Hazelcast
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 

Similar to td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin Keynote
Databricks
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpec
Taro L. Saito
 
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Nelson Calero
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Mammoth Data
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Amazon Web Services
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
Nezih Yigitbasi
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
Databricks
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPatrick Allaert
 
How to Upgrade Major Version of Your Production PostgreSQL
How to Upgrade Major Version of Your Production PostgreSQLHow to Upgrade Major Version of Your Production PostgreSQL
How to Upgrade Major Version of Your Production PostgreSQL
Keisuke Suzuki
 
Alexander Pavlenko, Java Software Engineer, DataArt.
Alexander Pavlenko, Java Software Engineer, DataArt.Alexander Pavlenko, Java Software Engineer, DataArt.
Alexander Pavlenko, Java Software Engineer, DataArt.
Alina Vilk
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
Craig Warman
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Tim Ellison
 
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Pulsar summit asia 2021   apache pulsar with mqtt for edge computingPulsar summit asia 2021   apache pulsar with mqtt for edge computing
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Timothy Spann
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
DataWorks Summit
 
Easy enterprise application integration with RabbitMQ and AMQP
Easy enterprise application integration with RabbitMQ and AMQPEasy enterprise application integration with RabbitMQ and AMQP
Easy enterprise application integration with RabbitMQ and AMQP
Rabbit MQ
 
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
StreamNative
 

Similar to td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020 (20)

Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin Keynote
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpec
 
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & Pinba
 
How to Upgrade Major Version of Your Production PostgreSQL
How to Upgrade Major Version of Your Production PostgreSQLHow to Upgrade Major Version of Your Production PostgreSQL
How to Upgrade Major Version of Your Production PostgreSQL
 
Alexander Pavlenko, Java Software Engineer, DataArt.
Alexander Pavlenko, Java Software Engineer, DataArt.Alexander Pavlenko, Java Software Engineer, DataArt.
Alexander Pavlenko, Java Software Engineer, DataArt.
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Pulsar summit asia 2021   apache pulsar with mqtt for edge computingPulsar summit asia 2021   apache pulsar with mqtt for edge computing
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
Easy enterprise application integration with RabbitMQ and AMQP
Easy enterprise application integration with RabbitMQ and AMQPEasy enterprise application integration with RabbitMQ and AMQP
Easy enterprise application integration with RabbitMQ and AMQP
 
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
 

More from Taro L. Saito

Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Taro L. Saito
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS Projects
Taro L. Saito
 
Learning Silicon Valley Culture
Learning Silicon Valley CultureLearning Silicon Valley Culture
Learning Silicon Valley Culture
Taro L. Saito
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
Taro L. Saito
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
Taro L. Saito
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
Taro L. Saito
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
Taro L. Saito
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
Taro L. Saito
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
Taro L. Saito
 
JNuma Library
JNuma LibraryJNuma Library
JNuma Library
Taro L. Saito
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 
Treasure Dataを支える技術 - MessagePack編
Treasure Dataを支える技術 - MessagePack編Treasure Dataを支える技術 - MessagePack編
Treasure Dataを支える技術 - MessagePack編
Taro L. Saito
 
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Taro L. Saito
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanTaro L. Saito
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
Taro L. Saito
 
Silkによる並列分散ワークフロープログラミング
Silkによる並列分散ワークフロープログラミングSilkによる並列分散ワークフロープログラミング
Silkによる並列分散ワークフロープログラミング
Taro L. Saito
 
2011年度 生物データベース論 2日目 木構造データ
2011年度 生物データベース論 2日目 木構造データ2011年度 生物データベース論 2日目 木構造データ
2011年度 生物データベース論 2日目 木構造データ
Taro L. Saito
 
Relational-Style XML Query @ SIGMOD-J 2008 Dec.
Relational-Style XML Query @ SIGMOD-J 2008 Dec.Relational-Style XML Query @ SIGMOD-J 2008 Dec.
Relational-Style XML Query @ SIGMOD-J 2008 Dec.
Taro L. Saito
 

More from Taro L. Saito (18)

Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS Projects
 
Learning Silicon Valley Culture
Learning Silicon Valley CultureLearning Silicon Valley Culture
Learning Silicon Valley Culture
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
JNuma Library
JNuma LibraryJNuma Library
JNuma Library
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Treasure Dataを支える技術 - MessagePack編
Treasure Dataを支える技術 - MessagePack編Treasure Dataを支える技術 - MessagePack編
Treasure Dataを支える技術 - MessagePack編
 
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
 
Silkによる並列分散ワークフロープログラミング
Silkによる並列分散ワークフロープログラミングSilkによる並列分散ワークフロープログラミング
Silkによる並列分散ワークフロープログラミング
 
2011年度 生物データベース論 2日目 木構造データ
2011年度 生物データベース論 2日目 木構造データ2011年度 生物データベース論 2日目 木構造データ
2011年度 生物データベース論 2日目 木構造データ
 
Relational-Style XML Query @ SIGMOD-J 2008 Dec.
Relational-Style XML Query @ SIGMOD-J 2008 Dec.Relational-Style XML Query @ SIGMOD-J 2008 Dec.
Relational-Style XML Query @ SIGMOD-J 2008 Dec.
 

Recently uploaded

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 

Recently uploaded (20)

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

  • 1. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Taro L. Saito Arm Treasure Data July 31th, 2020 Spark Meetup Tokyo #3 td-spark internals Extending Spark with Airframe 1
  • 2. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. About Me: Taro L. Saito 2 ● Ph.D., Principal Software Engineer of Arm Treasure Data ● Living US for 5 years ● Created Presto as a service ● Processing 1 million SQL queries / day on the cloud. Presto Webinar ● OSS: ● Airframe, snappy-java (used in Parquet, Spark core), sbt-sonatype, etc. ● Books: WIP
  • 3. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Challenge: Adding Treasure Data Support to Spark ● PlazmaDB: Cloud Data Store of Treasure Data ● MessagePack-based columnar format (MPC1) ● Each table column is represented as a sequence of MessagePack values ● What was necessary for supporting Spark? ● td-spark driver (td-spark-assembly.jar) ■ MPC1 <-> DataFrame conversion ● Plazma Public API ■ APIs for reading and writing MPC1 files from PlazmaDB ● Created these two components with Airframe OSS Airframe 3
  • 4. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Airframe: Core Scala Modules of Treasure Data ● Airframe ● Scala OSS assets of our knowledges, production experiences, and design decisions ● 20+ Common Utilities for Scala ● Dependency Injection (DI) ● Airframe RPC ■ HTTP Server, Client builder (ScalaMatsuri. Tokyo, October 2020) ● AirSpec ■ Testing framework for Scala (ScalaDays. Seattle, May 2021) 4 Knowledge Experiences Design Decisions Products 24/7 Services Business Values Programming OSS Outcome Airframe
  • 5. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Airframe Modules Used Inside td-spark Airframe DI DataFrame MPC1 airframe-codec airframe-msgpack Plazma Public API airframe-http airframe-finagle Airframe DI Airframe RPC airframe-fluentd Master Worker DesignSparkContext TDSparkContext TDSparkService MPC1 Reader/Writer IO Manager Airframe DI airframe-http airframe-config airframe-launcher airframe-jmx airframe-metrics airframe-control airframe-metrics td-spark.jarairframe-log airframe-log airframe-codec airframe-json Airframe 5
  • 6. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Reading MPC1 Partitions and Column Blocks 6 Table Data mpc1 partition mpc1 partition mpc1 partition mpc1 partition mpc1 partition Plazma Public API Table Data column blocks column blocks column blocks column blocks column blocks td-spark Table Data Data Frame Data Frame Data Frame Data Frame Data Frame td-pyspark Parallel Read User Programs Columnar Data Download DataFrame
  • 7. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Uploading DataFrame as MPC1 Partitions 7 td-spark td-pyspark User Programs Table Data mpc1 partition mpc1 partition mpc1 partition mpc1 partition mpc1 partition DataFrame Format Conversion Plazma Public API Amazon S3 Parallel Upload Copy Transaction Table Data mpc1 partition mpc1 partition mpc1 partition mpc1 partition mpc1 partition Table DataTable Data
  • 8. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Airframe Airframe RPC RPC Interface Router Scala.js Client RPC Web Server Generate HTTP/gRPC Client Open API Spec RPC Impl Create RPC CallsJSON Cross-Language RPC Client Scala.js Web Application Micro Servicesbt-airframeairframe-http airframe-http-finagle airframe-http-rx airframe-codec API Documentation airframe-gRPC 8 ● Use Scala As An RPC Interface ● Generate HTTP Server/Client (REST or gRPC) ● HTTP calls -> JSON/MessagePack data -> Remote Scala function calls
  • 9. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. td-spark: Adding More Functions to Spark ● Using an implicit class to extend SparkSession (spark variable) ● Adding TD-specific functionalities ● Time series data queries ■ e.g., : spark.td.table(“TD’s table”).within(“-1h”).df ● Predicate pushdown for time-series data ● etc. 9
  • 10. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Tips: Avoid Task Serialization Errors with Airframe DI ● Serializable ● spark.conf key-value properties inside SparkContext ● Non-Serializable ● Complex service objects ● Solution: Airframe DI (Dependency Injection) ● Distribute the service design (= how to construct objects) with the jar file ● Build service objects from the design (20+ components and config objects) Airframe DI Master Worker TDSparkContext TDSparkService TDSparkContext td-spark.jar td-spark.jarSerialization Error (!) Design Design TDSparkService Airframe Build OK! Config Config 10
  • 11. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Flexible Format Conversion with MessagePack DataFrame Airframe Codec Pack/Unpack Pack/Unpack MPC1 JDBC ResultSet Plazma Public API Airframe 11
  • 12. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Spark 3.0 and PySpark Support 12
  • 13. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Resources ● Airframe: https://wvlet.org/airframe/ ● Airframe Meetup #1 ~ #3 reports ● ScalaMatsuri 2019 presentation ■ And more!! ● td-spark documentation: https://treasure-data.github.io/td-spark/ ● See Also: Spark with Airframe (@smdmts) ● Spark to Spark data transfer with MessagePack-based airframe-codec ● Spark -> AWS service call management with airframe-control 13 Airframe New!
  • 14. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Appendix 14
  • 15. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Treasure Data: A Ready-to-Use Cloud Data Platform 15 Logs Device Data Batch Data PlazmaDB Table Schema Data Collection Cloud Storage Distributed Data Processing Jobs Job Management SQL Editor Scheduler Workflows Machine Learning Treasure Data OSS Third Party OSS Data
  • 16. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. TDSparkContext 16
  • 17. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. td-pyspark ● Supporting PySpark ● Access Scala methods of td-spark: ■ sparkContext._jvm.(jvm package name).method(...) ● Conversion to PySpark’s DataFrame ● DataFrame(Scala DataFrame, sqlContext) 17
  • 18. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Predicate Pushdown ● Traverse DataFrame Column Filters ● Extract time conditions (e.g., -1d, -1w, -7d, etc.) ● 18
  • 19. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Class Loader Hierarchy of Databricks ● Base class loader ● User library class loader ● td-spark.jar will be loaded here ● Shared between multiple notebooks ■ Static variables used inside td-spark.jar will be shared by multiple notebooks! ● REPL class loader ● Shared between multiple notebooks ● Spark-library class loader ● Notebook-local ● Notebook-local class loader ● Caching local instances to static variables in td-spark caused ClassNotFound error in other notebooks 19
  • 20. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Using Presto with Spark ● presto-jdbc ● Submit select * from (Original SQL) limit 0 => Query result schema ● JDBC ResultSet => Airframe Codec => DataFrame 20
  • 21. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. td-prestobase: A Proxy Gateway to Presto Clusters 21 ● td-prestobase is a proxy gateway to Presto clusters that talks standard presto protocol to support any Presto clients (e.g., presto-cli, jdbc, odbc, etc.) ● td-spark uses presto-jdbc and td-prestobase APIs for making Presto queries Airframe