SlideShare a Scribd company logo
1 of 33
Download to read offline
T R E A S U R E D A T A
Scala at Treasure Data
Taro L. Saito - GitHub:@xerial
Ph.D., Software Engineer at Treasure Data, Inc.
Treasure Data Tech Talk @ Tokyo, June 13, 2017
1
Why Scala?
• Scala is not an official programming language of Treasure Data
• I was the only engineer who can write Scala in TD
• 3 years ago
• Now all of my team members can write Scala
• Fact: Java experts can quickly learn Scala
https://www.treasuredata.com/company/careers/
2
Challenge: Increased Presto Usage at Treasure Data (2017)
Processing 15 Trillion Rows / Day 

(= 173 Million Rows / sec.)
150,000~ Queries / Day
1,500~ Users
• How do we improve the service by utilizing this massive amount of query logs?
3
Query Logs
Store
Analyze
SQL
Improve & Optimize
A Success Story: Using Scala in Genome Science
4
Scala Use Cases in TD
• Analyzing Query Engine Logs
• Data analytics workflows written in Scala
• For finding effective optimization approaches
• Prestobase
• Management Base of Presto
• Gateway to access Presto (Finagle + Presto)
• Monitoring + Runtime Analysis
• Spark Integration
• Accessing to Treasure Data from Spark
5
Open-Source Scala Libraries Developed at TD
• Libraries that make Scala programming fun
• wvlet-log: handy logging library: https://github.com/wvlet/log
• Airframe: Dependency Injection Library http://wvlet.org/airframe
• Airframe Config: YAML-based configuration library (a module in Airframe)
• Heavy use of meta-programing via Scalamacros
• sbt plugins
• Data analytics
• sbt-sql: https://github.com/xerial/sbt-sql
• Deployment
• sbt-pack: https://github.com/xerial/sbt-pack
• sbt-sonatype: https://github.com/xerial/sbt-sonatype
6
What is Scalamacros?
• Generates Scala code at compile-time
• Meta-programming (Writing a program that writes programs)
• Experimental State at Scala 2.10, 2.11, and 2.12
• Scalamacros will no longer be experimental
• Productization within 2017
• https://github.com/scalamacros/scalamacros
• Scala Macro author (@xeno-by), IntelliJ team, EPFL Ph.D student
• Support Scala 2.12 (and maybe Scala 2.11) and 3.0
• Announced at Scala Meetup at Twitter HQ, San Francisco
7
What is Scala 3.x?
• Scala 3.x
• Replaces the compiler to Dotty for faster compilation and better integration with IDE
• Dotty: Compilers Are Databases (Martin Odersky, Scala’s creator)
• https://www.youtube.com/watch?v=WxyyJyB_Ssc
• Because compiler needs to answer …
• Q: What is the signature of 

method A.f at a given point of time?
• class A[T] { def f(x: T): T = … }
• Compiler itself, IDE (e.g., IntelliJ), etc.
• Need to know these temporal types (Denotation)
8
Open-Source Scala Libraries in TD
9
Logging Library: Hard to Use
• Logging configuration is hard
• slf4j, log4j, logback-classic, etc.
• XML configuration, etc.
• Need to have redundant getLogger calls
embulk log configuration with logback-classic
10
Dependency Hell of slf4j
• slf4j (simple logger for Java)
• The de facto standard of Java logging library
• scala-logging: slf4j wrapper for Scala
• Switches log outputs
• Using a binding library in classpath
• slf4j-nop (no output)
• slf4j-simple (console output)
• slf4j-log4j (output to log4j)
• Pitfall
• Cannot have multiple binders
• But must have 1 binder (!!!)
• de facto = many bad users
• e.g., hadoop
• Doesn’t care the other people: Including slf4j-log4j in the direct dependency
• Need to exclude slf4j-log4j bindings from all of hadoop-related projects
11
wvlet-log github.com/wvlet/log
• Favors Simplicity
• Use Scalamacros to simplify user codes
• Only need to extend LogSupport trait
• No getLogger call
• Using standard java.util.logging
• No other dependency required
• Features
• Show source code locations of logs
• Log format is configurable in the code (No XML nor plugin!)
• Changing log levels with files or JMX
• log.properties
• log-test.properties
• Built-in log handlers
• log-rotate handler, async handler
• Works with Scala.js to show logs in Web browser console
12
wvlet-log: Logging code generation with Scalamacros
• Generate low-overhead logging code
• Quasiquote
• q”… scala code “
• Just writing Scala code template in macros
13
Airframe: wvlet.org/airframe/
• Dependency Injection Library for Scala
• Best practices of building objects in Scala
• We needed Google Guice for Scala
• But there is no good alternative
• Guice, Dagger2, Scaldi, Macwire, etc.
• http://wvlet.org/airframe/docs/comparison.html
• Using Google Guice in Scala
• PlayFramework
• Weird syntax
• Airframe uses Scalamacros to simplify DI in Scala
14
???
Airframe
• Three step DI in Scala
• Bind
• Design
• Build
• Built-in life cycle manager
• Session start/shutdown
• e.g., connection open/close
• Session
• Manage singletons and 

binding rules
15
Clear Separation of Concerns
• Traditional Service Building:
• With Airframe:
• Clear separation of concerns:
• How to build objects (design)
• How to use objects (bind)
• Simplest DI patten for Scala
16
How to build dependencies
Just use components!
Need to remember argument orders
Airframe Internals (Advanced)
• Code generation with Scalamacros
• Passing a Session when building App and A
• http://wvlet.org/airframe/docs/internals.html
17
Customizing Prestobase Filters with Airframe
• Prestobase Proxy: Gateway to access Presto
• Adding TD specific binding
• Finagle filters -> Injecting TD Specific filters
18
VCR Record/Replay for Testing Presto
• Launching Presto requires a lot of memory (e.g., 2GB or more)
• Often crashes CI service containers (TravisCI, CircleCI, etc.)
• Recording Presto responses (prestobase-vcr)
• with sqlite-jdbc: https://github.com/xerial/sqlite-jdbc
• DB file for each test suite
• Enabled small-memory footprint testing
• Can run many Presto tests in CI
19
Airframe Config
• YAML is useful for configuring applications
• Embedding YAML configurations inside docker images
• Provide credentials in a separate manner
• password, API keys, instance specific param, etc.
• properties file, environment variables, etc.
• YAML + overrides + object mapping
• http://wvlet.org/airframe/docs/config.html
20
Airframe Internal: Surface
• Surface: Object surface (shape) inspector library
• https://github.com/wvlet/airframe/tree/master/surface
• case class A(id:Int, name:String)
• surface.of[A]
• => Surface(“A”, Seq(Param(“id”, surface.of[Int]), Param(“name”, surface.of[String]))
• Extract object type parameters with Scala Runtime Reflection
• Scala generates this type information at compile type
• Used as Type Identifiers of Airframe and Airframe Config
• e.g., [A], [Seq[B]], [Map[Int, String]], [A @@ Tag], etc.
• Generating serializer/deserializer of Scala classes
• Surface => Serialize object parameters => Encoding in MessagePack.gz => Embulk
21
td-spark
• Access TD from Spark
• Binding components with Airframe
• IO Manager, Presto Client, etc.
• Passing Design through SparkContext
• Integration
• TD -> Spark Dataframe
• TD Presto Query -> DataFrame
22
Data Analytics with Scala
23
New Directions Explored By Presto
• Traditional Database Usage
• Required Database Administrator (DBA)
• DBA designs the schema and queries
• DBA tunes query performance
• After Presto
• Schema is designed by data providers
• 1st data (user’s customer data)
• 3rd party data sources
• Analysts or Marketers explore the data with Presto
• Don’t know the schema in advance
• Many Analytical SQL queries
24
Bridging Gaps Between SQL and Programming Language
• Traditional Approach
• OR-Mapper: app developer design objects and schema, then generate SQLs
• New Approach: SQL First
• Need to manage various SQL results inside programming language
• But How?
25
An Instinct
26
sbt-sql: https://github.com/xerial/sbt-sql
• Scala SBT plugin for generating model classes from SQL files
• src/main/sql/presto/*.sql (Presto Queries)
• Using SQL as a function
• Read Presto SQL Results as Objects
• Enabled managing SQL queries in GitHub
• Type-safe data analysis
27
Scala at Production
28
Packaging
• Do you need to install Scala?
• No. Only JDK is required
• sbt-pack
• https://github.com/xerial/sbt-pack
• Create Scala code packages for releasing
• At ./target/pack folder
• Folder structure:
• bin/ - launch scripts
• lib/ - Scala/Java libraries
• Makes easier to create docker images
• Also used for creating distributable packages of td-spark
29
Deploying to Maven Central
• Necessary Steps
• Upload artifacts -> Close -> Release -> Drop
• Painful
• Need to login to Nexus Web UI
• Many manual steps
• Bintray?
• Uploading to Bintray -> Automatic sync to Maven Central
30
sbt-sonatype plugin
• Enable one-command release to Maven Central
• Using REST APIs of Sonatype NEXUS Repository Manager
• Developed at 2015 New Year holiday
• Jan 5: Test Nexus REST API
• Jan 20: First release (Just 1 day effort)
• Released sbt-sonatype using sbt-sonatype
• 2,000+ projects are using sbt-sonatype
• Supporting sbt 0.13.x and 1.0.0
• And can be used for Java projects too
• Nexus to Maven Central sync is now fast
• Less than 10 minutes (June 2017)
31
Summary
• TD is a heavy user of Scala
• Analytics pipelines
• Production services
• Many libraries helping development
• Airframe, wvlet-log
• sbt plugins
• For details about Presto analysis
• Join Presto Meetup on Thursday!
32
Presto Meetup Tokyo: June 15, 2017 (Thu)
T R E A S U R E D A T A
33

More Related Content

What's hot

Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
Mark Miller
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 

What's hot (20)

State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solr
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Presto
PrestoPresto
Presto
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
 
Evolving Streaming Applications
Evolving Streaming ApplicationsEvolving Streaming Applications
Evolving Streaming Applications
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, Twitter
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop Meetup
 
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Mobius: C# Language Binding For Spark
Mobius: C# Language Binding For SparkMobius: C# Language Binding For Spark
Mobius: C# Language Binding For Spark
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced edition
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
 
Utilizing the OpenNTF Domino API
Utilizing the OpenNTF Domino APIUtilizing the OpenNTF Domino API
Utilizing the OpenNTF Domino API
 
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
 

Similar to Scala at Treasure Data

Typesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayTypesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and Play
Luka Zakrajšek
 
Rapid application development with spring roo j-fall 2010 - baris dere
Rapid application development with spring roo   j-fall 2010 - baris dereRapid application development with spring roo   j-fall 2010 - baris dere
Rapid application development with spring roo j-fall 2010 - baris dere
Baris Dere
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
Vasil Remeniuk
 

Similar to Scala at Treasure Data (20)

GraalVM and Oracle's Documentation Trends.pdf
GraalVM and Oracle's Documentation Trends.pdfGraalVM and Oracle's Documentation Trends.pdf
GraalVM and Oracle's Documentation Trends.pdf
 
Play Framework and Activator
Play Framework and ActivatorPlay Framework and Activator
Play Framework and Activator
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Typesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayTypesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and Play
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS Projects
 
Rapid application development with spring roo j-fall 2010 - baris dere
Rapid application development with spring roo   j-fall 2010 - baris dereRapid application development with spring roo   j-fall 2010 - baris dere
Rapid application development with spring roo j-fall 2010 - baris dere
 
Agile sites311training
Agile sites311trainingAgile sites311training
Agile sites311training
 
Stackato v2
Stackato v2Stackato v2
Stackato v2
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...
CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...
CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
 
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
 
ow.ppt
ow.pptow.ppt
ow.ppt
 
ow.ppt
ow.pptow.ppt
ow.ppt
 
Ow
OwOw
Ow
 
Eclipse e4
Eclipse e4Eclipse e4
Eclipse e4
 
Plantilla oracle
Plantilla oraclePlantilla oracle
Plantilla oracle
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 

More from Taro L. Saito

Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
Taro L. Saito
 

More from Taro L. Saito (20)

Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
 
Airframe RPC
Airframe RPCAirframe RPC
Airframe RPC
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpec
 
Presto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesPresto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 Updates
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
JNuma Library
JNuma LibraryJNuma Library
JNuma Library
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Treasure Dataを支える技術 - MessagePack編
Treasure Dataを支える技術 - MessagePack編Treasure Dataを支える技術 - MessagePack編
Treasure Dataを支える技術 - MessagePack編
 
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 

Scala at Treasure Data

  • 1. T R E A S U R E D A T A Scala at Treasure Data Taro L. Saito - GitHub:@xerial Ph.D., Software Engineer at Treasure Data, Inc. Treasure Data Tech Talk @ Tokyo, June 13, 2017 1
  • 2. Why Scala? • Scala is not an official programming language of Treasure Data • I was the only engineer who can write Scala in TD • 3 years ago • Now all of my team members can write Scala • Fact: Java experts can quickly learn Scala https://www.treasuredata.com/company/careers/ 2
  • 3. Challenge: Increased Presto Usage at Treasure Data (2017) Processing 15 Trillion Rows / Day 
 (= 173 Million Rows / sec.) 150,000~ Queries / Day 1,500~ Users • How do we improve the service by utilizing this massive amount of query logs? 3 Query Logs Store Analyze SQL Improve & Optimize
  • 4. A Success Story: Using Scala in Genome Science 4
  • 5. Scala Use Cases in TD • Analyzing Query Engine Logs • Data analytics workflows written in Scala • For finding effective optimization approaches • Prestobase • Management Base of Presto • Gateway to access Presto (Finagle + Presto) • Monitoring + Runtime Analysis • Spark Integration • Accessing to Treasure Data from Spark 5
  • 6. Open-Source Scala Libraries Developed at TD • Libraries that make Scala programming fun • wvlet-log: handy logging library: https://github.com/wvlet/log • Airframe: Dependency Injection Library http://wvlet.org/airframe • Airframe Config: YAML-based configuration library (a module in Airframe) • Heavy use of meta-programing via Scalamacros • sbt plugins • Data analytics • sbt-sql: https://github.com/xerial/sbt-sql • Deployment • sbt-pack: https://github.com/xerial/sbt-pack • sbt-sonatype: https://github.com/xerial/sbt-sonatype 6
  • 7. What is Scalamacros? • Generates Scala code at compile-time • Meta-programming (Writing a program that writes programs) • Experimental State at Scala 2.10, 2.11, and 2.12 • Scalamacros will no longer be experimental • Productization within 2017 • https://github.com/scalamacros/scalamacros • Scala Macro author (@xeno-by), IntelliJ team, EPFL Ph.D student • Support Scala 2.12 (and maybe Scala 2.11) and 3.0 • Announced at Scala Meetup at Twitter HQ, San Francisco 7
  • 8. What is Scala 3.x? • Scala 3.x • Replaces the compiler to Dotty for faster compilation and better integration with IDE • Dotty: Compilers Are Databases (Martin Odersky, Scala’s creator) • https://www.youtube.com/watch?v=WxyyJyB_Ssc • Because compiler needs to answer … • Q: What is the signature of 
 method A.f at a given point of time? • class A[T] { def f(x: T): T = … } • Compiler itself, IDE (e.g., IntelliJ), etc. • Need to know these temporal types (Denotation) 8
  • 10. Logging Library: Hard to Use • Logging configuration is hard • slf4j, log4j, logback-classic, etc. • XML configuration, etc. • Need to have redundant getLogger calls embulk log configuration with logback-classic 10
  • 11. Dependency Hell of slf4j • slf4j (simple logger for Java) • The de facto standard of Java logging library • scala-logging: slf4j wrapper for Scala • Switches log outputs • Using a binding library in classpath • slf4j-nop (no output) • slf4j-simple (console output) • slf4j-log4j (output to log4j) • Pitfall • Cannot have multiple binders • But must have 1 binder (!!!) • de facto = many bad users • e.g., hadoop • Doesn’t care the other people: Including slf4j-log4j in the direct dependency • Need to exclude slf4j-log4j bindings from all of hadoop-related projects 11
  • 12. wvlet-log github.com/wvlet/log • Favors Simplicity • Use Scalamacros to simplify user codes • Only need to extend LogSupport trait • No getLogger call • Using standard java.util.logging • No other dependency required • Features • Show source code locations of logs • Log format is configurable in the code (No XML nor plugin!) • Changing log levels with files or JMX • log.properties • log-test.properties • Built-in log handlers • log-rotate handler, async handler • Works with Scala.js to show logs in Web browser console 12
  • 13. wvlet-log: Logging code generation with Scalamacros • Generate low-overhead logging code • Quasiquote • q”… scala code “ • Just writing Scala code template in macros 13
  • 14. Airframe: wvlet.org/airframe/ • Dependency Injection Library for Scala • Best practices of building objects in Scala • We needed Google Guice for Scala • But there is no good alternative • Guice, Dagger2, Scaldi, Macwire, etc. • http://wvlet.org/airframe/docs/comparison.html • Using Google Guice in Scala • PlayFramework • Weird syntax • Airframe uses Scalamacros to simplify DI in Scala 14 ???
  • 15. Airframe • Three step DI in Scala • Bind • Design • Build • Built-in life cycle manager • Session start/shutdown • e.g., connection open/close • Session • Manage singletons and 
 binding rules 15
  • 16. Clear Separation of Concerns • Traditional Service Building: • With Airframe: • Clear separation of concerns: • How to build objects (design) • How to use objects (bind) • Simplest DI patten for Scala 16 How to build dependencies Just use components! Need to remember argument orders
  • 17. Airframe Internals (Advanced) • Code generation with Scalamacros • Passing a Session when building App and A • http://wvlet.org/airframe/docs/internals.html 17
  • 18. Customizing Prestobase Filters with Airframe • Prestobase Proxy: Gateway to access Presto • Adding TD specific binding • Finagle filters -> Injecting TD Specific filters 18
  • 19. VCR Record/Replay for Testing Presto • Launching Presto requires a lot of memory (e.g., 2GB or more) • Often crashes CI service containers (TravisCI, CircleCI, etc.) • Recording Presto responses (prestobase-vcr) • with sqlite-jdbc: https://github.com/xerial/sqlite-jdbc • DB file for each test suite • Enabled small-memory footprint testing • Can run many Presto tests in CI 19
  • 20. Airframe Config • YAML is useful for configuring applications • Embedding YAML configurations inside docker images • Provide credentials in a separate manner • password, API keys, instance specific param, etc. • properties file, environment variables, etc. • YAML + overrides + object mapping • http://wvlet.org/airframe/docs/config.html 20
  • 21. Airframe Internal: Surface • Surface: Object surface (shape) inspector library • https://github.com/wvlet/airframe/tree/master/surface • case class A(id:Int, name:String) • surface.of[A] • => Surface(“A”, Seq(Param(“id”, surface.of[Int]), Param(“name”, surface.of[String])) • Extract object type parameters with Scala Runtime Reflection • Scala generates this type information at compile type • Used as Type Identifiers of Airframe and Airframe Config • e.g., [A], [Seq[B]], [Map[Int, String]], [A @@ Tag], etc. • Generating serializer/deserializer of Scala classes • Surface => Serialize object parameters => Encoding in MessagePack.gz => Embulk 21
  • 22. td-spark • Access TD from Spark • Binding components with Airframe • IO Manager, Presto Client, etc. • Passing Design through SparkContext • Integration • TD -> Spark Dataframe • TD Presto Query -> DataFrame 22
  • 24. New Directions Explored By Presto • Traditional Database Usage • Required Database Administrator (DBA) • DBA designs the schema and queries • DBA tunes query performance • After Presto • Schema is designed by data providers • 1st data (user’s customer data) • 3rd party data sources • Analysts or Marketers explore the data with Presto • Don’t know the schema in advance • Many Analytical SQL queries 24
  • 25. Bridging Gaps Between SQL and Programming Language • Traditional Approach • OR-Mapper: app developer design objects and schema, then generate SQLs • New Approach: SQL First • Need to manage various SQL results inside programming language • But How? 25
  • 27. sbt-sql: https://github.com/xerial/sbt-sql • Scala SBT plugin for generating model classes from SQL files • src/main/sql/presto/*.sql (Presto Queries) • Using SQL as a function • Read Presto SQL Results as Objects • Enabled managing SQL queries in GitHub • Type-safe data analysis 27
  • 29. Packaging • Do you need to install Scala? • No. Only JDK is required • sbt-pack • https://github.com/xerial/sbt-pack • Create Scala code packages for releasing • At ./target/pack folder • Folder structure: • bin/ - launch scripts • lib/ - Scala/Java libraries • Makes easier to create docker images • Also used for creating distributable packages of td-spark 29
  • 30. Deploying to Maven Central • Necessary Steps • Upload artifacts -> Close -> Release -> Drop • Painful • Need to login to Nexus Web UI • Many manual steps • Bintray? • Uploading to Bintray -> Automatic sync to Maven Central 30
  • 31. sbt-sonatype plugin • Enable one-command release to Maven Central • Using REST APIs of Sonatype NEXUS Repository Manager • Developed at 2015 New Year holiday • Jan 5: Test Nexus REST API • Jan 20: First release (Just 1 day effort) • Released sbt-sonatype using sbt-sonatype • 2,000+ projects are using sbt-sonatype • Supporting sbt 0.13.x and 1.0.0 • And can be used for Java projects too • Nexus to Maven Central sync is now fast • Less than 10 minutes (June 2017) 31
  • 32. Summary • TD is a heavy user of Scala • Analytics pipelines • Production services • Many libraries helping development • Airframe, wvlet-log • sbt plugins • For details about Presto analysis • Join Presto Meetup on Thursday! 32 Presto Meetup Tokyo: June 15, 2017 (Thu)
  • 33. T R E A S U R E D A T A 33