Vector Search -An Introduction in Oracle Database 23ai.pptx
Scala at Treasure Data
1. T R E A S U R E D A T A
Scala at Treasure Data
Taro L. Saito - GitHub:@xerial
Ph.D., Software Engineer at Treasure Data, Inc.
Treasure Data Tech Talk @ Tokyo, June 13, 2017
1
2. Why Scala?
• Scala is not an official programming language of Treasure Data
• I was the only engineer who can write Scala in TD
• 3 years ago
• Now all of my team members can write Scala
• Fact: Java experts can quickly learn Scala
https://www.treasuredata.com/company/careers/
2
3. Challenge: Increased Presto Usage at Treasure Data (2017)
Processing 15 Trillion Rows / Day
(= 173 Million Rows / sec.)
150,000~ Queries / Day
1,500~ Users
• How do we improve the service by utilizing this massive amount of query logs?
3
Query Logs
Store
Analyze
SQL
Improve & Optimize
5. Scala Use Cases in TD
• Analyzing Query Engine Logs
• Data analytics workflows written in Scala
• For finding effective optimization approaches
• Prestobase
• Management Base of Presto
• Gateway to access Presto (Finagle + Presto)
• Monitoring + Runtime Analysis
• Spark Integration
• Accessing to Treasure Data from Spark
5
6. Open-Source Scala Libraries Developed at TD
• Libraries that make Scala programming fun
• wvlet-log: handy logging library: https://github.com/wvlet/log
• Airframe: Dependency Injection Library http://wvlet.org/airframe
• Airframe Config: YAML-based configuration library (a module in Airframe)
• Heavy use of meta-programing via Scalamacros
• sbt plugins
• Data analytics
• sbt-sql: https://github.com/xerial/sbt-sql
• Deployment
• sbt-pack: https://github.com/xerial/sbt-pack
• sbt-sonatype: https://github.com/xerial/sbt-sonatype
6
7. What is Scalamacros?
• Generates Scala code at compile-time
• Meta-programming (Writing a program that writes programs)
• Experimental State at Scala 2.10, 2.11, and 2.12
• Scalamacros will no longer be experimental
• Productization within 2017
• https://github.com/scalamacros/scalamacros
• Scala Macro author (@xeno-by), IntelliJ team, EPFL Ph.D student
• Support Scala 2.12 (and maybe Scala 2.11) and 3.0
• Announced at Scala Meetup at Twitter HQ, San Francisco
7
8. What is Scala 3.x?
• Scala 3.x
• Replaces the compiler to Dotty for faster compilation and better integration with IDE
• Dotty: Compilers Are Databases (Martin Odersky, Scala’s creator)
• https://www.youtube.com/watch?v=WxyyJyB_Ssc
• Because compiler needs to answer …
• Q: What is the signature of
method A.f at a given point of time?
• class A[T] { def f(x: T): T = … }
• Compiler itself, IDE (e.g., IntelliJ), etc.
• Need to know these temporal types (Denotation)
8
10. Logging Library: Hard to Use
• Logging configuration is hard
• slf4j, log4j, logback-classic, etc.
• XML configuration, etc.
• Need to have redundant getLogger calls
embulk log configuration with logback-classic
10
11. Dependency Hell of slf4j
• slf4j (simple logger for Java)
• The de facto standard of Java logging library
• scala-logging: slf4j wrapper for Scala
• Switches log outputs
• Using a binding library in classpath
• slf4j-nop (no output)
• slf4j-simple (console output)
• slf4j-log4j (output to log4j)
• Pitfall
• Cannot have multiple binders
• But must have 1 binder (!!!)
• de facto = many bad users
• e.g., hadoop
• Doesn’t care the other people: Including slf4j-log4j in the direct dependency
• Need to exclude slf4j-log4j bindings from all of hadoop-related projects
11
12. wvlet-log github.com/wvlet/log
• Favors Simplicity
• Use Scalamacros to simplify user codes
• Only need to extend LogSupport trait
• No getLogger call
• Using standard java.util.logging
• No other dependency required
• Features
• Show source code locations of logs
• Log format is configurable in the code (No XML nor plugin!)
• Changing log levels with files or JMX
• log.properties
• log-test.properties
• Built-in log handlers
• log-rotate handler, async handler
• Works with Scala.js to show logs in Web browser console
12
13. wvlet-log: Logging code generation with Scalamacros
• Generate low-overhead logging code
• Quasiquote
• q”… scala code “
• Just writing Scala code template in macros
13
14. Airframe: wvlet.org/airframe/
• Dependency Injection Library for Scala
• Best practices of building objects in Scala
• We needed Google Guice for Scala
• But there is no good alternative
• Guice, Dagger2, Scaldi, Macwire, etc.
• http://wvlet.org/airframe/docs/comparison.html
• Using Google Guice in Scala
• PlayFramework
• Weird syntax
• Airframe uses Scalamacros to simplify DI in Scala
14
???
15. Airframe
• Three step DI in Scala
• Bind
• Design
• Build
• Built-in life cycle manager
• Session start/shutdown
• e.g., connection open/close
• Session
• Manage singletons and
binding rules
15
16. Clear Separation of Concerns
• Traditional Service Building:
• With Airframe:
• Clear separation of concerns:
• How to build objects (design)
• How to use objects (bind)
• Simplest DI patten for Scala
16
How to build dependencies
Just use components!
Need to remember argument orders
17. Airframe Internals (Advanced)
• Code generation with Scalamacros
• Passing a Session when building App and A
• http://wvlet.org/airframe/docs/internals.html
17
19. VCR Record/Replay for Testing Presto
• Launching Presto requires a lot of memory (e.g., 2GB or more)
• Often crashes CI service containers (TravisCI, CircleCI, etc.)
• Recording Presto responses (prestobase-vcr)
• with sqlite-jdbc: https://github.com/xerial/sqlite-jdbc
• DB file for each test suite
• Enabled small-memory footprint testing
• Can run many Presto tests in CI
19
20. Airframe Config
• YAML is useful for configuring applications
• Embedding YAML configurations inside docker images
• Provide credentials in a separate manner
• password, API keys, instance specific param, etc.
• properties file, environment variables, etc.
• YAML + overrides + object mapping
• http://wvlet.org/airframe/docs/config.html
20
21. Airframe Internal: Surface
• Surface: Object surface (shape) inspector library
• https://github.com/wvlet/airframe/tree/master/surface
• case class A(id:Int, name:String)
• surface.of[A]
• => Surface(“A”, Seq(Param(“id”, surface.of[Int]), Param(“name”, surface.of[String]))
• Extract object type parameters with Scala Runtime Reflection
• Scala generates this type information at compile type
• Used as Type Identifiers of Airframe and Airframe Config
• e.g., [A], [Seq[B]], [Map[Int, String]], [A @@ Tag], etc.
• Generating serializer/deserializer of Scala classes
• Surface => Serialize object parameters => Encoding in MessagePack.gz => Embulk
21
22. td-spark
• Access TD from Spark
• Binding components with Airframe
• IO Manager, Presto Client, etc.
• Passing Design through SparkContext
• Integration
• TD -> Spark Dataframe
• TD Presto Query -> DataFrame
22
24. New Directions Explored By Presto
• Traditional Database Usage
• Required Database Administrator (DBA)
• DBA designs the schema and queries
• DBA tunes query performance
• After Presto
• Schema is designed by data providers
• 1st data (user’s customer data)
• 3rd party data sources
• Analysts or Marketers explore the data with Presto
• Don’t know the schema in advance
• Many Analytical SQL queries
24
25. Bridging Gaps Between SQL and Programming Language
• Traditional Approach
• OR-Mapper: app developer design objects and schema, then generate SQLs
• New Approach: SQL First
• Need to manage various SQL results inside programming language
• But How?
25
27. sbt-sql: https://github.com/xerial/sbt-sql
• Scala SBT plugin for generating model classes from SQL files
• src/main/sql/presto/*.sql (Presto Queries)
• Using SQL as a function
• Read Presto SQL Results as Objects
• Enabled managing SQL queries in GitHub
• Type-safe data analysis
27
29. Packaging
• Do you need to install Scala?
• No. Only JDK is required
• sbt-pack
• https://github.com/xerial/sbt-pack
• Create Scala code packages for releasing
• At ./target/pack folder
• Folder structure:
• bin/ - launch scripts
• lib/ - Scala/Java libraries
• Makes easier to create docker images
• Also used for creating distributable packages of td-spark
29
30. Deploying to Maven Central
• Necessary Steps
• Upload artifacts -> Close -> Release -> Drop
• Painful
• Need to login to Nexus Web UI
• Many manual steps
• Bintray?
• Uploading to Bintray -> Automatic sync to Maven Central
30
31. sbt-sonatype plugin
• Enable one-command release to Maven Central
• Using REST APIs of Sonatype NEXUS Repository Manager
• Developed at 2015 New Year holiday
• Jan 5: Test Nexus REST API
• Jan 20: First release (Just 1 day effort)
• Released sbt-sonatype using sbt-sonatype
• 2,000+ projects are using sbt-sonatype
• Supporting sbt 0.13.x and 1.0.0
• And can be used for Java projects too
• Nexus to Maven Central sync is now fast
• Less than 10 minutes (June 2017)
31
32. Summary
• TD is a heavy user of Scala
• Analytics pipelines
• Production services
• Many libraries helping development
• Airframe, wvlet-log
• sbt plugins
• For details about Presto analysis
• Join Presto Meetup on Thursday!
32
Presto Meetup Tokyo: June 15, 2017 (Thu)