T R E A S U R E D A T A
Tips for Maintaining Open Source Projects
Lunch Session @ Treasure Data Tokyo Office
1
Taro L. Saito - GitHub:@xerial
Ph.D., Software Engineer at Treasure Data, Inc.
Target Scope
• OSS projects that can be maintained by a single person = You!
• Middle/Large size OSS projects
• TD’s OSS projects: fluentd, embulk, digdag, etc.
• Sada’s strategy:
• Build a pluggable framework
• Quickly delegate the future extension and maintenance to other people
• Apache projects: Hadoop, Spark, etc.
• Need some funding
• Need to find paid contributors
• Big projects essentially require company support = out of scope of this talk.
2
3
sqlite-jdbc
• JDBC driver for using SQLite in Java
• SQLite: tiny database engine. 1 database = 1 file
• Maintaining for more than 10 years
• Why?
• There was no handy database for Genome Science data management.
• Installing PostgreSQL, MySQL was conversome
• OS differences
• Windows (development) / Linux (production)
• How?
• Embed pre-compiled SQLite binaries into a JAR file
• For multiple CPU architectures
• JDBC + SQLite + JNI + Runtime BinaryLoader
4
snappy-java
• Snappy compressor/decompressor for Java
• Released just 1 week after Google open-sourced Snappy (C++)
• Used the same techniques in sqlite-jdbc: pre-compile snappy -> embed to JAR
• Used in Parquet => Apache Spark, Presto, etc. => 1.6M+ downloads / month
• Building native libraries for 20+ CPU architecture x OS type combination
• os.arch: x86, x86_64, arm, ppc, os.name: Windows, Mac, Linux, etc.
• Previously VMWare was used to run native/cross-compilers
• Now:
• Using Docker
• Linux images for cross compilers
• Custom built GCC
• Building native libraries in a single command
5
Tip 1: Automate Release Process
6
Traditional Release Process
• Release Steps for JVM projects
• Binary releases to Maven Central (hosted by Sonatype)
• Compile -> Test -> Package + GPG Sign -> Deploy to Maven Central (staging) -> Check
Maven Central Requirements -> Promote from staging to release
• Java: mvn release
• Don’t do that. It ruins your life. Instead, use:
• mvn deploy -DperformRelease=true
• Scala: sbt release
• Follow the same practice with mvn release.
• This also waists your life
• It sequentially run: compile -> test -> publish steps
• Too slow especially for cross-building + multi-module projects for Scala 2.11, 2.12,
2.13, etc.
7
Deploying to Maven Central (Sonatype)
• Painful Operation at Sonatype UI
• Upload artifacts -> Close -> Release -> Drop
• Need to login to Nexus Web UI
• Many manual steps
• Using Bintray?
• Uploading to Bintray -> Automatic sync to Maven Central
• Suffered from many incidents
8
sbt-sonatype plugin
• Enables one-command release to Maven Central
• Using REST APIs of Sonatype NEXUS Repository Manager
• Developed at 2015 New Year holiday
• Jan 5: Test Nexus REST API
• Jan 20: First release (Just 1 day effort)
• Released sbt-sonatype using sbt-sonatype
• 3,500+ projects are using sbt-sonatype
• Can be used for Java project release
• Maven Central sync is faster now
• Less than 10 minutes (Since June 2017)
9
Full Release Automation
• Triggering a release process with git tag
• Automatic versioning (sbt-dynver)
• 0.51+3-99dc3f68 (snapshot)
• 0.52 (release)
• Separate test and release processes
• Tagging only CI passed commits
• Run release process on TravisCI
• Packaging
• GPG signature
• Publishing to Maven Central
• with sbt-sonatype
• Finishes in about 10 minutes.
• Scala 2.11, 2.12, 2.13-M3, Scala.js cross build for more than 15+ modules
• with sbt-release, it took more than 2~3 hours
• Airframe has 3 or more releases every month
10
sbt-pack
• Plugin for Packaging JVM Projects
• With command line launch scripts
• Collect all dependencies into a folder
• Good for building Docker images
• Folder Structure
• bin/ - launch scripts
• lib/ - Scala/Java libraries
• Used for TD internal Scala projects
• prestobase, prestop, presto-conductor, etc.
11
Tip 2: Think Your Project Maintenance as
Learning Opportunities
12
Airframe
• Lightweight library collection for Scala
• Logging
• App Configuration
• Dependency Injection (DI)
• Object Serialization
• msgpack based codec
• MessagePack reader/writer for Scala
• JMX monitoring
• Human-readable date/time units
• Object shape inspectors
• etc.
• It already has 15+ modules.
13
Scala Version Upgrades = History of My Experiments
• Scala 2.7 (2009)
• Almost useless for production use cases => Learned just for fun
• Scala 2.8
• Improved the compatibility with Java collection => Better Java => Migrating my Java projects into Scala => airframe-opts
• Scala 2.9
• Parallel collection (= easy MapReduce, multi-thread programming) => built a distributed engine
• Scala 2.10
• String interpolation (embed expressions into Strings) => airframe-log
• s”Hello ${world}”
• Scala 2.11
• Meta-programming with Scala Macros. => airframe-surface, airframe-codec
• Scala 2.12
• Java8 support => Using airframe with Presto libraries (which only supports Java8)
• Presto experience => Guice -> airframe DI
• Scala 2.13
• Compiler performance improvement
• Enhancement to the collection library (groupMap, etc.)
14
Scala Ninja
• Scala Ninja living in GitHub
• Upgrading Java/Scala versions, sbt/gradle/mvn versions, library versions, etc.
• Fixing documentations
• etc.
• Learning new technologies through small PRs
15
Tip 3: Build What You Actually Need
16
GitHub Stars Tell Nothing About Project Usability
• No Real or Active Users
• xerial/larray
• Large off-heap arrays and mmap files for Scala and Java (287 stars)
• Needed to manage human genomes (3GB) + FM-indexes (20GB or more) in JVM
• No longer used and maintained after leaving academia
• Airpal
• Web UI for PrestoDB (2346 stars) created by Airbnb
• Nobody is maintaining it for 2 years
• Only support old versions of Presto
• These tools look cool at first, but in reality, no use cases exist
17
What You Usually Need as A Software Engineer
• Daily Task Automation
• Packaging & Release
• sbt-pack
• sbt-sonatype
• Daily Debugging
• airframe-log (easy to configure and start logging)
• Application Development
• airframe DI (Helping service composition)
• airframe-codec (Data serialization)
• airframe-config (App configuration)
• etc.
• If you can save 1 minute for a daily task, spending 6 hours for such library development will
pay off
• 365 minutes ≒ 6 hours
18
Summary: Tips for Maintaining OSS Projects
• Automate Release Process
• Think Your Project Maintenance as Learning Opportunities
• Build What You Actually Need
19
Related: Blog Articles
• 3 Tips for Maintaining Your Scala Projects
• https://medium.com/@taroleo/3-tips-for-maintaining-your-scala-projects-
e54a2feea9c4
• Airframe: Lightweight Building Blocks for Scala
• https://medium.com/@taroleo/airframe-c5d044a97ec
• Airframe Log: A Modern Logging Library for Scala
• https://medium.com/@taroleo/airframe-log-a-modern-logging-library-for-
scala-56fbc2f950bc
20
T R E A S U R E D A T A
21

Tips For Maintaining OSS Projects

  • 1.
    T R EA S U R E D A T A Tips for Maintaining Open Source Projects Lunch Session @ Treasure Data Tokyo Office 1 Taro L. Saito - GitHub:@xerial Ph.D., Software Engineer at Treasure Data, Inc.
  • 2.
    Target Scope • OSSprojects that can be maintained by a single person = You! • Middle/Large size OSS projects • TD’s OSS projects: fluentd, embulk, digdag, etc. • Sada’s strategy: • Build a pluggable framework • Quickly delegate the future extension and maintenance to other people • Apache projects: Hadoop, Spark, etc. • Need some funding • Need to find paid contributors • Big projects essentially require company support = out of scope of this talk. 2
  • 3.
  • 4.
    sqlite-jdbc • JDBC driverfor using SQLite in Java • SQLite: tiny database engine. 1 database = 1 file • Maintaining for more than 10 years • Why? • There was no handy database for Genome Science data management. • Installing PostgreSQL, MySQL was conversome • OS differences • Windows (development) / Linux (production) • How? • Embed pre-compiled SQLite binaries into a JAR file • For multiple CPU architectures • JDBC + SQLite + JNI + Runtime BinaryLoader 4
  • 5.
    snappy-java • Snappy compressor/decompressorfor Java • Released just 1 week after Google open-sourced Snappy (C++) • Used the same techniques in sqlite-jdbc: pre-compile snappy -> embed to JAR • Used in Parquet => Apache Spark, Presto, etc. => 1.6M+ downloads / month • Building native libraries for 20+ CPU architecture x OS type combination • os.arch: x86, x86_64, arm, ppc, os.name: Windows, Mac, Linux, etc. • Previously VMWare was used to run native/cross-compilers • Now: • Using Docker • Linux images for cross compilers • Custom built GCC • Building native libraries in a single command 5
  • 6.
    Tip 1: AutomateRelease Process 6
  • 7.
    Traditional Release Process •Release Steps for JVM projects • Binary releases to Maven Central (hosted by Sonatype) • Compile -> Test -> Package + GPG Sign -> Deploy to Maven Central (staging) -> Check Maven Central Requirements -> Promote from staging to release • Java: mvn release • Don’t do that. It ruins your life. Instead, use: • mvn deploy -DperformRelease=true • Scala: sbt release • Follow the same practice with mvn release. • This also waists your life • It sequentially run: compile -> test -> publish steps • Too slow especially for cross-building + multi-module projects for Scala 2.11, 2.12, 2.13, etc. 7
  • 8.
    Deploying to MavenCentral (Sonatype) • Painful Operation at Sonatype UI • Upload artifacts -> Close -> Release -> Drop • Need to login to Nexus Web UI • Many manual steps • Using Bintray? • Uploading to Bintray -> Automatic sync to Maven Central • Suffered from many incidents 8
  • 9.
    sbt-sonatype plugin • Enablesone-command release to Maven Central • Using REST APIs of Sonatype NEXUS Repository Manager • Developed at 2015 New Year holiday • Jan 5: Test Nexus REST API • Jan 20: First release (Just 1 day effort) • Released sbt-sonatype using sbt-sonatype • 3,500+ projects are using sbt-sonatype • Can be used for Java project release • Maven Central sync is faster now • Less than 10 minutes (Since June 2017) 9
  • 10.
    Full Release Automation •Triggering a release process with git tag • Automatic versioning (sbt-dynver) • 0.51+3-99dc3f68 (snapshot) • 0.52 (release) • Separate test and release processes • Tagging only CI passed commits • Run release process on TravisCI • Packaging • GPG signature • Publishing to Maven Central • with sbt-sonatype • Finishes in about 10 minutes. • Scala 2.11, 2.12, 2.13-M3, Scala.js cross build for more than 15+ modules • with sbt-release, it took more than 2~3 hours • Airframe has 3 or more releases every month 10
  • 11.
    sbt-pack • Plugin forPackaging JVM Projects • With command line launch scripts • Collect all dependencies into a folder • Good for building Docker images • Folder Structure • bin/ - launch scripts • lib/ - Scala/Java libraries • Used for TD internal Scala projects • prestobase, prestop, presto-conductor, etc. 11
  • 12.
    Tip 2: ThinkYour Project Maintenance as Learning Opportunities 12
  • 13.
    Airframe • Lightweight librarycollection for Scala • Logging • App Configuration • Dependency Injection (DI) • Object Serialization • msgpack based codec • MessagePack reader/writer for Scala • JMX monitoring • Human-readable date/time units • Object shape inspectors • etc. • It already has 15+ modules. 13
  • 14.
    Scala Version Upgrades= History of My Experiments • Scala 2.7 (2009) • Almost useless for production use cases => Learned just for fun • Scala 2.8 • Improved the compatibility with Java collection => Better Java => Migrating my Java projects into Scala => airframe-opts • Scala 2.9 • Parallel collection (= easy MapReduce, multi-thread programming) => built a distributed engine • Scala 2.10 • String interpolation (embed expressions into Strings) => airframe-log • s”Hello ${world}” • Scala 2.11 • Meta-programming with Scala Macros. => airframe-surface, airframe-codec • Scala 2.12 • Java8 support => Using airframe with Presto libraries (which only supports Java8) • Presto experience => Guice -> airframe DI • Scala 2.13 • Compiler performance improvement • Enhancement to the collection library (groupMap, etc.) 14
  • 15.
    Scala Ninja • ScalaNinja living in GitHub • Upgrading Java/Scala versions, sbt/gradle/mvn versions, library versions, etc. • Fixing documentations • etc. • Learning new technologies through small PRs 15
  • 16.
    Tip 3: BuildWhat You Actually Need 16
  • 17.
    GitHub Stars TellNothing About Project Usability • No Real or Active Users • xerial/larray • Large off-heap arrays and mmap files for Scala and Java (287 stars) • Needed to manage human genomes (3GB) + FM-indexes (20GB or more) in JVM • No longer used and maintained after leaving academia • Airpal • Web UI for PrestoDB (2346 stars) created by Airbnb • Nobody is maintaining it for 2 years • Only support old versions of Presto • These tools look cool at first, but in reality, no use cases exist 17
  • 18.
    What You UsuallyNeed as A Software Engineer • Daily Task Automation • Packaging & Release • sbt-pack • sbt-sonatype • Daily Debugging • airframe-log (easy to configure and start logging) • Application Development • airframe DI (Helping service composition) • airframe-codec (Data serialization) • airframe-config (App configuration) • etc. • If you can save 1 minute for a daily task, spending 6 hours for such library development will pay off • 365 minutes ≒ 6 hours 18
  • 19.
    Summary: Tips forMaintaining OSS Projects • Automate Release Process • Think Your Project Maintenance as Learning Opportunities • Build What You Actually Need 19
  • 20.
    Related: Blog Articles •3 Tips for Maintaining Your Scala Projects • https://medium.com/@taroleo/3-tips-for-maintaining-your-scala-projects- e54a2feea9c4 • Airframe: Lightweight Building Blocks for Scala • https://medium.com/@taroleo/airframe-c5d044a97ec • Airframe Log: A Modern Logging Library for Scala • https://medium.com/@taroleo/airframe-log-a-modern-logging-library-for- scala-56fbc2f950bc 20
  • 21.
    T R EA S U R E D A T A 21