Continuous Integration
for Spark Apps
Hi, I’m Sean!
© 2015 Uncharted Software Inc.
It’s hard to test Spark Apps :(
© 2015 Uncharted Software Inc.
Case Study: Uncharted Spark Pipeline
© 2015 Uncharted Software Inc.
Case Study: Uncharted Spark Pipeline
Some key issues:
● Ensure reliability
● Prevent regressions
● Maintain compatibility with multiple versions of Spark
● Open-source - need a quick and easy way to evaluate PRs
© 2015 Uncharted Software Inc.
What is Continuous Integration?
© 2015 Uncharted Software Inc.
“Continuous Integration (CI) is a development practice that requires developers to integrate
code into a shared repository several times a day. Each check-in is then verified by an
automated build, allowing teams to detect problems early.”
-- ThoughtWorks
© 2015 Uncharted Software Inc.
“Continuous Integration (CI) is a development practice that is pretty damned
important for writing quality software.”
-- Me
© 2015 Uncharted Software Inc.
So, What is Continuous Integration?
© 2015 Uncharted Software Inc.
Best Practices, courtesy of Wikipedia
1. Maintain a code repository (Git)
© 2015 Uncharted Software Inc.
Best Practices, courtesy of Wikipedia
1. Maintain a code repository (Git)
2. Automate the build (Gradle)
© 2015 Uncharted Software Inc.
Best Practices, courtesy of Wikipedia
1. Maintain a code repository (Git)
2. Automate the build (Gradle)
3. Tests should be part of the build (ScalaTest)
© 2015 Uncharted Software Inc.
Best Practices, courtesy of Wikipedia
1. Maintain a code repository (Git)
2. Automate the build (Gradle)
3. Tests should be part of the build (ScalaTest)
4. Commit/push feature branches often
© 2015 Uncharted Software Inc.
Best Practices, courtesy of Wikipedia
1. Maintain a code repository (Git)
2. Automate the build (Gradle)
3. Tests should be part of the build (ScalaTest)
4. Commit/push feature branches often
5. Build (and test) All The Branches
© 2015 Uncharted Software Inc.
Best Practices, courtesy of Wikipedia
1. Maintain a code repository (Git)
2. Automate the build (Gradle)
3. Tests should be part of the build (ScalaTest)
4. Commit/push feature branches often
5. Build (and test) All The Branches
6. Test in a clone of the production environment
© 2015 Uncharted Software Inc.
Best Practices, courtesy of Wikipedia
1. Maintain a code repository (Git)
2. Automate the build (Gradle)
3. Tests should be part of the build (ScalaTest)
4. Commit/push feature branches often
5. Build (and test) All The Branches
6. Test in a clone of the production environment
7. Keep the build fast
© 2015 Uncharted Software Inc.
Best Practices, courtesy of Wikipedia
1. Maintain a code repository (Git)
2. Automate the build (Gradle)
3. Tests should be part of the build (ScalaTest)
4. Commit/push feature branches often
5. Build (and test) All The Branches
6. Test in a clone of the production environment
7. Keep the build fast
8. Everyone can see the results of builds
© 2015 Uncharted Software Inc.
Best Practices, courtesy of Wikipedia
1. Maintain a code repository (Git)
2. Automate the build (Gradle)
3. Tests should be part of the build (ScalaTest)
4. Commit/push feature branches often
5. Build (and test) All The Branches
6. Test in a clone of the production environment
7. Keep the build fast
8. Everyone can see the results of builds
}duh.
© 2015 Uncharted Software Inc.
Best Practices, courtesy of Wikipedia
1. Maintain a code repository (Git)
2. Automate the build (Gradle)
3. Tests should be part of the build (ScalaTest)
4. Commit/push feature branches often
5. Build (and test) All The Branches
6. Test in a clone of the production environment
7. Keep the build fast
8. Everyone can see the results of builds
}...less duh.
© 2015 Uncharted Software Inc.
Why are these difficult with Apache Spark?
5. Build (and test) All The Branches
6. Test in a clone of the production
environment
7. Keep the build fast
8. Everyone can see the results of builds
© 2015 Uncharted Software Inc.
What is a Spark App?
© 2015 Uncharted Software Inc.
What is a Spark app?
Source
JAR
Spark ?
This thing.
JAR
© 2015 Uncharted Software Inc.
And...
Source
JAR
Spark ?
We need to test this
JAR
© 2015 Uncharted Software Inc.
But...
Source
JAR
ScalaTest
Scala RE
By default, we have this
JAR
(boom)
© 2015 Uncharted Software Inc.
v1: Squish Spark inside ScalaTest
Source
JAR
ScalaTest
with
SparkContext
So, we try this
JAR
it works!
(sort of)
© 2015 Uncharted Software Inc.
it works!
(sort of)
© 2015 Uncharted Software Inc.
6. Test in a clone of the production environment
© 2015 Uncharted Software Inc.
v2: Squish ScalaTest into Spark
Source
Test
JAR
Tests Main.scala
Spark
JAR
Test
JAR
Test
Output
JAR
© 2015 Uncharted Software Inc.
Main.scala
© 2015 Uncharted Software Inc.
6. Test in a clone of the production environment
© 2015 Uncharted Software Inc.
Progress?
5. Build (and test) All The Branches
6. Test in a clone of the production environment
7. Keep the build fast
8. Everyone can see the results of builds
© 2015 Uncharted Software Inc.
What now?
5. Build (and test) All The Branches
6. Test in a clone of the production environment
7. Keep the build fast
8. Everyone can see the results of builds
© 2015 Uncharted Software Inc.
Docker Container
(uncharted/sparklet)
v3: Squish Spark and Test JAR into Docker
Test
Output
Source
Test
JAR
Tests Main.scala
Spark
JAR
JAR
Test
JAR
© 2015 Uncharted Software Inc.
test.sh
© 2015 Uncharted Software Inc.
build.gradle (excerpt)
© 2015 Uncharted Software Inc.
Progress?
5. Build (and test) All The Branches
6. Test in a clone of the production environment
7. Keep the build fast
8. Everyone can see the results of builds
© 2015 Uncharted Software Inc.
Travis CI VM
Docker Container
v4: Squish Docker into Travis CI
Test
Output
Source
Test
JAR
Tests Main.scala
Spark
JAR
JAR
Test
JAR
© 2015 Uncharted Software Inc.
.travis.yml
© 2015 Uncharted Software Inc.
Voilà!
© 2015 Uncharted Software Inc.
Progress?
5. Build (and test) All The Branches
6. Test in a clone of the production environment
7. Keep the build fast
8. Everyone can see the results of builds
© 2015 Uncharted Software Inc.
© 2015 Uncharted Software Inc.
© 2015 Uncharted Software Inc.
All done!
5. Build (and test) All The Branches
6. Test in a clone of the production environment
7. Keep the build fast
8. Everyone can see the results of builds
© 2015 Uncharted Software Inc.
Next Steps?
Alpine Linux
docker-compose
Windows (dev environment) support
python
© 2015 Uncharted Software Inc.
Questions?
https://github.com/unchartedsoftware/sparkpipe-core
https://github.com/Ghnuberath
@Ghnuberath
https://hub.docker.com/r/uncharted/sparklet/
smcintyre@unchartedsoftware.com

Continuous Integration for Spark Apps by Sean McIntyre

  • 1.
  • 2.
    Hi, I’m Sean! ©2015 Uncharted Software Inc.
  • 3.
    It’s hard totest Spark Apps :( © 2015 Uncharted Software Inc.
  • 4.
    Case Study: UnchartedSpark Pipeline © 2015 Uncharted Software Inc.
  • 5.
    Case Study: UnchartedSpark Pipeline Some key issues: ● Ensure reliability ● Prevent regressions ● Maintain compatibility with multiple versions of Spark ● Open-source - need a quick and easy way to evaluate PRs © 2015 Uncharted Software Inc.
  • 6.
    What is ContinuousIntegration? © 2015 Uncharted Software Inc.
  • 7.
    “Continuous Integration (CI)is a development practice that requires developers to integrate code into a shared repository several times a day. Each check-in is then verified by an automated build, allowing teams to detect problems early.” -- ThoughtWorks © 2015 Uncharted Software Inc.
  • 8.
    “Continuous Integration (CI)is a development practice that is pretty damned important for writing quality software.” -- Me © 2015 Uncharted Software Inc.
  • 9.
    So, What isContinuous Integration? © 2015 Uncharted Software Inc.
  • 10.
    Best Practices, courtesyof Wikipedia 1. Maintain a code repository (Git) © 2015 Uncharted Software Inc.
  • 11.
    Best Practices, courtesyof Wikipedia 1. Maintain a code repository (Git) 2. Automate the build (Gradle) © 2015 Uncharted Software Inc.
  • 12.
    Best Practices, courtesyof Wikipedia 1. Maintain a code repository (Git) 2. Automate the build (Gradle) 3. Tests should be part of the build (ScalaTest) © 2015 Uncharted Software Inc.
  • 13.
    Best Practices, courtesyof Wikipedia 1. Maintain a code repository (Git) 2. Automate the build (Gradle) 3. Tests should be part of the build (ScalaTest) 4. Commit/push feature branches often © 2015 Uncharted Software Inc.
  • 14.
    Best Practices, courtesyof Wikipedia 1. Maintain a code repository (Git) 2. Automate the build (Gradle) 3. Tests should be part of the build (ScalaTest) 4. Commit/push feature branches often 5. Build (and test) All The Branches © 2015 Uncharted Software Inc.
  • 15.
    Best Practices, courtesyof Wikipedia 1. Maintain a code repository (Git) 2. Automate the build (Gradle) 3. Tests should be part of the build (ScalaTest) 4. Commit/push feature branches often 5. Build (and test) All The Branches 6. Test in a clone of the production environment © 2015 Uncharted Software Inc.
  • 16.
    Best Practices, courtesyof Wikipedia 1. Maintain a code repository (Git) 2. Automate the build (Gradle) 3. Tests should be part of the build (ScalaTest) 4. Commit/push feature branches often 5. Build (and test) All The Branches 6. Test in a clone of the production environment 7. Keep the build fast © 2015 Uncharted Software Inc.
  • 17.
    Best Practices, courtesyof Wikipedia 1. Maintain a code repository (Git) 2. Automate the build (Gradle) 3. Tests should be part of the build (ScalaTest) 4. Commit/push feature branches often 5. Build (and test) All The Branches 6. Test in a clone of the production environment 7. Keep the build fast 8. Everyone can see the results of builds © 2015 Uncharted Software Inc.
  • 18.
    Best Practices, courtesyof Wikipedia 1. Maintain a code repository (Git) 2. Automate the build (Gradle) 3. Tests should be part of the build (ScalaTest) 4. Commit/push feature branches often 5. Build (and test) All The Branches 6. Test in a clone of the production environment 7. Keep the build fast 8. Everyone can see the results of builds }duh. © 2015 Uncharted Software Inc.
  • 19.
    Best Practices, courtesyof Wikipedia 1. Maintain a code repository (Git) 2. Automate the build (Gradle) 3. Tests should be part of the build (ScalaTest) 4. Commit/push feature branches often 5. Build (and test) All The Branches 6. Test in a clone of the production environment 7. Keep the build fast 8. Everyone can see the results of builds }...less duh. © 2015 Uncharted Software Inc.
  • 20.
    Why are thesedifficult with Apache Spark? 5. Build (and test) All The Branches 6. Test in a clone of the production environment 7. Keep the build fast 8. Everyone can see the results of builds © 2015 Uncharted Software Inc.
  • 21.
    What is aSpark App? © 2015 Uncharted Software Inc.
  • 22.
    What is aSpark app? Source JAR Spark ? This thing. JAR © 2015 Uncharted Software Inc.
  • 23.
    And... Source JAR Spark ? We needto test this JAR © 2015 Uncharted Software Inc.
  • 24.
    But... Source JAR ScalaTest Scala RE By default,we have this JAR (boom) © 2015 Uncharted Software Inc.
  • 25.
    v1: Squish Sparkinside ScalaTest Source JAR ScalaTest with SparkContext So, we try this JAR it works! (sort of) © 2015 Uncharted Software Inc.
  • 26.
    it works! (sort of) ©2015 Uncharted Software Inc.
  • 27.
    6. Test ina clone of the production environment © 2015 Uncharted Software Inc.
  • 28.
    v2: Squish ScalaTestinto Spark Source Test JAR Tests Main.scala Spark JAR Test JAR Test Output JAR © 2015 Uncharted Software Inc.
  • 29.
  • 30.
    6. Test ina clone of the production environment © 2015 Uncharted Software Inc.
  • 31.
    Progress? 5. Build (andtest) All The Branches 6. Test in a clone of the production environment 7. Keep the build fast 8. Everyone can see the results of builds © 2015 Uncharted Software Inc.
  • 32.
    What now? 5. Build(and test) All The Branches 6. Test in a clone of the production environment 7. Keep the build fast 8. Everyone can see the results of builds © 2015 Uncharted Software Inc.
  • 33.
    Docker Container (uncharted/sparklet) v3: SquishSpark and Test JAR into Docker Test Output Source Test JAR Tests Main.scala Spark JAR JAR Test JAR © 2015 Uncharted Software Inc.
  • 34.
  • 35.
    build.gradle (excerpt) © 2015Uncharted Software Inc.
  • 36.
    Progress? 5. Build (andtest) All The Branches 6. Test in a clone of the production environment 7. Keep the build fast 8. Everyone can see the results of builds © 2015 Uncharted Software Inc.
  • 37.
    Travis CI VM DockerContainer v4: Squish Docker into Travis CI Test Output Source Test JAR Tests Main.scala Spark JAR JAR Test JAR © 2015 Uncharted Software Inc.
  • 38.
  • 39.
  • 40.
    Progress? 5. Build (andtest) All The Branches 6. Test in a clone of the production environment 7. Keep the build fast 8. Everyone can see the results of builds © 2015 Uncharted Software Inc.
  • 41.
    © 2015 UnchartedSoftware Inc.
  • 42.
    © 2015 UnchartedSoftware Inc.
  • 43.
    All done! 5. Build(and test) All The Branches 6. Test in a clone of the production environment 7. Keep the build fast 8. Everyone can see the results of builds © 2015 Uncharted Software Inc.
  • 44.
    Next Steps? Alpine Linux docker-compose Windows(dev environment) support python © 2015 Uncharted Software Inc.
  • 45.