Successfully reported this slideshow.
Your SlideShare is downloading. ×

Gerrit + Jenkins = Continuous Delivery For Big Data

Gerrit + Jenkins = Continuous Delivery For Big Data

Download to read offline

BigData is now everywhere, from mobile media analytics, banking, industry, avionics and even in medicine to monitor expansion of epidemics.

We are showing how Code Review can be integrated with Continuous Integration and Continuous Delivery in a Big Data scenario that poses new challenges to the existing Jenkins framework. We are going to describe how we managed to implement our agile build and deployment process working with distributed teams in BigData Software Development Projects for media and financial organizations in London. The talk will start with a presentation of our workflow and then will explain how we leveraged Gerrit and Jenkins and how we integrated with Docker, Mesos and the Hadoop ecosystem.

BigData is now everywhere, from mobile media analytics, banking, industry, avionics and even in medicine to monitor expansion of epidemics.

We are showing how Code Review can be integrated with Continuous Integration and Continuous Delivery in a Big Data scenario that poses new challenges to the existing Jenkins framework. We are going to describe how we managed to implement our agile build and deployment process working with distributed teams in BigData Software Development Projects for media and financial organizations in London. The talk will start with a presentation of our workflow and then will explain how we leveraged Gerrit and Jenkins and how we integrated with Docker, Mesos and the Hadoop ecosystem.

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Gerrit + Jenkins = Continuous Delivery For Big Data

  1. 1. 1 Gerrit + Jenkins = Continuous Delivery for Big Data Mountain View, CA, November 2015 Stefano Galarraga GerritForge stefano@gerritforge.com http://www.gerritforge.com Real-life case study and future developments
  2. 2. The Team 2 Luca Milanesio • Co-founder and Director of GerritForge • over 20 years in Agile Development and ALM • OpenSource contributor to many projects (BigData, Continuous Integration, Git/Gerrit) Antonios Chalkiopulos • Author of Programming MapReduce with Scalding • Open source contributor to many BigData projects • Working on the "land-of-Hadoop' (landoop.com) Tiago Palma • Data Warehouse & Big Data Development • Senior Data Modeler • Big Data infrastructure specialist Stefano Galarraga • 20 years of Agile Development • Middleware, Big Data, Reactive Distributed Systems. • Open Source contributor to BigData projects.
  3. 3. Agenda • What’s special in Big Data – General lack of support for Unite/Integration testing – Testing the "real thing" (aka the Cluster) • Why Gerrit for continuous deployment on BigData? • Our Development Lifecycle ingredients – Gerrit, Jenkins, Mesos, Marathon, CDH / Spark • Gerrit Role and Components – What did we use, why, what we would like to have • New developments – Usint Topics with microservices for “atomic” multi-service changes • Live (minimised) Demo • Open points and discussion 3
  4. 4. WHY Gerrit? • Fast Paced • Distributed team • Relatively a “niche” technology – A lot of “junior” developers – Need for strong ownership – Validation rules – CD => We need to be have green build and consistent code quality 4
  5. 5. Code-Review Lifecycle • GIT used by distributed teams (UK, Israel, India) • Topics and Code Review • Jenkins build on every patch-set • Commits reviewed / approved via Gerrit Submit • Submitting a Topic automatically does: – all patch-sets merged (semi-atomically) – trigger a longer chain of CI steps – automatically promote a RC if everything passes • Jenkins automation via Gerrit Trigger Plugin 5
  6. 6. Build Steps and Solutions • Unit tests abstracting from dependencies • Integration Tests: – Using Docker to run dependencies on the CI • “Micro” Hadoop cluster or other dependencies (DBs, messaging) => Jenkins docker plugin • When possible “dockerizing” just the required components and driving them from the test framework • Performance/Acceptance required a real cluster 6
  7. 7. Fitting CDH Into this Picture • Acceptance / performance test with short-lived CDHs • Solution: Mesos, Marathon and Docker: – Ephemeral clusters with defined capacity – Automatic cluster-config – All controlled via Docker/Mesos • This was quite a long process!! – mostly because of CDH cluster configuration 7
  8. 8. Mesos + Marathon 8 • Apache Mesos – Abstracts CPU, memory, storage, other compute resources away from machines • Marathon Framework – Runs on top of Mesos – Guarantees that long-running applications never stop – REST API for managing and scaling services
  9. 9. CDH Components • CDH 5.4.1 distribution – Apache Spark – Hadoop HDFS – YARN 9
  10. 10. Slave Host Integration/Performance Test Flow on CDH Cluster 10 Jenkins Master Mesos Master Marathon Private Docker Registry Mesos Slave Docker POST to Marathon REST API to start 1 docker container with Cloudera Manager and N docker containers with cloudera agents Marathon Framework receives resource offers from Mesos Master and submits the tasks The task is sent to the Mesos Slave Mesos slave starts the docker container Docker image is fetched from Docker registry if not present in Slave host WaitingforDockers DockersUP Install Cloudera packages via Cloudera Manager API using Python Deploy the ETL, run the ETL and the Acceptance Tests
  11. 11. Unit and Integration Tests sample • Test project: – Test Spark project – ETL from Oracle to HDFS • Unit-test directly on Spark logic • Integration tests for every patch-set: – VERY small dataset just for this demo – CDH and Oracle Docker Images 11
  12. 12. O Unit and Integration Tests 12 Hadoop Pseudo- distributed mode Spark Standalone Jenkins Build Job init Submit job Init/read HDFS
  13. 13. DEMO 13
  14. 14. Open Point and Discussion • Topic based build of multiple artifacts – Demo implementation is naïve and difficult to maintain – Race conditions on build of dependent artifacts • Need more advanced triggering system (zuul might fit) – Race condition on submit of topic • Stream event: “topic-submitted” instead/in addition of many “patch-submitted” event • Gerrit Trigger plugin should listen to this event to coordinate 14
  15. 15. Questions?

×