How we use the unified log and samza to do real time stream processing in skyscanner. Also covering how we do deployment and monitoring of the stream processing jobs.
4. Introduction
• Skyscanner is a travel search company with over 50m UMVs and over 700
employees globally.
• Joseph Francis, Senior Software Engineer in Skyscanner
• Some use cases in Skyscanner
• Make samza jobs easily deployable and operable in a multi-tenant cluster
6. Past
• One (big) monolith SQL database for reporting and monitoring
• Central team to deliver data needs for the organization
• Not yet jumped into the bandwagon of large scale batch processing
9. Key Points
• Samza consumes 1 message at a time with at-least once delivery
guarantee
• Single thread of execution
• API offers init(), process() and window() methods
• State management with embedded key-value store
18. Current Deployment
• No centralised configuration
• Restrictive source folder structure
• Ansible deployment scripts were embedded with the samza job
27. Application Logs
• Application logs forwarded to elasticsearch through logstash
• Requires a shared format for logging (log4j.xml)
• Yarn UI is not the most intuitive!
29. Future
• More generic jobs
• Developers should only worry about writing code
• Fully automated production deployment
• Cross the boundaries of Batch vs Streaming?
How to make stream processing jobs offered as a service in a large organization with as little boiler plate as possible
* We need to have something that’s scalable and can be self serviceable
40k msg/sec approximately 30 MB/sec
No, one central team to write all the stream processing jobs. Make it self serving enough
Car hire uses embedded key value store
TC based deployment pipeline
All pipelines inherit from a templated structure and inherits common parameters
Ansible used under the hood to kill and start new jobs
Why not a REST based deployment system?
Hard to enforce any sense of centralised changes when we want to give flexibility to the users
Folder structure had a negative impact on giving flexibility to any sort of development
Samza job users had to understand ansible to test samza locally.
We are migrating all the systems deployed through ansible on VMs to docker containers, hence aligns with our migration plan
Local dev is very important so that developers have everything they need to fully test their jobs
Identical production and development environments
Vagrant local dev environment with Kafka, Zookeeper, Yarn
Package and compile jars locally
Jobs deployed with ansible scripts to the local yarn cluster
Commit and push through CI environments