In the last few years, the DevOps movement has introduced ground breaking approaches to the way we manage the lifecycle of software development and deployment. Today organisations aspire to fully automate the deployment of microservices and web applications with tools such as Chef, Puppet and Ansible. However, the deployment of data-processing pipelines remains a relic from the dark-ages of software development.
Processing large-scale data pipelines is the main engineering task of the Big Data era, and it should be treated with the same respect and craftsmanship as any other piece of software. That is why we created Apache Amaterasu (Incubating) - an open source framework that takes care of the specific needs of Big Data applications in the world of continuous delivery.
In this session, we will take a close look at Apache Amaterasu (Incubating) a simple and powerful framework to build and dispense pipelines. Amaterasu aims to help data engineers and data scientists to compose, configure, test, package, deploy and execute data pipelines written using multiple tools, languages and frameworks.
We will see what Amaterasu provides today, and how it can help existing Big Data application and demo some of the new bits that are coming in the near future.
Speaker:
Yaniv Rodenski, Senior Solutions Architect, Couchbase
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines
1. A CD Framework For Data Pipelines
Yaniv Rodenski
@YRodenski
yaniv@apache.org
2. Archetypes of Data Pipelines Builders
• Exploratory workloads
• Data centric
• Simple Deployment
Data People (Data Scientist/
Analysts/BI Devs)
Software Developers
• Code centric
• Heavy on methodologies
• Heavy tooling
• Very complex deployment
“Scientists”
5. Making Big Data Teams Scale
• Scaling teams is hard
• Scaling Big Data teams is harder
• Different mentality between data professionals/
engineers
• Mixture of technologies
• Data as integration point
• Often schema-less
• Lack of tools
6. What Do We Need for Deploying our
apps?
• Source control system: Git, Hg, etc
• A CI process to integrate code run tests and package app
• A repository to store packaged app
• A repository to store configuration
• An API/DSL to configure the underlaying framework
• Mechanism to monitor the behaviour and performance of the app
7. How can we apply these
techniques to
Big Data applications?
8. Who are we?
Software developers with
years of Big Data experience
What do we want?
Simple and robust way to
deploy Big Data pipelines
How will we get it?
Write tens thousands of lines
of code in Scala
9. Amaterasu - Simple Continuously Deployed
Data Apps
• Big Data apps in Multiple Frameworks
• Multiple Languages
• Scala
• Python
• SQL
• Pipeline deployments are defined as YAML
• Simple to Write, easy to deploy
• Reliable execution
• Multiple Environments
10. Amaterasu Repositories
• Jobs are defined in repositories
• Current implementation - git repositories
• tarballs support is planned for future release
• Repos structure
• maki.yml - The workflow definition
• src - a folder containing the actions (spark scripts, etc.) to be executed
• env - a folder containing configuration per environment
• deps - dependencies configuration
• Benefits of using git:
• Tooling
• Branching
11. Pipeline DSL - maki.yml
(Version 0.2.0)
---
job-name: amaterasu-test
flow:
- name: start
runner:
group: spark
type: scala
file: file.scala
- exports:
odd: parquet
- name: step2
runner:
group: spark
type: pyspark
file: file2.py
error: file2.scala
name: handle-error
runner:
group: spark
type: scala
file: cleanup.scala
…
Data-structures to be used in
downstream actions
Actions are components of
the pipeline
Error handling actions
12. Amaterasu is not a workflow engine,
it’s a deployment tool that understands that Big
Data applications are rarely deployed
independently of other Big Data applications
14. Pipeline DSL (Version 0.3.0)
---
job-name: amaterasu-test
type: long-running
def:
- name: start
type: long-running
runner:
group: spark
type: scala
file: file.scala
- exports:
odd: parquet
- name: step2
type: scheduled
schedule: 10 * * * *
runner:
group: spark
type: pyspark
artifact:
- groupid: io.shonto
artifactId: mySparkStreaming
version: 0.1.0
…
Scheduling is defined using Cron
format
In Version 3 Pipeline and actions
can be either long running or
scheduled
Actions can be pulled from other
application or git repositories
15. Actions DSL (Spark)
• Your Scala/Python/SQL Future languages Spark
code (R is in the works)
• Few changes:
• Don’t create a new sc/sqlContext, use the one
in scope or access via AmaContext.spark
AmaContext.sc and AmaContext.sqlContext
• AmaContext.getDataFrame is used to access
data from previously executed actions
19. Environments
• Configuration is stored per environment
• Stored as YAML files in an environment folder
• Contains:
• Input/output path
• Work dir
• User defined key-values
24. Version 0.2.0-incubating main futures
• YARN support
• Spark SQL, PySpark support
• Extend environments to support:
• Pure YAML support (configuration used to be JSON)
• Full spark configuration
• spark.yml - support all spark configurations
• spark_exec_env.yml - for configuring spark executors
environments
• SDK Preview - for building framework integration
25. Future Development
• Long running pipelines and streaming support
• Better tooling
• ama-cli
• Web console
• Other frameworks: Presto, TensorFlow, Apache Flink,
Apache Beam, Hive
• SDK improvements