Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A CD Framework For Data Pipelines
Yaniv Rodenski
@YRodenski
yaniv@apache.org
Archetypes of Data Pipelines Builders
• Exploratory workloads
• Data centric
• Simple Deployment



Data People (Data Scie...
Data scientist
deploying to
production
Making Big Data Teams Scale
• Scaling teams is hard
• Scaling Big Data teams is harder
• Different mentality between data ...
What Do We Need for Deploying our
apps?
• Source control system: Git, Hg, etc
• A CI process to integrate code run tests a...
How can we apply these 

techniques to
Big Data applications?
Who are we?
Software developers with

years of Big Data experience
What do we want?
Simple and robust way to

deploy Big D...
Amaterasu - Simple Continuously Deployed
Data Apps
• Big Data apps in Multiple Frameworks
• Multiple Languages
• Scala
• P...
Amaterasu Repositories
• Jobs are defined in repositories
• Current implementation - git repositories
• tarballs support is...
Pipeline DSL - maki.yml 

(Version 0.2.0)
---
job-name: amaterasu-test
flow:
- name: start
runner:
group: spark
type: scala...
Amaterasu is not a workflow engine, 

it’s a deployment tool that understands that Big
Data applications are rarely deploye...
Pipeline != Workflow
Pipeline DSL (Version 0.3.0)
---
job-name: amaterasu-test
type: long-running
def:
- name: start
type: long-running
runner:...
Actions DSL (Spark)
• Your Scala/Python/SQL Future languages Spark
code (R is in the works)
• Few changes:
• Don’t create ...
import org.apache.amaterasu.runtime._
val highNoDf = AmaContext.getDataFrame("start",
“odd")
.where(“_1 > 3")
highNoDf.wri...
high_no_df = ama_context

.get_dataframe(“start”, “odd")
.where("_1 > 100”)
high_no_df.write.save(“file:///tmp/test1”, form...
Actions DSL - SparkSQL
select * from
ama_context.start_odd 

where
_1 > 100
- name: acttion2
runner:
group: spark
type: sq...
Environments
• Configuration is stored per environment
• Stored as YAML files in an environment folder
• Contains:
• Input/o...
env/prduction/job.yml
name: default
master: mesos://prdmsos:5050
inputRootPath: hdfs://prdhdfs:9000/user/amaterasu/input
o...
env/dev/job.yml
name: test
master: local[*]
inputRootPath: file///tmp/input
outputRootPath: file///tmp/output
workingDir: ...
import io.shinto.amaterasu.runtime._
val highNoDf = AmaContext.getDataFrame("start", “x”)
.where("_1 > 3”)
highNoDf.write....
Demo time
Version 0.2.0-incubating main futures
• YARN support
• Spark SQL, PySpark support
• Extend environments to support:
• Pure...
Future Development
• Long running pipelines and streaming support
• Better tooling
• ama-cli
• Web console
• Other framewo...
Website
http://amaterasu.incubator.apache.org
GitHub

https://github.com/apache/incubator-amaterasu
Mailing List
dev@amate...
Thank you!
@YRodenski
yaniv@apache.org
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines
Upcoming SlideShare
Loading in …5
×

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines

820 views

Published on

In the last few years, the DevOps movement has introduced ground breaking approaches to the way we manage the lifecycle of software development and deployment. Today organisations aspire to fully automate the deployment of microservices and web applications with tools such as Chef, Puppet and Ansible. However, the deployment of data-processing pipelines remains a relic from the dark-ages of software development.

Processing large-scale data pipelines is the main engineering task of the Big Data era, and it should be treated with the same respect and craftsmanship as any other piece of software. That is why we created Apache Amaterasu (Incubating) - an open source framework that takes care of the specific needs of Big Data applications in the world of continuous delivery.

In this session, we will take a close look at Apache Amaterasu (Incubating) a simple and powerful framework to build and dispense pipelines. Amaterasu aims to help data engineers and data scientists to compose, configure, test, package, deploy and execute data pipelines written using multiple tools, languages and frameworks.
We will see what Amaterasu provides today, and how it can help existing Big Data application and demo some of the new bits that are coming in the near future.

Speaker:
Yaniv Rodenski, Senior Solutions Architect, Couchbase

Published in: Technology
  • Be the first to comment

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines

  1. 1. A CD Framework For Data Pipelines Yaniv Rodenski @YRodenski yaniv@apache.org
  2. 2. Archetypes of Data Pipelines Builders • Exploratory workloads • Data centric • Simple Deployment
 
 Data People (Data Scientist/ Analysts/BI Devs) Software Developers • Code centric • Heavy on methodologies • Heavy tooling • Very complex deployment “Scientists”
  3. 3. Data scientist deploying to production
  4. 4. Making Big Data Teams Scale • Scaling teams is hard • Scaling Big Data teams is harder • Different mentality between data professionals/ engineers • Mixture of technologies • Data as integration point • Often schema-less • Lack of tools
  5. 5. What Do We Need for Deploying our apps? • Source control system: Git, Hg, etc • A CI process to integrate code run tests and package app • A repository to store packaged app • A repository to store configuration • An API/DSL to configure the underlaying framework • Mechanism to monitor the behaviour and performance of the app
  6. 6. How can we apply these 
 techniques to Big Data applications?
  7. 7. Who are we? Software developers with
 years of Big Data experience What do we want? Simple and robust way to
 deploy Big Data pipelines How will we get it? Write tens thousands of lines
 of code in Scala
  8. 8. Amaterasu - Simple Continuously Deployed Data Apps • Big Data apps in Multiple Frameworks • Multiple Languages • Scala • Python • SQL • Pipeline deployments are defined as YAML • Simple to Write, easy to deploy • Reliable execution • Multiple Environments
  9. 9. Amaterasu Repositories • Jobs are defined in repositories • Current implementation - git repositories • tarballs support is planned for future release • Repos structure • maki.yml - The workflow definition • src - a folder containing the actions (spark scripts, etc.) to be executed • env - a folder containing configuration per environment • deps - dependencies configuration • Benefits of using git: • Tooling • Branching
  10. 10. Pipeline DSL - maki.yml 
 (Version 0.2.0) --- job-name: amaterasu-test flow: - name: start runner: group: spark type: scala file: file.scala - exports: odd: parquet - name: step2 runner: group: spark type: pyspark file: file2.py error: file2.scala name: handle-error runner: group: spark type: scala file: cleanup.scala … Data-structures to be used in downstream actions Actions are components of 
 the pipeline Error handling actions
  11. 11. Amaterasu is not a workflow engine, 
 it’s a deployment tool that understands that Big Data applications are rarely deployed independently of other Big Data applications
  12. 12. Pipeline != Workflow
  13. 13. Pipeline DSL (Version 0.3.0) --- job-name: amaterasu-test type: long-running def: - name: start type: long-running runner: group: spark type: scala file: file.scala - exports: odd: parquet - name: step2 type: scheduled schedule: 10 * * * * runner: group: spark type: pyspark artifact: - groupid: io.shonto artifactId: mySparkStreaming version: 0.1.0 … Scheduling is defined using Cron
 format In Version 3 Pipeline and actions 
 can be either long running or 
 scheduled Actions can be pulled from other application or git repositories
  14. 14. Actions DSL (Spark) • Your Scala/Python/SQL Future languages Spark code (R is in the works) • Few changes: • Don’t create a new sc/sqlContext, use the one in scope or access via AmaContext.spark AmaContext.sc and AmaContext.sqlContext • AmaContext.getDataFrame is used to access data from previously executed actions
  15. 15. import org.apache.amaterasu.runtime._ val highNoDf = AmaContext.getDataFrame("start", “odd") .where(“_1 > 3") highNoDf.write.json("file:///tmp/test1") Actions DSL - Spark Scala import org.apache.amaterasu.runtime._ val data = Array(1, 2, 3, 4, 5) val rdd = AmaContext.sc.parallelize(data) val odd = rdd.filter(n => n%2 != 0).toDf() Action 1 (“start”) Action 2 - name: start runner: group: spark type: scala file: file.scala - exports: odd: parquet
  16. 16. high_no_df = ama_context
 .get_dataframe(“start”, “odd") .where("_1 > 100”) high_no_df.write.save(“file:///tmp/test1”, format=“json”) Actions DSL - PySpark data = reange(1, 1000) rdd = ama_context.sc.parallelize(data) odd = rdd.filter(lambda n: n % 2 != 0) .map(row) .toDf() Action 1 (“start”) Action 2 - name: start runner: group: spark type: pyspark file: file.py - exports: odd: parquet
  17. 17. Actions DSL - SparkSQL select * from ama_context.start_odd 
 where _1 > 100 - name: acttion2 runner: group: spark type: sql file: file.sql - exports: high_no: parquet
  18. 18. Environments • Configuration is stored per environment • Stored as YAML files in an environment folder • Contains: • Input/output path • Work dir • User defined key-values
  19. 19. env/prduction/job.yml name: default master: mesos://prdmsos:5050 inputRootPath: hdfs://prdhdfs:9000/user/amaterasu/input outputRootPath: hdfs://prdhdfs:9000/user/amaterasu/ output workingDir: alluxio://prdalluxio:19998/ configuration: spark.cassandra.connection.host: cassandraprod sourceTable: documents
  20. 20. env/dev/job.yml name: test master: local[*] inputRootPath: file///tmp/input outputRootPath: file///tmp/output workingDir: file///tmp/work/ configuration: spark.cassandra.connection.host: 127.0.0.1 sourceTable: documents
  21. 21. import io.shinto.amaterasu.runtime._ val highNoDf = AmaContext.getDataFrame("start", “x”) .where("_1 > 3”) highNoDf.write.json(Env.outputPath) Environments in the Actions DSL
  22. 22. Demo time
  23. 23. Version 0.2.0-incubating main futures • YARN support • Spark SQL, PySpark support • Extend environments to support: • Pure YAML support (configuration used to be JSON) • Full spark configuration • spark.yml - support all spark configurations • spark_exec_env.yml - for configuring spark executors environments • SDK Preview - for building framework integration
  24. 24. Future Development • Long running pipelines and streaming support • Better tooling • ama-cli • Web console • Other frameworks: Presto, TensorFlow, Apache Flink, Apache Beam, Hive • SDK improvements
  25. 25. Website http://amaterasu.incubator.apache.org GitHub
 https://github.com/apache/incubator-amaterasu Mailing List dev@amaterasu.incubator.apache.org Slack http://apacheamaterasu.slack.com Twitter @ApacheAmaterasu Getting started
  26. 26. Thank you! @YRodenski yaniv@apache.org

×