SlideShare a Scribd company logo
1 of 27
Download to read offline
A CD Framework For Data Pipelines
Yaniv Rodenski
@YRodenski
yaniv@apache.org
Archetypes of Data Pipelines Builders
• Exploratory workloads
• Data centric
• Simple Deployment



Data People (Data Scientist/
Analysts/BI Devs)
Software Developers
• Code centric
• Heavy on methodologies
• Heavy tooling
• Very complex deployment
“Scientists”
Data scientist
deploying to
production
Making Big Data Teams Scale
• Scaling teams is hard
• Scaling Big Data teams is harder
• Different mentality between data professionals/
engineers
• Mixture of technologies
• Data as integration point
• Often schema-less
• Lack of tools
What Do We Need for Deploying our
apps?
• Source control system: Git, Hg, etc
• A CI process to integrate code run tests and package app
• A repository to store packaged app
• A repository to store configuration
• An API/DSL to configure the underlaying framework
• Mechanism to monitor the behaviour and performance of the app
How can we apply these 

techniques to
Big Data applications?
Who are we?
Software developers with

years of Big Data experience
What do we want?
Simple and robust way to

deploy Big Data pipelines
How will we get it?
Write tens thousands of lines

of code in Scala
Amaterasu - Simple Continuously Deployed
Data Apps
• Big Data apps in Multiple Frameworks
• Multiple Languages
• Scala
• Python
• SQL
• Pipeline deployments are defined as YAML
• Simple to Write, easy to deploy
• Reliable execution
• Multiple Environments
Amaterasu Repositories
• Jobs are defined in repositories
• Current implementation - git repositories
• tarballs support is planned for future release
• Repos structure
• maki.yml - The workflow definition
• src - a folder containing the actions (spark scripts, etc.) to be executed
• env - a folder containing configuration per environment
• deps - dependencies configuration
• Benefits of using git:
• Tooling
• Branching
Pipeline DSL - maki.yml 

(Version 0.2.0)
---
job-name: amaterasu-test
flow:
- name: start
runner:
group: spark
type: scala
file: file.scala
- exports:
odd: parquet
- name: step2
runner:
group: spark
type: pyspark
file: file2.py
error: file2.scala
name: handle-error
runner:
group: spark
type: scala
file: cleanup.scala
…
Data-structures to be used in
downstream actions
Actions are components of 

the pipeline
Error handling actions
Amaterasu is not a workflow engine, 

it’s a deployment tool that understands that Big
Data applications are rarely deployed
independently of other Big Data applications
Pipeline != Workflow
Pipeline DSL (Version 0.3.0)
---
job-name: amaterasu-test
type: long-running
def:
- name: start
type: long-running
runner:
group: spark
type: scala
file: file.scala
- exports:
odd: parquet
- name: step2
type: scheduled
schedule: 10 * * * *
runner:
group: spark
type: pyspark
artifact:
- groupid: io.shonto
artifactId: mySparkStreaming
 version: 0.1.0
…
Scheduling is defined using Cron

format
In Version 3 Pipeline and actions 

can be either long running or 

scheduled
Actions can be pulled from other
application or git repositories
Actions DSL (Spark)
• Your Scala/Python/SQL Future languages Spark
code (R is in the works)
• Few changes:
• Don’t create a new sc/sqlContext, use the one
in scope or access via AmaContext.spark
AmaContext.sc and AmaContext.sqlContext
• AmaContext.getDataFrame is used to access
data from previously executed actions
import org.apache.amaterasu.runtime._
val highNoDf = AmaContext.getDataFrame("start",
“odd")
.where(“_1 > 3")
highNoDf.write.json("file:///tmp/test1")
Actions DSL - Spark Scala
import org.apache.amaterasu.runtime._
val data = Array(1, 2, 3, 4, 5)
val rdd = AmaContext.sc.parallelize(data)
val odd = rdd.filter(n => n%2 != 0).toDf()
Action 1 (“start”) Action 2
- name: start
runner:
group: spark
type: scala
file: file.scala
- exports:
odd: parquet
high_no_df = ama_context

.get_dataframe(“start”, “odd")
.where("_1 > 100”)
high_no_df.write.save(“file:///tmp/test1”, format=“json”)
Actions DSL - PySpark
data = reange(1, 1000)
rdd = ama_context.sc.parallelize(data)
odd = rdd.filter(lambda n: n % 2 != 0)
.map(row)
.toDf()
Action 1 (“start”) Action 2
- name: start
runner:
group: spark
type: pyspark
file: file.py
- exports:
odd: parquet
Actions DSL - SparkSQL
select * from
ama_context.start_odd 

where
_1 > 100
- name: acttion2
runner:
group: spark
type: sql
file: file.sql
- exports:
high_no: parquet
Environments
• Configuration is stored per environment
• Stored as YAML files in an environment folder
• Contains:
• Input/output path
• Work dir
• User defined key-values
env/prduction/job.yml
name: default
master: mesos://prdmsos:5050
inputRootPath: hdfs://prdhdfs:9000/user/amaterasu/input
outputRootPath: hdfs://prdhdfs:9000/user/amaterasu/
output
workingDir: alluxio://prdalluxio:19998/
configuration:
spark.cassandra.connection.host: cassandraprod
sourceTable: documents
env/dev/job.yml
name: test
master: local[*]
inputRootPath: file///tmp/input
outputRootPath: file///tmp/output
workingDir: file///tmp/work/
configuration:
spark.cassandra.connection.host: 127.0.0.1
sourceTable: documents
import io.shinto.amaterasu.runtime._
val highNoDf = AmaContext.getDataFrame("start", “x”)
.where("_1 > 3”)
highNoDf.write.json(Env.outputPath)
Environments in the Actions DSL
Demo time
Version 0.2.0-incubating main futures
• YARN support
• Spark SQL, PySpark support
• Extend environments to support:
• Pure YAML support (configuration used to be JSON)
• Full spark configuration
• spark.yml - support all spark configurations
• spark_exec_env.yml - for configuring spark executors
environments
• SDK Preview - for building framework integration
Future Development
• Long running pipelines and streaming support
• Better tooling
• ama-cli
• Web console
• Other frameworks: Presto, TensorFlow, Apache Flink,
Apache Beam, Hive
• SDK improvements
Website
http://amaterasu.incubator.apache.org
GitHub

https://github.com/apache/incubator-amaterasu
Mailing List
dev@amaterasu.incubator.apache.org
Slack
http://apacheamaterasu.slack.com
Twitter
@ApacheAmaterasu
Getting started
Thank you!
@YRodenski
yaniv@apache.org

More Related Content

What's hot

Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 

What's hot (20)

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 

Similar to Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 

Similar to Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines

  • 1. A CD Framework For Data Pipelines Yaniv Rodenski @YRodenski yaniv@apache.org
  • 2. Archetypes of Data Pipelines Builders • Exploratory workloads • Data centric • Simple Deployment
 
 Data People (Data Scientist/ Analysts/BI Devs) Software Developers • Code centric • Heavy on methodologies • Heavy tooling • Very complex deployment “Scientists”
  • 3.
  • 5. Making Big Data Teams Scale • Scaling teams is hard • Scaling Big Data teams is harder • Different mentality between data professionals/ engineers • Mixture of technologies • Data as integration point • Often schema-less • Lack of tools
  • 6. What Do We Need for Deploying our apps? • Source control system: Git, Hg, etc • A CI process to integrate code run tests and package app • A repository to store packaged app • A repository to store configuration • An API/DSL to configure the underlaying framework • Mechanism to monitor the behaviour and performance of the app
  • 7. How can we apply these 
 techniques to Big Data applications?
  • 8. Who are we? Software developers with
 years of Big Data experience What do we want? Simple and robust way to
 deploy Big Data pipelines How will we get it? Write tens thousands of lines
 of code in Scala
  • 9. Amaterasu - Simple Continuously Deployed Data Apps • Big Data apps in Multiple Frameworks • Multiple Languages • Scala • Python • SQL • Pipeline deployments are defined as YAML • Simple to Write, easy to deploy • Reliable execution • Multiple Environments
  • 10. Amaterasu Repositories • Jobs are defined in repositories • Current implementation - git repositories • tarballs support is planned for future release • Repos structure • maki.yml - The workflow definition • src - a folder containing the actions (spark scripts, etc.) to be executed • env - a folder containing configuration per environment • deps - dependencies configuration • Benefits of using git: • Tooling • Branching
  • 11. Pipeline DSL - maki.yml 
 (Version 0.2.0) --- job-name: amaterasu-test flow: - name: start runner: group: spark type: scala file: file.scala - exports: odd: parquet - name: step2 runner: group: spark type: pyspark file: file2.py error: file2.scala name: handle-error runner: group: spark type: scala file: cleanup.scala … Data-structures to be used in downstream actions Actions are components of 
 the pipeline Error handling actions
  • 12. Amaterasu is not a workflow engine, 
 it’s a deployment tool that understands that Big Data applications are rarely deployed independently of other Big Data applications
  • 14. Pipeline DSL (Version 0.3.0) --- job-name: amaterasu-test type: long-running def: - name: start type: long-running runner: group: spark type: scala file: file.scala - exports: odd: parquet - name: step2 type: scheduled schedule: 10 * * * * runner: group: spark type: pyspark artifact: - groupid: io.shonto artifactId: mySparkStreaming version: 0.1.0 … Scheduling is defined using Cron
 format In Version 3 Pipeline and actions 
 can be either long running or 
 scheduled Actions can be pulled from other application or git repositories
  • 15. Actions DSL (Spark) • Your Scala/Python/SQL Future languages Spark code (R is in the works) • Few changes: • Don’t create a new sc/sqlContext, use the one in scope or access via AmaContext.spark AmaContext.sc and AmaContext.sqlContext • AmaContext.getDataFrame is used to access data from previously executed actions
  • 16. import org.apache.amaterasu.runtime._ val highNoDf = AmaContext.getDataFrame("start", “odd") .where(“_1 > 3") highNoDf.write.json("file:///tmp/test1") Actions DSL - Spark Scala import org.apache.amaterasu.runtime._ val data = Array(1, 2, 3, 4, 5) val rdd = AmaContext.sc.parallelize(data) val odd = rdd.filter(n => n%2 != 0).toDf() Action 1 (“start”) Action 2 - name: start runner: group: spark type: scala file: file.scala - exports: odd: parquet
  • 17. high_no_df = ama_context
 .get_dataframe(“start”, “odd") .where("_1 > 100”) high_no_df.write.save(“file:///tmp/test1”, format=“json”) Actions DSL - PySpark data = reange(1, 1000) rdd = ama_context.sc.parallelize(data) odd = rdd.filter(lambda n: n % 2 != 0) .map(row) .toDf() Action 1 (“start”) Action 2 - name: start runner: group: spark type: pyspark file: file.py - exports: odd: parquet
  • 18. Actions DSL - SparkSQL select * from ama_context.start_odd 
 where _1 > 100 - name: acttion2 runner: group: spark type: sql file: file.sql - exports: high_no: parquet
  • 19. Environments • Configuration is stored per environment • Stored as YAML files in an environment folder • Contains: • Input/output path • Work dir • User defined key-values
  • 20. env/prduction/job.yml name: default master: mesos://prdmsos:5050 inputRootPath: hdfs://prdhdfs:9000/user/amaterasu/input outputRootPath: hdfs://prdhdfs:9000/user/amaterasu/ output workingDir: alluxio://prdalluxio:19998/ configuration: spark.cassandra.connection.host: cassandraprod sourceTable: documents
  • 21. env/dev/job.yml name: test master: local[*] inputRootPath: file///tmp/input outputRootPath: file///tmp/output workingDir: file///tmp/work/ configuration: spark.cassandra.connection.host: 127.0.0.1 sourceTable: documents
  • 22. import io.shinto.amaterasu.runtime._ val highNoDf = AmaContext.getDataFrame("start", “x”) .where("_1 > 3”) highNoDf.write.json(Env.outputPath) Environments in the Actions DSL
  • 24. Version 0.2.0-incubating main futures • YARN support • Spark SQL, PySpark support • Extend environments to support: • Pure YAML support (configuration used to be JSON) • Full spark configuration • spark.yml - support all spark configurations • spark_exec_env.yml - for configuring spark executors environments • SDK Preview - for building framework integration
  • 25. Future Development • Long running pipelines and streaming support • Better tooling • ama-cli • Web console • Other frameworks: Presto, TensorFlow, Apache Flink, Apache Beam, Hive • SDK improvements