SlideShare a Scribd company logo
1 of 29
Download to read offline
Simplifying the Creation of Spark
Pipelines with Yaetos
Meetup @ Spark Barcelona
Arthur Prévot - 2022-06-28
Table Of Content
● Spark
● What is Yaetos?
● Jobs, Pipelines and Manifest
● Setup and Test
● From Prototype to Production
● Job Parameters
● Workflow
● Other Features
● Demo
Spark is Amazing
● Pushing for data lake architecture
○ ↘💰 storage , ↗󰤔 to organize data
● Strong support for various programming languages and SQL
○ Allowing software dev best practises
● Ability to run locally
● Pushing towards more open source
More work needed to operationalize it
● It needs external tooling
○ Computer resources
○ Storage
○ Scheduling
● It needs extra code to deal with data-eng problems
○ Dataset dependencies
○ Idempotence
○ Unit-testing
○ Dataset cataloging…
● It is often seen as overkill for small jobs
○ -> pandas good enough
An open source tool for data engineers,
scientists, and analysts to easily create data
pipelines in python and SQL and put them in
production in AWS.
It integrates all tools necessary to create a data
stack, relying only on open source engines (Spark
or Pandas) and hosted services (AWS). It can be
setup in minutes.
Used in prod at Adevinta and The Hotels Network,
100+ datasets updated daily for 2+ years
What is YAETOS?
Like a Swiss Army Knife
● Dataset on disk (S3, local)
● Unstructured data on disk
● MySQL DB tables
● PostGres DB Tables
● Redshift DB Tables
● API services (Salesforce,
Stripe, or other)
Engine
● Spark
● Pandas
● Dataset on disk (S3, local)
● Unstructured data on disk
● MySQL DB tables
● PostGres DB Tables
● Redshift DB Tables
● API services (Salesforce,
Stripe, or other)
Resources
● AWS EMR
Scheduling
● AWS Data Pipeline
Input Output
Engine/Res./Sched.
With Room for Expansion
Good candidates for addition:
Engine
● Spark
● Pandas
Resources
● EMR
● Kubernetes
● AWS Lambda
Scheduling
● AWS Data Pipeline
● Airflow
Engine/Res./Sched.
Definition of a Job
ex1_sql_job.sql ex1_pyspark_job.py ex1_unframed_job.py
More flexibility
Less simplicity
python jobs/launcher.py
--job_name=examples/ex1_sql_job.sql
python jobs/ex1_pyspark_job.py
python jobs/ex1_unframed_job.py
Job Details
ex1_pyspark_job.py
Framework part: loading dfs + more
Transform code
Params + link to more params
Framework: Command line + exec
The Jobs Manifest
List of jobs with jobs
metadata (IO,
dependencies,
scheduling info)
Can contain hundreds of
jobs
Can be split across
several files if needed
(per company dept.)
Human readable,
computer parse-able
jobs_metadata.yml
From the Manifest to Job Files
ex_sql_job.sql
jobs_metadata.yml
ex_pyspark_job.py
ex_pandas_job.py
Definition of a Pipeline
Pipeline = job running with its dependencies, as defined in the job manifest
How to get it setup?
Running this in a terminal
$ pip install yaetos
$ yaetos setup --project=my_yaetos_jobs
To run sample jobs
host $ cd my_yaetos_jobs/
host $ yaetos launch_docker_bash
# Running 1 job
guest $ python jobs/example/ex0_extraction_job.py
# Running 1 pipeline
guest $ python jobs/example/ex1_framework_job.py
–-dependencies
Where does it live ?
… in a folder, ready for github -> shareable
The files to create pipelines:
● Manifest
● Job code (python or SQL)
● Job unit-tests (optional)
Jobs can run in local or be pushed to the cloud
How to Inspect a job in Jupyter
From prototype to production
Same command, different argument. No updates to job code
host $ yaetos launch_docker_bash
guest $ python path/to/some_job.py # i.e. local
guest $ python path/to/some_job.py —deploy=EMR
guest $ python path/to/some_job.py —deploy=EMR_Scheduled
Running a Pipeline Locally
FS
CPU
On execution:
1. Load input dataset from FS
2. Load secrets from FS
3. Process data
4. Save output to FS
5. Repeat with next dependant job if any
$ python path/to/some_job.py
Running a Pipeline in the Cloud
EMR
Cluster S3 FS
On execution:
1. Zip repo files (without secrets)
2. Send zip to S3
3. Send secrets to AWS Secrets
4. Creates EMR cluster in AWS if req.
5. Load input datasets from S3
6. Load secrets from AWS
7. Process data in EMR
8. Save output to S3
9. Repeat with next dependant job if any
10. Kill cluster
AWS
Secrets
$ python path/to/some_job.py —-deploy=EMR
Running a Pipeline in the Cloud on a Schedule
EMR
Cluster S3 FS
On execution:
1. Zip repo files (without secrets)
2. Send zip to S3
3. Send secrets to AWS Secrets
4. Configure schedule in AWS Data
Pipeline (deactivate previous if any)
When scheduled time reached:
5. Create EMR cluster in AWS
6. Load input datasets from S3
7. Load secrets from AWS
8. Process data in EMR
9. Save output to S3
10. Repeat with next dependant job if any
11. Kill cluster
Repeat at next scheduled time
AWS
Secrets
AWS
Data
Pipeline
$ python path/to/some_job.py —-deploy=EMR_Schedule
Running a Pipeline in the Cloud on a Schedule, cont’d
Where do I track my jobs in the cloud ?
In the standard AWS UIs for each service
Resources
(EMR)
Storage
(S3)
Scheduling (AWS Data Pipeline)
Job Parameters
Input/Output
Copy Redshift
Dependencies
Size and # machines
Scheduling info
Email if failing
Custom param
Workflow
● (Optional) Write unit-test
● Write transform locally
○ Test on unit-tests, or on locally dataset copy
○ Put all parameters in job for faster iterations.
● Test in the cloud
● PR in github, merge, deploy to the scheduling tool
○ Suggest putting important parameters in the manifest (jobs_metadata.yml)
○ Use “--mode=prod_EMR ” to use the parameters associated to production in the cloud (such
as the base_path, the database schema to use…)
Pipeline Unit-Test
Other Features
● Fairly clean logs
● Secret management (see conf/connections.cfg)
● Automation of folder structure to store previous versions with timestamps
● Support for idempotente incremental pipelines (daily)
● Support gitops: Git hash in logs + prompt if code not clean before publishing
● Inferred schemas documented automatically in yaml file in repo
● Unit-testing
● saving and loading ML model files instead of dataset
● More example jobs available in main repo
Demo
More Details
● https://medium.com/@arthurprevot/yaetos-data-framework-description-ddc7
1caf6ce
● Standalone repo (framework + sample jobs):
https://github.com/arthurprevot/yaetos
● Jobs only repo (sample jobs only, framework from pip installed yaetos)
https://github.com/arthurprevot/yaetos_jobs
Found it interesting ?
● Please help make it more visible -> add “star”
https://github.com/arthurprevot/yaetos
● Get in touch if you have questions, at
arthur@yaetos.com
● Lots of room for improvements, any help
welcome !
Thank you !
Questions ?

More Related Content

Similar to Yaetos_Meetup_SparkBCN_v1.pdf

Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applicationsCesar Cardenas Desales
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...it-people
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
Architetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaArchitetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaAmazon Web Services
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsCesar Cardenas Desales
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
PyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applicationsPyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applicationsCesar Cardenas Desales
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...Ron Reiter
 
Data analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenueData analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenueKris Peeters
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
R Jobs on the Cloud
R Jobs on the CloudR Jobs on the Cloud
R Jobs on the CloudJohn Doxaras
 
Everything comes in 3's
Everything comes in 3'sEverything comes in 3's
Everything comes in 3'sdelagoya
 
Implementing a build manager in Ada
Implementing a build manager in AdaImplementing a build manager in Ada
Implementing a build manager in AdaStephane Carrez
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 

Similar to Yaetos_Meetup_SparkBCN_v1.pdf (20)

Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applications
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Architetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaArchitetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS Lambda
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applications
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
PyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applicationsPyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applications
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
 
Data analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenueData analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenue
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
R Jobs on the Cloud
R Jobs on the CloudR Jobs on the Cloud
R Jobs on the Cloud
 
Everything comes in 3's
Everything comes in 3'sEverything comes in 3's
Everything comes in 3's
 
Implementing a build manager in Ada
Implementing a build manager in AdaImplementing a build manager in Ada
Implementing a build manager in Ada
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 

Recently uploaded

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Yaetos_Meetup_SparkBCN_v1.pdf

  • 1. Simplifying the Creation of Spark Pipelines with Yaetos Meetup @ Spark Barcelona Arthur Prévot - 2022-06-28
  • 2. Table Of Content ● Spark ● What is Yaetos? ● Jobs, Pipelines and Manifest ● Setup and Test ● From Prototype to Production ● Job Parameters ● Workflow ● Other Features ● Demo
  • 3. Spark is Amazing ● Pushing for data lake architecture ○ ↘💰 storage , ↗󰤔 to organize data ● Strong support for various programming languages and SQL ○ Allowing software dev best practises ● Ability to run locally ● Pushing towards more open source
  • 4. More work needed to operationalize it ● It needs external tooling ○ Computer resources ○ Storage ○ Scheduling ● It needs extra code to deal with data-eng problems ○ Dataset dependencies ○ Idempotence ○ Unit-testing ○ Dataset cataloging… ● It is often seen as overkill for small jobs ○ -> pandas good enough
  • 5. An open source tool for data engineers, scientists, and analysts to easily create data pipelines in python and SQL and put them in production in AWS. It integrates all tools necessary to create a data stack, relying only on open source engines (Spark or Pandas) and hosted services (AWS). It can be setup in minutes. Used in prod at Adevinta and The Hotels Network, 100+ datasets updated daily for 2+ years What is YAETOS?
  • 6. Like a Swiss Army Knife ● Dataset on disk (S3, local) ● Unstructured data on disk ● MySQL DB tables ● PostGres DB Tables ● Redshift DB Tables ● API services (Salesforce, Stripe, or other) Engine ● Spark ● Pandas ● Dataset on disk (S3, local) ● Unstructured data on disk ● MySQL DB tables ● PostGres DB Tables ● Redshift DB Tables ● API services (Salesforce, Stripe, or other) Resources ● AWS EMR Scheduling ● AWS Data Pipeline Input Output Engine/Res./Sched.
  • 7. With Room for Expansion Good candidates for addition: Engine ● Spark ● Pandas Resources ● EMR ● Kubernetes ● AWS Lambda Scheduling ● AWS Data Pipeline ● Airflow Engine/Res./Sched.
  • 8. Definition of a Job ex1_sql_job.sql ex1_pyspark_job.py ex1_unframed_job.py More flexibility Less simplicity python jobs/launcher.py --job_name=examples/ex1_sql_job.sql python jobs/ex1_pyspark_job.py python jobs/ex1_unframed_job.py
  • 9. Job Details ex1_pyspark_job.py Framework part: loading dfs + more Transform code Params + link to more params Framework: Command line + exec
  • 10. The Jobs Manifest List of jobs with jobs metadata (IO, dependencies, scheduling info) Can contain hundreds of jobs Can be split across several files if needed (per company dept.) Human readable, computer parse-able jobs_metadata.yml
  • 11. From the Manifest to Job Files ex_sql_job.sql jobs_metadata.yml ex_pyspark_job.py ex_pandas_job.py
  • 12. Definition of a Pipeline Pipeline = job running with its dependencies, as defined in the job manifest
  • 13. How to get it setup? Running this in a terminal $ pip install yaetos $ yaetos setup --project=my_yaetos_jobs To run sample jobs host $ cd my_yaetos_jobs/ host $ yaetos launch_docker_bash # Running 1 job guest $ python jobs/example/ex0_extraction_job.py # Running 1 pipeline guest $ python jobs/example/ex1_framework_job.py –-dependencies
  • 14. Where does it live ? … in a folder, ready for github -> shareable The files to create pipelines: ● Manifest ● Job code (python or SQL) ● Job unit-tests (optional) Jobs can run in local or be pushed to the cloud
  • 15. How to Inspect a job in Jupyter
  • 16. From prototype to production Same command, different argument. No updates to job code host $ yaetos launch_docker_bash guest $ python path/to/some_job.py # i.e. local guest $ python path/to/some_job.py —deploy=EMR guest $ python path/to/some_job.py —deploy=EMR_Scheduled
  • 17. Running a Pipeline Locally FS CPU On execution: 1. Load input dataset from FS 2. Load secrets from FS 3. Process data 4. Save output to FS 5. Repeat with next dependant job if any $ python path/to/some_job.py
  • 18. Running a Pipeline in the Cloud EMR Cluster S3 FS On execution: 1. Zip repo files (without secrets) 2. Send zip to S3 3. Send secrets to AWS Secrets 4. Creates EMR cluster in AWS if req. 5. Load input datasets from S3 6. Load secrets from AWS 7. Process data in EMR 8. Save output to S3 9. Repeat with next dependant job if any 10. Kill cluster AWS Secrets $ python path/to/some_job.py —-deploy=EMR
  • 19. Running a Pipeline in the Cloud on a Schedule EMR Cluster S3 FS On execution: 1. Zip repo files (without secrets) 2. Send zip to S3 3. Send secrets to AWS Secrets 4. Configure schedule in AWS Data Pipeline (deactivate previous if any) When scheduled time reached: 5. Create EMR cluster in AWS 6. Load input datasets from S3 7. Load secrets from AWS 8. Process data in EMR 9. Save output to S3 10. Repeat with next dependant job if any 11. Kill cluster Repeat at next scheduled time AWS Secrets AWS Data Pipeline $ python path/to/some_job.py —-deploy=EMR_Schedule
  • 20. Running a Pipeline in the Cloud on a Schedule, cont’d
  • 21. Where do I track my jobs in the cloud ? In the standard AWS UIs for each service Resources (EMR) Storage (S3) Scheduling (AWS Data Pipeline)
  • 22. Job Parameters Input/Output Copy Redshift Dependencies Size and # machines Scheduling info Email if failing Custom param
  • 23. Workflow ● (Optional) Write unit-test ● Write transform locally ○ Test on unit-tests, or on locally dataset copy ○ Put all parameters in job for faster iterations. ● Test in the cloud ● PR in github, merge, deploy to the scheduling tool ○ Suggest putting important parameters in the manifest (jobs_metadata.yml) ○ Use “--mode=prod_EMR ” to use the parameters associated to production in the cloud (such as the base_path, the database schema to use…)
  • 25. Other Features ● Fairly clean logs ● Secret management (see conf/connections.cfg) ● Automation of folder structure to store previous versions with timestamps ● Support for idempotente incremental pipelines (daily) ● Support gitops: Git hash in logs + prompt if code not clean before publishing ● Inferred schemas documented automatically in yaml file in repo ● Unit-testing ● saving and loading ML model files instead of dataset ● More example jobs available in main repo
  • 26. Demo
  • 27. More Details ● https://medium.com/@arthurprevot/yaetos-data-framework-description-ddc7 1caf6ce ● Standalone repo (framework + sample jobs): https://github.com/arthurprevot/yaetos ● Jobs only repo (sample jobs only, framework from pip installed yaetos) https://github.com/arthurprevot/yaetos_jobs
  • 28. Found it interesting ? ● Please help make it more visible -> add “star” https://github.com/arthurprevot/yaetos ● Get in touch if you have questions, at arthur@yaetos.com ● Lots of room for improvements, any help welcome !