Building Data Science Pipelines in Python using Luigi

•Download as PPTX, PDF•

0 likes•259 views

Shwet Kamal Mishra

The presentation used by me and Shivam at PyData Delhi meetup held at Exzeo Softwares.

Data & Analytics

Engineering @ Exzeo
Building Data Science
Pipelines in Python
Pydata Delhi Meetup
Exzeo, Noida
Feb 10, 2018
Shivam Bansal
Shwet Kamal Mishra

Contents
● Introduction
● Typical Data Science Workflow
● Challenges in the Data Science Workflow
● Data Science Pipelines
● Why use a Data Science Pipeline
● Luigi - Pipeline in python
● Luigi Features
● Luigi Demo

Who We are
Exzeo is a software development company specialized in core tech products
and services that optimize human capital
It was registered with Registrar of Companies on 9th August 2012.
We are a part of HCI group (NYSE: HCI) , a multinational conglomerate based
at Tampa, FL,USA.
The key focus of Exzeo is to improve the Insurance Sector using technology,
analytics and data science

Our Products and Services
ATLAS VIEWER
A data visualization product to view real-time feeds
and massive datasets on a map.
EXZEO HQ
Cloud based process management and Intelligent
automation for the insurance industry.
PROPLET
Innovative policy quoting application leveraging
multiple proprietary data sources.
TYPTAP
A complete, quick and secure platform to access
user’s insurance policies, and loss information
JUSTER
An intelligent app which helps to organize the claim
inspections and sync information with Exzeo Cloud.
HARMONY
Project Harmony offers insurance solutions; right
from buying a policy to filing a claim.

Our Tech Stack
Backend Frontend DataStorage DataScience DevOps /
Platforms

Data Science Problems @ Exzeo
● Property Risk Scoring from Multidimensional Data
● Detecting Roof Shape from Satellite Images
● Fraud Detection in Insurance Claims
● Claim Cause and Cost Prediction
● Knowledge Graph : Root Claim Cause Detection using NLP
● Climate Risk Forecasting
● Insurance Price Quoting Chatbot
● Object Detection from Property Interior Images

$ python procure_data.py
$ python clean_data.py
$ python feature_engineering.py
$ python exploratory_data_analysis.py
$ python modelling.py
$ python visualize_results.py
Too many tasks

procure_data()
clean_data()
feature_engineering()
exploratory_data_anaysis
()
<<--
Error
modelling()
visualize_results()
Failure Recovery

Reproducibility
generic_data_cleaning()
generic_data_processing()
generic_data_analysis()
generic_data_modeling()

Too Much Boilerplate Code
If __name__ == ‘__main__’

Solution - Pipeline
Continuous Integration of data processing steps and analysis tasks

Why use a pipeline
- Reuse the models
- Quick Implementation of Ideas
- Focus more on science instead of engineering
- Production ready products

Pipelines in Python - Luigi
● Python tool for workflow task management
● Developed and maintained by Spotify
● Open Source: https://github.com/spotify/luigi
pip install luigi

What’s so special about Luigi
● Tasks Templating
● Tasks Scheduling
● Tasks Monitoring
● Command Line Integration
● Batch and Parallel Processing
● Dependency Graphs
● Failure Recovery and Error Emails

Problem Statement:
Building a Pipeline to predict the Performance Score of a mobile game user.
The game consists of 120 different characters(heroes) and every hero has some capabilities.
Input Data
Training Data: User score for given characters
Independent Variables: User ID, Character ID, User-Character ID, Num Tries, Boost Used(0/1),
Attack Duration
Dependent Variable: Performance Score
Character Metadata: Data of each character
Variables: Character ID, Character Type, Hitpoints

Solution Pipeline
● Load Data
● Aggregate Data
● PreProcess Data
● Model Training
● Linear Regression
● Random Forest
● Model Selection
● Model Prediction

- Not ideal for Streaming Data
- No built in triggering(crontab or message broker is used)
Limitations of Luigi

Shivam Bansal | shivam5992@gmail.com | www.shivambansal.com
Shwet Kamal Mishra | shwetmishraa@gmail.com | www.shwetkmishra.com
Thanks !

What's hot

Ml master classQuantUniversity

Synthetic data in financeQuantUniversity

Frontiers in Alternative Data : Techniques and Use CasesQuantUniversity

Machine Learning in Finance: 10 Things You Need to Know in 2021QuantUniversity

Quant university MRM and machine learningQuantUniversity

Ml master class cfa polandQuantUniversity

Fintech in the Post-Covid AgeQuantUniversity

Machine Learning: Considerations for Fairly and Transparently Expanding Acces...QuantUniversity

Synthetic data in financeQuantUniversity

Careers in analyticsQuantUniversity

Ml conference slidesQuantUniversity

Ml and AI for financial professionalsQuantUniversity

Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Institute of Contemporary Sciences

Machine learning for factor investingQuantUniversity

CFA-NY Workshop - Final slidesQuantUniversity

Machine Learning and AI in Risk ManagementQuantUniversity

An introduction to ML, AI and AnalyticsSpotle.ai

Rapid prototyping quant research ml models using the qu sandboxQuantUniversity

Synthetic data generation for machine learningQuantUniversity

AI and ML Disruption in FinanceGopi Suvanam

What's hot (20)

Ml master class

Synthetic data in finance

Frontiers in Alternative Data : Techniques and Use Cases

Machine Learning in Finance: 10 Things You Need to Know in 2021

Quant university MRM and machine learning

Ml master class cfa poland

Fintech in the Post-Covid Age

Machine Learning: Considerations for Fairly and Transparently Expanding Acces...

Synthetic data in finance

Careers in analytics

Ml conference slides

Ml and AI for financial professionals

Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...

Machine learning for factor investing

CFA-NY Workshop - Final slides

Machine Learning and AI in Risk Management

An introduction to ML, AI and Analytics

Rapid prototyping quant research ml models using the qu sandbox

Synthetic data generation for machine learning

AI and ML Disruption in Finance

Similar to Building Data Science Pipelines in Python using Luigi

Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...Sri Ambati

How Data Virtualization Puts Machine Learning into Production (APAC)Denodo

How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo

Bitrock manufacturing cosma_r

Data science tools of the tradeFangda Wang

Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainingsMildain Solutions

Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellIT Arena

Comparing the performance of a business process: using Excel & PythonIRJET Journal

Second Review GTU intern ship about plant disease.pptxroyromeo560

London atlassian meetup 31 jan 2016 jira metrics-extract slidesRudiger Wolf

R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsKai Wähner

How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...Codemotion

2018 Oracle Impact 발표자료: Oracle Enterprise AITaewan Kim

Session 2023-11.pptxAndreeaTom

Maximize Big Data ROI via Best of Breed Patterns and PracticesJeff Bertman

Pinterest - Big Data Machine Learning Platform at PinterestAlluxio, Inc.

Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryMárton Kodok

hari_duche_updatedHari Duche

28022017 Simen Munter MindfieldsMohit Sharma (GAICD)

Artificial Intelligence and Machine Learning with the Oracle Data Science CloudJuarez Junior

Similar to Building Data Science Pipelines in Python using Luigi (20)

Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...

How Data Virtualization Puts Machine Learning into Production (APAC)

How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...

Bitrock manufacturing

Data science tools of the trade

Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainings

Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell

Comparing the performance of a business process: using Excel & Python

Second Review GTU intern ship about plant disease.pptx

London atlassian meetup 31 jan 2016 jira metrics-extract slides

R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics

How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...

2018 Oracle Impact 발표자료: Oracle Enterprise AI

Session 2023-11.pptx

Maximize Big Data ROI via Best of Breed Patterns and Practices

Pinterest - Big Data Machine Learning Platform at Pinterest

Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery

hari_duche_updated

28022017 Simen Munter Mindfields

Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud

Recently uploaded

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Data-Analysis for Chicago Crime Data 2023ymrp368

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Halmar dropshipping via API with DroFxolyaivanovalion

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

B2 Creative Industry Response Evaluation.docxStephen266013

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一ffjhghh

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Recently uploaded (20)

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Edukaciniai dropshipping via API with DroFx

Log Analysis using OSSEC sasoasasasas.pptx

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

Carero dropshipping via API with DroFx.pptx

Data-Analysis for Chicago Crime Data 2023

Generative AI on Enterprise Cloud with NiFi and Milvus

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Halmar dropshipping via API with DroFx

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

B2 Creative Industry Response Evaluation.docx

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

Schema on read is obsolete. Welcome metaprogramming..pdf

Brighton SEO | April 2024 | Data Storytelling

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一

RA-11058_IRR-COMPRESS Do 198 series of 1998

Sampling (random) method and Non random.ppt

Building Data Science Pipelines in Python using Luigi

1. Engineering @ Exzeo Building Data Science Pipelines in Python Pydata Delhi Meetup Exzeo, Noida Feb 10, 2018 Shivam Bansal Shwet Kamal Mishra

2. Contents ● Introduction ● Typical Data Science Workflow ● Challenges in the Data Science Workflow ● Data Science Pipelines ● Why use a Data Science Pipeline ● Luigi - Pipeline in python ● Luigi Features ● Luigi Demo

3. Who We are Exzeo is a software development company specialized in core tech products and services that optimize human capital It was registered with Registrar of Companies on 9th August 2012. We are a part of HCI group (NYSE: HCI) , a multinational conglomerate based at Tampa, FL,USA. The key focus of Exzeo is to improve the Insurance Sector using technology, analytics and data science

4. Our Products and Services ATLAS VIEWER A data visualization product to view real-time feeds and massive datasets on a map. EXZEO HQ Cloud based process management and Intelligent automation for the insurance industry. PROPLET Innovative policy quoting application leveraging multiple proprietary data sources. TYPTAP A complete, quick and secure platform to access user’s insurance policies, and loss information JUSTER An intelligent app which helps to organize the claim inspections and sync information with Exzeo Cloud. HARMONY Project Harmony offers insurance solutions; right from buying a policy to filing a claim.

5. Our Tech Stack Backend Frontend DataStorage DataScience DevOps / Platforms

6. Data Science Problems @ Exzeo ● Property Risk Scoring from Multidimensional Data ● Detecting Roof Shape from Satellite Images ● Fraud Detection in Insurance Claims ● Claim Cause and Cost Prediction ● Knowledge Graph : Root Claim Cause Detection using NLP ● Climate Risk Forecasting ● Insurance Price Quoting Chatbot ● Object Detection from Property Interior Images

7. Typical Data Science Workflow

8. $ python procure_data.py $ python clean_data.py $ python feature_engineering.py $ python exploratory_data_analysis.py $ python modelling.py $ python visualize_results.py Too many tasks

9. procure_data() clean_data() feature_engineering() exploratory_data_anaysis () <<-- Error modelling() visualize_results() Failure Recovery

10. Reproducibility generic_data_cleaning() generic_data_processing() generic_data_analysis() generic_data_modeling()

11. Too Much Boilerplate Code If __name__ == ‘__main__’

12. Solution - Pipeline Continuous Integration of data processing steps and analysis tasks

13. Why use a pipeline - Reuse the models - Quick Implementation of Ideas - Focus more on science instead of engineering - Production ready products

14. Pipelines in Python - Luigi ● Python tool for workflow task management ● Developed and maintained by Spotify ● Open Source: https://github.com/spotify/luigi pip install luigi

15. What’s so special about Luigi ● Tasks Templating ● Tasks Scheduling ● Tasks Monitoring ● Command Line Integration ● Batch and Parallel Processing ● Dependency Graphs ● Failure Recovery and Error Emails

16. Luigi Tasks

17. Monitoring Tasks

18. Visualizing Tasks Workflow

19. Central Scheduler

20. Problem Statement: Building a Pipeline to predict the Performance Score of a mobile game user. The game consists of 120 different characters(heroes) and every hero has some capabilities. Input Data Training Data: User score for given characters Independent Variables: User ID, Character ID, User-Character ID, Num Tries, Boost Used(0/1), Attack Duration Dependent Variable: Performance Score Character Metadata: Data of each character Variables: Character ID, Character Type, Hitpoints

21. Solution Pipeline ● Load Data ● Aggregate Data ● PreProcess Data ● Model Training ● Linear Regression ● Random Forest ● Model Selection ● Model Prediction

22. Luigi Pipeline Demo

23. - Not ideal for Streaming Data - No built in triggering(crontab or message broker is used) Limitations of Luigi

24. Shivam Bansal | shivam5992@gmail.com | www.shivambansal.com Shwet Kamal Mishra | shwetmishraa@gmail.com | www.shwetkmishra.com Thanks !

Editor's Notes

Manage, Monitor, Visualize
BoilerPlate
Data science tasks are repetitive, there needs to be some workflow which can reproduce set of tasks DS involves a long chain of sequential processes and failure can happen at any step There needs to be a framework which can help us resume the work from failure point Tasks should be able to generalise for different set of parameters When running a large process, monitoring is required to find the progress of tasks and find error at exact failure point

Building Data Science Pipelines in Python using Luigi

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Data Science Pipelines in Python using Luigi

Similar to Building Data Science Pipelines in Python using Luigi (20)

Recently uploaded

Recently uploaded (20)

Building Data Science Pipelines in Python using Luigi

Editor's Notes