SlideShare a Scribd company logo
Engineering @ Exzeo
Building Data Science
Pipelines in Python
Pydata Delhi Meetup
Exzeo, Noida
Feb 10, 2018
Shivam Bansal
Shwet Kamal Mishra
Contents
● Introduction
● Typical Data Science Workflow
● Challenges in the Data Science Workflow
● Data Science Pipelines
● Why use a Data Science Pipeline
● Luigi - Pipeline in python
● Luigi Features
● Luigi Demo
Who We are
Exzeo is a software development company specialized in core tech products
and services that optimize human capital
It was registered with Registrar of Companies on 9th August 2012.
We are a part of HCI group (NYSE: HCI) , a multinational conglomerate based
at Tampa, FL,USA.
The key focus of Exzeo is to improve the Insurance Sector using technology,
analytics and data science
Our Products and Services
ATLAS VIEWER
A data visualization product to view real-time feeds
and massive datasets on a map.
EXZEO HQ
Cloud based process management and Intelligent
automation for the insurance industry.
PROPLET
Innovative policy quoting application leveraging
multiple proprietary data sources.
TYPTAP
A complete, quick and secure platform to access
user’s insurance policies, and loss information
JUSTER
An intelligent app which helps to organize the claim
inspections and sync information with Exzeo Cloud.
HARMONY
Project Harmony offers insurance solutions; right
from buying a policy to filing a claim.
Our Tech Stack
Backend Frontend DataStorage DataScience DevOps /
Platforms
Data Science Problems @ Exzeo
● Property Risk Scoring from Multidimensional Data
● Detecting Roof Shape from Satellite Images
● Fraud Detection in Insurance Claims
● Claim Cause and Cost Prediction
● Knowledge Graph : Root Claim Cause Detection using NLP
● Climate Risk Forecasting
● Insurance Price Quoting Chatbot
● Object Detection from Property Interior Images
Typical Data Science Workflow
$ python procure_data.py
$ python clean_data.py
$ python feature_engineering.py
$ python exploratory_data_analysis.py
$ python modelling.py
$ python visualize_results.py
Too many tasks
procure_data()
clean_data()
feature_engineering()
exploratory_data_anaysis
()
<<--
Error
modelling()
visualize_results()
Failure Recovery
Reproducibility
generic_data_cleaning()
generic_data_processing()
generic_data_analysis()
generic_data_modeling()
Too Much Boilerplate Code
If __name__ == ‘__main__’
Solution - Pipeline
Continuous Integration of data processing steps and analysis tasks
Why use a pipeline
- Reuse the models
- Quick Implementation of Ideas
- Focus more on science instead of engineering
- Production ready products
Pipelines in Python - Luigi
● Python tool for workflow task management
● Developed and maintained by Spotify
● Open Source: https://github.com/spotify/luigi
pip install luigi
What’s so special about Luigi
● Tasks Templating
● Tasks Scheduling
● Tasks Monitoring
● Command Line Integration
● Batch and Parallel Processing
● Dependency Graphs
● Failure Recovery and Error Emails
Luigi Tasks
Monitoring Tasks
Visualizing Tasks Workflow
Central Scheduler
Problem Statement:
Building a Pipeline to predict the Performance Score of a mobile game user.
The game consists of 120 different characters(heroes) and every hero has some capabilities.
Input Data
Training Data: User score for given characters
Independent Variables: User ID, Character ID, User-Character ID, Num Tries, Boost Used(0/1),
Attack Duration
Dependent Variable: Performance Score
Character Metadata: Data of each character
Variables: Character ID, Character Type, Hitpoints
Solution Pipeline
● Load Data
● Aggregate Data
● PreProcess Data
● Model Training
● Linear Regression
● Random Forest
● Model Selection
● Model Prediction
Luigi Pipeline Demo
- Not ideal for Streaming Data
- No built in triggering(crontab or message broker is used)
Limitations of Luigi
Shivam Bansal | shivam5992@gmail.com | www.shivambansal.com
Shwet Kamal Mishra | shwetmishraa@gmail.com | www.shwetkmishra.com
Thanks !

More Related Content

What's hot

Ml master class
Ml master classMl master class
Ml master class
QuantUniversity
 
Synthetic data in finance
Synthetic data in financeSynthetic data in finance
Synthetic data in finance
QuantUniversity
 
Frontiers in Alternative Data : Techniques and Use Cases
Frontiers in Alternative Data : Techniques and Use CasesFrontiers in Alternative Data : Techniques and Use Cases
Frontiers in Alternative Data : Techniques and Use Cases
QuantUniversity
 
Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021
QuantUniversity
 
Quant university MRM and machine learning
Quant university MRM and machine learningQuant university MRM and machine learning
Quant university MRM and machine learning
QuantUniversity
 
Ml master class cfa poland
Ml master class   cfa polandMl master class   cfa poland
Ml master class cfa poland
QuantUniversity
 
Fintech in the Post-Covid Age
Fintech in the Post-Covid AgeFintech in the Post-Covid Age
Fintech in the Post-Covid Age
QuantUniversity
 
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
QuantUniversity
 
Synthetic data in finance
Synthetic data in financeSynthetic data in finance
Synthetic data in finance
QuantUniversity
 
Careers in analytics
Careers in analyticsCareers in analytics
Careers in analytics
QuantUniversity
 
Ml conference slides
Ml conference slidesMl conference slides
Ml conference slides
QuantUniversity
 
Ml and AI for financial professionals
Ml and AI for financial professionalsMl and AI for financial professionals
Ml and AI for financial professionals
QuantUniversity
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Institute of Contemporary Sciences
 
Machine learning for factor investing
Machine learning for factor investingMachine learning for factor investing
Machine learning for factor investing
QuantUniversity
 
CFA-NY Workshop - Final slides
CFA-NY Workshop - Final slidesCFA-NY Workshop - Final slides
CFA-NY Workshop - Final slides
QuantUniversity
 
Machine Learning and AI in Risk Management
Machine Learning and AI in Risk ManagementMachine Learning and AI in Risk Management
Machine Learning and AI in Risk Management
QuantUniversity
 
An introduction to ML, AI and Analytics
An introduction to ML, AI and AnalyticsAn introduction to ML, AI and Analytics
An introduction to ML, AI and Analytics
Spotle.ai
 
Rapid prototyping quant research ml models using the qu sandbox
Rapid prototyping quant research ml models using the qu sandboxRapid prototyping quant research ml models using the qu sandbox
Rapid prototyping quant research ml models using the qu sandbox
QuantUniversity
 
Synthetic data generation for machine learning
Synthetic data generation for machine learningSynthetic data generation for machine learning
Synthetic data generation for machine learning
QuantUniversity
 
AI and ML Disruption in Finance
AI and ML Disruption in FinanceAI and ML Disruption in Finance
AI and ML Disruption in Finance
Gopi Suvanam
 

What's hot (20)

Ml master class
Ml master classMl master class
Ml master class
 
Synthetic data in finance
Synthetic data in financeSynthetic data in finance
Synthetic data in finance
 
Frontiers in Alternative Data : Techniques and Use Cases
Frontiers in Alternative Data : Techniques and Use CasesFrontiers in Alternative Data : Techniques and Use Cases
Frontiers in Alternative Data : Techniques and Use Cases
 
Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021
 
Quant university MRM and machine learning
Quant university MRM and machine learningQuant university MRM and machine learning
Quant university MRM and machine learning
 
Ml master class cfa poland
Ml master class   cfa polandMl master class   cfa poland
Ml master class cfa poland
 
Fintech in the Post-Covid Age
Fintech in the Post-Covid AgeFintech in the Post-Covid Age
Fintech in the Post-Covid Age
 
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
 
Synthetic data in finance
Synthetic data in financeSynthetic data in finance
Synthetic data in finance
 
Careers in analytics
Careers in analyticsCareers in analytics
Careers in analytics
 
Ml conference slides
Ml conference slidesMl conference slides
Ml conference slides
 
Ml and AI for financial professionals
Ml and AI for financial professionalsMl and AI for financial professionals
Ml and AI for financial professionals
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
 
Machine learning for factor investing
Machine learning for factor investingMachine learning for factor investing
Machine learning for factor investing
 
CFA-NY Workshop - Final slides
CFA-NY Workshop - Final slidesCFA-NY Workshop - Final slides
CFA-NY Workshop - Final slides
 
Machine Learning and AI in Risk Management
Machine Learning and AI in Risk ManagementMachine Learning and AI in Risk Management
Machine Learning and AI in Risk Management
 
An introduction to ML, AI and Analytics
An introduction to ML, AI and AnalyticsAn introduction to ML, AI and Analytics
An introduction to ML, AI and Analytics
 
Rapid prototyping quant research ml models using the qu sandbox
Rapid prototyping quant research ml models using the qu sandboxRapid prototyping quant research ml models using the qu sandbox
Rapid prototyping quant research ml models using the qu sandbox
 
Synthetic data generation for machine learning
Synthetic data generation for machine learningSynthetic data generation for machine learning
Synthetic data generation for machine learning
 
AI and ML Disruption in Finance
AI and ML Disruption in FinanceAI and ML Disruption in Finance
AI and ML Disruption in Finance
 

Similar to Data Science Pipelines in Python using Luigi

Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...
Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...
Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...
Sri Ambati
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
Denodo
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Denodo
 
Bitrock manufacturing
Bitrock manufacturing Bitrock manufacturing
Bitrock manufacturing
cosma_r
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
Fangda Wang
 
Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainings
Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainingsTop 10 Most Demand IT Certifications Course in 2020 - MildainTrainings
Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainings
Mildain Solutions
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
IT Arena
 
Comparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & PythonComparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & Python
IRJET Journal
 
Second Review GTU intern ship about plant disease.pptx
Second Review GTU intern ship about plant disease.pptxSecond Review GTU intern ship about plant disease.pptx
Second Review GTU intern ship about plant disease.pptx
royromeo560
 
London atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesLondon atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slides
Rudiger Wolf
 
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsR, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
Kai Wähner
 
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
Codemotion
 
2018 Oracle Impact 발표자료: Oracle Enterprise AI
2018  Oracle Impact 발표자료: Oracle Enterprise AI2018  Oracle Impact 발표자료: Oracle Enterprise AI
2018 Oracle Impact 발표자료: Oracle Enterprise AI
Taewan Kim
 
Session 2023-11.pptx
Session 2023-11.pptxSession 2023-11.pptx
Session 2023-11.pptx
AndreeaTom
 
Maximize Big Data ROI via Best of Breed Patterns and Practices
Maximize Big Data ROI via Best of Breed Patterns and PracticesMaximize Big Data ROI via Best of Breed Patterns and Practices
Maximize Big Data ROI via Best of Breed Patterns and Practices
Jeff Bertman
 
Pinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at PinterestPinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at Pinterest
Alluxio, Inc.
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Márton Kodok
 
hari_duche_updated
hari_duche_updatedhari_duche_updated
hari_duche_updatedHari Duche
 
28022017 Simen Munter Mindfields
28022017 Simen Munter Mindfields28022017 Simen Munter Mindfields
28022017 Simen Munter Mindfields
Mohit Sharma (GAICD)
 
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science CloudArtificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Juarez Junior
 

Similar to Data Science Pipelines in Python using Luigi (20)

Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...
Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...
Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Bitrock manufacturing
Bitrock manufacturing Bitrock manufacturing
Bitrock manufacturing
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
 
Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainings
Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainingsTop 10 Most Demand IT Certifications Course in 2020 - MildainTrainings
Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainings
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
 
Comparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & PythonComparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & Python
 
Second Review GTU intern ship about plant disease.pptx
Second Review GTU intern ship about plant disease.pptxSecond Review GTU intern ship about plant disease.pptx
Second Review GTU intern ship about plant disease.pptx
 
London atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesLondon atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slides
 
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsR, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
 
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
 
2018 Oracle Impact 발표자료: Oracle Enterprise AI
2018  Oracle Impact 발표자료: Oracle Enterprise AI2018  Oracle Impact 발표자료: Oracle Enterprise AI
2018 Oracle Impact 발표자료: Oracle Enterprise AI
 
Session 2023-11.pptx
Session 2023-11.pptxSession 2023-11.pptx
Session 2023-11.pptx
 
Maximize Big Data ROI via Best of Breed Patterns and Practices
Maximize Big Data ROI via Best of Breed Patterns and PracticesMaximize Big Data ROI via Best of Breed Patterns and Practices
Maximize Big Data ROI via Best of Breed Patterns and Practices
 
Pinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at PinterestPinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at Pinterest
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
 
hari_duche_updated
hari_duche_updatedhari_duche_updated
hari_duche_updated
 
28022017 Simen Munter Mindfields
28022017 Simen Munter Mindfields28022017 Simen Munter Mindfields
28022017 Simen Munter Mindfields
 
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science CloudArtificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
 

Recently uploaded

Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 

Recently uploaded (20)

Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 

Data Science Pipelines in Python using Luigi

  • 1. Engineering @ Exzeo Building Data Science Pipelines in Python Pydata Delhi Meetup Exzeo, Noida Feb 10, 2018 Shivam Bansal Shwet Kamal Mishra
  • 2. Contents ● Introduction ● Typical Data Science Workflow ● Challenges in the Data Science Workflow ● Data Science Pipelines ● Why use a Data Science Pipeline ● Luigi - Pipeline in python ● Luigi Features ● Luigi Demo
  • 3. Who We are Exzeo is a software development company specialized in core tech products and services that optimize human capital It was registered with Registrar of Companies on 9th August 2012. We are a part of HCI group (NYSE: HCI) , a multinational conglomerate based at Tampa, FL,USA. The key focus of Exzeo is to improve the Insurance Sector using technology, analytics and data science
  • 4. Our Products and Services ATLAS VIEWER A data visualization product to view real-time feeds and massive datasets on a map. EXZEO HQ Cloud based process management and Intelligent automation for the insurance industry. PROPLET Innovative policy quoting application leveraging multiple proprietary data sources. TYPTAP A complete, quick and secure platform to access user’s insurance policies, and loss information JUSTER An intelligent app which helps to organize the claim inspections and sync information with Exzeo Cloud. HARMONY Project Harmony offers insurance solutions; right from buying a policy to filing a claim.
  • 5. Our Tech Stack Backend Frontend DataStorage DataScience DevOps / Platforms
  • 6. Data Science Problems @ Exzeo ● Property Risk Scoring from Multidimensional Data ● Detecting Roof Shape from Satellite Images ● Fraud Detection in Insurance Claims ● Claim Cause and Cost Prediction ● Knowledge Graph : Root Claim Cause Detection using NLP ● Climate Risk Forecasting ● Insurance Price Quoting Chatbot ● Object Detection from Property Interior Images
  • 8. $ python procure_data.py $ python clean_data.py $ python feature_engineering.py $ python exploratory_data_analysis.py $ python modelling.py $ python visualize_results.py Too many tasks
  • 11. Too Much Boilerplate Code If __name__ == ‘__main__’
  • 12. Solution - Pipeline Continuous Integration of data processing steps and analysis tasks
  • 13. Why use a pipeline - Reuse the models - Quick Implementation of Ideas - Focus more on science instead of engineering - Production ready products
  • 14. Pipelines in Python - Luigi ● Python tool for workflow task management ● Developed and maintained by Spotify ● Open Source: https://github.com/spotify/luigi pip install luigi
  • 15. What’s so special about Luigi ● Tasks Templating ● Tasks Scheduling ● Tasks Monitoring ● Command Line Integration ● Batch and Parallel Processing ● Dependency Graphs ● Failure Recovery and Error Emails
  • 20. Problem Statement: Building a Pipeline to predict the Performance Score of a mobile game user. The game consists of 120 different characters(heroes) and every hero has some capabilities. Input Data Training Data: User score for given characters Independent Variables: User ID, Character ID, User-Character ID, Num Tries, Boost Used(0/1), Attack Duration Dependent Variable: Performance Score Character Metadata: Data of each character Variables: Character ID, Character Type, Hitpoints
  • 21. Solution Pipeline ● Load Data ● Aggregate Data ● PreProcess Data ● Model Training ● Linear Regression ● Random Forest ● Model Selection ● Model Prediction
  • 23. - Not ideal for Streaming Data - No built in triggering(crontab or message broker is used) Limitations of Luigi
  • 24. Shivam Bansal | shivam5992@gmail.com | www.shivambansal.com Shwet Kamal Mishra | shwetmishraa@gmail.com | www.shwetkmishra.com Thanks !