This document provides an overview of a project to predict product demand for a Mexican bakery company using a Kaggle competition dataset. It describes three models created - a naive baseline model, an NLTK model using text features, and a comprehensive XGBoost model with extensive feature engineering and parameter tuning. The comprehensive model achieved the best scores on the validation and test sets compared to the other approaches. The document also discusses the tools, platforms, data, and technical challenges involved in building accurate demand prediction models at scale for this large dataset.
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Fwdays
Here in DS team in WIX we want to help to create stunning sites by applying recent achievement of AI research to production. Since Data Science engineering practices are still not fully shaped we found out that it is crucial to bring the best practices from software engineering - give Data Scientist ability to deliver models fast without loss in quality and computation efficiency to stay competitive in this overhyped market. To achieve this we are developing our own infrastructure for creating pipelines and deploying them to production with minimum (to none) engineer involvement.
This talk will cover initial motivation, solved technical issues and lessons learned while building such ML delivery system.
Website: https://fwdays.com/en/event/data-science-fwdays-2019/review/continuous-delivery-of-ml-pipelines-to-production
Wide Vision is pleased to announce Project Paper Presentation Competition [P3C]– a contest for 2014 & 2015 passed out Software / IT branch students. The competition is a proven platform to test knowledge in the form of projects and showcase them to the wide range of professionals.
WideVision has played a dominant role in uprising the students’ talents to the job level. With this event we give students a platform to showcase their projects and in return winners would get handsome perks and job assistance.
In present scenario, most of the projects done during studies are either left unnoticed afterwards or simply ignored. The time and labor students put needs to be recognized and they should be awarded for their creativity and development.
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Fwdays
Here in DS team in WIX we want to help to create stunning sites by applying recent achievement of AI research to production. Since Data Science engineering practices are still not fully shaped we found out that it is crucial to bring the best practices from software engineering - give Data Scientist ability to deliver models fast without loss in quality and computation efficiency to stay competitive in this overhyped market. To achieve this we are developing our own infrastructure for creating pipelines and deploying them to production with minimum (to none) engineer involvement.
This talk will cover initial motivation, solved technical issues and lessons learned while building such ML delivery system.
Website: https://fwdays.com/en/event/data-science-fwdays-2019/review/continuous-delivery-of-ml-pipelines-to-production
Wide Vision is pleased to announce Project Paper Presentation Competition [P3C]– a contest for 2014 & 2015 passed out Software / IT branch students. The competition is a proven platform to test knowledge in the form of projects and showcase them to the wide range of professionals.
WideVision has played a dominant role in uprising the students’ talents to the job level. With this event we give students a platform to showcase their projects and in return winners would get handsome perks and job assistance.
In present scenario, most of the projects done during studies are either left unnoticed afterwards or simply ignored. The time and labor students put needs to be recognized and they should be awarded for their creativity and development.
Presentation for App that predicts machine failure based on provided data.
Link to GitHub repo:
https://github.com/bishtabhinavsingh/Docker-pySpark-MachineFailurePrediction
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
This talk will focus on Techniques, metrics and different tests (code, models, infra and features/data) that help the developers of machine learning systems to achieve CD.
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
MLOps Lifecycle
ML problem framing
ML solution architecture
Data preparation and processing
ML model development
ML pipeline automation and orchestration
ML solution monitoring, optimization, and maintenance
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...StormForge .io
Complimentary Live Webinar
Sponsored by StormForge
Analyzing the performance and behavior of applications run on Kubernetes is often challenging, making the need to optimize prior to production something that you must have. However, a problem has reared its head in the form of a question: How do you get an accurate measurement of application performance or other behavior without accurate testing or an accurate representation of how it will run in production? In this webinar, we will present and discuss a new fully Open Source tool for creating the needed tests with which to accurately measure your applications. We hope you will join us to learn more about this tool, and find out how you can help contribute.
This webinar is sponsored by StormForge and hosted by The Linux Foundation.
Speaker
Noah Abrahams, Open Source Advocate
Noah is an Open Source Advocate for StormForge, merging Open Source Strategy with Developer Advocacy. He has been involved in cloud for over 12 years, has been contributing to the Kubernetes ecosystem for 5 years, and has been up and down the business stack from DevOps and Architecture to Sales, Enablement, and Education. You will find him running meetups in Las Vegas and attending conferences, once those are both happening again.
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
Have a big data product / database / DBMS? need to test it? don't know where to start? some things to consider while you design your Automation QA.
Link to Video
https://www.youtube.com/watch?v=MlT4pP7BGFQ
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
In this talk I describe the specific challenges that we faced at Signal to make our use case scale. I then go into detail on how we benchmarked single queries and different shard configurations. You can try the experiments yourself using The Signal Media One-Million News Articles Dataset, a Docker Compose stack and some scripts provided here: https://github.com/joachimdraeger/elasticsearch-performance-experiments.
I also got the great advice to have a look at https://github.com/elastic/rally which can also give you summaries for test runs.
This talk is about the ways to improve the performance of individual applications, to optimize code and architecture. It also provides a brief overview of Java tools.
This presentation was held by Oleksandr Bodnar (Senior Software Java Developer, Consultant, GlobalLogic) at GlobalLogic MykolaivJava TechTalk #2 on September 27, 2018.
This is our contributions to the Data Science projects, as developed in our startup. These are part of partner trainings and in-house design and development and testing of the course material and concepts in Data Science and Engineering. It covers Data ingestion, data wrangling, feature engineering, data analysis, data storage, data extraction, querying data, formatting and visualizing data for various dashboards.Data is prepared for accurate ML model predictions and Generative AI apps
This is our project work at our startup for Data Science. This is part of our internal training and focused on data management for AI, ML and Generative AI apps
Presentation from the 4th Athens Gophers Meetup.
At a glance we present:
- why we introduced a new language in the organization and why that
was Go
- how we approached the transition
- some of the projects we built in Go
- the challenges we faced and the lessons we learned in the process
If we could only predict the future of the software industry, we could make better investments and decisions. We could waste less resources on technology and processes we know will not last, or at least be conscious in our decisions to choose solutions with a limited life time. It turns out that for data engineering, we can predict the future, because it has already happened. Not in our workplace, but at a few leading companies that are blazing ahead. It has also already happened in the neighbouring field of software engineering, which is two decades ahead of data engineering regarding process maturity. In this presentation, we will glimpse into the future of data engineering. Data engineering has gone from legacy data warehouses with stored procedures, to big data with Hadoop and data lakes, on to a new form of modern data warehouses and low code tools aka "the modern data stack". Where does it go from here? We will look at the points where data leaders differ from the crowd and combine with observations on how software engineering has evolved, to see that it points towards a new, more industrialised form of data engineering - "data factory engineering".
DataOps requires a cultural shift that brings the principles of lean manufacturing and DevOps to data analytics. It breaks down silos between developers, data scientists, and operators, resulting in rapid cycle times and low error rates.
At Spotify in 2013, the concept of DataOps did not exist but the Swedish company needed a way to align the people, processes, and technologies of the data organization to accelerate the development of high-quality analytics. The result was a Swedish-style DataOps, influenced by Scandinavian culture and agile principles, that enabled the company to become a true data-driven leader.
Presentation for App that predicts machine failure based on provided data.
Link to GitHub repo:
https://github.com/bishtabhinavsingh/Docker-pySpark-MachineFailurePrediction
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
This talk will focus on Techniques, metrics and different tests (code, models, infra and features/data) that help the developers of machine learning systems to achieve CD.
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
MLOps Lifecycle
ML problem framing
ML solution architecture
Data preparation and processing
ML model development
ML pipeline automation and orchestration
ML solution monitoring, optimization, and maintenance
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...StormForge .io
Complimentary Live Webinar
Sponsored by StormForge
Analyzing the performance and behavior of applications run on Kubernetes is often challenging, making the need to optimize prior to production something that you must have. However, a problem has reared its head in the form of a question: How do you get an accurate measurement of application performance or other behavior without accurate testing or an accurate representation of how it will run in production? In this webinar, we will present and discuss a new fully Open Source tool for creating the needed tests with which to accurately measure your applications. We hope you will join us to learn more about this tool, and find out how you can help contribute.
This webinar is sponsored by StormForge and hosted by The Linux Foundation.
Speaker
Noah Abrahams, Open Source Advocate
Noah is an Open Source Advocate for StormForge, merging Open Source Strategy with Developer Advocacy. He has been involved in cloud for over 12 years, has been contributing to the Kubernetes ecosystem for 5 years, and has been up and down the business stack from DevOps and Architecture to Sales, Enablement, and Education. You will find him running meetups in Las Vegas and attending conferences, once those are both happening again.
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
Have a big data product / database / DBMS? need to test it? don't know where to start? some things to consider while you design your Automation QA.
Link to Video
https://www.youtube.com/watch?v=MlT4pP7BGFQ
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
In this talk I describe the specific challenges that we faced at Signal to make our use case scale. I then go into detail on how we benchmarked single queries and different shard configurations. You can try the experiments yourself using The Signal Media One-Million News Articles Dataset, a Docker Compose stack and some scripts provided here: https://github.com/joachimdraeger/elasticsearch-performance-experiments.
I also got the great advice to have a look at https://github.com/elastic/rally which can also give you summaries for test runs.
This talk is about the ways to improve the performance of individual applications, to optimize code and architecture. It also provides a brief overview of Java tools.
This presentation was held by Oleksandr Bodnar (Senior Software Java Developer, Consultant, GlobalLogic) at GlobalLogic MykolaivJava TechTalk #2 on September 27, 2018.
This is our contributions to the Data Science projects, as developed in our startup. These are part of partner trainings and in-house design and development and testing of the course material and concepts in Data Science and Engineering. It covers Data ingestion, data wrangling, feature engineering, data analysis, data storage, data extraction, querying data, formatting and visualizing data for various dashboards.Data is prepared for accurate ML model predictions and Generative AI apps
This is our project work at our startup for Data Science. This is part of our internal training and focused on data management for AI, ML and Generative AI apps
Presentation from the 4th Athens Gophers Meetup.
At a glance we present:
- why we introduced a new language in the organization and why that
was Go
- how we approached the transition
- some of the projects we built in Go
- the challenges we faced and the lessons we learned in the process
If we could only predict the future of the software industry, we could make better investments and decisions. We could waste less resources on technology and processes we know will not last, or at least be conscious in our decisions to choose solutions with a limited life time. It turns out that for data engineering, we can predict the future, because it has already happened. Not in our workplace, but at a few leading companies that are blazing ahead. It has also already happened in the neighbouring field of software engineering, which is two decades ahead of data engineering regarding process maturity. In this presentation, we will glimpse into the future of data engineering. Data engineering has gone from legacy data warehouses with stored procedures, to big data with Hadoop and data lakes, on to a new form of modern data warehouses and low code tools aka "the modern data stack". Where does it go from here? We will look at the points where data leaders differ from the crowd and combine with observations on how software engineering has evolved, to see that it points towards a new, more industrialised form of data engineering - "data factory engineering".
DataOps requires a cultural shift that brings the principles of lean manufacturing and DevOps to data analytics. It breaks down silos between developers, data scientists, and operators, resulting in rapid cycle times and low error rates.
At Spotify in 2013, the concept of DataOps did not exist but the Swedish company needed a way to align the people, processes, and technologies of the data organization to accelerate the development of high-quality analytics. The result was a Swedish-style DataOps, influenced by Scandinavian culture and agile principles, that enabled the company to become a true data-driven leader.
3. Project Description
Kaggle Contest
Grupo Bimbo: Mexican company of
fresh bakery products.
Products are shipped from storage
facilities to stores.
The following week, unsold products
are returned.
Need to predict the correct demand for
shipping to stores.
4. Why Did We Pick This Project?
Why Kaggle?
● Test our “Data Science” abilities in an
international field
● Kaggle forum
● Clean data and clear goal
● More time for feature engineering and
modelling
Why This Project?
● Very common problem
● Chance to work with a very large dataset
● Deadline of the competition (30 August)
6. Platforms
● Ubuntu (16 GB RAM)
● Macbook Pro (16 GB RAM)
● EC2 Instance on Amazon (100 GB RAM, 16 core
CPU)
○ 150$ for 2 days and extra 50$ for backup
● Google Cloud (100 GB RAM, 16 core CPU)
○ 50$ for 1 day
● Google Cloud Preemptible (208 GB RAM, 32 core
CPU)
7. Data Provided
Train.csv (3.2 GB)
the training set which includes week 3-9
Test.csv (251 MB)
the test set which includes week 10-11
Sample_submission.csv (69 MB)
a sample submission file in the correct format
cliente_tabla.csv
client names (can be joined with train/test on Cliente_ID)
11. Dealing with the Large Data
To optimize RAM use and speed up XGBoost performance:
● Forced type of the data explicitly
● Converted integers to unsigned ones
● Decreased the accuracy of floating points as much as
possible
Memory usage is reduced from 6.1 GB to 2.1 GB.
An alternative approach would have been reading and
processing in chunks.
13. Model 1 - Naive Prediction
We first decided to create a naive approach without using Machine Learning.
● Group training data on Product ID, Client ID, Agency ID and Route ID and
took mean of demand
● If this specific grouping doesn’t exist, default back to product’s mean demand
● If this doesn’t exist either, simply take the mean of demand
This method resulted in a score of 0.73 on public data when submitted.
15. Feature Engineering
We utilized NLTK library to extract following information from the Producto file.
● Weight: In grams
● Pieces
● Brand Name: Extracted through a three letter acronym
● Short Name: Extracted from the Product Name field. We first removed the
Spanish “stop words” and then used the stemming
16. Modeling
1. Separate x and y (label) of train;
2. Delete train’s columns which don’t exist in test data;
3. Match the order of train and test column orders same and append test to train
vertically;
4. Merge table with Product Table;
5. Use “count-vectorizer” of Scikit-learn on brand and short_name columns to
create sparse count-word matrices;
6. Use the output of count-vectorizer to create dummy variables;
7. Separate appended train and test data;
8. Train XGBoost with default parameters on train data and predict test data.
17. Technical Problems
● Garbage Collection
○ We had to remove unused objects and force garbage collection mechanism manually to free
this memory
● Data size because of sparsity
○ 70+ million rows and 577 columns in memory would need ~161 GB
○ We solved this problem by using sparse matrices from SciPy library and memory was 5 GB
○ In the example below, we see “COO” sparse method that holds data only different from 0
18. Score and Conclusion
RMSLE score were as follows:
These scores were worse than the naive approach, so we started to think about a
new model.
Validation Test 10 week Test 11 week
0.764 0.775 0.781
20. Digging Deeper Data Exploration
Train - Test difference
● We analyzed missing products, clients, agencies, routes which exist in train
but not in test
● There were 9663 clients, 34 products, 0 agencies and 1012 routes that
doesn’t exist in test data.
● The important outcome of this analysis was that: we should build a general
model that can handle new products, clients and routes which don’t
exist in train data but in test data.
21. Feature Engineering - 1
● Agencia
○ Agencia file shows each agency’s town id and state name. We merged this file with train and
test data on Agencia_ID column and encode state columns into integers.
● Producto
○ We used features from NLTK model, weights and pieces. In addition to them, we included
short names of product and brand id.
○ Example of short name of a product and brand id can be seen below.
○ 2025,Pan Blanco 460g BIM 2025
2027,Pan Blanco 567g WON 2027
22. Feature Engineering - 2
We want to predict how many product are sold in a client came from an agency.
● Why don’t we look at the past numbers of this product which was sold in this
client came from this agency?
● If doesn’t exist, why don’t we look at the past numbers of this product sold in
this client?
Let’s try this logic with product’s short names and also brand id.
We named these features : Lag0, Lag1, Lag2 and Lag3
23. Feature Engineering - 3
Other features:
● Total $ amount of a client/product name/product id
● Total unit sold by a client/product name/product id
● Price per unit of a client/product name/product id
● Ratio of goods sold by client/product name/product id
Other features:
● Client per town
● Sum of returns of product
24. Validation Technique
● We made 2 separate models for week 10 and week 11
● We didn’t involve “Lag1” variable in the model that predicts week 11
● We deleted first 3 weeks after feature engineering phase
25. XGBoost & Parameter Tuning
Why did we pick XGBoost?
● Boosting Tree Algorithm
● Both Regression and Classification
● Compiled C++ code
● Multi-Thread
Parameter Tuning:
Max depth
Subsample
26. Technical Problems
● Storing Data
○ Picked HDF5 over pickle and csv
● Memory and CPU
○ Max 32 core CPU, 75 GB ram
● Code Reuse and Automation
○ Object Oriented Python Programming
○ Most of the work was automated. For example:
parameterDict = { "ValidationStart":8, "ValidationEnd":9, "maxLag":3, "trainHdfPath":'../../input/train_wz.h5',
"testHdfPath1":"../../input/test1_wz.h5".. }...
ConfigElements(1,[ ("SPClR0_mean",["Producto_ID", "Cliente_ID", "Agencia_SAK"], ["mean"]),
("SPCl_mean", ["Producto_ID", "Cliente_ID"], ["mean"])...
27. Model Comparison
Model Validation 1 Validation 2 Public Score Private Score
Naive 0.736 0.734 0.754
NLTK 0.764 0.775 0.781
XGBoost with
default
parameters
0.476226 0.498475 0.46949 0.49596
XGBoost with
parameter
tuning
0.469628 0.489799 0.46257 0.48666