Technical debt in machine learning - Data Natives Berlin 2018Jaroslaw Szymczak
A presentation from Data Natives Berlin 2018 conference about technical debt in machine learning with examples from extent authors experience in productionizing machine learning models and maintaining their quality over time
The document discusses automatic image moderation in classified ads. It outlines an approach using machine learning to classify images as appropriate or inappropriate. Key aspects include using convolutional neural networks to extract image features, combining image and listing metadata, dealing with class imbalance, developing batch processing pipelines, and monitoring a live classification system. The overall goal is to automatically moderate millions of images uploaded daily to classified ad platforms.
Presented at #H2OWorld 2017 in Mountain View, CA.
Enjoy the video: https://youtu.be/WKAuXlsq6xw.
Learn more about H2O.ai: https://www.h2o.ai/.
Follow @h2oai: https://twitter.com/h2oai.
- - -
Leaderboard shake-up and overfitting are commonly known problems in Kaggle competitions. In his talk Dmitry is going to share an interesting approach to model’s performance validation which proven to be useful on Kaggle competitions with noisy data.
Dmitry Larko's Bio:
Senior Data Scientist at H2O.ai, Dmitry also is a former #25 Kaggle Grandmaster and loves to use his machine learning and data science skills in Kaggle Competitions and predictive analytics software development.
He has more than 15 years of experience in information technology. Post his masters in computer information systems from Krasnoyarsk State Technical University (KSTU), he started his career in data warehousing and business intelligence and gradually moved to big data and data science.
He holds a lot of experience in predictive analytics in a wide array of domains and tasks. Prior to H2O.ai, Dmitry held the position of SAP BW Developer at Chevron, Data Scientist at EPAM, and that of Lead Software Engineer with the Russian Federation.
Reproducibility and experiments management in Machine Learning Mikhail Rozhkov
Machine Learning becomes more and more common practice in many companies. ML teams size is growing and collaboration goes out of office and personal laptops. The complexity of ML projects leads to adopting distributed team collaboration, cloud based infrastructure and distributed machine learning. Well defined and manageable process for ML experiments becomes a central issue. Practices to apply automated pipelines, models and data set versioning helps to establish a good manageable process in project and provide reproducible results.
This speech helps to start with handling models and datasets versioning using open source tools: DVC, mlflow, Luigi, etc.
Gradient boosting in practice: a deep dive into xgboostJaroslaw Szymczak
The document discusses tuning parameters for the XGBoost gradient boosting algorithm. It explores different parameters like max_depth, learning_rate, and n_estimators using a news article classification dataset. Experiments are performed to evaluate the effect of these parameters on model accuracy and training time. The learning curves are also plotted to analyze model performance over iterations.
2017 holiday survey: An annual analysis of the peak shopping seasonDeloitte United States
Holiday retail spending is bucking trends this season with only one-third of holiday budgets going toward gifts. Online spending is expected to exceed in-store for the first time. In addition to gifts for others this year, spending on experiences and self-gifting increased. Explore more consumer spending trends in our 32nd annual holiday survey. For more: http://deloi.tt/2yH1VAn.
The document discusses machine learning models used to moderate classified ads on OLX's platform. It covers the scale of OLX's business with over 60 million monthly listings, feature engineering to better represent the underlying moderation problem, building a model generation pipeline using tools like Scikit and XGBoost, measuring model performance, the system architecture, validating models on sample predictions, and managing models over time.
When setting up a new project we have some tips and tricks to help you do this in the best way possible, incl. infrastructure, database, standard attributes, logging, code alignment, and service center.
Technical debt in machine learning - Data Natives Berlin 2018Jaroslaw Szymczak
A presentation from Data Natives Berlin 2018 conference about technical debt in machine learning with examples from extent authors experience in productionizing machine learning models and maintaining their quality over time
The document discusses automatic image moderation in classified ads. It outlines an approach using machine learning to classify images as appropriate or inappropriate. Key aspects include using convolutional neural networks to extract image features, combining image and listing metadata, dealing with class imbalance, developing batch processing pipelines, and monitoring a live classification system. The overall goal is to automatically moderate millions of images uploaded daily to classified ad platforms.
Presented at #H2OWorld 2017 in Mountain View, CA.
Enjoy the video: https://youtu.be/WKAuXlsq6xw.
Learn more about H2O.ai: https://www.h2o.ai/.
Follow @h2oai: https://twitter.com/h2oai.
- - -
Leaderboard shake-up and overfitting are commonly known problems in Kaggle competitions. In his talk Dmitry is going to share an interesting approach to model’s performance validation which proven to be useful on Kaggle competitions with noisy data.
Dmitry Larko's Bio:
Senior Data Scientist at H2O.ai, Dmitry also is a former #25 Kaggle Grandmaster and loves to use his machine learning and data science skills in Kaggle Competitions and predictive analytics software development.
He has more than 15 years of experience in information technology. Post his masters in computer information systems from Krasnoyarsk State Technical University (KSTU), he started his career in data warehousing and business intelligence and gradually moved to big data and data science.
He holds a lot of experience in predictive analytics in a wide array of domains and tasks. Prior to H2O.ai, Dmitry held the position of SAP BW Developer at Chevron, Data Scientist at EPAM, and that of Lead Software Engineer with the Russian Federation.
Reproducibility and experiments management in Machine Learning Mikhail Rozhkov
Machine Learning becomes more and more common practice in many companies. ML teams size is growing and collaboration goes out of office and personal laptops. The complexity of ML projects leads to adopting distributed team collaboration, cloud based infrastructure and distributed machine learning. Well defined and manageable process for ML experiments becomes a central issue. Practices to apply automated pipelines, models and data set versioning helps to establish a good manageable process in project and provide reproducible results.
This speech helps to start with handling models and datasets versioning using open source tools: DVC, mlflow, Luigi, etc.
Gradient boosting in practice: a deep dive into xgboostJaroslaw Szymczak
The document discusses tuning parameters for the XGBoost gradient boosting algorithm. It explores different parameters like max_depth, learning_rate, and n_estimators using a news article classification dataset. Experiments are performed to evaluate the effect of these parameters on model accuracy and training time. The learning curves are also plotted to analyze model performance over iterations.
2017 holiday survey: An annual analysis of the peak shopping seasonDeloitte United States
Holiday retail spending is bucking trends this season with only one-third of holiday budgets going toward gifts. Online spending is expected to exceed in-store for the first time. In addition to gifts for others this year, spending on experiences and self-gifting increased. Explore more consumer spending trends in our 32nd annual holiday survey. For more: http://deloi.tt/2yH1VAn.
The document discusses machine learning models used to moderate classified ads on OLX's platform. It covers the scale of OLX's business with over 60 million monthly listings, feature engineering to better represent the underlying moderation problem, building a model generation pipeline using tools like Scikit and XGBoost, measuring model performance, the system architecture, validating models on sample predictions, and managing models over time.
When setting up a new project we have some tips and tricks to help you do this in the best way possible, incl. infrastructure, database, standard attributes, logging, code alignment, and service center.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Predictive Analytics Project in Automotive IndustryMatouš Havlena
Original article: http://www.havlena.net/en/business-analytics-intelligence/predictive-analytics-project-in-automotive-industry/
I had a chance to work on a predictive analytics project for a US car manufacturer. The goal of the project was to evaluate the feasibility to use Big Data analysis solutions for manufacturing to solve different operational needs. The objective was to determine a business case and identify a technical solution (vendor). Our task was to analyze production history data and predict car inspection failures from the production line. We obtained historical data on defects on the car, how the car moved along the assembly line and car specific information like engine type, model, color, transmission type, and so on. The data covered the whole manufacturing history for one year. We used IBM BigInsights and SPSS Modeler to make the predictions.
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
MLOps Lifecycle
ML problem framing
ML solution architecture
Data preparation and processing
ML model development
ML pipeline automation and orchestration
ML solution monitoring, optimization, and maintenance
Live predictions with schemaless data at scale. MLMU Kosice, ExponeaData Science Club
Imagine you have huge amounts of data about your customers. All this data is schemaless and represents everything a customer is doing in your e-shop. From page visits and banner showings to purchases or registrations. Having all this data is a data scientists wet dream but also a nightmare at the same time. The data is schemaless and every project you track can send you different attributes and event types. Now, here comes the hard work. Create some universal data preprocessing engine which can turn all of this data into something that is reasonable and useful for machine learning algorithms for any project you have.
We will show you, how this is done at Exponea and much more. How to connect this data to Spark ML library and then translate the model into a sequence of mathematical functions and aggregation methods for our in memory database to evaluate it on all customers in real time.Ondrej Brichta – currently working at Exponea as AI Engineer. Studying Logic and computability at Vienna University of Technology, alumni of Nexteria Leadership Academy and Matfyz in Bratislava
This document discusses moving from traditional business intelligence (BI) tools to adopting machine learning (ML). It provides an overview of common BI workflows and limitations. It then introduces ML concepts like supervised, unsupervised, and reinforcement learning. The document outlines the typical ML pipeline including data wrangling, modeling, validation, and deployment. Finally, it discusses challenges of adopting ML and provides recommendations for getting started with ML using Python libraries and optimizing infrastructure costs.
The document compares the leading data visualization tools Tableau, Power BI, and Qlik. It reviews the strengths, weaknesses, and unique features of each tool. Tableau is seen as the gold standard but is very expensive. Power BI is easy to use and affordable but best for Excel users. Qlik Sense has powerful scripting but a confusing licensing model. The document recommends tools based on cost, capabilities, and intended users.
This document discusses moving from traditional business intelligence (BI) tools to adopting machine learning. It begins with an overview of common BI workflows and their limitations. It then provides introductions to machine learning, deep learning, and artificial intelligence. The machine learning pipeline is explained along with examples of adopting machine learning in products. Challenges of adopting machine learning are discussed as well as cost optimization strategies. Real world use cases are presented and open source options are mentioned.
Strangle The Monolith: A Data Driven ApproachVMware Tanzu
The document discusses the "data driven strangler" pattern for iteratively rewriting a large monolithic system. It involves:
1. Logging requests and responses between the monolith and downstream systems to understand the system's logic.
2. Building replacement services that mimic the monolith's responses while also logging any differences.
3. Analyzing the logs to identify areas of the monolith that can be "strangled" by redirecting traffic to the new services instead.
The approach allows decomposing the monolith incrementally with real data and minimal risk of breaking the system's behavior. It provides near real-time feedback to guide the rewrite process.
MLOps is the process of taking machine learning models into production and maintaining and monitoring them. It addresses issues like lack of reproducibility, inability to identify new trends, and lack of scalability that can occur without proper processes. The machine learning lifecycle includes scoping a project, collecting and preparing data, developing and evaluating models, deploying models into production, and ongoing monitoring. MLOps aims to operationalize this lifecycle to ensure models can be deployed and updated efficiently and reliably at scale.
As machine learning has is permeating more and more industries and businesses, the need for audit professionals to provide assurance over machine learning is growing. Andrew's presentation will provide an audit-centric overview of machine learning and present a framework for how to begin auditing machine learning in your organization.
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Aaron Saray
Object Oriented Programming in enterprise level PHP is incredibly important. In this presentation, concepts like MVC architecture, data mappers, services, and domain and data models will be discussed. Simple demonstrations will be used to show patterns and best practices. In addition, using tools like Doctrine or integration with Salesforce or the AS/400 will also be discussed. There will be an emphasis on the practical application of these techniques as well - this isn't just a theoretical talk! This presentation is great for those just beginning to create enterprise applications as well as those who have had years of experience.
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Databricks
B2B sales intelligence has become an integral part of LinkedIn’s business to help companies optimize resource allocation and design effective sales and marketing strategies. This new trend of data-driven approaches has “sparked” a new wave of AI and ML needs in companies large and small. Given the tremendous complexity that arises from the multitude of business needs across different verticals and product lines, Apache Spark, with its rich machine learning libraries, scalable data processing engine and developer-friendly APIs, has been proven to be a great fit for delivering such intelligence at scale.
See how Linkedin is utilizing Spark for building sales intelligence products. This session will introduce a comprehensive B2B intelligence system built on top of various open source stacks. The system puts advanced data science to work in a dynamic and complex scenario, in an easily controllable and interpretable way. Balancing flexibility and complexity, the system can deal with various problems in a unified manner and yield actionable insights to empower successful business. You will also learn about some impactful Spark-ML powered applications such as prospect prediction and prioritization, churn prediction, model interpretation, as well as challenges and lessons learned at LinkedIn while building such platform.
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceWei Di
B2B sales intelligence has become an integral part of LinkedIn’s business to help companies optimize resource allocation and design effective sales and marketing strategies. This new trend of data-driven approaches has “sparked” a new wave of AI and ML needs in companies large and small. Given the tremendous complexity that arises from the multitude of business needs across different verticals and product lines, Apache Spark, with its rich machine learning libraries, scalable data processing engine and developer-friendly APIs, has been proven to be a great fit for delivering such intelligence at scale.
See how Linkedin is utilizing Spark for building sales intelligence products. This session will introduce a comprehensive B2B intelligence system built on top of various open source stacks. The system puts advanced data science to work in a dynamic and complex scenario, in an easily controllable and interpretable way. Balancing flexibility and complexity, the system can deal with various problems in a unified manner and yield actionable insights to empower successful business. You will also learn about some impactful Spark-ML powered applications such as prospect prediction and prioritization, churn prediction, model interpretation, as well as challenges and lessons learned at LinkedIn while building such platform.
Jaroslaw Szymczak presented an approach for automatic image moderation in classified listings. The approach uses machine learning techniques including convolutional neural networks (CNNs) to extract image features and eXtreme Gradient Boosting (XGBoost) to combine image and listing features. To address class imbalance between acceptable and unacceptable images, the training data was undersampled from a 99:1 ratio to a 9:1 ratio. Key evaluation metrics for the imbalanced data include ROC AUC, PR AUC, and precision or recall at fixed thresholds of the other. The trained models are deployed into a live service using Flask, containerized with Docker, and monitored for performance using Grafana.
With so much noise and buzzwords floating around regarding data analytics, it can be rather difficult to decipher between the signal (what is worthwhile) and what is only talk. Sometimes the rhetoric even starts within your organization, confounding the issue further. During Andrew’s session, he will provide attendees with the knowledge they need to tune out the bogus information while gleaning valuable insights for developing and deploying their audit analytics program. The presentation will conclude with tangible examples of a successful Manufacturing Audit Analytics program, and recommendations for how to get yours up and running. After attending, participants will be able to articulate how steps for setting up an analytics program within their departments, as well be armed with knowledge for educating senior leadership on the fundamental changes in technology that are occurring, and what is just marketing.
Transforming B2B Sales with Spark Powered Sales IntelligenceSongtao Guo
This is the presentation we delivered in Spark Summit 2017, San Francisco
Title: Transforming B2B Sales with Spark Powered Sales Intelligence
Presenters: Songtao Guo and Wei Di
It gives an overview of our Apache Spark powered B2B intelligence engine we developed at Linkedin and its use cases.
RPA is a technology that enables software programs called robots to mimic human actions like mouse clicks and keyboard inputs to automate repetitive tasks. Some common uses of RPA include processing invoices, data entry, and report generation. The document discusses the UiPath platform which is made up of Studio for designing workflows visually, Orchestrator for deploying and managing robots, and robots that can operate attended by a human or unattended. It provides examples of RPA jobs and discusses the growth of the RPA market and talent shortage. The document aims to introduce RPA and the UiPath platform.
The document discusses how AppDynamics helped a healthcare software company successfully integrate two different codebases and architectures during a major project. AppDynamics identified performance bottlenecks that were addressed, improving response times. It also increased trust between engineering, QA and operations by providing a shared view of metrics. The company plans to implement additional monitoring tools like AppDynamics EUM and Sumologic going forward.
Using the Business Process Technology Workflow Engine for Advanced ModelingOutSystems
Some business processes follow very linear steps and are, therefore, fairly simple to build. Others can be extremely difficult to model and implement. In this session, we will look at using Workflows in OutSystems for simplified modeling and ease of process building, analysis, and editing. Not an “intro” session, rather, we will cover some of the more difficult and advanced use cases you’re likely to need for your own organization.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Predictive Analytics Project in Automotive IndustryMatouš Havlena
Original article: http://www.havlena.net/en/business-analytics-intelligence/predictive-analytics-project-in-automotive-industry/
I had a chance to work on a predictive analytics project for a US car manufacturer. The goal of the project was to evaluate the feasibility to use Big Data analysis solutions for manufacturing to solve different operational needs. The objective was to determine a business case and identify a technical solution (vendor). Our task was to analyze production history data and predict car inspection failures from the production line. We obtained historical data on defects on the car, how the car moved along the assembly line and car specific information like engine type, model, color, transmission type, and so on. The data covered the whole manufacturing history for one year. We used IBM BigInsights and SPSS Modeler to make the predictions.
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
MLOps Lifecycle
ML problem framing
ML solution architecture
Data preparation and processing
ML model development
ML pipeline automation and orchestration
ML solution monitoring, optimization, and maintenance
Live predictions with schemaless data at scale. MLMU Kosice, ExponeaData Science Club
Imagine you have huge amounts of data about your customers. All this data is schemaless and represents everything a customer is doing in your e-shop. From page visits and banner showings to purchases or registrations. Having all this data is a data scientists wet dream but also a nightmare at the same time. The data is schemaless and every project you track can send you different attributes and event types. Now, here comes the hard work. Create some universal data preprocessing engine which can turn all of this data into something that is reasonable and useful for machine learning algorithms for any project you have.
We will show you, how this is done at Exponea and much more. How to connect this data to Spark ML library and then translate the model into a sequence of mathematical functions and aggregation methods for our in memory database to evaluate it on all customers in real time.Ondrej Brichta – currently working at Exponea as AI Engineer. Studying Logic and computability at Vienna University of Technology, alumni of Nexteria Leadership Academy and Matfyz in Bratislava
This document discusses moving from traditional business intelligence (BI) tools to adopting machine learning (ML). It provides an overview of common BI workflows and limitations. It then introduces ML concepts like supervised, unsupervised, and reinforcement learning. The document outlines the typical ML pipeline including data wrangling, modeling, validation, and deployment. Finally, it discusses challenges of adopting ML and provides recommendations for getting started with ML using Python libraries and optimizing infrastructure costs.
The document compares the leading data visualization tools Tableau, Power BI, and Qlik. It reviews the strengths, weaknesses, and unique features of each tool. Tableau is seen as the gold standard but is very expensive. Power BI is easy to use and affordable but best for Excel users. Qlik Sense has powerful scripting but a confusing licensing model. The document recommends tools based on cost, capabilities, and intended users.
This document discusses moving from traditional business intelligence (BI) tools to adopting machine learning. It begins with an overview of common BI workflows and their limitations. It then provides introductions to machine learning, deep learning, and artificial intelligence. The machine learning pipeline is explained along with examples of adopting machine learning in products. Challenges of adopting machine learning are discussed as well as cost optimization strategies. Real world use cases are presented and open source options are mentioned.
Strangle The Monolith: A Data Driven ApproachVMware Tanzu
The document discusses the "data driven strangler" pattern for iteratively rewriting a large monolithic system. It involves:
1. Logging requests and responses between the monolith and downstream systems to understand the system's logic.
2. Building replacement services that mimic the monolith's responses while also logging any differences.
3. Analyzing the logs to identify areas of the monolith that can be "strangled" by redirecting traffic to the new services instead.
The approach allows decomposing the monolith incrementally with real data and minimal risk of breaking the system's behavior. It provides near real-time feedback to guide the rewrite process.
MLOps is the process of taking machine learning models into production and maintaining and monitoring them. It addresses issues like lack of reproducibility, inability to identify new trends, and lack of scalability that can occur without proper processes. The machine learning lifecycle includes scoping a project, collecting and preparing data, developing and evaluating models, deploying models into production, and ongoing monitoring. MLOps aims to operationalize this lifecycle to ensure models can be deployed and updated efficiently and reliably at scale.
As machine learning has is permeating more and more industries and businesses, the need for audit professionals to provide assurance over machine learning is growing. Andrew's presentation will provide an audit-centric overview of machine learning and present a framework for how to begin auditing machine learning in your organization.
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Aaron Saray
Object Oriented Programming in enterprise level PHP is incredibly important. In this presentation, concepts like MVC architecture, data mappers, services, and domain and data models will be discussed. Simple demonstrations will be used to show patterns and best practices. In addition, using tools like Doctrine or integration with Salesforce or the AS/400 will also be discussed. There will be an emphasis on the practical application of these techniques as well - this isn't just a theoretical talk! This presentation is great for those just beginning to create enterprise applications as well as those who have had years of experience.
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Databricks
B2B sales intelligence has become an integral part of LinkedIn’s business to help companies optimize resource allocation and design effective sales and marketing strategies. This new trend of data-driven approaches has “sparked” a new wave of AI and ML needs in companies large and small. Given the tremendous complexity that arises from the multitude of business needs across different verticals and product lines, Apache Spark, with its rich machine learning libraries, scalable data processing engine and developer-friendly APIs, has been proven to be a great fit for delivering such intelligence at scale.
See how Linkedin is utilizing Spark for building sales intelligence products. This session will introduce a comprehensive B2B intelligence system built on top of various open source stacks. The system puts advanced data science to work in a dynamic and complex scenario, in an easily controllable and interpretable way. Balancing flexibility and complexity, the system can deal with various problems in a unified manner and yield actionable insights to empower successful business. You will also learn about some impactful Spark-ML powered applications such as prospect prediction and prioritization, churn prediction, model interpretation, as well as challenges and lessons learned at LinkedIn while building such platform.
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceWei Di
B2B sales intelligence has become an integral part of LinkedIn’s business to help companies optimize resource allocation and design effective sales and marketing strategies. This new trend of data-driven approaches has “sparked” a new wave of AI and ML needs in companies large and small. Given the tremendous complexity that arises from the multitude of business needs across different verticals and product lines, Apache Spark, with its rich machine learning libraries, scalable data processing engine and developer-friendly APIs, has been proven to be a great fit for delivering such intelligence at scale.
See how Linkedin is utilizing Spark for building sales intelligence products. This session will introduce a comprehensive B2B intelligence system built on top of various open source stacks. The system puts advanced data science to work in a dynamic and complex scenario, in an easily controllable and interpretable way. Balancing flexibility and complexity, the system can deal with various problems in a unified manner and yield actionable insights to empower successful business. You will also learn about some impactful Spark-ML powered applications such as prospect prediction and prioritization, churn prediction, model interpretation, as well as challenges and lessons learned at LinkedIn while building such platform.
Jaroslaw Szymczak presented an approach for automatic image moderation in classified listings. The approach uses machine learning techniques including convolutional neural networks (CNNs) to extract image features and eXtreme Gradient Boosting (XGBoost) to combine image and listing features. To address class imbalance between acceptable and unacceptable images, the training data was undersampled from a 99:1 ratio to a 9:1 ratio. Key evaluation metrics for the imbalanced data include ROC AUC, PR AUC, and precision or recall at fixed thresholds of the other. The trained models are deployed into a live service using Flask, containerized with Docker, and monitored for performance using Grafana.
With so much noise and buzzwords floating around regarding data analytics, it can be rather difficult to decipher between the signal (what is worthwhile) and what is only talk. Sometimes the rhetoric even starts within your organization, confounding the issue further. During Andrew’s session, he will provide attendees with the knowledge they need to tune out the bogus information while gleaning valuable insights for developing and deploying their audit analytics program. The presentation will conclude with tangible examples of a successful Manufacturing Audit Analytics program, and recommendations for how to get yours up and running. After attending, participants will be able to articulate how steps for setting up an analytics program within their departments, as well be armed with knowledge for educating senior leadership on the fundamental changes in technology that are occurring, and what is just marketing.
Transforming B2B Sales with Spark Powered Sales IntelligenceSongtao Guo
This is the presentation we delivered in Spark Summit 2017, San Francisco
Title: Transforming B2B Sales with Spark Powered Sales Intelligence
Presenters: Songtao Guo and Wei Di
It gives an overview of our Apache Spark powered B2B intelligence engine we developed at Linkedin and its use cases.
RPA is a technology that enables software programs called robots to mimic human actions like mouse clicks and keyboard inputs to automate repetitive tasks. Some common uses of RPA include processing invoices, data entry, and report generation. The document discusses the UiPath platform which is made up of Studio for designing workflows visually, Orchestrator for deploying and managing robots, and robots that can operate attended by a human or unattended. It provides examples of RPA jobs and discusses the growth of the RPA market and talent shortage. The document aims to introduce RPA and the UiPath platform.
The document discusses how AppDynamics helped a healthcare software company successfully integrate two different codebases and architectures during a major project. AppDynamics identified performance bottlenecks that were addressed, improving response times. It also increased trust between engineering, QA and operations by providing a shared view of metrics. The company plans to implement additional monitoring tools like AppDynamics EUM and Sumologic going forward.
Using the Business Process Technology Workflow Engine for Advanced ModelingOutSystems
Some business processes follow very linear steps and are, therefore, fairly simple to build. Others can be extremely difficult to model and implement. In this session, we will look at using Workflows in OutSystems for simplified modeling and ease of process building, analysis, and editing. Not an “intro” session, rather, we will cover some of the more difficult and advanced use cases you’re likely to need for your own organization.
Similar to Machine Learning to moderate ads in real world classified's business (20)
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Machine Learning to moderate ads in real world classified's business
1. Machine Learning to moderate
ads in real world classified's
business
by Vaibhav Singh & Jaroslaw Szymczak
2. Agenda
● Moderation problem
● Offline model creation
○ feature generation
○ feature selection
○ data leakage
○ the algorithm
● Model evaluation
● Going live with the product
○ is your data really big?
○ automatic model creation pipeline
○ consistent development and production environments
○ platform architecture
○ performance monitoring
4. What do moderators look for?
Avoidance of payment
Sell another item in paid
listing by changing its
content
Flood site with duplicate
posts to increase
visibility
Create multiple accounts
to bypass free ad per
user limit
Violation of ToS
Add Phone numbers,
Company information on
image rather than in
description or dedicated
fields
Try to sell forbidden
items, very often with
title and description that
try to evade keyword
filters
Miscategorized listings
Item is placed in wrong
category
Item is coming from
legitimate business, but
is marked as coming
from individual
‘Seek’ problem in job
offers
9. Feature hashing
➔ Good when dealing high
dimensional, sparse features --
dimensionality reduction
➔ Memory efficient
➔ Cons - Getting back to feature
names is difficult
➔ Cons - Hash collisions can have
negative effects
10. Data Leakage
➔ Remove obvious fields
e.g.: id, account numbers
➔ Check the importance of
the features for any
unusual observations
➔ Have hold-out set that you
do not process wrt. target
variable
➔ Closely monitor live
performance
15. Beyond accuracy
● ROC AUC (Receiver-Operator Curve):
○ can be interpreted as concordance probability (i.e. random positive example has the
probability equal to AUC, that it’s score is higher)
○ it is too abstract to use as a standalone quality metric
○ does not depend on classes ratio
● PRC AUC (Precision-Recall Curve)
○ Depends on data balance
○ Is not intuitively interpretable
● Precision @ fixed Recall, Recall @ fixed Precision:
○ can be found using thresholding
○ they heavily depend on data balance
○ they are the best to reflect the business requirements
○ and to take into account processing capabilities
(then actually Precision @k is more accurate)
● choose one, and only one as your KPI and others as
constraints
22. SVM Light
Data Format
➔ Memory Efficient.
Features can be created
on one machine and do
not require huge clusters
➔ Cons - Number of
features is unknown,
store it separately
1 191:-0.44 87214:-0.44 200004:0.20 200012:1 206976:1 206983:-1 207015:1 207017:1 226201:1
1 1738:0.57 130440:-0.57 206999:0.32 207000:28 207001:6 207013:1 207015:1 207017:1 226300:1
0 2812:-0.63 34755:-0.31 206995:2.28 206997:1 206998:2 206999:0.00 207000:1 207001:28 226192:1
1 4019:0.35 206999:0.43 207000:40 207001:18 207013:1 207014:1 207016:1 226261:1
0 8903:0.37 207000:4 207001:14 207013:1 207014:1 207016:1 226262:1
1 5878:-0.27 206995:2.28 206998:1 206999:5.80 207000:1 207001:24 226187:1
23. Lessons Learnt
➔ Do not go for distributed learning if you
don’t need to
➔ Choose your tech dependent on data size.
Do not go for hype driven development
➔ Your machine does not limit, there’s cloud
➔ Ask yourself: What’s the most difficult
problem to scale ? → People
28. Lessons Learnt
➔ when you use the output path on your own,
create your output at the very end of the
task
➔ you can dynamically create dependencies
by yielding the task
➔ adding workers parameter to your
command parallelizes task that are ready
to be run (e.g. python run.py Task …
--workers 15)
30. Model Serving Architecture
Flask API
Queue Prediction
Module
Mongo
Monitoring & Stats
Graphite, Grafana
Learning
Module
Scikit
XGBoost
Luigi
Ask Prediction
Return Prediction
Learning Ads
31. Image Model Serving Architecture
AWS Kinensis
Stream
Incoming
Pictures
Hash Generation
Country Specific Image
Moderation
General Moderation NSFW
Tag and Category
Prediction
Mongo
OLX Site
S3
Models
GPU Clusters
Learning Cluster
TF, Keras, MxNet
34. Lessons Learnt
➔ Always Batch
Batching will reduce CPU Utilization and the same machines
would be able to handle much more requests
➔ Modularize, Dockerize and Orchestrate
Containerize your code so that it is transparent to Machine
configurations
➔ Monitoring
Use a monitoring service
➔ Choose simple and easy tech
35. Acknowledgements
● Andrzej Prałat
● Wojciech Rybicki
Vaibhav Singh
vaibhav.singh@olx.com
Jaroslaw Szymczak
jaroslaw.szymczak@olx.com
PYDATA BERLIN 2017
July 2nd
, 2017