Winning Kaggle competitions involves getting a good score as fast as possible using versatile machine learning libraries and models like Scikit-learn, XGBoost, and Keras. It also involves model ensembling techniques like voting, averaging, bagging and boosting to improve scores. The document provides tips for approaches like feature engineering, algorithm selection, and stacked generalization/stacking to develop strong ensemble models for competitions.
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Winning Kaggle 101: Introduction to StackingTed Xiao
This document provides an introduction to stacking, an ensemble machine learning method. Stacking involves training a "metalearner" to optimally combine the predictions from multiple "base learners". The stacking algorithm was developed in the 1990s and improved upon with techniques like cross-validation and the "Super Learner" which combines models in a way that is provably asymptotically optimal. H2O implements an efficient stacking method called H2O Ensemble which allows for easily finding the best combination of algorithms like GBM, DNNs, and more to improve predictions.
Jeong-Yoon Lee has extensive experience winning data science competitions, taking first place in KDD Cup 2012 and 2015 and placing in the top 10 in several others. He competes for fun, experience, learning, and networking. Some best practices for competitions include thorough feature engineering, using diverse machine learning algorithms, cross-validation, ensemble methods, and collaboration. While competitions may seem limited, they provide valuable experience in data wrangling, exploration, and pipeline development applicable to real-world work.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
The document discusses interactive machine learning (IML), which aims to make machine learning more accessible to non-experts by allowing iterative human feedback. IML is defined as an iterative process where users can provide feedback to control model behavior, unlike classical machine learning which involves a single pass with no user input. The document outlines categories of IML including interaction perspectives for supplying training data, choosing algorithms, and evaluating models. Examples of IML systems are provided for each category.
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Fernando Amat and Elliot Chow from Netflix talk about the Bandit infrastructure for Personalized Recommendations
Winning Kaggle competitions involves getting a good score as fast as possible using versatile machine learning libraries and models like Scikit-learn, XGBoost, and Keras. It also involves model ensembling techniques like voting, averaging, bagging and boosting to improve scores. The document provides tips for approaches like feature engineering, algorithm selection, and stacked generalization/stacking to develop strong ensemble models for competitions.
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Winning Kaggle 101: Introduction to StackingTed Xiao
This document provides an introduction to stacking, an ensemble machine learning method. Stacking involves training a "metalearner" to optimally combine the predictions from multiple "base learners". The stacking algorithm was developed in the 1990s and improved upon with techniques like cross-validation and the "Super Learner" which combines models in a way that is provably asymptotically optimal. H2O implements an efficient stacking method called H2O Ensemble which allows for easily finding the best combination of algorithms like GBM, DNNs, and more to improve predictions.
Jeong-Yoon Lee has extensive experience winning data science competitions, taking first place in KDD Cup 2012 and 2015 and placing in the top 10 in several others. He competes for fun, experience, learning, and networking. Some best practices for competitions include thorough feature engineering, using diverse machine learning algorithms, cross-validation, ensemble methods, and collaboration. While competitions may seem limited, they provide valuable experience in data wrangling, exploration, and pipeline development applicable to real-world work.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
The document discusses interactive machine learning (IML), which aims to make machine learning more accessible to non-experts by allowing iterative human feedback. IML is defined as an iterative process where users can provide feedback to control model behavior, unlike classical machine learning which involves a single pass with no user input. The document outlines categories of IML including interaction perspectives for supplying training data, choosing algorithms, and evaluating models. Examples of IML systems are provided for each category.
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Fernando Amat and Elliot Chow from Netflix talk about the Bandit infrastructure for Personalized Recommendations
Data pre-processing involves cleaning raw data by filling in missing values, removing noise, and resolving inconsistencies. It also includes integrating, transforming, and reducing data through techniques like normalization, aggregation, dimensionality reduction, and discretization. The goal of data pre-processing is to convert raw data into a clean, organized format suitable for modeling and analysis tasks like data mining and machine learning.
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
Property graph vs. RDF Triplestore comparison in 2020Ontotext
This presentation goes all the way from intro "what graph databases are" to table comparing the RDF vs. PG plus two different diagrams presenting the market circa 2020
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/wcFdmQSX6hM
Description:
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Speaker's Bio:
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
Spark 2019: Equifax's SVP Data & Analytics, Peter Maynard, discusses the notion (and importance) of explainable AI in the financial services sector. He looks at the work Equifax have done to crack open the black box by creating patented AI technology that helps companies make smarter, explainable decisions using AI.
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Join Marcel Kurovski to explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Event: O'Reilly Artificial Intelligence Conference, New York, 18.04.2019
Speaker: Marcel Kurovski, inovex GmbH
Mehr Tech-Vorträge: inovex.de/vortraege
Mehr Tech-Artikel: inovex.de/blog
Word embeddings are common for NLP tasks, but embeddings can also be used to learn relations among categorical data. Deep learning can be useful also for structured data, and entity embeddings is one reason why it makes sense. These are slides from a seminar held in Sbanken.
Opinion Dynamics on Generalized NetworksMason Porter
This is a talk on opinion dynamics (especially bounded-confidence models) on generalized networks.
It is part of the MIX-NEXT III (Multiscale & Integrative compleX Networks: EXperiments & Theories) satellite at NetSci 2022.
(Thursday 14 July 2022)
Exploration and diversity in recommender systemsJaya Kawale
The document discusses exploration and diversity in recommender systems at Tubi. It provides an overview of Tubi as an AVOD platform and describes how machine learning is used for personalization, content, and ads. It then discusses the importance of exploration in recommender systems to address cold starts, changing user tastes, and item popularities. Various exploration techniques like epsilon-greedy, optimism in the face of uncertainty, Thompson sampling, and contextual bandits are covered. The document also discusses how diversity is important to maximize utility and utilization of recommendations and describes methods to increase diversity like determinantal point processes. It concludes that exploration can help achieve diversity and vice versa.
Understanding the difference between Data, information and knowledgeNeeti Naag
In decision making process it is very important to use past and present data. This presentation will help in understanding what is data, how it is converted to information and how information becomes knowledge.
1st Place in EY Data Science Challenge Hyunju Shim
1st Place Presentation of EY Next Wave Data Science Challenge in 2019.
Following presentation contains background, methods, deep learning models, parameters, and other miscellaneous details about our model used for the data science challenge.
Overview of Model: LSTM models including Multiplicative LSTM, LSTM-CNN, simple LSTM were used to tackle Variable Sequence Length Binary Classification problem.
홍콩 지역 EY NextWave 데이터사이언스 대회에서 1위를 차지한 모델에 대한 설명입니다. 다양한 LSTM 모델 (Multiplicative LSTM, LSTM-CNN, Simple LSTM) 등을 사용해서 Variable Sequence Length Binary Classification 문제에 접근하였습니다.
Knowledge graphs for knowing more and knowing for sureSteffen Staab
Knowledge graphs have been conceived to collect heterogeneous data and knowledge about large domains, e.g. medical or engineering domains, and to allow versatile access to such collections by means of querying and logical reasoning. A surge of methods has responded to additional requirements in recent years. (i) Knowledge graph embeddings use similarity and analogy of structures to speculatively add to the collected data and knowledge. (ii) Queries with shapes and schema information can be typed to provide certainty about results. We survey both developments and find that the development of techniques happens in disjoint communities that mostly do not understand each other, thus limiting the proper and most versatile use of knowledge graphs.
The document describes Dropbox's machine learning infrastructure and platform. It discusses how the platform provides scalable access to Dropbox's large data sources for offline and online ML use cases. The platform aims to accelerate ML development at Dropbox by standardizing workflows, automating processes, and making ML deployment and experimentation easy. It utilizes various services like Antenna for activity data and dbxlearn for distributed training across Dropbox and AWS resources. The platform supports all stages of the ML lifecycle from data preparation to model deployment and monitoring.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Understanding how high powered ML models arrive at their predictions is an important aspect of Machine Learning, and SHAP is a powerful tool that enables practitioners to understand how different features combine to help a model arrive at a prediction.
This slidedeck is from a presentation given at pydata global on the theoretical foundations of SHAP as well as how to use its library. Link to the presentation can be found here: https://pydata.org/global2021/schedule/presentation/3/behind-the-black-box-how-to-understand-any-ml-model-using-shap/
The document provides an introduction to competitive data science. It outlines the data science process for competitions, which includes data cleaning, exploratory data analysis, feature engineering, modeling, and ensemble techniques. It explains that the goal of competitive data science is to improve performance on predefined metrics for a given task. Participants can enhance their skills, showcase their work, and learn from others in a challenging environment.
Workshop presented at Webdagene 2013 (http://webdagene.no/en/) September 9, 2013; UX Lisbon (http://www.ux-lx.com), May 12, 2011; UX Hong Kong (http://www.uxhongkong.com/), February 17, 2011.
Data pre-processing involves cleaning raw data by filling in missing values, removing noise, and resolving inconsistencies. It also includes integrating, transforming, and reducing data through techniques like normalization, aggregation, dimensionality reduction, and discretization. The goal of data pre-processing is to convert raw data into a clean, organized format suitable for modeling and analysis tasks like data mining and machine learning.
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
Property graph vs. RDF Triplestore comparison in 2020Ontotext
This presentation goes all the way from intro "what graph databases are" to table comparing the RDF vs. PG plus two different diagrams presenting the market circa 2020
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/wcFdmQSX6hM
Description:
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Speaker's Bio:
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
Spark 2019: Equifax's SVP Data & Analytics, Peter Maynard, discusses the notion (and importance) of explainable AI in the financial services sector. He looks at the work Equifax have done to crack open the black box by creating patented AI technology that helps companies make smarter, explainable decisions using AI.
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Join Marcel Kurovski to explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Event: O'Reilly Artificial Intelligence Conference, New York, 18.04.2019
Speaker: Marcel Kurovski, inovex GmbH
Mehr Tech-Vorträge: inovex.de/vortraege
Mehr Tech-Artikel: inovex.de/blog
Word embeddings are common for NLP tasks, but embeddings can also be used to learn relations among categorical data. Deep learning can be useful also for structured data, and entity embeddings is one reason why it makes sense. These are slides from a seminar held in Sbanken.
Opinion Dynamics on Generalized NetworksMason Porter
This is a talk on opinion dynamics (especially bounded-confidence models) on generalized networks.
It is part of the MIX-NEXT III (Multiscale & Integrative compleX Networks: EXperiments & Theories) satellite at NetSci 2022.
(Thursday 14 July 2022)
Exploration and diversity in recommender systemsJaya Kawale
The document discusses exploration and diversity in recommender systems at Tubi. It provides an overview of Tubi as an AVOD platform and describes how machine learning is used for personalization, content, and ads. It then discusses the importance of exploration in recommender systems to address cold starts, changing user tastes, and item popularities. Various exploration techniques like epsilon-greedy, optimism in the face of uncertainty, Thompson sampling, and contextual bandits are covered. The document also discusses how diversity is important to maximize utility and utilization of recommendations and describes methods to increase diversity like determinantal point processes. It concludes that exploration can help achieve diversity and vice versa.
Understanding the difference between Data, information and knowledgeNeeti Naag
In decision making process it is very important to use past and present data. This presentation will help in understanding what is data, how it is converted to information and how information becomes knowledge.
1st Place in EY Data Science Challenge Hyunju Shim
1st Place Presentation of EY Next Wave Data Science Challenge in 2019.
Following presentation contains background, methods, deep learning models, parameters, and other miscellaneous details about our model used for the data science challenge.
Overview of Model: LSTM models including Multiplicative LSTM, LSTM-CNN, simple LSTM were used to tackle Variable Sequence Length Binary Classification problem.
홍콩 지역 EY NextWave 데이터사이언스 대회에서 1위를 차지한 모델에 대한 설명입니다. 다양한 LSTM 모델 (Multiplicative LSTM, LSTM-CNN, Simple LSTM) 등을 사용해서 Variable Sequence Length Binary Classification 문제에 접근하였습니다.
Knowledge graphs for knowing more and knowing for sureSteffen Staab
Knowledge graphs have been conceived to collect heterogeneous data and knowledge about large domains, e.g. medical or engineering domains, and to allow versatile access to such collections by means of querying and logical reasoning. A surge of methods has responded to additional requirements in recent years. (i) Knowledge graph embeddings use similarity and analogy of structures to speculatively add to the collected data and knowledge. (ii) Queries with shapes and schema information can be typed to provide certainty about results. We survey both developments and find that the development of techniques happens in disjoint communities that mostly do not understand each other, thus limiting the proper and most versatile use of knowledge graphs.
The document describes Dropbox's machine learning infrastructure and platform. It discusses how the platform provides scalable access to Dropbox's large data sources for offline and online ML use cases. The platform aims to accelerate ML development at Dropbox by standardizing workflows, automating processes, and making ML deployment and experimentation easy. It utilizes various services like Antenna for activity data and dbxlearn for distributed training across Dropbox and AWS resources. The platform supports all stages of the ML lifecycle from data preparation to model deployment and monitoring.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Understanding how high powered ML models arrive at their predictions is an important aspect of Machine Learning, and SHAP is a powerful tool that enables practitioners to understand how different features combine to help a model arrive at a prediction.
This slidedeck is from a presentation given at pydata global on the theoretical foundations of SHAP as well as how to use its library. Link to the presentation can be found here: https://pydata.org/global2021/schedule/presentation/3/behind-the-black-box-how-to-understand-any-ml-model-using-shap/
The document provides an introduction to competitive data science. It outlines the data science process for competitions, which includes data cleaning, exploratory data analysis, feature engineering, modeling, and ensemble techniques. It explains that the goal of competitive data science is to improve performance on predefined metrics for a given task. Participants can enhance their skills, showcase their work, and learn from others in a challenging environment.
Workshop presented at Webdagene 2013 (http://webdagene.no/en/) September 9, 2013; UX Lisbon (http://www.ux-lx.com), May 12, 2011; UX Hong Kong (http://www.uxhongkong.com/), February 17, 2011.
This presentation was held at ISC 2014 on June 26, 2014 in Leipzig, Germany.
More information available at:
http://msrg.org/papers/ISC2014-Rabl
Abstract:
The Workshops for Big Data Benchmarking (http://clds.sdsc.edu/bdbc/workshops), which have been underway since May 2012, have identified a set of characteristics of big data applications that apply to industry as well as scientific application scenarios involving pipelines of processing with steps that include aggregation, cleaning, and annotation of large volumes of data; filtering, integration, fusion, subsetting, and compaction of data; and, subsequent analysis, including visualization, data mining, predictive analytics and, eventually, decision making. One of the outcomes of the WBDB workshops has been the formation of a Transaction Processing Council subcommittee on Big Data, which is initially defining a Hadoop systems benchmark, TPCx-HS, based on Terasort. TPCx-HS would be a simple, functional benchmark that would assist in determining basic resiliency and scalability features of large-scale systems. Other proposals are also actively under development including BigBench, which extends the TPC-DS benchmark for big data scenarios; Big Decision Benchmark from HP; HiBench from Intel; and the Deep Analytics Pipeline (DAP), which defines a sequence of end-to-end processing steps consisting of some of the operations mentioned above. Pipeline benchmarks reveal the need for different processing modalities and system characteristics for different steps in the pipeline. For example, early processing steps may process very large volumes of data and may benefit from a Hadoop and MapReduce-style of computing, while later steps may operate on more structured data and may require, say, SMP-style architectures or very large memory systems. This talk will provide an overview of these benchmark activities and discuss opportunities for collaboration and future work with industry partners.
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j
In this talk Tareq will discuss graph solutions based on his experiences building a varied mix of graph-based systems. He will be sharing techniques and approaches that he has learned and will focus on a number of concepts that may be applied to a wider context.
The document outlines five rules for transforming big data into decisions: 1) Start with the question, not the data, 2) Write down your fitness function, 3) Experiment by launching and learning, 4) Respect and empower your customers, and 5) Embrace transparency. It also suggests collaborating with people and machines as a bonus rule. The document proposes a thought experiment about what could be done with all of Google's data and concludes by emphasizing making the implicit explicit.
This document discusses large scale modeling and data analysis. It defines large scale modeling as building models that can process very large datasets that are difficult for traditional tools. It provides examples of large scale recommendation models at LinkedIn and discusses how more data allows for better accuracy, deeper insights through exploration, and more flexible feature engineering. Challenges include ensuring infrastructure can handle the data volume and complexities of online versus offline modeling.
Hawaii Machine Learning - Our Inaugural MeetupMichael Motoki
This is the presentation from our first meetup. We talked about what Machine Learning is, applications of Machine Learning, what to expect from this Meetup, and ended by predicting members age using a picture of their face and their responses to speed dating questions.
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
This document summarizes a presentation on data science consulting. It discusses:
1) The Agile Analytics group at ThoughtWorks which does data science consulting projects using probabilistic modeling, machine learning, and big data technologies.
2) Two case studies are described, including developing a machine learning model to improve matching of healthcare product data and using logistic regression for retail recommendation systems.
3) The origins and future of the field are discussed, noting that while not entirely new, data science has grown due to improvements in technology, programming languages, and libraries that have increased productivity and driven new career opportunities in the field.
Best Practices for Hyperparameter Tuning with MLflowDatabricks
Hyperparameter tuning and optimization is a powerful tool in the area of AutoML, for both traditional statistical learning models as well as for deep learning. There are many existing tools to help drive this process, including both blackbox and whitebox tuning. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, Bayesian optimization, and parzen estimators) and then discuss the open source tools which implement each of these techniques. Finally, we will discuss how we can leverage MLflow with these tools and techniques to analyze how our search is performing and to productionize the best models.
Speaker: Joseph Bradley
Machine Learning From Raw Data To The PredictionsLuca Zavarella
The document discusses machine learning concepts and Azure Machine Learning. It begins with an introduction to Azure ML services and environment. It then covers important steps for data preparation, including checking for missing values and outliers, performing feature engineering and selection. Common machine learning algorithms are also outlined, such as regression, classification, clustering and anomaly detection. The presentation concludes with a demonstration of how to build a predictive model for price elasticity using Azure ML Studio.
The list of failed big data projects is long. They leave end-users, data analysts and data scientists frustrated with long lead times for changes. This case study will illustrate how to make changes to big data, models, and visualizations quickly, with high quality, using the tools teams love. We synthesize techniques from devOps, Demming, and direct experience.
Data Refinement: The missing link between data collection and decisionsVivastream
The document discusses the importance of data refinement between data collection and decision making. It emphasizes the need to transform raw data into useful insights through techniques like data summarization, categorization, and predictive modeling in order to provide accurate marketing answers and improve targeting, costs, and results. Specifically, it recommends structuring data into a model-ready environment, creating descriptive variables from transaction histories, matching data to the appropriate analytical goals and levels, and categorizing non-numeric attributes.
From Labelling Open data images to building a private recommender systemPierre Gutierrez
Recommender systems are paramount for e-business companies. There is an increasing need to take into account all the user information to tailor the best product proposition. One of them is the content that the user actually sees: the visual of the product.
When it comes to hostels, some people can be more attracted by pictures of the room, the building or even the nearby beach.
In this talk, we will describe how we improved an e-business vacation retailer recommender system using the content of images. We’ll explain how to leverage open dataset and pre-trained deep learning models to derive user taste information. This transfer learning approach enables companies to use state of the art machine learning methods without having deep learning expertise.
The document outlines training offerings from Hudson Data Corp in algorithms, big data, and machine learning. For big data training, it describes a 4-week evening class curriculum covering Hadoop, Hive, and MapReduce. For machine learning training, it lists a 6-week evening class curriculum teaching techniques like decision trees, random forests, PCA, clustering and recommendation engines. It provides details on topics, assignments, and projects for each week of both programs.
This document discusses search and data enhancement projects at a company. It outlines four phases of the project:
1) Data standardization by creating product description templates and establishing workflows for consistent data entry.
2) Cleaning up legacy data and creating processes for standardizing new item data.
3) Expanding keyword data by adding new keywords and scrubbing existing keywords.
4) Continuing enhancements by adding more metadata fields, building customer sentiment data, and ongoing improvements.
The document also describes how an external company, Stibo, can contribute to the project by providing additional product attributes, improving product images, scrubbing and adding metadata, and enhancing media on the website.
1. The document discusses future directions for software engineering research, including tools to support "citizen scientists" and proposed services for next-generation data repositories.
2. It suggests that data mining tools could provide more services beyond data repositories, such as supporting verification, compression, privacy, and streaming of data.
3. The talk outlines several topics, including software tools for citizen scientists, issues around decision software, and lessons learned regarding certification envelopes, goals, locality, and the need for repair and verification tools.
U Unit 6 [MT355] Page 1 of 3 Unit 6 Assignment.docxouldparis
U
Unit 6 [MT355]
Page 1 of 3
Unit 6 Assignment: Design Appropriate Data Collection
Methods
In this Assignment, you will be assessed based on the following outcome:
MT355-3: Design appropriate data collection methods.
Marketing researchers must become highly skilled at designing appropriate methods of data collection,
especially when it comes to designing a data collection methodology and developing survey questions
used in the construction of a data collection form.
In Part 1 of this Assignment, you will demonstrate your ability to design a data collection methodology
using the knowledge obtained from Chapters 6 and 7 in your textbook.
In Part 2 of this Assignment, you will demonstrate your ability to develop a viable research study
questionnaire using the knowledge you obtain from Chapter 8 in your textbook. Be sure to conduct
additional research to learn best practices used in marketing research questionnaire development.
Directions for completing this Assignment
While completing this Assignment, it is essential that you consider the ethics behind your data collection
methodology in Part 1, and in every question you develop in Part 2. Always consider the validity of your
questions before making them ready for data collection.
Part 1: Situation Analysis – Research Problem Defined
It is essential for marketing researchers to develop an ability to design appropriate data collection
methods. Follow the theoretical and conceptual methods you learned about in Chapters 6 and 7 in the
“Essentials of Marketing Research” textbook.
Using the Random Scenario Generator (RSG), select a scenario for your Assignment. (Reminder: The
RSG will prompt you to select 1 of 3 options for each of the variables. Once you have selected from each
variable category, the resulting scenario is to be the basis for your work on this Assignment. Each
student’s scenario will be documented.)
Part 1 will include:
A 1,000-words (4-5 pages), in addition to the title, reference, and appendix pages, informative
essay to define a research problem.
Separate title and reference pages, standard paragraph structure, double-spacing, 12-point Times
New Roman font, and it should follow all other APA 6th edition formatting and citation guidelines.
Secondary research to support the need for a study (minimum of three academic resources) and
how this research will be useful in solving the research problem.
Discussion of the sampling design chosen to conduct the research and explanation of how you will
choose your sample and collect your data.
Survey validation methods.
Scales used in data collection.
https://kapextmediassl-a.akamaihd.net/business/Media/MT355/MT355_1904C/RSG/story.html
U
Unit 6 [MT355]
Page 2 of 3
Part 2: Research Survey Design
Part 2 will include the following sections:
Include a brief description of the seven steps in the questionnaire design, and a 10- ...
Using SigOpt to Tune Deep Learning Models with Nervana CloudSigOpt
This document discusses using SigOpt to tune deep learning models. It notes that tuning deep learning systems is non-intuitive and expert-intensive using traditional random search or grid search methods. SigOpt provides a more efficient approach using Bayesian optimization to suggest optimal hyperparameters after each trial, reducing wasted expert time and computation. The document provides examples applying SigOpt to tune convolutional neural networks on CIFAR10, demonstrating a 1.6% reduction in error rate over expert tuning with no wasted trials.
Similar to Starting data science with kaggle.com (20)
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
University of New South Wales degree offer diploma Transcript
Starting data science with kaggle.com
1. Starting Data Science
With Kaggle.com
6/25/2017
Starting Data Science with Kaggle.com -
Nathaniel Shimoni
1Nathaniel Shimoni 25/6/2017
2. • What is Kaggle?
• Why is Kaggle so great? The everyone wins approach
• Kaggle tiers & top kagglers
• Frequently used terms and the main rules
• The benefits of starting with Kaggle
• Common Kaggle data science process
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
2
Talk outline
3. • An online platform that runs data science competitions
• Declares itself to be the home of data science
• Has over 1M registered users & over 60k active users
• One of the most vibrant communities for data scientists
• A great place to meet other “data people”
• A great place to learn and test your data & modeling
skills
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
3
What is kaggle?
4. 6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
4
Why is Kaggle so great? (the everyone wins approach)
• Receives prizes,
knowledge,
exposure &
portfolio
showcase
• Rapid development
& adoption of
highly performing
platforms
• Receives money,
from competition
sponsors
• influence on the
community
• knowledge on the
platforms & algo.
trends
• Have data &
business task but
no data scientists
• Receives state of
the art models
quickly and
without hiring
data scientists
6. • Novice – a new Kaggle user
• Contributor – participated in one or more competitions,
ran a kernel, and is active in the forums
• Expert – 2 top 25% finishes
• Master - 2 top 10% finishes, & 1 top 10 (places) finish
• Grandmaster – 5 top 10 finishes & 1 solo top 10 finish
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
6
Kaggle tiers
8. • Leaderboard (public & private)
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
8
Frequently used terms
Training
Public
LB
Private LB
Available once
approved the rules
Used for ranking
submissions
through the
competition
Training data Testing data
Used for final scoring
(the only score that truly matters)
Public LB can serve as
additional validation
frame, but can also be
source of over fitting
9. • Leakage - the introduction of information about
the target that is not a legitimate predictor
(usually by a mistake within the data preparation process)
• Team merger – 2 or more participants competing
together
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
9
Frequently used terms
10. • LB shuffle – the re-ranking that occurs at the end
of the competition (upon moving from public to private LB)
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
10
Frequently used terms
11. 6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
11
Main rules for Kaggle competitions
• One account per user
• No private sharing outside teams
(public sharing is usually allowed and endorsed)
• Limited number of entries per day & per competition
• Winning solutions must be written in open source code
• Winners should hand well documented source code in
order to be eligible of the price
• Usually select 2 solutions for final evaluation
12. • Project based learning – learn by doing
• Solve real world challenges
• Great supporting community
• Benchmark solutions & shared code samples
• Clear business objective and modeling task
• Develop work portfolio and rank yourself against
other competitors (and get recognition)
• Compete against state of the art solutions
• Learn (a lot!!!) when competition ends
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
12
Why start with Kaggle?
13. • Ability to team-up with others:
learn from better Kagglers
learn how to collaborate effectively
merge different solutions to achieve a score boost
meet new exciting people
• Answer the questions of others – you only truly learn
something when you teach it to someone else
• Ability to apply new ideas at work with little effort
• Varied areas of activity (verticals)
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
13
Why start with Kaggle?
14. • The ability to follow many experts where each of them
specializes in a particular area (sample from my list)
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
14
Why start with Kaggle?
Ensemble learning
Mathias Müller
Feature extraction
Darius Barušauskas
Validation
Gert Jacobusse
Super fast draft modeling
ZFTurbo - unknown
Inspiration – no minimal age for data science
Mikel Bober-Irizar
15. 6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
15
Common Kaggle Data Science process
Data cleaning
Data
augmentation
Adding
External Data
Single
models
Feature
engineering
Exploratory
data
analysis
Single
models
Diverse
single
models
Set the
correct
validation
method
Ensemble
learning
Final
prediction
EDA
Feature generation
modeling
Ensemble
learning
Data cleaning
& augmenting
Not always allowed yet
good practice to
consider when possible
40%20% 30% 10%
% of total
time spent in
each activity
16. • Impute missing values
(mean, median, most common value, use separate prediction task)
• Remove zero variance features
• Remove duplicated features
• Outlier removal – caution can be harmful, at cleaning
stage we’ll remove irrelevant values (e.g. negative price)
• Na’s encoding / imputing
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
16
Data cleaning
17. • External data sources:
open street map
weather measurement data
online calendars
• API’s
• Scraping (using ScraPy / beautiful soup)
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
17
Data augmentation & external data
18. • Rescaling/ standardization of existing features
• Performing data transformations:
Tf-Idf, log1p, min-max scaling, binning of numeric features
• Turn categorical features to numeric
(label encoding / one hot encoding)
• Create count features
• Parsing textual features to get more generalizable
features
• Hashing trick
• Extracting date/time features i.e DayOfWeek, month, year,
dayOfMonth, isHoliday etc.
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
18
Feature engineering
19. • Remove near-zero-variance features
• Use feature importance and eliminate least
important features
• Recursive Feature Elimination
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
19
Feature selection
20. • Grid search CV (exhaustive, rarely better than alternatives)
• Random search CV
• Hyper-opt
• Bayesian optimization
* Hyper parameter adjustment will usually yield
better results but not as much as other activities
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
20
Hyper parameter optimization
21. • Train test split
• Shuffle split
• Kfold is the most commonly used
• Time based separation
• Group Kfold
• Leave one group out
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
21
Validation
22. • Simple/weighted average of previous best models
• Bagging of same type of models (i.e different rng,
different hyper-param)
• Majority vote
• Using out of fold predictions as meta features
a.k.a stacking
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
22
Ensemble learning
23. 6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
23
Out Of Fold predictions – a.k.a meta features
fold1
fold2
fold3
fold4
oof 1
oof 2
oof 3
oof 4
Out of fold
predictions
Averaged
test
predictions
Test
predictions
fold1
Test
predictions
fold2
Test
predictions
fold3
Test
predictions
fold4
Divided training data - train on 3 folds
predict the forth fold and the testing data
24. 6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
24
Out Of Fold predictions – a.k.a meta features
fold1
fold2
fold3
fold4
oof 1
oof 2
oof 3
oof 4
Out of fold
predictions
Averaged
test
predictions
Test
predictions
fold1
Test
predictions
fold2
Test
predictions
fold3
Test
predictions
fold4
Divided training data - train on 3 folds
predict the forth fold and the testing data
25. 6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
25
Out Of Fold predictions – a.k.a meta features
oof 1
oof 2
oof 3
oof 4
Model 1
e.g. knn
Averaged test
predictions
Out of fold
predictions
oof 1
oof 2
oof 3
oof 4
Model 2
e.g. NN
oof 1
oof 2
oof 3
oof 4
Model 3
e.g. gbm
Train
labels
Model 1
e.g. knn
Model 2
e.g. NN
Model 3
e.g. gbm
After training several models using this method (3 different models in this sample)
We can now train a new model using our newly formed meta features
* Note that we can either train our meta model using only these new features or use
the new features along with our original train data for training
26. • Large focus on modeling relatively to the rest of
the steps in the process
• Small weight to runtime and scalability
• Little reasoning for selecting a specific eval metric
• Competing for the last few percent points isn’t
always valuable
• “Click and submit” phenomena
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
26
Disadvantages of Kaggle
27. • MOOC’s:
Machine learning – Stanford Coursera
Data science track – Johns Hopkins Coursera
Udacity deep learning course
• Documentation:
Scikit learn documentation
Keras documentation
R caret package documentation
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
27
Additional reading resources
28. • This presentation draws heavily from the
following sources:
• Mark Peng’s presentation
“Tips for participating Kaggle challenges”
• Darius Barušauskas’s presentation
“Tips and tricks to win Kaggle data science competitions”
• Kaggle discussion forums and blog
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
28
Links to sources