GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
Slides of my presentation at CIKM2018 about version 2 of the GRU4Rec algorithm, a recurrent neural network based algorithm for the session-based recommendation task.
We discuss sampling strategies and introduce additional sampling to the algorithm. We also redesign the loss function to cope with additional sampling. The resulting BPR-max loss function is able to efficiently handle many negative samples without encountering the vanishing gradient problem. We also introduce constrained embeddings which speeds up the conversion of item representations and reduces memory usage by a factor of 4. These improvements increase offline measures up to 52%.
In the talk we also discuss online A/B test and the implications of long time observations. Most of these observations are exclusive to this talk and are not in the paper.
You can access the preprint version of the paper on arXiv: https://arxiv.org/abs/1706.03847
The code is available on GitHub: https://github.com/hidasib/GRU4Rec
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4. 5 algorithm, and is typically used in the machine learning and natural language processing domains.
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
Slides of my presentation at CIKM2018 about version 2 of the GRU4Rec algorithm, a recurrent neural network based algorithm for the session-based recommendation task.
We discuss sampling strategies and introduce additional sampling to the algorithm. We also redesign the loss function to cope with additional sampling. The resulting BPR-max loss function is able to efficiently handle many negative samples without encountering the vanishing gradient problem. We also introduce constrained embeddings which speeds up the conversion of item representations and reduces memory usage by a factor of 4. These improvements increase offline measures up to 52%.
In the talk we also discuss online A/B test and the implications of long time observations. Most of these observations are exclusive to this talk and are not in the paper.
You can access the preprint version of the paper on arXiv: https://arxiv.org/abs/1706.03847
The code is available on GitHub: https://github.com/hidasib/GRU4Rec
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4. 5 algorithm, and is typically used in the machine learning and natural language processing domains.
A tutorial on LDA that first builds on the intuition of the algorithm followed by a numerical example that is solved using MATLAB. This presentation is an audio-slide, which becomes self-explanatory if downloaded and viewed in slideshow mode.
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Massimo Quadrana
Slides of the Tutorial on Sequence Aware Recommenders held at ACM RecSys 2018 in Vancouver.
Link to the website: https://sites.google.com/view/seq-recsys-tutorial
Link to the hands-on: https://github.com/mquad/sars_tutorial
Amazon DynamoDB is a fully managed NoSQL database service for applications that need consistent, single-digit millisecond latency at any scale. This talk explores DynamoDB capabilities and benefits in detail and discusses how to get the most out of your DynamoDB database. We go over schema design best practices with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others. We also explore designing efficient indexes, scanning, and querying, and go into detail on a number of recently released features, including JSON document support, Streams, and more.
Data Mining Seminar - Graph Mining and Social Network Analysisvwchu
Delivered a formal presentation on course material for the Data Mining (EECS 4412) course at York University, Canada, about graph mining. Graphs have become increasingly important in modeling sophisticated structures and their interactions, with broad applications including chemical informatics, bioinformatics, computer vision, video indexing, text retrieval, and Web analysis. The formal seminar was 50 to 60 minutes followed by 10 to 20 minutes for questions.
https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412
https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412/lectures
These slides provide an overview of the most popular approaches up to date to solve the task of object detection with deep neural networks. It reviews both the two stages approaches such as R-CNN, Fast R-CNN and Faster R-CNN, and one-stage approaches such as YOLO and SSD. It also contains pointers to relevant datasets (Pascal, COCO, ILSRVC, OpenImages) and the definition of the Average Precision (AP) metric.
Full program:
https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgraduate-course-artificial-intelligence-deep-learning/
A tutorial on LDA that first builds on the intuition of the algorithm followed by a numerical example that is solved using MATLAB. This presentation is an audio-slide, which becomes self-explanatory if downloaded and viewed in slideshow mode.
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Massimo Quadrana
Slides of the Tutorial on Sequence Aware Recommenders held at ACM RecSys 2018 in Vancouver.
Link to the website: https://sites.google.com/view/seq-recsys-tutorial
Link to the hands-on: https://github.com/mquad/sars_tutorial
Amazon DynamoDB is a fully managed NoSQL database service for applications that need consistent, single-digit millisecond latency at any scale. This talk explores DynamoDB capabilities and benefits in detail and discusses how to get the most out of your DynamoDB database. We go over schema design best practices with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others. We also explore designing efficient indexes, scanning, and querying, and go into detail on a number of recently released features, including JSON document support, Streams, and more.
Data Mining Seminar - Graph Mining and Social Network Analysisvwchu
Delivered a formal presentation on course material for the Data Mining (EECS 4412) course at York University, Canada, about graph mining. Graphs have become increasingly important in modeling sophisticated structures and their interactions, with broad applications including chemical informatics, bioinformatics, computer vision, video indexing, text retrieval, and Web analysis. The formal seminar was 50 to 60 minutes followed by 10 to 20 minutes for questions.
https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412
https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412/lectures
These slides provide an overview of the most popular approaches up to date to solve the task of object detection with deep neural networks. It reviews both the two stages approaches such as R-CNN, Fast R-CNN and Faster R-CNN, and one-stage approaches such as YOLO and SSD. It also contains pointers to relevant datasets (Pascal, COCO, ILSRVC, OpenImages) and the definition of the Average Precision (AP) metric.
Full program:
https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgraduate-course-artificial-intelligence-deep-learning/
Slides for talk delivered at the Python Pune meetup on 31st Jan 2014.
Categorical data is a huge problem many data scientists face. This talk is about how to tame it
Fast detection of Android malware: machine learning approachYury Leonychev
This is a my presentation for YaC 2013 about machine learning based system for fast classification of Android applications. Covered themes: how to find malware around thousands of applications in Store.
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
"Optimisation of closed loop supply chain decisions using integrated game theoretic particle swarm algorithm"
Kalpit Patne, Visiting Fellow, SMART Infrastructure Facility presented a summary of his research as part of the SMART Seminar Series on 8 July 2016.
For more information, visit the event page at: http://smart.uow.edu.au/events/UOW217694.
• Explored and cleaned huge amount of user activity logs (JSON) from Movies website using Map Reduce jobs in Python.
• Classified user accounts into adults and children for targeted advertising by implementing Similarity Ranking algorithm.
• Grouped user sessions based on user behavior using K means clustering to observe outliers and to find distinctive groups.
• Predicted ratings for movies using User-user and Item-Item based recommendation algorithms using Mahout.
This talk was prepared for the November 2013 DataPhilly Meetup: Data in Practice ( http://www.meetup.com/DataPhilly/events/149515412/ )
Map Reduce: Beyond Word Count by Jeff Patti
Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate's information pipeline
Bio: Jeff Patti is a backend engineer at Monetate with a passion for algorithms, big data, and long walks on the beach. Prior to working at Monetate he performed software R&D for Lockheed Martin, where he worked on projects ranging from social network analysis to robotics.
tIt appears that you've provided a set of instructions or input format for a machine learning task, particularly clustering using K-Means. Let's break down what each component means:
(number of clusters):
This is a placeholder for an actual numerical value that represents the desired number of clusters into which you want to divide your training data. In K-Means clustering, you need to specify in advance how many clusters (K) you want the algorithm to find in your data.
Training set:
The "training set" is your dataset, which contains the data points that you want to cluster. Each data point represents an observation or sample in your dataset.
(drop convention):
It's not clear from this input what "(drop convention)" refers to. It could be related to a specific data preprocessing or handling instruction, but without additional context or information, it's challenging to provide a precise explanation for this part.
In summary, you are expected to provide the number of clusters (K) that you want to discover in your training data, and the training data itself contains the observations or samples that will be used for clustering. The "(drop convention)" part may require further clarification or context to provide a meaningful explanation.Clustering is a fundamental concept in the field of machine learning and data analysis that involves grouping similar data points together based on certain criteria or patterns. It is a technique used to discover inherent structures, relationships, or similarities within a dataset when there are no predefined labels or categories. Clustering is widely employed in various domains, including marketing, biology, image analysis, recommendation systems, and more. In this comprehensive explanation of clustering, we will explore its principles, methods, applications, and key considerations.
Table of Contents
Introduction to Clustering
Key Concepts and Terminology
Types of Clustering
3.1. Partitioning Clustering
3.2. Hierarchical Clustering
3.3. Density-Based Clustering
3.4. Model-Based Clustering
Distance Metrics and Similarity Measures
Common Clustering Algorithms
5.1. K-Means Clustering
5.2. Hierarchical Agglomerative Clustering
5.3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
5.4. Gaussian Mixture Models (GMM)
Evaluation of Clusters
Applications of Clustering
7.1. Customer Segmentation
7.2. Image Segmentation
7.3. Anomaly Detection
7.4. Document Clustering
7.5. Recommender Systems
7.6. Genomic Clustering
Challenges and Considerations
8.1. Determining the Number of Clusters (K)
8.2. Handling High-Dimensional Data
8.3. Initial Centroid Selection
8.4. Scaling and Normalization
8.5. Interpretation of Results
Best Practices in Clustering
Future Trends and Advances
Conclusion
1. Introduction to Clustering
Clustering, in the context of data analysis and machine learning, refers to the process of grouping a set of data points into subsets,
Machine Learning : why we should know and how it worksKevin Lee
The most popular buzz word nowadays in the technology world is “Machine Learning (ML).” Most economists and business experts foresee Machine Learning changing every aspect of our lives in the next 10 years through automating and optimizing processes such as: self-driving vehicles; online recommendation on Netflix and Amazon; fraud detection in banks; image and video recognition; natural language processing; question answering machines (e.g., IBM Watson); and many more. This is leading many organizations to seek experts who can implement Machine Learning into their businesses.
Statistical programmers and statisticians in the pharmaceutical industry are in very interesting positions. We have very similar backgrounds as Machine Learning experts, such as programming, statistics, and data expertise, thus embodying the essential technical skill sets needed. This similarity leads many individuals to ask us about Machine Learning. If you are the leaders of biometric groups, you get asked more often.
The paper is intended for statistical programmers and statisticians who are interested in learning and applying Machine Learning to lead innovation in the pharmaceutical industry. The paper will start with the introduction of basic concepts of Machine Learning - hypothesis and cost function and gradient descent. Then, paper will introduce Supervised ML (e.g., Support Vector Machine, Decision Trees, Logistic Regression), Unsupervised ML (e.g., clustering) and the most powerful ML algorithm, Artificial Neural Network (ANN). The paper will also introduce some of popular SAS ® ML procedures and SAS Visual Data Mining and Machine Learning. Finally, the paper will discuss the current ML implementation, its future implementation and how programmers and statisticians could lead this exciting and disruptive technology in pharmaceutical industry.
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...PAPIs.io
Artificial Intelligence and Machine Learning are becoming increasingly accessible. Starting from example use cases, I’ll aim at demystifying how they work and how they improve businesses in 3 areas: increasing the number of customers, serving them better, and serving them more efficiently. I’ll show how machines can use data to automatically learn business rules and make predictions, that can then be used to make better decisions. I’ll introduce the main concepts of ML, its possibilities, its limitations, and I’ll give tips on framing the right problems for your company to tackle.
Louis Dorard is the author of Bootstrapping Machine Learning, a co-founder of PAPIs, and an independent consultant. His goal is to help people use new machine learning technologies to make their apps and businesses smarter. He does this by writing, speaking and teaching.
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docxmelbruce90096
Week 2 iLab
TCO 2 — Given a simple problem, design a solution algorithm that uses arithmetic expressions and built-in functions.
Scenario
Your goal is to solve the following simple programming exercise. You have been contracted by a local antique store to design an algorithm determining the total purchases and sales tax. According to the store owner, the user will need to see the subtotal, the sales tax amount, and the total purchase amount. A customer is purchasing four items from the antique store. Design an algorithm where the user will enter the price of each of the four items. The algorithm will determine the subtotal, the sales tax, and the total purchase amount. Assume the sales tax is 7%.
Be sure to think about the logic and design first (input-process-output (IPO) chart, flowchart, and pseudocode). Display all output using currency formatting.
Advanced (optional): Use a constant for the 7% sales tax.
Rubric
Point distribution for this activity:
iLab Activity
Document
Points possible
Points received
Variable list
10
IPO chart
10
Flowchart
10
Pseudocode/C# code
10
Desk-check
10
Total Points
50
Name:_________________
(1) Variable List With Data Type
List all the variables you will use (use valid variable names). Indicate whether the data type is string, integer, or double, and so on.
(2) IPO Model
List the inputs, any processes, calculations, and outputs. Use the same valid variable names you used in Step 1.
Inputs
Process (calculations)
Outputs
(3) Flowchart
Use MS Visio to create a flowchart. Paste the flowchart here, or attach as separate document. Use the same valid variable names you used in Step 1.
(4) Pseudocode or C# Code
Describe your solution using pseudocode or actual C# code. Use the same valid variable names you selected in Step 1.
(5) Desk-Check
Desk-check your solution by selecting appropriate test data.
Test data: List the values for your test data.
Expected output: What is the expected output of your program?
Step
Variables (write variable names in first line below)
Output
Enter step numbers
1
2
3
Week 2 Activity—Game Seating Charges
TCO 2—Given a simple problem, design a solution algorithm that uses arithmetic expressions and built-in functions.
Assignment
Your goal is to solve the following simple programming exercise. You have been contracted by a local stadium to design an algorithm determining the total seating charges for any game held at the stadium. Lower-level seats cost $25 per seat, mid-level seats cost $15 per seat, and upper-level seats cost $10 per seat. The algorithm should ask the user for the number of seats being purchased in each seating level. Then, the algorithm will determine the total for each level and a grand total for the enti.
The ABC of Implementing Supervised Machine Learning with Python.pptxRuby Shrestha
It is to our fact that machine learning has taken a significant height. However, knowing and understanding how small problems can be solved from a machine learning perspective is necessary to form a good base, appreciate the process of implementation and get started in this domain. Therefore, in this post, I would like to talk about the ABC of implementing Supervised Machine Learning with Python by navigating through a simple example, which is, adding two numbers. So, to put it in simple terms, I would like to make a machine learn to add. This can be put in other words; I would like to develop a predictive model that can add. Sounds simple, right? View the presentation for more details.
Факторизационные модели в рекомендательных системахromovpa
Факторизационные модели, модели разложения матриц для коллаборативной фильтрации в рекомендательных системах. В презентации рассматриваются теоретические аспекты и алгоритмы.
С доклада на спецсеминаре "Machine Learning & Information Retrieval" в Школе Анализа Данных Яндекса.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
RecSys Challenge 2015: ensemble learning with categorical features
1. RecSys Challenge 2015: ensemble learning with
categorical features
Peter Romov, Evgeny Sokolov
2. • Logs from e-commerce website: collection of sessions
• Session
• sequence of clicks on the item pages
• could end with or without purchase
• Click
• Timestamp
• ItemID (≈54k unique IDs)
• CategoryID (≈350 unique IDs)
• Purchase
• set of bought items with price and quantity
• Known purchases for the train-set, need to predict on the test-set
Problem statement
2
3. Problem statement
3
Clicks from session s
Purchase (actual)
Purchase (predicted)
c(s) = (c1(s), . . . , cn(s)(s))
h(s) ⇡ y(s)
y(s) =
(
; — no purchase
{i1, . . . , im(s)} (bought items) — otherwise
Evaluation measure:
— Jaccard distance between two sets
Sb
test
Stest
where
— all sessions from test-set
— sessions from test-set with purchase
J(y(s), h(s)) =
|y(s) h(s)|
|y(s) [ h(s)|
Q(h, Stest) =
X
s2Stest:
|h(s)|>0
8
<
:
|Sb
test|
|Stest| + J(y(s), h(s)), if y(s) 6= ;
|Sb
test|
|Stest| , otherwise
4. Problem statement
4
First observations (from the task):
• the task is uncommon (set prediction with specific loss function)
• evaluation measure could be rewritten
• the original problem can be divided into two well-known
binary classification problems;
1. predict purchase given session
optimize Purchase score
2. predict bought items given session with purchase
optimize Jaccard score
P(y(s) 6= ;|s)
Q(h, Stest) =
|Sb
test|
|Stest|
(TP FP)
| {z }
purchase score
+
X
s2Stest
J(y(s), h(s))
| {z }
Jaccard score
,
P(i 2 y(s)|s, y(s) 6= ;)
5. • Two-stage prediction
• Two binary classification models learned on the train-set
• Both classifiers require thresholds
• Set up thresholds to optimize Purchase score and Jaccard score
using hold-out subsample of the train-set
Solution schema
5
Strain
purchase classifier
bought item classifier
Slearn
90%
Svalid
10%
s 7! P(purchase|s)
(s, i) 7! P(i 2 y(s)|s, y(s) 6= ;)
classifier
thresholds
6. Some relationships from data
6
Next observations (from the data):
• Buying rate strongly depends on time features
• Buying rate varies highly between categorical features
he
er-
87
e-
s)
es-
me
s.
he
or
ht
is
)) ,
ne
ty
Figure 1: Dynamics of the buying rate in time
Figure 2: Buying rate versus number of clicked
items (left) and ID of the item with the maximum
number of clicks in session (right)
The total number of item IDs and category IDs is 54, 287
and 347 correspondingly. Both training and test sets be-
long to an interval of 6 months. The target function y(s)
corresponds to the set of items that were bought in the ses-
sion s 1
. In other words, the target function gives some
subset of the universal item set I for each user session s.
We are given sets of bought items y(s) for all sessions s the
training set Strain, and are required to predict these sets for
test sessions s ∈ Stest.
2.2 Evaluation Measure
Denote by h(s) a hypothesis that predicts a set of bought
items for any user session s. The score of this hypothesis is
measured by the following formula:
Q(h, Stest) =
s∈Stest:
|h(s)|>0
|Sb
test|
|Stest|
(−1)isEmpty(y(s))
+J(y(s), h(s)) ,
where Sb
test is the set of all test sessions with at least one
purchase event, J(A, B) = |A∩B|
|A∪B|
is the Jaccard similarity
measure. It is easy to rewrite this expression as
Q(h, Stest) =
|Sb
test|
|Stest|
(TP − FP)
purchase score
+
s∈Stest
J(y(s), h(s))
Jaccard score
, (1)
where TP is the number of sessions with |y(s)| > 0 and |h(s)| >
0 (i.e. true positives), and FP is the number of sessions
with |y(s)| = 0 and |h(s)| > 0 (i.e. false positives). Now it
is easy to see that the score consists of two parts. The first
one gives a reward for each correctly guessed session with
buy events and a penalty for each false alarm; the absolute
values of penalty and reward are both equal to
|Sb
test|
|Stest|
. The
second part calculates the total similarity of predicted sets
of bought items to the real sets.
2.3 Purchase statistics
Figure 2: Buying rate versus number of clicked
items (left) and ID of the item with the maximum
number of clicks in session (right)
is that a higher number of items clicked during the session
leads to a higher chance of a purchase.
Lots of information could be extracted from the data by
considering item identifiers and categories clicked during the
Buying rate — fraction of buyer sessions in some subset of sessions
7. Feature extraction
7
• Purchase classifier: features from session (sequence of clicks)
• Bought item classifier: features from pair session+itemID
• Observation: bought item is a clicked item
• We use two types of features
• Numerical: real number, e.g. seconds between two clicks
• Categorical: element of the unordered set of values (levels),
e.g. ItemID
8. Feature extraction: session
8
1. Start/end of the session (month, day, hour, etc.)
[numerical + categorical with few levels]
2. Number of clicks, unique items, categories, item-category pairs
[numerical]
3. Top 10 items and categories by the number of clicks
[categorical with ≈50k levels]
4. ID of the first/last item clicked at least k times
[categorical with ≈50k levels]
5. Click counts for 100 items and 50 categories that were most
popular in the whole training set
[sparse numerical]
9. Feature extraction:
session+ItemID
9
1. All session features
2. ItemID
[categorical with ≈50k levels]
3. Timestamp of the first/last click on the item (month, day, hour, etc.)
[numerical + categorical with few levels]
4. Number of clicks in the session for the given item
[numerical]
5. Total duration (by analogy with dwell time) of the clicks on the item
[numerical]
10. • GBM and similar ensemble learning techniques
• useful with numerical features
• one-hot encoding of categorical features doesn’t perform well
• Matrix decompositions, FM
• useful with categorical features
• hard to incorporate numerical features because of rough (bi-linear)
model
• We used our internal learning algorithm: MatrixNet
• GBM with oblivious decision trees
• trees properly handle categorical features (multi-split decision trees)
• SVD-like decompositions for new feature value combinations
Classification method
10
11. Classification method
11
Oblivious decision tree with categorical features
duration > 20
user
item item
[numerical]
[categorical]
[categorical]
yes no
…
user
item item…
… … ……
12. • Training classifiers
• GB with 10k trees for each classifier
• ≈12 hours to train both models on 150 machines
• Making predictions
• We made 4000 predictions per second per thread
Classification method: speed
12
13. Threshold optimization
13
We optimized thresholds using validation set (10% hold-out
from train-set)
1) Maximize Jaccard score
2) Maximize Purchase+Jaccard scores using fixed bought
item threshold
Q(h, Svalid) =
|Sb
valid|
|Svalid|
(TP FP)
| {z }
purchase score
+
X
s2Svalid
J(y(s), h(s))
| {z }
Jaccard score
,
Figure 3: Item detection threshold (above) and pur-
chase detection threshold (below) quality on the val-
idation set.
we train purchase detection and purchased item detection
classifiers. The purchase detection classifier hp(s) predicts
the outcome of the function yp(s) = isNotEmpty(y(s)) and
uses the entire training set in the learning phase. The item
detection classifier hi(s, j) approximates the indicator func-
tion yi(s, j) = I(j ∈ y(s)) and uses only sessions with bought
items in the learning phase. Of course, it would be wise to
use classifiers that output probabilities rather than binary
predictions, because in this case we will be able to select
thresholds that directly optimize evaluation metric (1) in-
stead of the classifier’s internal quality measure. So, our
final expression for the hypothesis can be written as
h(s) =
∅ if hp(s) < αp,
{j ∈ I | hi(s, j) > αi} if hp(s) ≥ αp.
(2)
3.2 Feature Extraction
We have outlined two groups of features: one describes a
session and the other describes a session-item pair. The pur-
chase detection classifier uses only session features and the
item detection classifier uses both groups. The full feature
listing can be found in Table 1; for further details, please
refer to our code2
. We give some comments on our feature
extraction decisions below.
One could use sophisticated aggregations to extract nu-
merical features that describe items and categories. How-
Figure 3: Item detection threshold (above) and pur-
chase detection threshold (below) quality on the val-
idation set.
14. • Leaderboard: 63102 (1st place)
• Purchase detection on validation (10% hold-out):
• 16% precision
• 77% recall
• AUC 0.85
• Purchased item detection on validation:
• Jaccard measure 0.765
• Features / datasets used to learn classifiers / evaluation process
can be reproduced, see our code1
Final results
14 1https://github.com/romovpa/ydf-recsys2015-challenge
15. 1. Observations from the problem statement
› The task is complex but decomposable into two well-known:
binary classification of sessions and (session, ItemID)-pairs
2. Observations from the data (user click sessions)
› Features from sessions and (session, ItemID)-pairs
› Easy to develop many meaningful categorical features
3. The algorithm
› Gradient boosting on trees with categorical features
› No sophisticated mixtures of Machine Learning techniques: one
algorithm to work with many numerical and categorical features
Summary / Questions?
15