This document summarizes the key steps in developing a machine learning model to classify online shops into different vertical categories based on their product descriptions. The model uses a Naive Bayes classifier trained on 80,000 shops that were manually labeled by Amazon Mechanical Turk workers into 15 categories. Text preprocessing like removing HTML tags and stemming is done before extracting term frequencies to build the model. The model provides 60-80% accuracy on held-out data and is used in production at the company to categorize thousands of new shops daily and provide insights to different teams. There is still room for improvement by evaluating additional metrics and hyperparameters.
How to Become a Thought Leader in Your NicheLeslie Samuel
Are bloggers thought leaders? Here are some tips on how you can become one. Provide great value, put awesome content out there on a regular basis, and help others.
Alessandro Magnani, Data Scientist, @WalmartLabs at MLconf SF - 11/13/15MLconf
Classification Labels in a Fast Moving Environment: Classification problems are very common in ecommerce. Collecting and storing labels from different sources is key to train and evaluate such models.
Labels are expensive to obtain, thus selecting which products to get labels for is key to optimally use any available labeling budget, both when training and evaluating a model. At the same time, if available labels are not correctly used, incorrect or suboptimal results can be produced.
In this talk I will discuss some of the challenges and potential pitfalls of acquiring and using labels for classification in a quickly evolving environment. I will present a system that store labels, provides a way to select labels to optimize budget while providing accurate and unbias evaluations of the classification models.
How to Become a Thought Leader in Your NicheLeslie Samuel
Are bloggers thought leaders? Here are some tips on how you can become one. Provide great value, put awesome content out there on a regular basis, and help others.
Alessandro Magnani, Data Scientist, @WalmartLabs at MLconf SF - 11/13/15MLconf
Classification Labels in a Fast Moving Environment: Classification problems are very common in ecommerce. Collecting and storing labels from different sources is key to train and evaluate such models.
Labels are expensive to obtain, thus selecting which products to get labels for is key to optimally use any available labeling budget, both when training and evaluating a model. At the same time, if available labels are not correctly used, incorrect or suboptimal results can be produced.
In this talk I will discuss some of the challenges and potential pitfalls of acquiring and using labels for classification in a quickly evolving environment. I will present a system that store labels, provides a way to select labels to optimize budget while providing accurate and unbias evaluations of the classification models.
LiveViewGPS Hours Of Service (HOS) PresentationLiveViewGPS Inc
Learn about the LiveViewGPS Hours of Service solution. The FMCSA Electronic Logging Device rule takes effect this December. LiveViewGPS provides a simple, easy to use, economical solution.
A rundown review of the most popular Engines / Frameworks and Extensions built on top of the MonoGame API, used by popular games to achieve stunning results with a lot less faff
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
LiveViewGPS Hours Of Service (HOS) PresentationLiveViewGPS Inc
Learn about the LiveViewGPS Hours of Service solution. The FMCSA Electronic Logging Device rule takes effect this December. LiveViewGPS provides a simple, easy to use, economical solution.
A rundown review of the most popular Engines / Frameworks and Extensions built on top of the MonoGame API, used by popular games to achieve stunning results with a lot less faff
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
2. Background
• Large ecommerce platform
• 240K+ current customers
• Many more shops created (churned or
didn’t make it to customer status)
3.
4. Problem
● No information about their industry in most cases
1st solution
● ask them
2nd solution
● We have html product descriptions for each shop
● We have labelled data (mechanical turk)
Classifier
5. Context
• Started during a Shopify Hack Day
• Pursued as a side project at work
• Used sk-learn and
• Moved to Spark MLlib for full scale testing
and production
• Now in production
7. Getting Label Data
• Asked Amazon Mechanical Turkers to assess 80K stores
• Having to choose among 15 verticals
• Involved hundreds of turkers
8. 80K shops
Shop Aggregated product data
1 “Nice octopolo shirt !…”
2 “Nice hat and nice shirt …”
3 “Set of <b> tires </b> …”
4 “Beef and more beef…”
5 “Tire set for bikes”
... ...
Input
9. 80K shops
Shop Text
1 “nice octopolo shirt…”
2 “nice hat and nice shirt…”
3 “set tire…”
4 “beef beef…”
5 “tire set bike”
... ...
Cleaning
• HTML code removed
• Stop word removed
• Words stemmed
10. Shops nice octopolo shirt hat set tires beef bike ... label
1 1 1 1 ... Apparel
2 2 1 1 ... Apparel
3 1 1 ... Auto
4 2 … Food
5 1 1 1 … Auto
... ... ... ... … … … … … ... …
10K words (8 in ex)
Term Frequency
80Kshops
Joining
mech
turk
12. Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
15labels
Naïve Bayes Model
13. Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apprel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
What and why
• These are the model parameters
• Needed as input to the prediction formula
!"#$%&'#$ )*+,, = +"./+01 ! &* $2&)
14. Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
What and why
! &* $2&) =
4 15 ∗4 781 15)
4(781)
∝ ! &* ∗ ! $2& &*)
= ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*)
(Bayes Theorem)
with conditional independence
assumption, actually violated..
denominator not important to compare likelihoods
!"#$%&'#$ )*+,, = +"./+01 ! &* $2&)
15. Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
Numerical Limitation
• Multiplying many values close to 0 -> float underflow
! &* $2&) ∝ ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*)
16. Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Apparel Log(P(..))
3, 5 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Auto Log(P(..))
4 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Food Log(P(..))
Numerical limitation
?2. ! &* $2&) ∝ log ! &* + log( ! ;$< &*)) + log (! ;$= &*)) + … + log(! ;$> &*))
• Way around: take log -> leads to summation instead of multiplication
• No impact on comparisons across classes
! &* $2&) ∝ ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*) From before, so:
17. Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apprel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
Getting cell probabilities
! ;$> &*) =
DEF GH
∑ DEFKLEMN
Dealing with P(wd|cl)=0
which makes P(cl|doc)=0
regardless of other words
!(&*) =
DEF
D
≈
DEF GH P<
∑ (DEFP<)KLEMN
=
DEF GH P<
∑ (DEF)PQ81RSKLEMN
21. class LabeledDataFilter():
...
class Featurizer():
...
class Trainer()
...
class Evaluator()
...
class Predictor()
...
class verticalPredictor():
use Featurizer()
use Predictor()
...
product_data
Training job (every 7 days) Prediction job (every day)
model
accuracy
product_data
shop+industry
model
Code
22. Change in Training Set
• Start of home card
• Allowed asking for Industry in
a voluntary way
• Quickly grew to 50K shops
• Advantage: growing over time
• Issue: training set is not fully
random
23. Shop Name
Shop URL
Shop Address
Shop City
…
Shop Predicted Industry
…
Shop Dimension
In the Data Warehouse
Updated daily
24. Results
Shops top
category
turker 1 turker2 turker 3
Chive Apparel Apparel Apparel Art
Lackers Sports Sports Apparel Sports
Tesla Auto Auto Auto Sports
... ... ... ...
60-80%
25. Results
Shops top
category
turker 1 turker2 turker 3 algo
top1
algo
top2
algo
top3
Chive Apparel Apparel Apparel Art Apparel Sport Art
Lackers Sports Sports Apparel Sports Sports Apparel Food
Tesla Auto Auto Auto Sports Fashion Auto Electro
... ... ... ...
60-80% ~65%
26. Results
Shops top
category
turker 1 turker2 turker 3 algo
top1
algo
top2
algo
top3
Chive Apparel Apparel Apparel Art Apparel Sport Art
Lackers Sports Sports Apparel Sports Sports Apparel Food
Tesla Auto Auto Auto Sports unknown Auto Electro
... ... ... ...
90%
~75%
27. Business Use
Management or product teams:
• What are the biggest industries per shop count, per sales made?
• How does that evolve over time ?
Theme team:
• We want to develop new themes for a given vertical, can we see the
top stores in this vertical to understand trends ?
Event team:
• We want to be part of an event in the music business, can we get
interesting shops in this field ?
28. Could be improved
●More metrics: Add multiclass precision/recall
○Now available in mllib
●Better performances: Rerun for combination
of parameters
○Also added recently to mllib but missing some
components