This document summarizes a presentation about solutions for the Kaggle Avito Demand Prediction Challenge. It describes:
1. The challenge involved predicting demand for online advertisements based on description, context, and history. Scores were evaluated on a 0 to 1 scale for likelihood of an ad selling.
2. The presentation covered the competition pipeline, best single model using LightGBM, diverse models, the top neural network model, and an ensemble.
3. Feature engineering included text, image, aggregation, and interaction features. Text features included TF-IDF, SVD, and word embeddings. Image features included meta and pre-trained prediction values.
BERT を中心に解説した資料です.BERT に比べると,XLNet と RoBERTa の内容は詳細に追ってないです.
あと,自作の図は上から下ですが,引っ張ってきた図は下から上になっているので注意してください.
もし間違い等あったら修正するので,言ってください.
(特に,RoBERTa の英語を読み間違えがちょっと怖いです.言い訳すいません.)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding
RoBERTa: A Robustly Optimized BERT Pretraining Approach
BERT を中心に解説した資料です.BERT に比べると,XLNet と RoBERTa の内容は詳細に追ってないです.
あと,自作の図は上から下ですが,引っ張ってきた図は下から上になっているので注意してください.
もし間違い等あったら修正するので,言ってください.
(特に,RoBERTa の英語を読み間違えがちょっと怖いです.言い訳すいません.)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding
RoBERTa: A Robustly Optimized BERT Pretraining Approach
機械学習の社会実装では、予測精度が高くても、機械学習がブラックボックであるために使うことができないということがよく起きます。
このスライドでは機械学習が不得意な予測結果の根拠を示すために考案されたLIMEの論文を解説します。
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why should i trust you?" Explaining the predictions of any classifier." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.
ベイズ最適化によるハイパーパラメータ探索についてざっくりと解説しました。
今回紹介する内容の元となった論文
Bergstra, James, et al. "Algorithms for hyper-parameter optimization." 25th annual conference on neural information processing systems (NIPS 2011). Vol. 24. Neural Information Processing Systems Foundation, 2011.
https://hal.inria.fr/hal-00642998/
Webinar: Schema Patterns and Your Storage EngineMongoDB
How do MongoDB’s different storage options change the way you model your data?
Each storage engine, WiredTiger, the In-Memory Storage engine, MMAP V1 and other community supported drivers, persists data differently, writes data to disk in different formats and handles memory resources in different ways.
This webinar will go through how to design applications around different storage engines based on your use case and data access patterns. We will be looking into concrete examples of schema design practices that were previously applied on MMAPv1 and whether those practices still apply, to other storage engines like WiredTiger.
Topics for review: Schema design patterns and strategies, real-world examples, sizing and resource allocation of infrastructure.
Angular.JS is a modern Javascript MVC Framework that was built from the ground up by a team of Googlers, sponsored by Google itself. Angular.JS allows web developers a clear separation between logic and view, and greatly improves the ability to reuse the code by using things such as Directives, Services, Components.Angular.JS smart templating engine also allows to minimize the HTML code, During the presentation, you'll learn some medium-advanced usages of Angular.JS, how to use it, tips & tricks that will make your app amazing.
機械学習の社会実装では、予測精度が高くても、機械学習がブラックボックであるために使うことができないということがよく起きます。
このスライドでは機械学習が不得意な予測結果の根拠を示すために考案されたLIMEの論文を解説します。
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why should i trust you?" Explaining the predictions of any classifier." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.
ベイズ最適化によるハイパーパラメータ探索についてざっくりと解説しました。
今回紹介する内容の元となった論文
Bergstra, James, et al. "Algorithms for hyper-parameter optimization." 25th annual conference on neural information processing systems (NIPS 2011). Vol. 24. Neural Information Processing Systems Foundation, 2011.
https://hal.inria.fr/hal-00642998/
Webinar: Schema Patterns and Your Storage EngineMongoDB
How do MongoDB’s different storage options change the way you model your data?
Each storage engine, WiredTiger, the In-Memory Storage engine, MMAP V1 and other community supported drivers, persists data differently, writes data to disk in different formats and handles memory resources in different ways.
This webinar will go through how to design applications around different storage engines based on your use case and data access patterns. We will be looking into concrete examples of schema design practices that were previously applied on MMAPv1 and whether those practices still apply, to other storage engines like WiredTiger.
Topics for review: Schema design patterns and strategies, real-world examples, sizing and resource allocation of infrastructure.
Angular.JS is a modern Javascript MVC Framework that was built from the ground up by a team of Googlers, sponsored by Google itself. Angular.JS allows web developers a clear separation between logic and view, and greatly improves the ability to reuse the code by using things such as Directives, Services, Components.Angular.JS smart templating engine also allows to minimize the HTML code, During the presentation, you'll learn some medium-advanced usages of Angular.JS, how to use it, tips & tricks that will make your app amazing.
We have just rolled out the Blunt umbrella site for the US in conjunction with the Tile launch. https://www.thetileapp.com
This US build was also part of a larger infrastructure setup to cater for the other markets.
We first deployed the US site. Shortly there after the NZ and Global site.
USA: bluntumbrellas.com/us
NZ: bluntumbrellas.com/nz
Global: bluntumbrellas.com
The site had some interesting challenges with the type of content that was to be displayed within a single page. For example every content page is a Basic Page.
Site Features:
Mega Menu
Background Video
jQuery UI
Several Sliders
Drupal Commerce
Picture and Breakpoints
In this talk I'll cover:
Business needs
Site and platform architecture
Theming
Picture and Breakpoints module
Hover state menu
Drupal Commerce
Memcache Gotha
How To Build a Multi-Field Search Page For Your XPages ApplicationMichael McGarel
This is a five-minute presentation from the 2013 IBM Connect conference. I show one way to build a faceted search page using IBM's XPages platform as part of the annual SpeedGeeking session. It includes sample code and links the the project I posted on OpenNTF.org.
Development on the Salesforce platform continues to be much more JavaScript centric. One of the most popular JavaScript frameworks in use, AngularJS, has undergone major changes in the upcoming Angular 2 release.
Ibis: Seamless Transition Between Pandas and Apache SparkDatabricks
Pandas is the de facto standard (single-node) Data Frame implementation in Python. However, as data grows larger, pandas no longer works very well due to performance reasons.
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...IT Event
Over the years developers were used to thing that web is not user-friendly, performance efficient and powerful as native apps. But things have been changed so far; now you can build offline applications with notifications, Bluetooth and camera access and so on. Web development is great again.
- Quick startup - I will show how to prioritize content loading in the application to show users meaningful pixels as soon as possible
- Progressive enhancement - I will encourage you to use maximum of the platform but still support earlier browsers
- Offline application - here I will explain how you can easily make your web application working offline
- Push Notifications - one of the best way to increase conversion of your application and now it's possible on the web. I am going to show how to do it right with few steps.
- Experimental APIs - I will show how to sign in once on all your devices with Credential API, use native share menu and make payments in few clicks
How We Built the Private AppExchange App (Apex, Visualforce, RWD)Salesforce Developers
The AppExchange and Success Community team built a brand new app this year: the Private AppExchange. Join us and learn how the team built this managed package, the choices we made and why. We will talk about the AppExchange Search Framework that all three of these products are built upon and we will talk about how we made a responsive UI that works on whatever device you choose.
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and PracticeJamesAnderson599331
Steve McGhee talks about how to build reliable things on top of unreliable things. Steve was a Google SRE for 10 years, then he left to help move a company onto the Cloud. He came back to Google to help more customers do that.
Recording on YouTube: https://youtu.be/YnjsYzCwTQI
Check out presos here: https://gdg.community.dev/gdg-cloud-southlake/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
3. Agenda
¨ Avito-demand-prediction Overview
¨ Competition Pipeline
¨ Best Single Model (Lightgbm)
¨ Diverse Models
¨ Ynktk’s Best NN
¨ Kohei’s Ensemble
¨ China Competitions/Kagglers
¨ Q & A
6. Description
「Prediction」
predict demand for an
online advertisement
based on its full
description (title,
description, images, etc.),
its context
(geographically where it
was posted, similar ads
already posted) and
historical demand for
similar ads in similar
contexts
Russia’s largest classified
advertisements website
8. Data Description
¨ ID
item_id,user_id
¨ Numeric
price
¨ Category
region,city,parent_category_name,user_type
category_name,param_1, param_2, param_3,
image_top_1
¨ Text
title,description
¨ Image
image
¨ Sequence
item_seq_number
¨ Date
activation_date,date_from,date_to
Train/Test
User_id
……
Item_id
Target
Active
User_id
……
Item_id
Periods
Item_id
Date_from
Date_to
Supplemental data of
train minus
deal_probability,
image, and
image_top_1
10. Pipeline
[Baseline]
1.Table Data Model
2.Text Data Model
(reduce wait time)
[Validation]
Kfold: 5
Feature Validation:
once by one
Validate Score:
5fold
One Week One WeekOne Month
Description
Kernel
Discussion
Beseline
Design
[Feature Engineering]
LightGBM(Table + Text + Image)
Feature Save: 1 feature 1 pickle file
[Validation]
Kfold: 5
Feature Validation: once by one or
by group
Validate Score: 1fold
[Parameter Tuning]
Manually
Teammates’
feature
reuse
Diverse
Model’s oof
15. Supplemental Data Feature
¨ Caculate each item’s up days
all_periods['days_up'] = all_periods['date_to'].dt.dayofyear -
all_periods['date_from'].dt.dayofyear
¨ Count and Sum of item’s up days
{'groupby': ['item_id'], 'target':'days_up', 'agg':'count'},
{'groupby': ['item_id'], 'target':'days_up', 'agg':'sum'},
¨ Merge to main table
df_all = df_all.merge(all_periods, on='item_id', how='left')
※補足データの業務に関わる部分深掘りが大事
16. Impute Null Values
¨ Fillna with 0
df_all[‘price’].fillna(0)
¨ Fillna with median
enc = df_all.groupby('category_name')
['item_id_count_days_up'].agg('median’).reset_index()
enc.columns = ['category_name' ,'count_days_up_impute']
df_all = pd.merge(df_all, enc, how='left', on='category_name')
df_all['item_id_count_days_up_impute'].fillna(df_all['count_days_up_impute'
], inplace=True)
¨ Fillna with model prediction value
Rnn(text) -> image_top_1(rename:image_top_2)
※見つからなかったMagic feature: df[‘price’] – df[Rnn(text) -> price]
18. Text Feature
¨ SVD for Title
tfidf_vec = TfidfVectorizer(ngram_range=(1,1))
svd_title_obj = TruncatedSVD(n_components=40, algorithm='arpack')
svd_title_obj.fit(full_title_tfidf)
train_title_svd = pd.DataFrame(svd_title_obj.transform(train_title_tfidf))
test_title_svd = pd.DataFrame(svd_title_obj.transform(test_title_tfidf))
19. Text Feature
¨ Count Unique Feature
for cols in ['text','title','param_123']:
df_all[cols + '_num_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[А-ЯA-Z]', x))
df_all[cols + '_num_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[а-яa-z]', x))
df_all[cols + '_num_rus_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[А-Я]', x))
df_all[cols + '_num_eng_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[A-Z]', x))
df_all[cols + '_num_rus_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[а-я]', x))
df_all[cols + '_num_eng_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[a-z]', x))
df_all[cols + '_num_dig'] = df_all[cols].apply(lambda x: count_regexp_occ('[0-9]', x))
df_all[cols + '_num_pun'] = df_all[cols].apply(lambda x: sum(c in punct for c in x))
df_all[cols + '_num_space'] = df_all[cols].apply(lambda x: sum(c.isspace() for c in x))
df_all[cols + '_num_emo'] = df_all[cols].apply(lambda x: sum(c in emoji for c in x))
df_all[cols + '_num_row'] = df_all[cols].apply(lambda x: x.count('/n'))
df_all[cols + '_num_chars'] = df_all[cols].apply(len) # Count number of Characters
df_all[cols + '_num_words'] = df_all[cols].apply(lambda comment: len(comment.split()))
df_all[cols + '_num_unique_words'] = df_all[cols].apply(lambda comment: len(set(w for w in
comment.split())))
df_all[cols + '_ratio_unique_words'] = df_all[cols+'_num_unique_words'] / (df_all[cols+'_num_words']+1)
df_all[cols +'_num_stopwords'] = df_all[cols].apply(lambda x: len([w for w in x.split() if w in
stopwords]))
df_all[cols +'_num_words_upper'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if
w.isupper()]))
df_all[cols +'_num_words_lower'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if
w.islower()]))
df_all[cols +'_num_words_title'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if
w.istitle()]))
20. Text Feature
¨ Ynktk’s WordEmbedding
u Self-trained FastText
model = FastText(PathLineSentences(train+test+train_active+test_active),
size=300, window=5, min_count=5, word_ngrams=1, seed=seed, workers=32)
u Self-trained Word2Vec
model = Word2Vec(PathLineSentences(train+test+train_active+test_active),
size=300, window=5, min_count=5, seed=seed, workers=32)
※Wikiなどで学習したembeddingsよりも、与えられたテキストで学習したembeddingsの方が
有効.おそらく、商品名などの固有名詞が目的変数に効いていたため
21. Image Feature
¨ Meta Feature
u Image_size ,Height,Width,Average_pixel_width,Average_blue,Average_red,Aver
age_green,Blurrness,Whiteness,Dullness
u Dullness – Whiteness (Interaction feature)
¨ Pre-trained Prediction Feature
u Vgg16 Prediction Value
u Resnet50 Prediction Value
¨ Ynktk’s Feature
u 上位入賞者はImageをVGGなどで特徴抽出していたが、hand-craftな特徴も有効だった
u NIMA [1]
u Brightness, Saturation, Contrast, Colorfullness, Dullness, Bluriness, Interest Points, Saliency Map,
Human Facesなど[2]
[1] Talebi, H., & Milanfar, P. (2018). NIMA: Neural Image Assessment
[2] Cheng, H. et al. (2012). Multimedia Features for Click Prediction of New Ads in Display
Advertising
22. Parameter Tuning
q Manually choosing using multi servers
params = {
'boosting_type': 'gbdt',
’objective’: ‘xentropy’, #target value like a binary classification probability value
'metric': 'rmse',
'learning_rate': 0.02,
'num_leaves': 600,
'max_depth': -1,
'max_bin': 256,
’bagging_fraction’: 1,
’feature_fractio’: 0.1, #sparse text vector
'verbose': 1
}
24. Best Lightgbm Summary
¨ Table Feature Number
~250
¨ Text Feature Number
1,500,000+
¨ Image Feature Number
50+
¨ Total Feature Number
1,503,424
¨ Public LB
better than 0.2174
¨ Private LB
better than 0.2210
25. Diversity
Type Loss Data Set Feature Set Parameter NN Structure
Lightgbm xentropy
regression
huber
fair
auc
With/Without
Active data
Table
Table + Text
Table + Text + Image
Table + Text + Image +
Ridge_meta
Learning_rate
Num_leaves
Xgboost reg:linear
binary:logist
ic
With/Without
Active data
Table + Text
Table + Text + Image
Catboost binary_cross
entropy
With Active data Table + Image
Random
Forest
regression With Active data Table + Text + Image
Ridge
Regression
regression Without Active
data
Text
Table + Text + Image
Tfidf
max_features
Neural
network
regression
binary_cross
entropy
With Active data Table + Text + Image +
wordembedding
Layer size
Dropout
BatchNorm
Pooling
rnn-dnn
rnn-cnn-dnn
rnn-attention-dnn
26. Ynktk’s Best NN
Numerical Categorical Image Text
Embedding
Dense SpatialDropout
LSTM
GRU
Conv1D
LeakyReLU
GAP GMP
Concat
BatchNorm
LeakyReLU
Dropout
Dense
LeakyReLU
BatchNorm
Dense
Concat
Embedding
*callbacks
• EarlyStopping
• ReduceLROnPlateau
*optimizer
• Adam with clipvalue
0.5
• Learning rate 2e-03
*loss
• Binary cross entropy
Priv LB: 0.2225
Pub LB: 0.2181
27.
28.
29. China Competitions & Platform
Kaggle China Comp
Platform Kaggle Tianchi,Tencent,Jdata,Kesci,DataCastle,
Biendata,DataFountain… …[1]
Round Round1:2~3 months
Public/Private LB
Round1:1.5months Public,3 days Private
Round2:2 weeks Public,3 days Private
Round3:Presentation
Sub/day 5 Public:3,Private:1
Prize Top 3 Top 5/10/50
[1] https://github.com/iphysresearch/DataSciComp