How we won the 2nd place In KaggleDays Championship.pptx

•Download as PPTX, PDF•

0 likes•19 views

jficwwc

presentation

Travel

How we won the 2nd place
In Kaggle-Days Championship
CKS Team - Cohen, Cohen, Katz & (Gold)Shlager

Kaggle-Days
2022 world
championship -
the numbers
Vestibulum congue
Vestibulum congue
Vestibulum congue
Vestibulum congue
Vestibulum congue
Vestibulum congue
competitions
tournaments
teams made it
to the finals
Simultaneous
competitions
2nd
Place!
36
13
2
17
Grandmasters
Include 5 from
the top 10
1
Team from the
Middle East
:)

First competition - Don't stop until you drop!
Predict the yoga pose correctly!
● Training datasets include
2360 images
● 6 yoga pose targets
● 4 hours of competition -
not enough time to clean the
data
● various sources and partly
synthetic

How can a model understand an image?
● classical methods (no machine learning)
○ edge detections, contrasts, RGB analysis, ...
● traditional machine learning methods
○ - Fully connected neural networks
● A little bit advanced
○ - CNNs
● State-of-the art (SOTA)
○ - Transformers

First competition - Don't stop until you drop!
Our approach - 1st place solution
● Strong augmentation
● SOTA models - SwinTransformer
large and Hybrid EfficientNet +
SwinTransformer
● Ensemble

First competition - Don't stop until you drop!
Our approach - results
model Private Score Public Score CV score
swin-l-5-folds 0.96699 0.95364 0.9661
swin-b-5-folds 0.95379 0.94260 0.9597
ensemble 0.96039 0.95805 0.9632
https://github.com/OrKatz7/1st-place-Don-t-stop-until-you-drop
● Trust your CV!!

Competition #1
Data:
Tabular data of:
● Datetime
● Location
● Sensor Data
● Country
● Country’s population
Free text:
● Observer description
Target:
Sky clearness rate

Our approach
EDA & Data Cleaning
● Correcting the data
● Reasonably fill NaNs
Tabular Feature Extraction
● Time-extracted features
● Seeing the bigger picture
● Population’s temporal statistics

Country’s Population
Country Population’s Temporal Features

Free text analysis
● Initial Approach
Represent Observer description free text with TF-IDF
● Advanced Approach
Extract embeddings vector extracted from NLP transformer

Competition #2
Data:
Python notebooks code
by cells
Meta data: author,size,
Target:
Running time
(gaps in gaps?)

Analyzing the carbon footprints generated from the Training Models

Our secret sauce - Killer dataset (before modeling)
AutoEDA and data cleansing
Massive feature extraction using human expertise and
automated techniques
Different methods to categoricals encoding
Missing values imputation
Combine different methods for feature selection

Similar to How we won the 2nd place In KaggleDays Championship.pptx

Our Tale from the Trail of Shadows at REI Co-op - Chris Phillips & Dale Smith...Lucidworks

CenternetArithmer Inc.

Build machine learning pipelines from research to productioncnvrg.io AI OS - Hands-on ML Workshops

Cloudera Data Science ChallengeMark Nichols, P.E.

Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham

Building data "Py-pelines"Rob Winters

Iasi CodeCamp 20 april 2013 Agile Estimations and Planning - Cornel FatulescuCodecamp Romania

Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaSpark Summit

SQL Query Optimization: Why Is It So Hard to Get Right?Brent Ozar

Role of ML engineerBorys Biletskyy

DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestBerker Kozan

GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState

An early look at the LDBC Social Network Benchmark's Business Intelligence wo...Gábor Szárnyas

How we learned to rank search results big data meetupMouloud LOUNACI

JOSA Data Science Bootcamp OverviewMahmoud Jalajel

DataOps - Lean principles and lean practicesLars Albertsson

Query generation across multiple data stores [SBTB 2016]Hiral Patel

Engineering data qualityLars Albertsson

The Data Science Process - Do we need it and how to apply?Ivo Andreev

Parking space detectAmanullah Tariq

Similar to How we won the 2nd place In KaggleDays Championship.pptx (20)

Our Tale from the Trail of Shadows at REI Co-op - Chris Phillips & Dale Smith...

Centernet

Build machine learning pipelines from research to production

Cloudera Data Science Challenge

Data Science Challenge presentation given to the CinBITools Meetup Group

Building data "Py-pelines"

Iasi CodeCamp 20 april 2013 Agile Estimations and Planning - Cornel Fatulescu

Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha

SQL Query Optimization: Why Is It So Hard to Get Right?

Role of ML engineer

DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest

GALE: Geometric active learning for Search-Based Software Engineering

An early look at the LDBC Social Network Benchmark's Business Intelligence wo...

How we learned to rank search results big data meetup

JOSA Data Science Bootcamp Overview

DataOps - Lean principles and lean practices

Query generation across multiple data stores [SBTB 2016]

Engineering data quality

The Data Science Process - Do we need it and how to apply?

Parking space detect

Recently uploaded

A Comprehensive Guide to The Types of Dubai Residence Visas.pdfDisha Global Tours

"Fly with Ease: Booking Your Flights with Air Europa"flyn goo

Hoi An Ancient Town, Vietnam (越南會安古鎮).ppsxChung Yen Chang

LPC Transport Presentation introduction to PLCthomas851723

Rohini Sector 18 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

08448380779 Call Girls In Chhattarpur Women Seeking MenDelhi Call girls

visa consultant | 📞📞 03094429236 || Best Study Visa ConsultantSherazi Tours

Akshay Mehndiratta Summer Special Light Meal Ideas From Across India.pptxAkshay Mehndiratta

Call Girls Service !! New Friends Colony!! @9999965857 Delhi 🫦 No Advance VV...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Call Girls In Munirka 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

08448380779 Call Girls In Chirag Enclave Women Seeking MenDelhi Call girls

Dubai Call Girls O528786472 Call Girls Dubai Big Juicyhf8803863

Night 7k Call Girls Noida Sector 93 Escorts Call Me: 8448380779Delhi Call girls

Top 10 Traditional Indian Handicrafts.pptxdishha99

Call Girls 🫤 Connaught Place ➡️ 9999965857 ➡️ Delhi 🫦 Russian Escorts FULL ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

How can I fly with the British Airways Unaccompanied Minor Policy?flightsvillacom

Study Consultants in Lahore || 📞03094429236Sherazi Tours

Visa Consultant in Lahore || 📞03094429236Sherazi Tours

Exploring Sicily Your Comprehensive Ebook Travel GuideTime for Sicily

08448380779 Call Girls In Bhikaji Cama Palace Women Seeking MenDelhi Call girls

Recently uploaded (20)

A Comprehensive Guide to The Types of Dubai Residence Visas.pdf

"Fly with Ease: Booking Your Flights with Air Europa"

Hoi An Ancient Town, Vietnam (越南會安古鎮).ppsx

LPC Transport Presentation introduction to PLC

Rohini Sector 18 Call Girls Delhi 9999965857 @Sabina Saikh No Advance

08448380779 Call Girls In Chhattarpur Women Seeking Men

visa consultant | 📞📞 03094429236 || Best Study Visa Consultant

Akshay Mehndiratta Summer Special Light Meal Ideas From Across India.pptx

Call Girls Service !! New Friends Colony!! @9999965857 Delhi 🫦 No Advance VV...

Call Girls In Munirka 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE

08448380779 Call Girls In Chirag Enclave Women Seeking Men

Dubai Call Girls O528786472 Call Girls Dubai Big Juicy

Night 7k Call Girls Noida Sector 93 Escorts Call Me: 8448380779

Top 10 Traditional Indian Handicrafts.pptx

Call Girls 🫤 Connaught Place ➡️ 9999965857 ➡️ Delhi 🫦 Russian Escorts FULL ...

How can I fly with the British Airways Unaccompanied Minor Policy?

Study Consultants in Lahore || 📞03094429236

Visa Consultant in Lahore || 📞03094429236

Exploring Sicily Your Comprehensive Ebook Travel Guide

08448380779 Call Girls In Bhikaji Cama Palace Women Seeking Men

How we won the 2nd place In KaggleDays Championship.pptx

1. How we won the 2nd place In Kaggle-Days Championship CKS Team - Cohen, Cohen, Katz & (Gold)Shlager

3. Kaggle-Days 2022 world championship - the numbers Vestibulum congue Vestibulum congue Vestibulum congue Vestibulum congue Vestibulum congue Vestibulum congue competitions tournaments teams made it to the finals Simultaneous competitions 2nd Place! 36 13 2 17 Grandmasters Include 5 from the top 10 1 Team from the Middle East :)

5. First competition - Don't stop until you drop! Predict the yoga pose correctly! ● Training datasets include 2360 images ● 6 yoga pose targets ● 4 hours of competition - not enough time to clean the data ● various sources and partly synthetic

6. How can a model understand an image? ● classical methods (no machine learning) ○ edge detections, contrasts, RGB analysis, ... ● traditional machine learning methods ○ - Fully connected neural networks ● A little bit advanced ○ - CNNs ● State-of-the art (SOTA) ○ - Transformers

7. Attention

8. Transfer Learning

9. Layer 2

10. First competition - Don't stop until you drop! Our approach - 1st place solution ● Strong augmentation ● SOTA models - SwinTransformer large and Hybrid EfficientNet + SwinTransformer ● Ensemble

11. First competition - Don't stop until you drop! Our approach - results model Private Score Public Score CV score swin-l-5-folds 0.96699 0.95364 0.9661 swin-b-5-folds 0.95379 0.94260 0.9597 ensemble 0.96039 0.95805 0.9632 https://github.com/OrKatz7/1st-place-Don-t-stop-until-you-drop ● Trust your CV!!

12.

13.

14. Competition #1 Data: Tabular data of: ● Datetime ● Location ● Sensor Data ● Country ● Country’s population Free text: ● Observer description Target: Sky clearness rate

15. Our approach EDA & Data Cleaning ● Correcting the data ● Reasonably fill NaNs Tabular Feature Extraction ● Time-extracted features ● Seeing the bigger picture ● Population’s temporal statistics

16. Country’s Population Country Population’s Temporal Features

17. Free text analysis ● Initial Approach Represent Observer description free text with TF-IDF ● Advanced Approach Extract embeddings vector extracted from NLP transformer

18. Competition #2 Data: Python notebooks code by cells Meta data: author,size, Target: Running time (gaps in gaps?)

19. Analyzing the carbon footprints generated from the Training Models

20. Our secret sauce - Killer dataset (before modeling) AutoEDA and data cleansing Massive feature extraction using human expertise and automated techniques Different methods to categoricals encoding Missing values imputation Combine different methods for feature selection

21. Stacking Layer 1 Layer 2 Layer 3

Editor's Notes

הפלטפורמה kaggle היא פלטפורמה וקהילה בין-לאומית של שיתוף ידע ב-Data Science. היא כוללת: שיתוף Dataset פומביים שיתוף קוד data science פורום ותחרויות data science תחרות טיפוסיות ב-kaggle אורכות כ-3 חודשים כשבדר"כ האנשים שמצחים בהם הם אנשים שמשקיעים בערך את כל הזמן שלהם בזה
Kaggle days היא פלטפורמה נפרדת שעושה שיתוף פעולה עם kaggle ומארגת תחרויות DS בכל העולם (גם online וגם פיזית). ב-2022 הם הכריזו על אירוע מסוג אליפות העולם ב-Data Science שבנוי בצורה הבאה: החל מחודש נובמבר, יש תחרות online בכל חודש (בסה"כ 13 תחרויות מקדימות) כל חודש, התחרות בנושא אחר (CV, NLP, Tabular, Time-Series...) ה-top 3 קבוצות מכל תחרות עולים לגמר הגמר מתקיים באוקטובר פיזית בברצלונה לגמר עלו 36 קבוצות, מתוכן היו 17 Kaggle grandmasters!!! (להסביר בקצרה מה זה אומר Kaggle grandmaster) אנחנו הרכבנו קבוצה של 4 אנשים שנמצאים באותה מהמעבדה באוניברסיטה בבן גוריון – ספי כהן דוקטורנט והמנחה שלי בתואר השני, נורית כהן דוקטורנטית ואור כץ מסטרנט. הקבוצה שלנו הייתה הקבוצה היחידה מהמזרח התיכון וכמובן מישראל ביום שלפני התחרות היה כנס, ובו חשפו לנו את ההפתעה שבתחרות הגמר יהיו 2 תחרויות במקביל במשך 11 שעות.
אנחנו השתתפנו בתחרות הראשונה שהייתה וירטואלית
להסביר מה המורכבות בדאטה שקיבלנו: מידע תמונתי – הרבה יותר כבד ודורש משאבים עצומים בשביל באמת לעשות איתו דברים מעניינים תנוחה מסויימת נראית לפעמים בצורה שונה אצל אנשים שונים תמונות שמכילות כמה אנשים מקורות שונים של מידע (גם תמונות מסונטזות וגם תמונות "מאוד" אמיתיות (איכות תמונה לא אחיד, חתוך, ...) – מכל זה המודל צריך ללמוד
להסביר את המורכבות של איבוד תמונה - לתת את הדוגמא: "אי אפשר לכתוב אלגוריתם קבוע מראש עם if else-ים שבעזרתו נוכל לזהות מכתב יד איזה מספר כתוב. אנחנו צריכים לייצר ייצוג של מה ניתן כ-input ולנתח אותו" להסביר קצת על התפתחות תחום הvision. טרנספורמרים – להסביר בכלליות את רעיון ה-Attention. להסביר שנחקר במקור ב-NLP ושאחר כך התאימו את המנגנון ל-Vision (חלוקת תמונה ל-patches)
להסביר ממש בקצרה על SwinTransformer ואיך הוא משפר את ViT שהיה הפעם הראשונה שהשתמשו בטרנספורמרים ב-CV. להסביר על ארכיטקטורת הפתרון שלנו – הוצאת embeddings מ-EfficientNet ושילובן כ-features נוספים ביחד עם התמונה למודל SwinTransformer להסביר שהתחרות נשענה הרבה על משאבים ומימוש בזמן מהיר מאוד.
להסביר את המשמעות של cross-validation ובאופן כללי, איך עושים ולידציה למודלים לפני שמעלים איתם הגשה. להסביר את איך שהגשות בתחרויות kaggle עובדות – Public Leaderboard, Private Leaderboard... זכינו מקום ראשון...
ניתנו לנו 2 תחרויות באירוע הגמר
להסביר על הרקע של התחרות הראשונה ואת המטרה להסביר את הדאטה המגוון שקיבלנו להסביר שזאת תחרות שמשלבת גם Tabular Data, גם Time-Series וגם NLP.
קודם כל עושים EDA ממחקר של הדאטה מגלים דברים מעניינים השלמת חוסרים לפי ההגיון הבריא חילוץ פיצ'רים חכמים – מבדיל את עצמך מאחרים עם מלא מלא מקום ליצירתיות: האם יום הבדיקה הוא סופש? שעת הבדיקה ביום ( חילקנו לבינים של בוקר/צהריים/ערב) להרחיב את דקירת ה-GPS של הבדיקה למרחב רחב יותר בגלל אופי הבעיה (זיהום אור אינה נוגע רק לנקודה במרחב אלא לטווח גדול יותר). פיצ'ר שתרם לנו הרבה אנחנו מתייחסים לכל התהליך של מחשבה להוספת פי'צר כמו לאל ניסוי. ומניהול נכון של הניסויים אנחנו כל הזמן משתפרים. המטרה היא לחקור בצורה עמוקה הדאטה, להעלות השערות, להוכיח אותן, לעלות על תופעות נסתרות, בשביל כמה שיותר לייצג בצורה נכונה את ההתפלגות של הנתונים בדומיין של הבעיה. כל זה רק מהדאטה שמקבלים.
ניצול נתוני ה-time series שקיבלנו הוצאת פיצ'רים חכמה שמתחשבת באופי הטמפורלי של הנתונים
בהתחלה, זנחנו לגמרי את המידע הזה – התרכזנו בדברים אחרים כמו שהבנו איך מודל יכול להתמודד עם תמונה, נבין עכשיו איך מודל יכול להתמודד עם טקסט חופשי. TF-IDF - term frequency–inverse document frequency באחזור מידע, tf–idf (גם TF*IDF, TFIDF, TF–IDF, או Tf–idf), קיצור של תדירות–תדירות מסמך הפוך, הוא נתון מספרי שנועד לשקף כמה חשובה מילה ל- מסמך באוסף או בקורפוס.[1] הוא משמש לעתים קרובות כגורם שקלול בחיפושים של אחזור מידע, כריית טקסט ומודלים של משתמשים. הערך tf–idf גדל באופן יחסי למספר הפעמים שמילה מופיעה במסמך ומתקזז במספר המסמכים בקורפוס המכילים את המילה, מה שעוזר להתאים את העובדה שמילים מסוימות מופיעות בתדירות גבוהה יותר באופן כללי. tf–idf Transformers – שיטות SOTA להסביר על איך מודלי שפה מבוססי טרנספורמרים מאומנים ואיך ניתן להשתמש בהם אחרי זה לטובתנו (Transfer Learning) להסביר על איך אנחנו השתמשנו בהם בתחרות להסביר ממש בקצרה על SVD, ועל למה השתמשנו ב-Truncated SVD
תחרות שנייה, שהתקיימה במקביל לכמה שמה שדיברנו עליו עכשיו איזה דאטה קיבלנו (גם NLP וגם Time-Series) ומה המטרה
ניתן להסביר את c-TF-IDF בצורה הטובה ביותר כנוסחת TF-IDF המאומצת עבור מחלקות מרובות על ידי צירוף כל המסמכים לכל מחלקה. לפיכך, כל מחלקה מומרת למסמך בודד במקום לסט של מסמכים. התדירות של כל מילה x נשלפת עבור כל מחלקה c ומנורמלת l1. זה מהווה את המונח תדר. לאחר מכן, המונח תדירות מוכפל עם IDF שהוא הלוגריתם של 1 בתוספת המספר הממוצע של מילים למחלקה A חלקי התדירות של המילה x בכל המחלקות.
רק בעזרת מחקר עמוק של הדאטה מאפשר להעשיר אותו ולהפוך אותו להרבה יותר איכותי.
הקסם להכל – ensemble אגרסיבי להסביר קצת על מה זה ensemble ולמה זה עובד להסביר על שיטת ensemble חזקה – stacking להסביר על הפעולה האחרונה - משקול ידני של מודלים מוצלחים מהCV ומה-LB

How we won the 2nd place In KaggleDays Championship.pptx

Recommended

Recommended

More Related Content

Similar to How we won the 2nd place In KaggleDays Championship.pptx

Similar to How we won the 2nd place In KaggleDays Championship.pptx (20)

Recently uploaded

Recently uploaded (20)

How we won the 2nd place In KaggleDays Championship.pptx

Editor's Notes