• What is the problem?
• Data collection
• What features we have + what features are reasonable
• Base model and other models
• Updates / Evaluations / Adoptations
• Celebrate victories and learn from mistakes
Injustice - Developers Among Us (SciFiDevCon 2024)
Data Science for Cars and Cattle - (Predictive Analytics / Data Science In A Nutshell) – by Zinayida Kensche
1. DATA SCIENCE FOR CARS
AND CATTLE
ZINAYIDA KENSCHE, PHD, DATA SCIENTIST AT SPOTLIGHT @SAP
DATA SCIENCE IN A NUTSHELL
23.06.20
2. A DATA SCIENCE PROJECT
• WHAT IS THE PROBLEM?
• DATA COLLECTION
• WHAT FEATURES WE HAVE + WHAT FEATURES ARE REASONABLE
• BASE MODEL AND OTHER MODELS
• UPDATES / EVALUATIONS / ADOPTATIONS
• CELEBRATE VICTORIES AND LEARN FROM MISTAKES
4. CARS: DATA COLLECTION
• MANUFACTURER & MODEL & VERSION: VW PASSAT VARIANT 2.0 TDI
• FUEL: DIESEL
• MOTOR: 150 PS
• COLOR: BLACK
• MILEAGE: 200.150KM
• FIRST REGISTRATION: 02/2016
• …
5. CARS: ARE TWO ADS REPRESENT ONE CAR?
• PREPARE YOUR DATA: FILTER, SORT, PRIORITIZE
• SELECT ONLY THOSE CARS THAT HAVE A HIGH PROBABILITY TO REPRESENT ONE CAR
• SORT ON FIRST REGISTRATION, DATE OF APPEARANCE
AND MILEAGE
• START WITH SIMPLE MODEL FIRST
• CLASSIFICATION:
• A VECTOR WITH SIMILARITIES BETWEEN TWO ADS
• TWO ADS REPRESENT ONE CAR
• TWO ADS REPRESENT DIFFERENT CARS
6. CARS: NEXT STEPS
• USE LOGISTIC REGRESSION AS BASE MODEL, TRY OTHER MODELS
• FEATURE ENGINEERING
• MORE DATA: FEATURES, LABELED ENTITIES
• AUTOMATE, MONITOR, ADOPT
7. CARS: LESSONS LEARNED
• + START SIMPLE
• + QUICK SOLUTION
• + PRIORITIZING FEATURES
• - LABELLED DATA IS COSTLY
• - DATA NEED TO BE PRE-FILTERED
• - MORE INTUITIVE APPROACH?!
8. CATTLE: WHEN IS IT READY TO BE FERTILIZED?
• PERIOD TRACKING APPS: CLUE, FLOW, OVIA, EVE BY GLOW, …
• BIRTH - 11 MONTHS (FIRST HEAT) – HEAT - … - HEAT & INSEMINATION – PREGNANCY – CALVING
Repeats 3-6 times
9. CATTLE: DS PROJECT SET-UP
• DATA COLLECTION AND SYNTHETIC DATA GENERATION
• SYSTEMATIC SAMPLING & FEATURE ENGINEERING: FREQUENCY OF PEAKS
• HUNTING OUTLIERS WITH SVM
10. CATTLE: LESSONS LEARNED
• + FILTER REAL-TIME DATA
• + NOT ENOUGH DATA – GENERATE
IT
• - DO NOT TRUST YOUR DATA
• - ONE DATA SOURCE IS NOT
ENOUGH
http://dilbert.com/strip/2004-12-26