A prototype of a real estate search engine. We leverage machine learning techniques to provide user with a better experience and to help them make good decision.
#WeAreAnts (http://weareants.fr/)
17. Immoviz - #WeAreAnts
Text analysis
Price comparison
Elastic Search
Duplicate aggregation
ID comparison
Price comparison
PostgreSQL
17
18. Immoviz - #WeAreAnts
Error analysis
Cross-validate for testing error
Locate sensitive zone
Visualize error
…
MACHINE LEARNING WORKFLOW
Data Cleaning
Check input format
Split data and hide holdout
Drop/impute null values
Filter outlier
…
Feature Engineering
Extract features
Scale/normalize data
Test contextual data
…
Data Modeling
Cross-validate for model selection
Optimize hyper-parameters
…
18
19. Immoviz - #WeAreAnts 19
If a data set has
affected any step in
the learning process,
its ability to assess the
outcome has been
compromised.
Data snooping
21. Immoviz - #WeAreAnts
Error analysis
Cross-validate for testing error
Locate sensitive zone
Visualize error
…
MACHINE LEARNING WORKFLOW
Data Cleaning
Check input format
Drop/impute null values
Filter outlier
Split data and hide holdout
…
Feature Engineering
Extract features
Scale/normalize data
Test contextual data
…
Data Modeling
Cross-validate for model selection
Optimize hyper-parameters
…
21
23. Immoviz - #WeAreAnts X
If you torture the data
long enough, it will
confess.
Data snooping
24. Immoviz - #WeAreAnts 23
Some key numbers
60 000 adverts, including 20 432 selling ads
12 839 unique selling ads with 61 features
10 883 selling ads remaining with 52 features after filtering
8 months of data
25. Immoviz - #WeAreAnts 24
Data Cleaning & EDA
Data Modeling
20%
Error Analysis
Allocation of time
10%
20%
Feature Engineering 50%
26. Immoviz - #WeAreAnts
Location features
Contextual data (Open
Moulinette)
Imputing Room features
Removing contextual
outliers
Improving ES queries
Feature engineering - what work ?
Time series features
NLP on text data
Dimensionality reduction
Numerical values
transforming/scaling
25
27. Immoviz - #WeAreAnts 26
Linear Model Tree-based model Average Ensemble
method
Metamodel
Ensemble method
Data Modeling: what algorithms to use ?
28. Immoviz - #WeAreAnts 27
This is how you win ML
competitions: you
take other peoples’
work and ensemble
them together.”
Vitaly Kuznetsov - NIPS2014
39. Immoviz - #WeAreAnts 36
Metrics
Recommendation System
User Experience
Speed
What’s next ?
40. Immoviz - #WeAreAnts 37
Conclusion
Better data beats cleverer algorithm
System monitoring is vital
There needs to be a coherent data flow
between backend and ML engine