SlideShare a Scribd company logo
1 of 24
LEBONCOIN DE LA DATA
Stéphanie Baltus – Responsable Data Engineering- @steph_baltus
Meetup Duchess France @ TheFamily – 01/19/2016
■ About leboncoin
■ Data, data everywhere !
■ To infinity and beyond …
2
PLAN
ABOUT LEBONCOIN
4
LEBONCOIN...AND FRIENDS
5
■ A Schibsted Media Group company
■ Since 2006
■ 320+ people
■ Located in Paris, Montceau-Les-Mines, Reims
■ 2014 Revenue: 150+M€
6
IN A FEW WORDS
7
NOT JUST A WEBSITE
■ Classified ads :
■ Professional
■ Personal
■ Premium offer :
■ Highlight products
■ Ad import tools
■ Ad display
8
NOT JUST A CLASSIFIEDADS COMPANY
DATA, DATA EVERYWHERE
■ Building a team
■ Provide daily batch DWH
■ Website traffic (sort of)
■ Ad activity & validation
■ Sales & Coin usage
■ User information
■ Support
■ Try near-real time processing
10
A BIT OF STORY
11
SO, WE DID SOME BI STUFF (2012-2015)
12
IT LOOKS LIKE THIS
■ A lot of uncovered scope
■ Incremental load only
■ Unablity to load historical data, stuck from 2013 to today
■ A business team unable to query the database
■ A lot of « no! » when asking for evolution
■ Vertical scalability only
■ No potential sharing policy with the product (website, app, CRM, …)
13
IT WORKS ! BUT …
TO INFINITYAND BEYOND!
■ Share data services with the website, apps
■ Build a unique source of truth
■ Provide raw data to our analysts
■ Provide real time data
■ Cover all the data scope of leboncoin
15
THE FUTURE
16
FUNCTIONALARCHITECTURE
17
DATAARCHITECTURE : DUMBO STYLE
18
ONE STACK TO RULE THEM ALL
■ Centralized data cleaning / streamlining
■ Extended analytics apps
■ Ads and customers indexes
■ Import ad web service
■ Datalake indexing through bloomfilter
■ Anomaly detection
19
SOME IMPLEMENTATIONS
■ Goal : help the SysAdmin Team to catch bots crawling our website and apps to steal
our ads or people’s phone numbers => Anomaly detection
■ How :
■ Use http logs (150Go per day)
■ Build KPIs and vectors
■ Apply a logistic regression to identify suspicious session
■ Next steps :
■ Test K-Means algorithm
20
CATCH’EMALL !
■ Data unified view
■ Home built data extractor + Spark MDM jobs
■ Build a next generation BI app
■ Spark ETL+ Redshift
■ Share built information with other apps
■ Spark ETL+ ES + Kafka
21
DIVE INTO DATA SHARING
22
NOW IT LOOKS LIKE THIS
■ Being production ready
■ New app, new services
■ More machine learning oriented apps
■ Feeding the website
■ Recruiting 
23
WHAT’S NEXT ?
QUESTIONS ?

More Related Content

Similar to Meetup duchess 20160119 - Leboncoin de la data

Snowplow: putting digital analysts at the heart of digital analytics - the fo...
Snowplow: putting digital analysts at the heart of digital analytics - the fo...Snowplow: putting digital analysts at the heart of digital analytics - the fo...
Snowplow: putting digital analysts at the heart of digital analytics - the fo...yalisassoon
 
Turning Digital Performance into Competitive Advantage
Turning Digital Performance into Competitive AdvantageTurning Digital Performance into Competitive Advantage
Turning Digital Performance into Competitive AdvantageJennifer Finney
 
How Financial Services Firms are Using Digital to Improve the Customer Experi...
How Financial Services Firms are Using Digital to Improve the Customer Experi...How Financial Services Firms are Using Digital to Improve the Customer Experi...
How Financial Services Firms are Using Digital to Improve the Customer Experi...Acquia
 
eCommerce. How digital is transforming retail
eCommerce. How digital is transforming retaileCommerce. How digital is transforming retail
eCommerce. How digital is transforming retailAlex Rayón Jerez
 
Portalfk SIPA Munich
Portalfk SIPA MunichPortalfk SIPA Munich
Portalfk SIPA MunichFilip Nowicki
 
Capturing online customer data to create better insights and targeted actions...
Capturing online customer data to create better insights and targeted actions...Capturing online customer data to create better insights and targeted actions...
Capturing online customer data to create better insights and targeted actions...yalisassoon
 
Data Bootcamp by Fabernovel and Squid Solutions
Data Bootcamp by Fabernovel and Squid SolutionsData Bootcamp by Fabernovel and Squid Solutions
Data Bootcamp by Fabernovel and Squid SolutionsSquidSolutions
 
Office Depot: Equipping the Business to Drive Growth
Office Depot: Equipping the Business to Drive GrowthOffice Depot: Equipping the Business to Drive Growth
Office Depot: Equipping the Business to Drive GrowthSAP Customer Experience
 
Acando Seminar Best of ignite 2016
Acando Seminar Best of ignite 2016Acando Seminar Best of ignite 2016
Acando Seminar Best of ignite 2016Acando Sweden
 
Analytics is Taking over the World (Again) - UKOUG Tech'17
Analytics is Taking over the World (Again) - UKOUG Tech'17Analytics is Taking over the World (Again) - UKOUG Tech'17
Analytics is Taking over the World (Again) - UKOUG Tech'17Rittman Analytics
 
Artem Makarov, Business Development Russia, Trademob
Artem Makarov, Business Development Russia, TrademobArtem Makarov, Business Development Russia, Trademob
Artem Makarov, Business Development Russia, Trademobanastasiaalikova
 
Artificial Intelligence in an ABM World
Artificial Intelligence in an ABM WorldArtificial Intelligence in an ABM World
Artificial Intelligence in an ABM WorldDemandbase
 
DPM Overview Soasta Partners.pptx
DPM Overview Soasta Partners.pptxDPM Overview Soasta Partners.pptx
DPM Overview Soasta Partners.pptxJennifer Finney
 
Data Dunk with Insight - Twin Cities Eloqua User Group September 30, 2014
Data Dunk with Insight - Twin Cities Eloqua User Group September 30, 2014Data Dunk with Insight - Twin Cities Eloqua User Group September 30, 2014
Data Dunk with Insight - Twin Cities Eloqua User Group September 30, 2014Ron Corbisier
 
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSouth Tyrol Free Software Conference
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Databricks
 
E-commerce platforms - Benchmark by EBG Berlin 2019
E-commerce platforms - Benchmark by EBG Berlin 2019 E-commerce platforms - Benchmark by EBG Berlin 2019
E-commerce platforms - Benchmark by EBG Berlin 2019 EBG
 
DesignersX Corporate Deck for Strategic Partnership - Web, Mobile, eCommerce ...
DesignersX Corporate Deck for Strategic Partnership - Web, Mobile, eCommerce ...DesignersX Corporate Deck for Strategic Partnership - Web, Mobile, eCommerce ...
DesignersX Corporate Deck for Strategic Partnership - Web, Mobile, eCommerce ...Kulbir Singh
 
Top 15 Business Intelligence (BI) Software
Top 15 Business Intelligence (BI) SoftwareTop 15 Business Intelligence (BI) Software
Top 15 Business Intelligence (BI) SoftwareMopinion
 

Similar to Meetup duchess 20160119 - Leboncoin de la data (20)

Snowplow: putting digital analysts at the heart of digital analytics - the fo...
Snowplow: putting digital analysts at the heart of digital analytics - the fo...Snowplow: putting digital analysts at the heart of digital analytics - the fo...
Snowplow: putting digital analysts at the heart of digital analytics - the fo...
 
Turning Digital Performance into Competitive Advantage
Turning Digital Performance into Competitive AdvantageTurning Digital Performance into Competitive Advantage
Turning Digital Performance into Competitive Advantage
 
How Financial Services Firms are Using Digital to Improve the Customer Experi...
How Financial Services Firms are Using Digital to Improve the Customer Experi...How Financial Services Firms are Using Digital to Improve the Customer Experi...
How Financial Services Firms are Using Digital to Improve the Customer Experi...
 
eCommerce. How digital is transforming retail
eCommerce. How digital is transforming retaileCommerce. How digital is transforming retail
eCommerce. How digital is transforming retail
 
Portalfk SIPA Munich
Portalfk SIPA MunichPortalfk SIPA Munich
Portalfk SIPA Munich
 
Capturing online customer data to create better insights and targeted actions...
Capturing online customer data to create better insights and targeted actions...Capturing online customer data to create better insights and targeted actions...
Capturing online customer data to create better insights and targeted actions...
 
Data Bootcamp by Fabernovel and Squid Solutions
Data Bootcamp by Fabernovel and Squid SolutionsData Bootcamp by Fabernovel and Squid Solutions
Data Bootcamp by Fabernovel and Squid Solutions
 
Office Depot: Equipping the Business to Drive Growth
Office Depot: Equipping the Business to Drive GrowthOffice Depot: Equipping the Business to Drive Growth
Office Depot: Equipping the Business to Drive Growth
 
Acando Seminar Best of ignite 2016
Acando Seminar Best of ignite 2016Acando Seminar Best of ignite 2016
Acando Seminar Best of ignite 2016
 
Analytics is Taking over the World (Again) - UKOUG Tech'17
Analytics is Taking over the World (Again) - UKOUG Tech'17Analytics is Taking over the World (Again) - UKOUG Tech'17
Analytics is Taking over the World (Again) - UKOUG Tech'17
 
Artem Makarov, Business Development Russia, Trademob
Artem Makarov, Business Development Russia, TrademobArtem Makarov, Business Development Russia, Trademob
Artem Makarov, Business Development Russia, Trademob
 
Artificial Intelligence in an ABM World
Artificial Intelligence in an ABM WorldArtificial Intelligence in an ABM World
Artificial Intelligence in an ABM World
 
DPM Overview Soasta Partners.pptx
DPM Overview Soasta Partners.pptxDPM Overview Soasta Partners.pptx
DPM Overview Soasta Partners.pptx
 
Data Dunk with Insight - Twin Cities Eloqua User Group September 30, 2014
Data Dunk with Insight - Twin Cities Eloqua User Group September 30, 2014Data Dunk with Insight - Twin Cities Eloqua User Group September 30, 2014
Data Dunk with Insight - Twin Cities Eloqua User Group September 30, 2014
 
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
 
BoostIT - StartUp
BoostIT - StartUpBoostIT - StartUp
BoostIT - StartUp
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
 
E-commerce platforms - Benchmark by EBG Berlin 2019
E-commerce platforms - Benchmark by EBG Berlin 2019 E-commerce platforms - Benchmark by EBG Berlin 2019
E-commerce platforms - Benchmark by EBG Berlin 2019
 
DesignersX Corporate Deck for Strategic Partnership - Web, Mobile, eCommerce ...
DesignersX Corporate Deck for Strategic Partnership - Web, Mobile, eCommerce ...DesignersX Corporate Deck for Strategic Partnership - Web, Mobile, eCommerce ...
DesignersX Corporate Deck for Strategic Partnership - Web, Mobile, eCommerce ...
 
Top 15 Business Intelligence (BI) Software
Top 15 Business Intelligence (BI) SoftwareTop 15 Business Intelligence (BI) Software
Top 15 Business Intelligence (BI) Software
 

Recently uploaded

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 

Recently uploaded (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 

Meetup duchess 20160119 - Leboncoin de la data

  • 1. LEBONCOIN DE LA DATA Stéphanie Baltus – Responsable Data Engineering- @steph_baltus Meetup Duchess France @ TheFamily – 01/19/2016
  • 2. ■ About leboncoin ■ Data, data everywhere ! ■ To infinity and beyond … 2 PLAN
  • 5. 5
  • 6. ■ A Schibsted Media Group company ■ Since 2006 ■ 320+ people ■ Located in Paris, Montceau-Les-Mines, Reims ■ 2014 Revenue: 150+M€ 6 IN A FEW WORDS
  • 7. 7 NOT JUST A WEBSITE
  • 8. ■ Classified ads : ■ Professional ■ Personal ■ Premium offer : ■ Highlight products ■ Ad import tools ■ Ad display 8 NOT JUST A CLASSIFIEDADS COMPANY
  • 10. ■ Building a team ■ Provide daily batch DWH ■ Website traffic (sort of) ■ Ad activity & validation ■ Sales & Coin usage ■ User information ■ Support ■ Try near-real time processing 10 A BIT OF STORY
  • 11. 11 SO, WE DID SOME BI STUFF (2012-2015)
  • 13. ■ A lot of uncovered scope ■ Incremental load only ■ Unablity to load historical data, stuck from 2013 to today ■ A business team unable to query the database ■ A lot of « no! » when asking for evolution ■ Vertical scalability only ■ No potential sharing policy with the product (website, app, CRM, …) 13 IT WORKS ! BUT …
  • 15. ■ Share data services with the website, apps ■ Build a unique source of truth ■ Provide raw data to our analysts ■ Provide real time data ■ Cover all the data scope of leboncoin 15 THE FUTURE
  • 18. 18 ONE STACK TO RULE THEM ALL
  • 19. ■ Centralized data cleaning / streamlining ■ Extended analytics apps ■ Ads and customers indexes ■ Import ad web service ■ Datalake indexing through bloomfilter ■ Anomaly detection 19 SOME IMPLEMENTATIONS
  • 20. ■ Goal : help the SysAdmin Team to catch bots crawling our website and apps to steal our ads or people’s phone numbers => Anomaly detection ■ How : ■ Use http logs (150Go per day) ■ Build KPIs and vectors ■ Apply a logistic regression to identify suspicious session ■ Next steps : ■ Test K-Means algorithm 20 CATCH’EMALL !
  • 21. ■ Data unified view ■ Home built data extractor + Spark MDM jobs ■ Build a next generation BI app ■ Spark ETL+ Redshift ■ Share built information with other apps ■ Spark ETL+ ES + Kafka 21 DIVE INTO DATA SHARING
  • 22. 22 NOW IT LOOKS LIKE THIS
  • 23. ■ Being production ready ■ New app, new services ■ More machine learning oriented apps ■ Feeding the website ■ Recruiting  23 WHAT’S NEXT ?

Editor's Notes

  1. From Mexico to Malaysia, from Brazil to Norway – millions of people interact with Schibsted companies every day. We’re meeting our customers’ needs with our ever expanding range of smart products and services. Schibsted is increasingly international, and we’re moving forward. Fast. Through all this diversity, we provide similar solutions to make everyday life for millions of people a little bit easier, a little bit better. In doing this we are committed to always try to innovate and deliver new, even smarter services that will meet the needs of people today and tomorrow around the world.
  2. Un petit peu d’histoire. Au delà du fait que leboncoin fait partie d’un groupe. Leboncoin est né en 2006 d’une joint-venture entre Schibsted et Spir (le groupe qui détient 20 minutes)
  3. Leboncoin, est principalement connu pour le site web, mais co
  4. A l’origine, il y a eu le site, puis les équipes produit ont eu besoin de stats. Alors, des stats caclulées en batch par l’équipe Core, envoyées en fichiers texte par mail Des batchs qui pouvaient durer des jours, des nuits, et qui leur consommaient pas mal de temps… Et nous sommes d’accord ce n’est pas leur boulot.
  5. 1) Parc appli LBC (WWW, Mobile, CP, CRM, OTRS) 2) Extractions BATCH des données brutes requises + Stockage BDD travail => Pas solliciter les systèmes sources 3) Données brutes à nettoyer/rafiner/croiser => ETL 4) Données hydratées stockées dans un datawarehouse. BDD avec une modélisation dite dimensionelle & un stockage colonnes qui permettent de grosse perf en aggrégation. => analyses Niveau techno : PSQL, PDI, MonetDB Exploitation de pas mal de fonctinnalités de Postgres qui nous ont permis de repondre à des besoin difficilement réalisable uniquement avec les fonction relationnelle : hll et les ranges (je pourrais vous détailler pourquoi en dehors de la présentation) Pour une idée de volumétrie, on stocke a peu près 6 To de données dans les bases de travail, En terme d’infra, on n’était plutot bien lotis, personnellement, je n’avais pas eu de telles machines dans mes missions précédentes. Entre les serveurs de BDD et d’ETL : environ 300 Go de RAM, 10 To Tout cela pour dire que nous avons mis toutes les chances de notre côtés pour répondre aux besoins des analystes et chef de produit.
  6. Malgré toutes cette bonne volonté, 1) Rétention de données transformées. Analyses mais pas de mise à disposition directe coté produit (sens large). 2) Bon outils mais scalabilité verticale uniquement => compléxité persistence de certaines infos + compléxité perf Du coup on a commencé à voir plus grand et à réflichir à une architecture dite "bigdata".
  7. Fort de nos constats, on a redefinit la mission de l’équipe, les technologies big data pouvant servir de levier à l’accroissement de notre périmètre.
  8.  Fonctionnellement çà consiste en quoi ? 1) Toujours extract Batch mais plutot qu'une BDD : stockage fichier extensible cloud + ensemble des data au format brute => Datalake 2) De même toujours netoyer/rafiner/croiser nos données => ETL mais en capable de distribuer ses traitements sur un cluster scalable 3) Idem toujours DWH avec modélisation dimensionelle & stockage colonne mais sur une base SQL distribuée => A ce stade on a "juste" adresser le problème de scalabilité de notre archi BI, reste celui du feed back 4) Contrainte inhérente à l'échange d'info inter-applicative => système de communication "temps réel". Bon gout de fonctionner dans les deux sens => ingestion temps réel + feedback de données hydratées en streaming. Doit aussi être scalable & robuste. 5) Si le streaming convient bien aux besoins de syncrhonisation et d'alerting il est peu adapté à la recherche de données spécifiques. => On expose donc des services de recherche pour adresser ces besoins Doivent etre scalable et robustes 6) Enfin on ne veut pas se contenter de mettre à disposition de la donnée recyclée (fut elle rafinée), on veut aussi créer de nouveaux services et produits depuis celles-ci. Machine Learning Nombreux de champs d'application : détection d'anomalies, lutte fraude, suggestion de contenus, détection des intentionistes d'achat, ... Une fois cette archi posée reste à faire le choix de l'implémentation concrète.
  9. Lorsqu'on a commencé à réfléchir à cette question l'état de l'art ressemblait à çà. Stack Hadoop pour le stockage & le batch + Storm & kafka pour les traitements temps réel. Répondait aux besoins fonctionnels mais entrait conflit avec certains de nos choix de conception. 1) 6 mois de veille        => Dans le monde du "BigData" les choses évoluent très rapidement               => paru critique d'assurer une certaine agilité, etre capable de switcher d'une techno à une autre à moindre coup Or une archi qui repose sur hadoop introduit un fort couplage entre ses composants. 2) 2nd problème :  Ecosystème Hadoop = 20aine de projets Apache => Hétérogénéité des outils (Pig, hiveql, java, scala, ...) => Difficile à rationaliser/déployer/maintenir  => Redondance fonctionelle entre projets  + Périnité difficile à anticiper => Rend complexe et critique les choix d'archi
  10. => On a donc éssayé de garder l'éléphant jaune le + loin possible. On a abouti au résultat suivant : S3 => stockage élastique et distribué Redshift => Base DWH (=> consistence groupe) Kafka => StreamingES=> Services de recherches Cassandra => Pour la persistence coté ML Spark => ETL bacth & Temps réel + Machine learning (uniformité du code) Au final on abouti a une archi modulaire, basée sur des briques a priori pérennes mais dont on peu sortir à moindre coup. Ex : Redshit --> Vertica ou S3 --> HDFS (1-2 semaines taf) Implémentation commencée en Mai. Peinture loiiiiiin d'etre sèche mais premiers retours d'exp =>Transition Nico. Quelques applications concrete à cette architecture
  11. data = pas que de la données blocket logs audience, http aussi Objectif => aider les sysadmins à épingler les vils concurrents qui nous volent nos annonces Récemment lancé un projet d'apprentissage des comportements utilisateurs Sysadmins identifient le gros via elastic search, mais difficile pour eux d'identifier les sessions et leur activité dans la durée