Meetup duchess 20160119 - Leboncoin de la data

•Download as PPTX, PDF•

2 likes•969 views

This document summarizes the presentation "LEBONCOIN DE LA DATA" about the data engineering work at Leboncoin, a French classified ads website. It discusses how Leboncoin is building a centralized data warehouse and real-time processing pipeline to share data across its website and apps. This will create a single source of truth and allow for expanded analytics, anomaly detection using machine learning, and real-time data feeds. The goal is to help the engineering team scale to support new products and applications.

Data & Analytics

LEBONCOIN DE LA DATA
Stéphanie Baltus – Responsable Data Engineering- @steph_baltus
Meetup Duchess France @ TheFamily – 01/19/2016

■ About leboncoin
■ Data, data everywhere !
■ To infinity and beyond …
2
PLAN

■ A Schibsted Media Group company
■ Since 2006
■ 320+ people
■ Located in Paris, Montceau-Les-Mines, Reims
■ 2014 Revenue: 150+M€
6
IN A FEW WORDS

■ Classified ads :
■ Professional
■ Personal
■ Premium offer :
■ Highlight products
■ Ad import tools
■ Ad display
8
NOT JUST A CLASSIFIEDADS COMPANY

■ Building a team
■ Provide daily batch DWH
■ Website traffic (sort of)
■ Ad activity & validation
■ Sales & Coin usage
■ User information
■ Support
■ Try near-real time processing
10
A BIT OF STORY

■ A lot of uncovered scope
■ Incremental load only
■ Unablity to load historical data, stuck from 2013 to today
■ A business team unable to query the database
■ A lot of « no! » when asking for evolution
■ Vertical scalability only
■ No potential sharing policy with the product (website, app, CRM, …)
13
IT WORKS ! BUT …

■ Share data services with the website, apps
■ Build a unique source of truth
■ Provide raw data to our analysts
■ Provide real time data
■ Cover all the data scope of leboncoin
15
THE FUTURE

■ Centralized data cleaning / streamlining
■ Extended analytics apps
■ Ads and customers indexes
■ Import ad web service
■ Datalake indexing through bloomfilter
■ Anomaly detection
19
SOME IMPLEMENTATIONS

■ Goal : help the SysAdmin Team to catch bots crawling our website and apps to steal
our ads or people’s phone numbers => Anomaly detection
■ How :
■ Use http logs (150Go per day)
■ Build KPIs and vectors
■ Apply a logistic regression to identify suspicious session
■ Next steps :
■ Test K-Means algorithm
20
CATCH’EMALL !

■ Data unified view
■ Home built data extractor + Spark MDM jobs
■ Build a next generation BI app
■ Spark ETL+ Redshift
■ Share built information with other apps
■ Spark ETL+ ES + Kafka
21
DIVE INTO DATA SHARING

■ Being production ready
■ New app, new services
■ More machine learning oriented apps
■ Feeding the website
■ Recruiting 
23
WHAT’S NEXT ?

Similar to Meetup duchess 20160119 - Leboncoin de la data

Snowplow: putting digital analysts at the heart of digital analytics - the fo...yalisassoon

Turning Digital Performance into Competitive AdvantageJennifer Finney

How Financial Services Firms are Using Digital to Improve the Customer Experi...Acquia

eCommerce. How digital is transforming retailAlex Rayón Jerez

Portalfk SIPA MunichFilip Nowicki

Capturing online customer data to create better insights and targeted actions...yalisassoon

Data Bootcamp by Fabernovel and Squid SolutionsSquidSolutions

Office Depot: Equipping the Business to Drive GrowthSAP Customer Experience

Acando Seminar Best of ignite 2016Acando Sweden

Analytics is Taking over the World (Again) - UKOUG Tech'17Rittman Analytics

Artem Makarov, Business Development Russia, Trademobanastasiaalikova

Artificial Intelligence in an ABM WorldDemandbase

DPM Overview Soasta Partners.pptxJennifer Finney

Data Dunk with Insight - Twin Cities Eloqua User Group September 30, 2014Ron Corbisier

SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSouth Tyrol Free Software Conference

BoostIT - StartUpSergiu Draganus

Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Databricks

E-commerce platforms - Benchmark by EBG Berlin 2019 EBG

DesignersX Corporate Deck for Strategic Partnership - Web, Mobile, eCommerce ...Kulbir Singh

Top 15 Business Intelligence (BI) SoftwareMopinion

Similar to Meetup duchess 20160119 - Leboncoin de la data (20)

Snowplow: putting digital analysts at the heart of digital analytics - the fo...

Turning Digital Performance into Competitive Advantage

How Financial Services Firms are Using Digital to Improve the Customer Experi...

eCommerce. How digital is transforming retail

Portalfk SIPA Munich

Capturing online customer data to create better insights and targeted actions...

Data Bootcamp by Fabernovel and Squid Solutions

Office Depot: Equipping the Business to Drive Growth

Acando Seminar Best of ignite 2016

Analytics is Taking over the World (Again) - UKOUG Tech'17

Artem Makarov, Business Development Russia, Trademob

Artificial Intelligence in an ABM World

DPM Overview Soasta Partners.pptx

Data Dunk with Insight - Twin Cities Eloqua User Group September 30, 2014

SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub

BoostIT - StartUp

Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...

E-commerce platforms - Benchmark by EBG Berlin 2019

DesignersX Corporate Deck for Strategic Partnership - Web, Mobile, eCommerce ...

Top 15 Business Intelligence (BI) Software

Recently uploaded

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor

Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Industrialised data - the key to AI success.pdfLars Albertsson

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Ukraine War presentation: KNOW THE BASICSAishani27

Data Warehouse , Data Cube Computationsit20ad004

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

Recently uploaded (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

04242024_CCC TUG_Joins and Relationships

Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service

RA-11058_IRR-COMPRESS Do 198 series of 1998

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...

Data Science Project: Advancements in Fetal Health Classification

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

Industrialised data - the key to AI success.pdf

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Ukraine War presentation: KNOW THE BASICS

Data Warehouse , Data Cube Computation

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

Unveiling Insights: The Role of a Data Analyst

Meetup duchess 20160119 - Leboncoin de la data

1. LEBONCOIN DE LA DATA Stéphanie Baltus – Responsable Data Engineering- @steph_baltus Meetup Duchess France @ TheFamily – 01/19/2016

2. ■ About leboncoin ■ Data, data everywhere ! ■ To infinity and beyond … 2 PLAN

3. ABOUT LEBONCOIN

4. 4 LEBONCOIN...AND FRIENDS

5. 5

6. ■ A Schibsted Media Group company ■ Since 2006 ■ 320+ people ■ Located in Paris, Montceau-Les-Mines, Reims ■ 2014 Revenue: 150+M€ 6 IN A FEW WORDS

7. 7 NOT JUST A WEBSITE

8. ■ Classified ads : ■ Professional ■ Personal ■ Premium offer : ■ Highlight products ■ Ad import tools ■ Ad display 8 NOT JUST A CLASSIFIEDADS COMPANY

9. DATA, DATA EVERYWHERE

10. ■ Building a team ■ Provide daily batch DWH ■ Website traffic (sort of) ■ Ad activity & validation ■ Sales & Coin usage ■ User information ■ Support ■ Try near-real time processing 10 A BIT OF STORY

11. 11 SO, WE DID SOME BI STUFF (2012-2015)

12. 12 IT LOOKS LIKE THIS

13. ■ A lot of uncovered scope ■ Incremental load only ■ Unablity to load historical data, stuck from 2013 to today ■ A business team unable to query the database ■ A lot of « no! » when asking for evolution ■ Vertical scalability only ■ No potential sharing policy with the product (website, app, CRM, …) 13 IT WORKS ! BUT …

14. TO INFINITYAND BEYOND!

15. ■ Share data services with the website, apps ■ Build a unique source of truth ■ Provide raw data to our analysts ■ Provide real time data ■ Cover all the data scope of leboncoin 15 THE FUTURE

16. 16 FUNCTIONALARCHITECTURE

17. 17 DATAARCHITECTURE : DUMBO STYLE

18. 18 ONE STACK TO RULE THEM ALL

19. ■ Centralized data cleaning / streamlining ■ Extended analytics apps ■ Ads and customers indexes ■ Import ad web service ■ Datalake indexing through bloomfilter ■ Anomaly detection 19 SOME IMPLEMENTATIONS

20. ■ Goal : help the SysAdmin Team to catch bots crawling our website and apps to steal our ads or people’s phone numbers => Anomaly detection ■ How : ■ Use http logs (150Go per day) ■ Build KPIs and vectors ■ Apply a logistic regression to identify suspicious session ■ Next steps : ■ Test K-Means algorithm 20 CATCH’EMALL !

21. ■ Data unified view ■ Home built data extractor + Spark MDM jobs ■ Build a next generation BI app ■ Spark ETL+ Redshift ■ Share built information with other apps ■ Spark ETL+ ES + Kafka 21 DIVE INTO DATA SHARING

22. 22 NOW IT LOOKS LIKE THIS

23. ■ Being production ready ■ New app, new services ■ More machine learning oriented apps ■ Feeding the website ■ Recruiting  23 WHAT’S NEXT ?

24. QUESTIONS ?

Editor's Notes

From Mexico to Malaysia, from Brazil to Norway – millions of people interact with Schibsted companies every day. We’re meeting our customers’ needs with our ever expanding range of smart products and services. Schibsted is increasingly international, and we’re moving forward. Fast. Through all this diversity, we provide similar solutions to make everyday life for millions of people a little bit easier, a little bit better. In doing this we are committed to always try to innovate and deliver new, even smarter services that will meet the needs of people today and tomorrow around the world.
Un petit peu d’histoire. Au delà du fait que leboncoin fait partie d’un groupe. Leboncoin est né en 2006 d’une joint-venture entre Schibsted et Spir (le groupe qui détient 20 minutes)
Leboncoin, est principalement connu pour le site web, mais co
A l’origine, il y a eu le site, puis les équipes produit ont eu besoin de stats. Alors, des stats caclulées en batch par l’équipe Core, envoyées en fichiers texte par mail Des batchs qui pouvaient durer des jours, des nuits, et qui leur consommaient pas mal de temps… Et nous sommes d’accord ce n’est pas leur boulot.
1) Parc appli LBC (WWW, Mobile, CP, CRM, OTRS) 2) Extractions BATCH des données brutes requises + Stockage BDD travail => Pas solliciter les systèmes sources 3) Données brutes à nettoyer/rafiner/croiser => ETL 4) Données hydratées stockées dans un datawarehouse. BDD avec une modélisation dite dimensionelle & un stockage colonnes qui permettent de grosse perf en aggrégation. => analyses Niveau techno : PSQL, PDI, MonetDB Exploitation de pas mal de fonctinnalités de Postgres qui nous ont permis de repondre à des besoin difficilement réalisable uniquement avec les fonction relationnelle : hll et les ranges (je pourrais vous détailler pourquoi en dehors de la présentation) Pour une idée de volumétrie, on stocke a peu près 6 To de données dans les bases de travail, En terme d’infra, on n’était plutot bien lotis, personnellement, je n’avais pas eu de telles machines dans mes missions précédentes. Entre les serveurs de BDD et d’ETL : environ 300 Go de RAM, 10 To Tout cela pour dire que nous avons mis toutes les chances de notre côtés pour répondre aux besoins des analystes et chef de produit.
Malgré toutes cette bonne volonté, 1) Rétention de données transformées. Analyses mais pas de mise à disposition directe coté produit (sens large). 2) Bon outils mais scalabilité verticale uniquement => compléxité persistence de certaines infos + compléxité perf Du coup on a commencé à voir plus grand et à réflichir à une architecture dite "bigdata".
Fort de nos constats, on a redefinit la mission de l’équipe, les technologies big data pouvant servir de levier à l’accroissement de notre périmètre.
Fonctionnellement çà consiste en quoi ? 1) Toujours extract Batch mais plutot qu'une BDD : stockage fichier extensible cloud + ensemble des data au format brute => Datalake 2) De même toujours netoyer/rafiner/croiser nos données => ETL mais en capable de distribuer ses traitements sur un cluster scalable 3) Idem toujours DWH avec modélisation dimensionelle & stockage colonne mais sur une base SQL distribuée => A ce stade on a "juste" adresser le problème de scalabilité de notre archi BI, reste celui du feed back 4) Contrainte inhérente à l'échange d'info inter-applicative => système de communication "temps réel".Bon gout de fonctionner dans les deux sens => ingestion temps réel + feedback de données hydratées en streaming.Doit aussi être scalable & robuste. 5) Si le streaming convient bien aux besoins de syncrhonisation et d'alerting il est peu adapté à la recherche de données spécifiques. => On expose donc des services de recherche pour adresser ces besoins Doivent etre scalable et robustes 6) Enfin on ne veut pas se contenter de mettre à disposition de la donnée recyclée (fut elle rafinée), on veut aussi créer de nouveaux services et produits depuis celles-ci. Machine Learning Nombreux de champs d'application : détection d'anomalies, lutte fraude, suggestion de contenus, détection des intentionistes d'achat, ... Une fois cette archi posée reste à faire le choix de l'implémentation concrète.
Lorsqu'on a commencé à réfléchir à cette question l'état de l'art ressemblait à çà. Stack Hadoop pour le stockage & le batch + Storm & kafka pour les traitements temps réel. Répondait aux besoins fonctionnels mais entrait conflit avec certains de nos choix de conception. 1) 6 mois de veille => Dans le monde du "BigData" les choses évoluent très rapidement => paru critique d'assurer une certaine agilité, etre capable de switcher d'une techno à une autre à moindre coup Or une archi qui repose sur hadoop introduit un fort couplage entre ses composants. 2) 2nd problème : Ecosystème Hadoop = 20aine de projets Apache => Hétérogénéité des outils (Pig, hiveql, java, scala, ...) => Difficile à rationaliser/déployer/maintenir => Redondance fonctionelle entre projets + Périnité difficile à anticiper => Rend complexe et critique les choix d'archi
=> On a donc éssayé de garder l'éléphant jaune le + loin possible. On a abouti au résultat suivant : S3 => stockage élastique et distribué Redshift => Base DWH (=> consistence groupe) Kafka => StreamingES=> Services de recherches Cassandra => Pour la persistence coté ML Spark => ETL bacth & Temps réel + Machine learning (uniformité du code) Au final on abouti a une archi modulaire, basée sur des briques a priori pérennes mais dont on peu sortir à moindre coup. Ex : Redshit --> Vertica ou S3 --> HDFS (1-2 semaines taf) Implémentation commencée en Mai. Peinture loiiiiiin d'etre sèche mais premiers retours d'exp =>Transition Nico. Quelques applications concrete à cette architecture
data = pas que de la données blocket logs audience, http aussi Objectif => aider les sysadmins à épingler les vils concurrents qui nous volent nos annonces Récemment lancé un projet d'apprentissage des comportements utilisateurs Sysadmins identifient le gros via elastic search, mais difficile pour eux d'identifier les sessions et leur activité dans la durée

Meetup duchess 20160119 - Leboncoin de la data

Recommended

Recommended

More Related Content

Similar to Meetup duchess 20160119 - Leboncoin de la data

Similar to Meetup duchess 20160119 - Leboncoin de la data (20)

Recently uploaded

Recently uploaded (20)

Meetup duchess 20160119 - Leboncoin de la data

Editor's Notes