Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Open-Source Analytics Stack on MongoDB, with Schema, Pierre-Alain Jachiet and Aurélien Gervasi

736 views

Published on

PyParis 2017
http://pyparis.org

Published in: Technology
  • Be the first to comment

Open-Source Analytics Stack on MongoDB, with Schema, Pierre-Alain Jachiet and Aurélien Gervasi

  1. 1. By OCTO & The RefinersPierre-Alain Jachiet - Aurélien Gervasi PEN S URCE ANALYTICS on MONGO DB with Schema
  2. 2. Pierre-Alain Jachiet Aurélien Gervasi DATA SCIENTIST
  3. 3. Data strategist Applied mathematician Analysts, with developer skills DATA SCIENTIST
  4. 4. DATA PROCESSOR Data strategist Applied mathematician Analysts, with developer skills
  5. 5. “ the major activity in the data science process is identifying, accessing and preparing data for analysis
  6. 6. From MongoDB data … to Superset Colors
  7. 7. OCTO TECHNOLOGY > THERE IS A BETTER WAY So ! What's the point with MongoDB ?>
  8. 8. MongoDB - The Leading NoSQL Database Cassandra Redis Hbase
  9. 9. MongoDB - A NoSQL database in the big leagues of RDBMS 2013 2014 2015 2016 2017 https://db-engines.com/en/ranking Popularity score by db-engines.com
  10. 10. OCTO TECHNOLOGY > THERE IS A BETTER WAY Why MongoDB ? Yes ! Semi-structured data ? Performance ?Scalability ? And more generally because it is natural for developers a pleasure to use from the developer perspective “ “ “ MongoDB is fast to get started “
  11. 11. OCTO TECHNOLOGY > THERE IS A BETTER WAY Developers speak json … XML JSON 100 75 50 25 2008 2011 2014 2017 (= document with schema) … the modern data exchange format …
  12. 12. OCTO TECHNOLOGY > THERE IS A BETTER WAY Developers speak json … XML JSON 100 75 50 25 2008 2011 2014 2017 (= document with schema) … the modern data exchange format … … and Mongo DB eats JSON
  13. 13. OCTO TECHNOLOGY > THERE IS A BETTER WAY MongoDB, a common technology to store data
  14. 14. OCTO TECHNOLOGY > THERE IS A BETTER WAY So far, so good>
  15. 15. And, one day, someone has a dream… So far, so good. But times goes on and data goes in. AI
  16. 16. OCTO TECHNOLOGY > THERE IS A BETTER WAY Please ! An analyst for this data ! Hey !
  17. 17. OCTO TECHNOLOGY > THERE IS A BETTER WAY Please ! An analyst for this data ! But NoSQL / json data is not natural for analysts ?
  18. 18. OCTO TECHNOLOGY > THERE IS A BETTER WAY Analysts use SQL MongoDB : aggregation framework
  19. 19. OCTO TECHNOLOGY > THERE IS A BETTER WAY Analysts work with tables Analyst land … and relations Developer land Developer like json … and imbrications
  20. 20. OCTO TECHNOLOGY > THERE IS A BETTER WAY Relational database Code Model layer Application = API to data Map + Contract Data schema Analyst landDeveloper land Analysts work with a data schemaDeveloper have a data model in the code
  21. 21. OCTO TECHNOLOGY > THERE IS A BETTER WAY MongoDB Code Model layer Application = API to data Map + Contract Data schema Analyst landDeveloper land > But mongoDB is schema-less Analysts work with a data schemaDeveloper have a data model in the code
  22. 22. OCTO TECHNOLOGY > THERE IS A BETTER WAY The usual reaction… MongoDB ExcelAccessSAS Hack a pipeline to flatten the Mongo DB data Pymongo + scripts Python notebooksCSV file Difficulties ☉ Hard job for the analyst ☉ Batch / no real time ☉ Not robust to changes => Difficult to industrialize
  23. 23. OCTO TECHNOLOGY > THERE IS A BETTER WAY Mongo DB enterprise solution>
  24. 24. OCTO TECHNOLOGY > THERE IS A BETTER WAY Mongo BI Connector Mongo BI Connector Developed for integration with SQL-based BI tools An SQL compatibility layer to MongoDB Mongo SQLD MongoDB Data Model Tableau MySQL Wire * DRDL = Document - Relational Definition Language Mongo DRDL* - SQL translator Data table - Post-processor Data json
  25. 25. OCTO TECHNOLOGY > THERE IS A BETTER WAY Mongo BI Connector - Pro & Cons Pro ☉ Official & Supported Install it and go Cons ☉ Commercial → MongoDB Enterprise license ☉ Closed-source → black box ☉ Limited performance ? ☉ Mandatory use of SQL wire protocol
  26. 26. OCTO TECHNOLOGY > THERE IS A BETTER WAY An open-source solution ?>
  27. 27. OCTO TECHNOLOGY > THERE IS A BETTER WAY Open-source bricks put together ! PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager (PostgreSQL) Streaming data from MongoDB to PostgreSQL
  28. 28. OCTO TECHNOLOGY > THERE IS A BETTER WAY Mongo Connector : Connect them all ! Developed by MongoDB Labs Python 2.6, 2.7, 3.3+ MongoDB 2.4, 2.6, 3.0, 3.2, and 3.4 Apache License 2.0 https://github.com/mongodb-labs/mongo-connector PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager Synchronize a Mongodb database with another database ☉ MongoDB ☉ SolR ☉ ElasticSearch
  29. 29. OCTO TECHNOLOGY > THERE IS A BETTER WAY Mongo Connector : Connect them all ! changes in DB write new events (differential) replication Oplog file propagate changes to other DB Primary Secondary Secondary Mongo Connector
  30. 30. OCTO TECHNOLOGY > THERE IS A BETTER WAY Doc-manager : Do you speak PostgreSQL ? PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager Developed by Hopwork Python 2.7, 3.4+ PostgreSQL 9.5 Apache License 2.0 https://github.com/Hopwork/mongo-connector-postgresql ☉ Translate a modification request from MongoConnector to the target database ☉ Speak the target database language
  31. 31. OCTO TECHNOLOGY > THERE IS A BETTER WAY { _id: “12”, f1: “fu”, f2: true, f3: 42, f4: { sf1: “pyparis” sf2: 2017 }, f5: [ “fu”, “bar”, “fubar” ] } Doc-manager : Do you speak PostgreSQL ? _id f1 f2 f3 12 “fu” true 42 _id value id_parent 1 ‘fu’ 12 2 ‘bar’ 12 3 ‘fubar’ 12 f4.sf1 f4.sf2 ‘pyparis’ 2017 Mongo DB world SQL world
  32. 32. OCTO TECHNOLOGY > THERE IS A BETTER WAY Pymongo Schema : A mapping to rule them all PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager “Homemade” Python 2.7 Apache License 2.0 https://github.com/pajachiet/pymongo-schema ☉ Scan the entire database to define its data model schema ☉ Generate a mapping file flattening the MongoDB schema into an SQL-compatible schema
  33. 33. OCTO TECHNOLOGY > THERE IS A BETTER WAY Demo>
  34. 34. OCTO TECHNOLOGY > THERE IS A BETTER WAY PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager ☉ Mongodb example Dataset: Restaurants in New York > Address & coordinates > Cuisi ne type > List of grades ☉ Nested data structure
  35. 35. OCTO TECHNOLOGY > THERE IS A BETTER WAY
  36. 36. OCTO TECHNOLOGY > THERE IS A BETTER WAY PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager EXTRACT Read entire database to extract its data model schema Returns: ☉ Field name and field nesting ☉ Field completion (frequence and ratio) ☉ Field type
  37. 37. OCTO TECHNOLOGY > THERE IS A BETTER WAY
  38. 38. OCTO TECHNOLOGY > THERE IS A BETTER WAY PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager TOSQL Read a schema to generate a MongoDB/SQL mapping. Returns: ☉ Mapping file used by the doc-manager
  39. 39. OCTO TECHNOLOGY > THERE IS A BETTER WAY Same table Column “cuisine” New table “restaurants__address__coord
  40. 40. OCTO TECHNOLOGY > THERE IS A BETTER WAY PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager Check for updates in the oplog file Send update commands with data Translate command and make SQL requests
  41. 41. OCTO TECHNOLOGY > THERE IS A BETTER WAY PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager
  42. 42. OCTO TECHNOLOGY > THERE IS A BETTER WAY Time to play with your analytics tools>
  43. 43. OCTO TECHNOLOGY > THERE IS A BETTER WAY Adding an open-source BI tool... PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager (PostgreSQL)
  44. 44. OCTO TECHNOLOGY > THERE IS A BETTER WAY Now, in Superset colors ! “Superset is a data exploration platform designed to be visual, intuitive and interactive.” PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager Developed by AirBnB Python 2.7, 3.4, 3.5 Apache License 2.0 https://github.com/airbnb/superset Superset
  45. 45. OCTO TECHNOLOGY > THERE IS A BETTER WAY SQL lab
  46. 46. OCTO TECHNOLOGY > THERE IS A BETTER WAY Wrap up>
  47. 47. OCTO TECHNOLOGY > THERE IS A BETTER WAY Take home message ☉ Issues for analysts with NoSQL frameworks > Developer oriented languages > Nested data structure > Schema-less ☉ An open-source stack to unlock analysis of MongoDB data > Extract a MongoDB schema > Normalize the data model > Real time synchronization to PostgreSQL ☉ Currently running in production environments
  48. 48. OCTO TECHNOLOGY > THERE IS A BETTER WAY Come, use and contribute ! :) pajachiet@octo.com agervasi@octo.com https://github.com/mongodb-labs/mongo-connector https://github.com/Hopwork/mongo-connector-postgresql https://github.com/pajachiet/pymongo-schema https://github.com/airbnb/superset
  49. 49. OCTO TECHNOLOGY > THERE IS A BETTER WAY Bien rappeler qu’on est sur une stack open-source ☉ Collaborative ☉ Gratuite But hey ! It’s Open-Source !
  50. 50. OCTO TECHNOLOGY > THERE IS A BETTER WAY 53OCTO TECHNOLOGY > THERE IS A BETTER WAY « J’analyse mes données pour me comprendre » « J’apprends automatiquement à réaliser des tâches complexes à partir des données » « Je me dote d’outils avancés me permettant des analyses complexes et interactives » Dataviz Search Statistics Organisation pilotée par la donnée Learning
  51. 51. OCTO TECHNOLOGY > THERE IS A BETTER WAY MongoDB popularity https://db-engines.com/en/ranking
  52. 52. OCTO TECHNOLOGY > THERE IS A BETTER WAY Analysts use SQL The Mongo way : aggregation framework
  53. 53. Superset Architecture des visualisations Datasource (tables SQLa)Tables PostgreSQL Visualisations Tableau de bord
  54. 54. OCTO TECHNOLOGY > THERE IS A BETTER WAY Please ! An analyst for this data ! But analysts don’t speak json… ??? ? Should we call the developer ?

×