1Distributed, fault-tolerant, transactionalReal-Time Integration: MongoDB and SQL DatabasesEugene DvorkinArchitect, WebMD
2WebMD: A lot of data; a lot of traffic~900 millions page view a month~100 million unique visitors a month
3How We Use MongoDBUser Activity
4Why Move Data to RDBMS?Preserve existing investment in BIand data warehouseTo use analytical database such asVerticaTo us...
5Why Move Data In Real-time?Batch process is slowNo ad-hoc queriesNo real-time reports
6Challenge in moving dataTransform Document to Relational StructureInsert into RDBMS at high rate
7Challenge in moving dataScale easily as data volume and velocityincrease
8Our Solution to move data in Real-time: Stormtem.Storm – open source distributed real-time computation system.Developed b...
9Hadoop StormOur Solution to move data in Real-time: Storm
10Why STORM?JVM-based frameworkGuaranteed data processingSupports development in multiple languagesScalable and transactio...
11Overview of Storm clusterMaster NodeCluster Coordinationrun worker processes
12Storm AbstractionsTuples, Streams, Spouts, Bolts and Topologies
13Tuples(“ns:events”,”email:edvorkin@gmail.com”)Ordered list of elements
14StreamUnbounded sequence of tuplesExample: Stream of messages frommessage queue
15SpoutRead from stream of data – Queues, weblogs, API calls, mongoDB oplogEmit documents as tuplesSource of Streams
16BoltsProcess tuples and create new streams
17BoltsApply functions /transformsCalculate and aggregatedata (word count!)Access DB, API , etc.Filter dataMap/ReduceProce...
18Topology
19TopologyStorm is transforming and moving data
20MongoDBHow To Read All Incoming Datafrom MongoDB?
21MongoDBHow To Read All Incoming Datafrom MongoDB?Use MongoDB OpLog
22What is OpLog?Replicationmechanism inMongoDBIt is a CappedCollection
23Spout: reading from OpLogLocated at local database, oplog.rs collection
24Spout: reading from OpLogOperations: Insert, Update, Delete
25Spout: reading from OpLogName space: Table – Collection name
26Spout: reading from OpLogData object:
27Sharded cluster
28Automatic discovery of sharded cluster
29Example: Shard vs Replica set discovery
30Example: Shard discovery
31Spout: Reading data from OpLogHow to Read data continuouslyfrom OpLog?
32Spout: Reading data from OpLogHow to Read data continuouslyfrom OpLog?Use Tailable Cursor
33Example: Tailable cursor - like tail –f
34Manage timestampsUse ts (timestamp in oplog entry) field totrack processed recordsIf system restart, start from recorded...
35Spout: reading from OpLog
36SPOUT – Code Example
37TOPOLOGY
38Working With Embedded ArraysArray represents One-to-Many relationship inRDBMS
39Example: Working with embedded arrays
40Example: Working with embedded arrays{_id: 1,ns: “person_awards”,o: { award: National Medal of Science,year: 1975,by: Na...
41Example: Working with embedded arrayspublic void execute(Tuple tuple) {.........if (field instanceof BasicDBList) {Basic...
42Parse documents with Bolt
43{"ns": "people", "op":"i",o : {_id: 1,name: { first: John, last:Backus },birth: Dec 03, 1924’}["ns": "people", "op":"i",...
44@Overridepublic void execute(Tuple tuple) {......final BasicDBObject oplogObject =(BasicDBObject)tuple.getValueByField("...
45Write to SQL with SQLWriter Bolt
46Write to SQL with SQLWriter Bolt["ns": "people", "op":"i",“_id”:1,"name_first": "John","name_last":"Backus","birth": "De...
47@Overridepublic void prepare(.....) {....Class.forName("com.vertica.jdbc.Driver");con = DriverManager.getConnection(dBUr...
48Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new Mong...
49Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new Mong...
50Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new Mong...
51Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new Mong...
52Lesson learnedBy leveraging MongoDB Oplog or othercapped collection, tailable cursor and Stormframework, you can build f...
53ResourcesBook: Getting started with StormStorm Project wikiStorm starter projectStorm contributions projectRunning a Mul...
54Resources (cont’d)Understanding the Parallelism of a StormTopologyTrident – high level Storm abstractionA practical Stor...
55QuestionsEugene Dvorkin, Architect, WebMD edvorkin@webmd.netTwitter: @edvorkin LinkedIn: eugenedvorkin
Upcoming SlideShare
Loading in...5
×

Real-Time Integration Between MongoDB and SQL Databases

2,062

Published on

Many companies have huge investment in Data Warehouse and BI tools and want to leverage those investments to process data collected by applications in MongoDB. For example, a company may need to blend clickstream data collected by distributed MongoDB data storage with personal data from Oracle into the Data Warehouse system or Analytics platform to provide timely marketing reports. Most of the time the job requires converting a MongoDB JSON document structure into a traditional relational model. Traditional ETL (Extract Transform Load) process still needed to be developed for loading and conversion of unstructured data into traditional analytical tools or Hadoop. In this talk we discuss how to develop a real-time, scalable, fault-tolerant ETL process to integrate MongoDB with traditional RDBMS storage using the open-sourced Twitter Storm project. We will be capturing data streamed by MongoDB oplog or capped collections, transforming it into tables, rows and columns and loading it into a SQL database. We will discuss mongoDB oplog and Storm architecture. The principles discussed in the talk can be used for many other applications - like advanced analytics, continuous computations and so on. We will be using Java as our language of choice but you can use the same software stack with any language.

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,062
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
0
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide
  • Leading source of health and medical information.
  • Data is rawData is immutable, data is trueDynamic personalized marketing campaigns
  • Main data structure in stormNamed list of value where each valuecan be of any typeTyple know ho to serialize primitive data types, string and byte arrays. For any other type Register serializer for this type
  • The oplog is a capped collection that lives in a database called local on every replicating node and records all changes to the data. Every time a client writes to the primary, an entry with enough information to reproduce the write is automatically added to the primary’s oplog. Once the write is replicated to a given secondary, that secondary’s oplog also stores a record of the write. Each oplog entry is identified with a BSON timestamp, and all secondaries use the timestamp to keep track of the latest entry they’ve applied.
  • How do you now if you connected to shard cluster
  • Use mongo Oplog as a queue
  • Spout extend interface
  • Awards array in Person document – converted into 2 documents with id as of parent document Id
  • Awards array – converted into 2 documents with id as of parent document Id. Name space will be used later to insert data into correct table on SQL side
  • Instance of BasicDBList in Java
  • Flatten out your document structure – use loop or recursion to flatten it outHopefully you don’t have deeply nested documents, which against mongoDB guidelines for schema design
  • Use tickle tuples and update in batches
  • Local mode vs prod mode
  • Increasing papallelization of the bolt. Let say You want 5 bolts to process your array, because it more time consuming operation or you want more SQLWtirerBolts,Because it takes long time to insert data, then use parallelization hint parameters in bolt definition.System will create correspponding number of workers to process your request.
  • Local mode vs prod mode
  • Local mode vs prod mode
  • Real-Time Integration Between MongoDB and SQL Databases

    1. 1. 1Distributed, fault-tolerant, transactionalReal-Time Integration: MongoDB and SQL DatabasesEugene DvorkinArchitect, WebMD
    2. 2. 2WebMD: A lot of data; a lot of traffic~900 millions page view a month~100 million unique visitors a month
    3. 3. 3How We Use MongoDBUser Activity
    4. 4. 4Why Move Data to RDBMS?Preserve existing investment in BIand data warehouseTo use analytical database such asVerticaTo use SQL
    5. 5. 5Why Move Data In Real-time?Batch process is slowNo ad-hoc queriesNo real-time reports
    6. 6. 6Challenge in moving dataTransform Document to Relational StructureInsert into RDBMS at high rate
    7. 7. 7Challenge in moving dataScale easily as data volume and velocityincrease
    8. 8. 8Our Solution to move data in Real-time: Stormtem.Storm – open source distributed real-time computation system.Developed by Nathan Marz - acquiredby Twitter
    9. 9. 9Hadoop StormOur Solution to move data in Real-time: Storm
    10. 10. 10Why STORM?JVM-based frameworkGuaranteed data processingSupports development in multiple languagesScalable and transactionalEasy to learn and use
    11. 11. 11Overview of Storm clusterMaster NodeCluster Coordinationrun worker processes
    12. 12. 12Storm AbstractionsTuples, Streams, Spouts, Bolts and Topologies
    13. 13. 13Tuples(“ns:events”,”email:edvorkin@gmail.com”)Ordered list of elements
    14. 14. 14StreamUnbounded sequence of tuplesExample: Stream of messages frommessage queue
    15. 15. 15SpoutRead from stream of data – Queues, weblogs, API calls, mongoDB oplogEmit documents as tuplesSource of Streams
    16. 16. 16BoltsProcess tuples and create new streams
    17. 17. 17BoltsApply functions /transformsCalculate and aggregatedata (word count!)Access DB, API , etc.Filter dataMap/ReduceProcess tuples and create new streams
    18. 18. 18Topology
    19. 19. 19TopologyStorm is transforming and moving data
    20. 20. 20MongoDBHow To Read All Incoming Datafrom MongoDB?
    21. 21. 21MongoDBHow To Read All Incoming Datafrom MongoDB?Use MongoDB OpLog
    22. 22. 22What is OpLog?Replicationmechanism inMongoDBIt is a CappedCollection
    23. 23. 23Spout: reading from OpLogLocated at local database, oplog.rs collection
    24. 24. 24Spout: reading from OpLogOperations: Insert, Update, Delete
    25. 25. 25Spout: reading from OpLogName space: Table – Collection name
    26. 26. 26Spout: reading from OpLogData object:
    27. 27. 27Sharded cluster
    28. 28. 28Automatic discovery of sharded cluster
    29. 29. 29Example: Shard vs Replica set discovery
    30. 30. 30Example: Shard discovery
    31. 31. 31Spout: Reading data from OpLogHow to Read data continuouslyfrom OpLog?
    32. 32. 32Spout: Reading data from OpLogHow to Read data continuouslyfrom OpLog?Use Tailable Cursor
    33. 33. 33Example: Tailable cursor - like tail –f
    34. 34. 34Manage timestampsUse ts (timestamp in oplog entry) field totrack processed recordsIf system restart, start from recorded ts
    35. 35. 35Spout: reading from OpLog
    36. 36. 36SPOUT – Code Example
    37. 37. 37TOPOLOGY
    38. 38. 38Working With Embedded ArraysArray represents One-to-Many relationship inRDBMS
    39. 39. 39Example: Working with embedded arrays
    40. 40. 40Example: Working with embedded arrays{_id: 1,ns: “person_awards”,o: { award: National Medal of Science,year: 1975,by: National Science Foundation }}{ _id: 1,ns: “person_awards”,o: {award: Turing Award,year: 1977,by: ACM }}
    41. 41. 41Example: Working with embedded arrayspublic void execute(Tuple tuple) {.........if (field instanceof BasicDBList) {BasicDBObject arrayElement=processArray(field)......outputCollector.emit("documents", tuple, arrayElement);
    42. 42. 42Parse documents with Bolt
    43. 43. 43{"ns": "people", "op":"i",o : {_id: 1,name: { first: John, last:Backus },birth: Dec 03, 1924’}["ns": "people", "op":"i",“_id”:1,"name_first": "John","name_last":"Backus","birth": "DEc 03, 1924"]Parse documents with Bolt
    44. 44. 44@Overridepublic void execute(Tuple tuple) {......final BasicDBObject oplogObject =(BasicDBObject)tuple.getValueByField("document");final BasicDBObject document = (BasicDBObject)oplogObject.get("o");......outputValues.add(flattenDocument(document));outputCollector.emit(tuple,outputValues);Parse documents with Bolt
    45. 45. 45Write to SQL with SQLWriter Bolt
    46. 46. 46Write to SQL with SQLWriter Bolt["ns": "people", "op":"i",“_id”:1,"name_first": "John","name_last":"Backus","birth": "Dec 03, 1924"]insert into people (_id,name_first,name_last,birth) values(1,John,Backus,Dec 03,1924) ,insert into people_awards (_id,awards_award,awards_award,awards_by)values (1,Turing Award,1977,ACM),insert into people_awards (_id,awards_award,awards_award,awards_by)values (1,National Medal of Science,1975,National Science Foundation)
    47. 47. 47@Overridepublic void prepare(.....) {....Class.forName("com.vertica.jdbc.Driver");con = DriverManager.getConnection(dBUrl, username,password);@Overridepublic void execute(Tuple tuple) {String insertStatement=createInsertStatement(tuple);try {Statement stmt = con.createStatement();stmt.execute(insertStatement);stmt.close();Write to SQL with SQLWriter Bolt
    48. 48. 48Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",opslog_progress)builder.setBolt(arrayExtractorId ,newArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, newMongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, newSQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf,builder.createTopology());
    49. 49. 49Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",opslog_progress)builder.setBolt(arrayExtractorId ,newArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, newMongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, newSQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf,builder.createTopology());
    50. 50. 50Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",opslog_progress)builder.setBolt(arrayExtractorId ,newArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, newMongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, newSQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf,builder.createTopology());
    51. 51. 51Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",opslog_progress)builder.setBolt(arrayExtractorId ,newArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, newMongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, newSQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)StormSubmitter.submitTopology("OfflineEventProcess",conf,builder.createTopology())
    52. 52. 52Lesson learnedBy leveraging MongoDB Oplog or othercapped collection, tailable cursor and Stormframework, you can build fast, scalable,real-time data processing pipeline.
    53. 53. 53ResourcesBook: Getting started with StormStorm Project wikiStorm starter projectStorm contributions projectRunning a Multi-Node Storm cluster tutorialImplementing real-time trending topicA Hadoop Alternative: Building a real-timedata pipeline with StormStorm Use cases
    54. 54. 54Resources (cont’d)Understanding the Parallelism of a StormTopologyTrident – high level Storm abstractionA practical Storm’s Trident APIStorm online forumMongo connector from 10gen LabsMoSQL streaming Translator in RubyProject source codeNew York City Storm Meetup
    55. 55. 55QuestionsEugene Dvorkin, Architect, WebMD edvorkin@webmd.netTwitter: @edvorkin LinkedIn: eugenedvorkin

    ×