Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Real-Time Integration Between MongoDB and SQL Databases

6,989

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,989
On Slideshare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
85
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Leading source of health and medical information.
  • Data is rawData is immutable, data is trueDynamic personalized marketing campaigns
  • The oplog is a capped collection that lives in a database calledlocal on every replicating node and records all changes to the data. Every time a client writes to the primary, an entry with enough information to reproduce the write is automatically added to the primary’s oplog. Once the write is replicated to a given secondary, that secondary’s oplog also stores a record of the write. Each oplog entry is identified with a BSON timestamp, and all secondaries use the timestamp to keep track of the latest entry they’ve applied.
  • How do you now if you connected to shard cluster
  • Use mongo Oplog as a queue
  • Spout extend interface
  • Awards array in Person document – converted into 2 documents with id as of parent document Id
  • Awards array – converted into 2 documents with id as of parent document Id. Name space will be used later to insert data into correct table on SQL side
  • Instance of BasicDBList in Java
  • Flatten out your document structure – use loop or recursion to flatten it outHopefully you don’t have deeply nested documents, which against mongoDB guidelines for schema design
  • Use tickle tuples and update in batches
  • Local mode vs prod mode
  • Increasing papallelization of the bolt. Let say You want 5 bolts to process your array, because it more time consuming operation or you want more SQLWtirerBolts,Because it takes long time to insert data, then use parallelization hint parameters in bolt definition.System will create correspponding number of workers to process your request.
  • Local mode vs prod mode
  • Local mode vs prod mode
  • Transcript

    • 1. 1Distributed, fault-tolerant, transactionalReal-Time Integration: MongoDB and SQL DatabasesEugene DvorkinArchitect, WebMD
    • 2. 2WebMD: A lot of data; a lot of traffic~900 millions page view a month~100 million unique visitors a month
    • 3. 3How We Use MongoDBUser Activity
    • 4. 4Why Move Data to RDBMS?Preserve existing investment in BIand data warehouseTo use analytical database such asVerticaTo use SQL
    • 5. 5Why Move Data In Real-time?Batch process is slowNo ad-hoc queriesNo real-time reports
    • 6. 6Challenge in moving dataTransform Document to Relational StructureInsert into RDBMS at high rate
    • 7. 7Challenge in moving dataScale easily as data volume and velocityincrease
    • 8. 8Our Solution to move data in Real-time: Stormtem.Storm – open source distributed real-time computation system.Developed by Nathan Marz - acquiredby Twitter
    • 9. 9Hadoop StormOur Solution to move data in Real-time: Storm
    • 10. 10Why STORM?JVM-based frameworkGuaranteed data processingSupports development in multiple languagesScalable and transactional
    • 11. 11Overview of Storm clusterMaster NodeCluster Coordinationrun worker processes
    • 12. 12Storm AbstractionsTuples, Streams, Spouts, Bolts and Topologies
    • 13. 13Tuples(“ns:events”,”email:edvorkin@gmail.com”)Ordered list of elements
    • 14. 14StreamUnbounded sequence of tuplesExample: Stream of messages frommessage queue
    • 15. 15SpoutRead from stream of data – Queues, weblogs, API calls, mongoDB oplogEmit documents as tuplesSource of Streams
    • 16. 16BoltsProcess tuples and create new streams
    • 17. 17BoltsApply functions /transformsCalculate and aggregatedata (word count!)Access DB, API , etc.Filter dataMap/ReduceProcess tuples and create new streams
    • 18. 18Topology
    • 19. 19TopologyStorm is transforming and moving data
    • 20. 20MongoDBHow To Read All Incoming Datafrom MongoDB?
    • 21. 21MongoDBHow To Read All Incoming Datafrom MongoDB?Use MongoDB OpLog
    • 22. 22What is OpLog?Replicationmechanism inMongoDBIt is a CappedCollection
    • 23. 23Spout: reading from OpLogLocated at local database, oplog.rs collection
    • 24. 24Spout: reading from OpLogOperations: Insert, Update, Delete
    • 25. 25Spout: reading from OpLogName space: Table – Collection name
    • 26. 26Spout: reading from OpLogData object:
    • 27. 27Sharded cluster
    • 28. 28Automatic discovery of sharded cluster
    • 29. 29Example: Shard vs Replica set discovery
    • 30. 30Example: Shard discovery
    • 31. 31Spout: Reading data from OpLogHow to Read data continuouslyfrom OpLog?
    • 32. 32Spout: Reading data from OpLogHow to Read data continuouslyfrom OpLog?Use Tailable Cursor
    • 33. 33Example: Tailable cursor - like tail –f
    • 34. 34Manage timestampsUse ts (timestamp in oplog entry) field totrack processed recordsIf system restart, start from recorded ts
    • 35. 35Spout: reading from OpLog
    • 36. 36SPOUT – Code Example
    • 37. 37TOPOLOGY
    • 38. 38Working With Embedded ArraysArray represents One-to-Many relationship inRDBMS
    • 39. 39Example: Working with embedded arrays
    • 40. 40Example: Working with embedded arrays{_id: 1,ns: “person_awards”,o: { award: National Medal of Science,year: 1975,by: National Science Foundation }}{ _id: 1,ns: “person_awards”,o: {award: Turing Award,year: 1977,by: ACM }}
    • 41. 41Example: Working with embedded arrayspublic void execute(Tuple tuple) {.........if (field instanceof BasicDBList) {BasicDBObject arrayElement=processArray(field)......outputCollector.emit("documents", tuple, arrayElement);
    • 42. 42Parse documents with Bolt
    • 43. 43{"ns": "people", "op":"i",o : {_id: 1,name: { first: John, last:Backus },birth: Dec 03, 1924’}["ns": "people", "op":"i",[“id”:1,"name_first": "John","name_last":"Backus","birth": "DEc 03, 1924"]]Parse documents with Bolt
    • 44. 44@Overridepublic void execute(Tuple tuple) {......final BasicDBObject oplogObject =(BasicDBObject)tuple.getValueByField("document");final BasicDBObject document = (BasicDBObject)oplogObject.get("o");......outputValues.add(flattenDocument(document));outputCollector.emit(tuple,outputValues);Parse documents with Bolt
    • 45. 45Write to SQL with SQLWriter Bolt
    • 46. 46Write to SQL with SQLWriter Bolt["ns": "people", "op":"i",[“id”:1,"name_first": "John","name_last":"Backus","birth": "Dec 03, 1924"]]insert into people (_id,name_first,name_last,birth) values(1,John,Backus,Dec 03,1924) ,insert into people_awards (_id,awards_award,awards_award,awards_by)values (1,Turing Award,1977,ACM),insert into people_awards (_id,awards_award,awards_award,awards_by)values (1,National Medal of Science,1975,National Science Foundation)
    • 47. 47@Overridepublic void prepare(.....) {....Class.forName("com.vertica.jdbc.Driver");con = DriverManager.getConnection(dBUrl, username,password);@Overridepublic void execute(Tuple tuple) {String insertStatement=createInsertStatement(tuple);try {Statement stmt = con.createStatement();stmt.execute(insertStatement);stmt.close();Write to SQL with SQLWriter Bolt
    • 48. 48Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",opslog_progress)builder.setBolt(arrayExtractorId ,newArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, newMongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, newSQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf,builder.createTopology());
    • 49. 49Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",opslog_progress)builder.setBolt(arrayExtractorId ,newArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, newMongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, newSQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf,builder.createTopology());
    • 50. 50Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",opslog_progress)builder.setBolt(arrayExtractorId ,newArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, newMongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, newSQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf,builder.createTopology());
    • 51. 51Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",opslog_progress)builder.setBolt(arrayExtractorId ,newArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, newMongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, newSQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)StormSubmitter.submitTopology("OfflineEventProcess",conf,builder.createTopology())
    • 52. 52Lesson learnedBy leveraging MongoDB Oplog or othercapped collection, tailable cursor and Stormframework, you can build fast, scalable,real-time data processing pipeline.
    • 53. 53ResourcesBook: Getting started with StormStorm Project wikiStorm starter projectStorm contributions projectRunning a Multi-Node Storm cluster tutorialImplementing real-time trending topicA Hadoop Alternative: Building a real-timedata pipeline with StormStorm Use cases
    • 54. 54Resources (cont’d)Understanding the Parallelism of a StormTopologyTrident – high level Storm abstractionA practical Storm’s Trident APIStorm online forumMongo connector from 10gen LabsMoSQL streaming Translator in RubyProject source codeNew York City Storm Meetup
    • 55. 55QuestionsEugene Dvorkin, Architect, WebMD edvorkin@webmd.netTwitter: @edvorkin LinkedIn: eugenedvorkin
    • 56. 56
    • 57. 57
    • 58. 58Next Sessions at 2:505th Floor:WestSideBallroom3&4:DataModelingExamplesfromtheRealWorldWestSideBallroom1&2: GrowingUpMongoDBJuilliardComplex:BusinessTrack:MetLifeLeapfrogsInsuranceIndustrywithMongoDB-PoweredBigDataApplicationLyceumComplex: AsktheExperts:MongoDBMonitoringandBackupServiceSession7th Floor:EmpireComplex:HowWeFixedOurMongoDBProblemsSoHoComplex:HighPerformance,HighScaleMongoDBonAWS:AHandsOnGuide

    ×