• Save
Real-Time Integration Between MongoDB and SQL Databases
Upcoming SlideShare
Loading in...5
×
 

Real-Time Integration Between MongoDB and SQL Databases

on

  • 2,084 views

Many companies have huge investment in Data Warehouse and BI tools and want to leverage those investments to process data collected by applications in MongoDB. For example, a company may need to blend ...

Many companies have huge investment in Data Warehouse and BI tools and want to leverage those investments to process data collected by applications in MongoDB. For example, a company may need to blend clickstream data collected by distributed MongoDB data storage with personal data from Oracle into the Data Warehouse system or Analytics platform to provide timely marketing reports. Most of the time the job requires converting a MongoDB JSON document structure into a traditional relational model. Traditional ETL (Extract Transform Load) process still needed to be developed for loading and conversion of unstructured data into traditional analytical tools or Hadoop. In this talk we discuss how to develop a real-time, scalable, fault-tolerant ETL process to integrate MongoDB with traditional RDBMS storage using the open-sourced Twitter Storm project. We will be capturing data streamed by MongoDB oplog or capped collections, transforming it into tables, rows and columns and loading it into a SQL database. We will discuss mongoDB oplog and Storm architecture. The principles discussed in the talk can be used for many other applications - like advanced analytics, continuous computations and so on. We will be using Java as our language of choice but you can use the same software stack with any language.

Statistics

Views

Total Views
2,084
Views on SlideShare
1,773
Embed Views
311

Actions

Likes
7
Downloads
0
Comments
0

6 Embeds 311

http://eugenedvorkin.com 292
http://cloud.feedly.com 9
http://www.linkedin.com 5
http://newsblur.com 2
http://feedly.com 2
http://wordpress.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Leading source of health and medical information.
  • Data is rawData is immutable, data is trueDynamic personalized marketing campaigns
  • Main data structure in stormNamed list of value where each valuecan be of any typeTyple know ho to serialize primitive data types, string and byte arrays. For any other type Register serializer for this type
  • The oplog is a capped collection that lives in a database called local on every replicating node and records all changes to the data. Every time a client writes to the primary, an entry with enough information to reproduce the write is automatically added to the primary’s oplog. Once the write is replicated to a given secondary, that secondary’s oplog also stores a record of the write. Each oplog entry is identified with a BSON timestamp, and all secondaries use the timestamp to keep track of the latest entry they’ve applied.
  • How do you now if you connected to shard cluster
  • Use mongo Oplog as a queue
  • Spout extend interface
  • Awards array in Person document – converted into 2 documents with id as of parent document Id
  • Awards array – converted into 2 documents with id as of parent document Id. Name space will be used later to insert data into correct table on SQL side
  • Instance of BasicDBList in Java
  • Flatten out your document structure – use loop or recursion to flatten it outHopefully you don’t have deeply nested documents, which against mongoDB guidelines for schema design
  • Use tickle tuples and update in batches
  • Local mode vs prod mode
  • Increasing papallelization of the bolt. Let say You want 5 bolts to process your array, because it more time consuming operation or you want more SQLWtirerBolts,Because it takes long time to insert data, then use parallelization hint parameters in bolt definition.System will create correspponding number of workers to process your request.
  • Local mode vs prod mode
  • Local mode vs prod mode

Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases Presentation Transcript

  • 1Distributed, fault-tolerant, transactionalReal-Time Integration: MongoDB and SQL DatabasesEugene DvorkinArchitect, WebMD
  • 2WebMD: A lot of data; a lot of traffic~900 millions page view a month~100 million unique visitors a month
  • 3How We Use MongoDBUser Activity
  • 4Why Move Data to RDBMS?Preserve existing investment in BIand data warehouseTo use analytical database such asVerticaTo use SQL
  • 5Why Move Data In Real-time?Batch process is slowNo ad-hoc queriesNo real-time reports
  • 6Challenge in moving dataTransform Document to Relational StructureInsert into RDBMS at high rate
  • 7Challenge in moving dataScale easily as data volume and velocityincrease
  • 8Our Solution to move data in Real-time: Stormtem.Storm – open source distributed real-time computation system.Developed by Nathan Marz - acquiredby Twitter
  • 9Hadoop StormOur Solution to move data in Real-time: Storm
  • 10Why STORM?JVM-based frameworkGuaranteed data processingSupports development in multiple languagesScalable and transactionalEasy to learn and use
  • 11Overview of Storm clusterMaster NodeCluster Coordinationrun worker processes
  • 12Storm AbstractionsTuples, Streams, Spouts, Bolts and Topologies
  • 13Tuples(“ns:events”,”email:edvorkin@gmail.com”)Ordered list of elements
  • 14StreamUnbounded sequence of tuplesExample: Stream of messages frommessage queue
  • 15SpoutRead from stream of data – Queues, weblogs, API calls, mongoDB oplogEmit documents as tuplesSource of Streams
  • 16BoltsProcess tuples and create new streams
  • 17BoltsApply functions /transformsCalculate and aggregatedata (word count!)Access DB, API , etc.Filter dataMap/ReduceProcess tuples and create new streams
  • 18Topology
  • 19TopologyStorm is transforming and moving data
  • 20MongoDBHow To Read All Incoming Datafrom MongoDB?
  • 21MongoDBHow To Read All Incoming Datafrom MongoDB?Use MongoDB OpLog
  • 22What is OpLog?Replicationmechanism inMongoDBIt is a CappedCollection
  • 23Spout: reading from OpLogLocated at local database, oplog.rs collection
  • 24Spout: reading from OpLogOperations: Insert, Update, Delete
  • 25Spout: reading from OpLogName space: Table – Collection name
  • 26Spout: reading from OpLogData object:
  • 27Sharded cluster
  • 28Automatic discovery of sharded cluster
  • 29Example: Shard vs Replica set discovery
  • 30Example: Shard discovery
  • 31Spout: Reading data from OpLogHow to Read data continuouslyfrom OpLog?
  • 32Spout: Reading data from OpLogHow to Read data continuouslyfrom OpLog?Use Tailable Cursor
  • 33Example: Tailable cursor - like tail –f
  • 34Manage timestampsUse ts (timestamp in oplog entry) field totrack processed recordsIf system restart, start from recorded ts
  • 35Spout: reading from OpLog
  • 36SPOUT – Code Example
  • 37TOPOLOGY
  • 38Working With Embedded ArraysArray represents One-to-Many relationship inRDBMS
  • 39Example: Working with embedded arrays
  • 40Example: Working with embedded arrays{_id: 1,ns: “person_awards”,o: { award: National Medal of Science,year: 1975,by: National Science Foundation }}{ _id: 1,ns: “person_awards”,o: {award: Turing Award,year: 1977,by: ACM }}
  • 41Example: Working with embedded arrayspublic void execute(Tuple tuple) {.........if (field instanceof BasicDBList) {BasicDBObject arrayElement=processArray(field)......outputCollector.emit("documents", tuple, arrayElement);
  • 42Parse documents with Bolt
  • 43{"ns": "people", "op":"i",o : {_id: 1,name: { first: John, last:Backus },birth: Dec 03, 1924’}["ns": "people", "op":"i",“_id”:1,"name_first": "John","name_last":"Backus","birth": "DEc 03, 1924"]Parse documents with Bolt
  • 44@Overridepublic void execute(Tuple tuple) {......final BasicDBObject oplogObject =(BasicDBObject)tuple.getValueByField("document");final BasicDBObject document = (BasicDBObject)oplogObject.get("o");......outputValues.add(flattenDocument(document));outputCollector.emit(tuple,outputValues);Parse documents with Bolt
  • 45Write to SQL with SQLWriter Bolt
  • 46Write to SQL with SQLWriter Bolt["ns": "people", "op":"i",“_id”:1,"name_first": "John","name_last":"Backus","birth": "Dec 03, 1924"]insert into people (_id,name_first,name_last,birth) values(1,John,Backus,Dec 03,1924) ,insert into people_awards (_id,awards_award,awards_award,awards_by)values (1,Turing Award,1977,ACM),insert into people_awards (_id,awards_award,awards_award,awards_by)values (1,National Medal of Science,1975,National Science Foundation)
  • 47@Overridepublic void prepare(.....) {....Class.forName("com.vertica.jdbc.Driver");con = DriverManager.getConnection(dBUrl, username,password);@Overridepublic void execute(Tuple tuple) {String insertStatement=createInsertStatement(tuple);try {Statement stmt = con.createStatement();stmt.execute(insertStatement);stmt.close();Write to SQL with SQLWriter Bolt
  • 48Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",opslog_progress)builder.setBolt(arrayExtractorId ,newArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, newMongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, newSQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf,builder.createTopology());
  • 49Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",opslog_progress)builder.setBolt(arrayExtractorId ,newArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, newMongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, newSQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf,builder.createTopology());
  • 50Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",opslog_progress)builder.setBolt(arrayExtractorId ,newArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, newMongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, newSQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf,builder.createTopology());
  • 51Topology DefinitionTopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",opslog_progress)builder.setBolt(arrayExtractorId ,newArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, newMongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, newSQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)StormSubmitter.submitTopology("OfflineEventProcess",conf,builder.createTopology())
  • 52Lesson learnedBy leveraging MongoDB Oplog or othercapped collection, tailable cursor and Stormframework, you can build fast, scalable,real-time data processing pipeline.
  • 53ResourcesBook: Getting started with StormStorm Project wikiStorm starter projectStorm contributions projectRunning a Multi-Node Storm cluster tutorialImplementing real-time trending topicA Hadoop Alternative: Building a real-timedata pipeline with StormStorm Use cases
  • 54Resources (cont’d)Understanding the Parallelism of a StormTopologyTrident – high level Storm abstractionA practical Storm’s Trident APIStorm online forumMongo connector from 10gen LabsMoSQL streaming Translator in RubyProject source codeNew York City Storm Meetup
  • 55QuestionsEugene Dvorkin, Architect, WebMD edvorkin@webmd.netTwitter: @edvorkin LinkedIn: eugenedvorkin