Your SlideShare is downloading. ×
Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC


Published on

One of the most popular NoSQL databases, MongoDB is one of the building blocks for big data analysis. MongoDB can store unstructured data and makes it easy to analyze files by commonly available …

One of the most popular NoSQL databases, MongoDB is one of the building blocks for big data analysis. MongoDB can store unstructured data and makes it easy to analyze files by commonly available tools. This session will go over how big data analytics can improve sales outcomes in identifying users with a propensity to buy by processing information from social networks. All attendees will have a MongoDB instance on a public cloud, plus sample code to run Big Data Analytics.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. We offer MongoDB-as-a-Service on any cloud of your choice. You can read more about ourMongoDB-as-a-service in our white paper on our website:
  • 2. The goal of this boot camp is to give you hands-on experience with MongoDB database-as-a-service, how to load the data and show you a sample application to analyze the data. We willuse a small sample Twitter application for our hands-on lab, which will help you write aMongoDB application. We will also discuss briefly a few performance-related so you cananalyze and tweak performance of your databases. At the same time, you will also see howyou can easily launch a fully managed MongoDB instance in the cloud.
  • 3. About a decade ago, business applications were transactional in nature and most of theissues were related to executing transactions (i.e. credit card processing) with low latency, asa result enterprise data was more “relational” in nature and was therefore “structured”.The nature of business applications has changed and enterprises are trying to figure out howto use all the data in their enterprise systems, social media, machine logs, etc. to understandhow all the data impacts their business and how they can get competitive advantage byleveraging nuggets in that data.Fast forward till today and businesses are trying to solve a different problem. And with thediverse nature of data sources and data formats, we need newer technologies that scale andprovide answers or identify those nuggets in the data at much faster speed and low cost thantraditional SQL database or data warehouse systems. Hence, we see a slew of new databasetechnologies being developed that promise to help solving these problems.Depending on the nature of the data or problem they solve, we can categorize these newdatabase technologies in three major categories. (1) Document oriented databases, whichstore and crunch data in document formats, (2) Key-value pair databases such as Riak andRedis and (3) Graph databases. Depending on the type of data, we could use one of thesedatabases to solve your data analytics problems. Today, we are focus on MongoDB.
  • 4. When should we want to use NoSQL database Vs SQL database, and which NoSQLdatabase?As I mentioned before, the problems that NoSQL databases solve is related to the nature andamount of data we want to processes in our next generation applications. We need databasesthat can scale to petabytes of data at a fraction of the cost of a relational database. We needdatabase systems which can help us quickly analyze petabytes of data and provide results inrealtime - hence the speed and velocity of data access is critical.NoSQL database systems can provide high speed access and low latency access to largeamount of data. And one key criteria to consider when choosing NoSQL database is thenature of your applications and main issues with them – are they operational or analytical? Forexample, for batch processing, analytical apps, you may be better off with Hadoop – while foroperational issues of scalability and realtime processing, you may want to choose MongoDBdatabase. So consider these criteria in making your decisions and do some experiments andfind the best ones that fits your application needs.
  • 5. 1. Let’s take a look at the key feature sets of MongoDB at very high level. MongoDB is adocument oriented database server. It stores objects as BSON (pronounced as bison), whichis a binary versions of JSON format and it supports dynamic schemas – which essentiallymeans it is schema-less database. There is no rigid SQL-like schema to store the data. Thisgives flexibility in choosing the data types from different data sources such as social networks,machine logs or CRM systems.2. MongoDb supports indexing just like traditional SQL indexing, which means you can indexdata on any field with high fidelity to improve query performance. (FYI – High fidelity heremeans the field which is a variable in all records. For example, if we are storing data aboutemployees, the data field that varies most is the phone number and not the city name orcompany name)3. Most of you may be familiar with the concept of database sharding. MongoDB is ahorizontally scalable database and supports sharding – which means it stores data in smallerchunks on several data nodes for low latency access to the data. Hence MongoDB is widelyused in the cloud because you can scale the database by adding shards as your data growsand maintain that low latency of data access even as your size of the data grows.4. MongoDB is designed to be resilient for data durability and supports replica sets which canbe geographically distributed5. MongoDB supports Map-reduce operations and provides fast updates to the data.FAQs: When do you want to use Hadoop Vs MongoDB for Map-reduce?Answer: You want to use Hadoop for batch jobs, where you can fire up analytics onoffline data, whereas you can use MongoDB for realtime data analytics.Question: How does Sharding work in MongoDB?Answer: MongoDB sharding works by spreading writes to multiple data nodes.Mongos, which is the mongoDB proces,s directs data to a different data node to write or read.And show the slide – (refer to the sharding diagram)
  • 6. Since MongoDB scales very well horizontally, it is the most widely used database in the cloud.And given the complexity of managing mongoDB for maintaining availability, data durabilityand performance, you may want to leverage platforms which provide you MongoDB-as-a-Service, which is a web service call to provision a dedicated mongoDB server, fully shardedand replicated, which scales automatically.You will get a chance to use MongoDB service shortly in our platform
  • 7. The specific MongoDB architecture that you choose will impact the performance, availabilityand data durability. MongoDB is flexible and supports high availability and shardingarchitectures to provide you tge level of redundancy, performance and SLA you want for yourservice.MongoDB supports replica sets and sharding deployment architectures. Replica sets providehigh availability and data durability while sharding provides scalability. You can configureshards on the replica sets for achieving the best of both, reliability and scalability.
  • 8. This is a replica set with three replica nodes in two datacenters or two regions of a publiccloud.MongoDB uses “eventual consistency” which means there may be a possibility that data onthe replicas may be out of sync from the primary node. You may want to use this architecturefor data redundancy purposes rather than scaling. In this architecture, you still send reads andwrites to the primary node, which means even with multiple nodes, your application wouldn’tnecessarily scale better. To maintain this level of redundancy yet improve scalability, you canuse sharding as in the next slide.
  • 9. This is a three shard deployment architecture which uses three replica sets and can be in asingle region or datacenter or distributed geographically.With this architecture, you get the benefit of both, the data redundancy with replica sets andhigh scalability with shards. Each shard itself can be a replica set which provides dataredundancy at each node level. But keep in mind, there is a overhead to sharding andreplication and you want to choose what’s best for your database
  • 10. Now let’s take a look at a sample application. We have a sample Twitter app to do hands-onexperiment with. We will use MongoDB-as-a-Service on the cloud and use a sample app toanalyze twitter dat.
  • 11. Just like any database, the performance of MongoDB database must be monitored andoptimized for a given workload or application type.These are key metrics you want to look for in MongoDb: (1) CPU (2) memory (3) Ops counters– this is the total number of operations over a period of time. This number shows you numberof active and pending operations (4) background flush – this is the number of disk writes whenMongoDb flushes all in-memory data to the disk. You want to keep an eye on this number andtweak if you wish to reduce the number of times or frequency of disk writes. There are othermetrics which we will see during our hands-on lab.