For some, Hadoop is synonymous with “Big Data,” but Hadoop is just one component of a successful Big Data architecture. Depending on one’s application, it may not even be the most important part.
NoSQL solutions like MongoDB also play a dominant role for storage and real-time data processing, helping companies keep pace with the scale of their data requirements. But NoSQL figures even more prominently in helping enterprises consume a wide variety of data sources at speeds not currently possible in Hadoop. NoSQL, then, offers a useful complement to Hadoop, as well as the transaction-based data of traditional RDBMSs.
Tackling Big Data is not a one-tool job, and so the orchestration of the appropriate NoSQL database with Hadoop and RDBMS is essential. In this session, we’ll dig deep into the different types of NoSQL, identifying how they differ and the types of Big Data workloads for which they’re best suited. We’ll also explore the trade-offs one makes in choosing NoSQL databases like MongoDB or Neo4j over an RDBMS like MySQL, and when it makes sense to use both Hadoop and NoSQL and when it’s more appropriate to use NoSQL on its own.
3. Top Big Data Challenges?
Translation?
Most struggle
to know what
Big Data
is, how to
manage it and
who can
manage it
Source: Gartner
3
4. Understanding Big Data – It’s Not Very “Big”
64% - Ingest
diverse, new data in
real-time
15% - More than 100TB
of data
20% - Less than 100TB
(average of all? <20TB)
from Big Data Executive Summary – 50+ top executives from Government and F500 firms
4
16. RDBMS Is Expensive To Scale
“Clients can also opt to run zEC12 without a raised
datacenter floor -- a first for high-end IBM mainframes.”
IBM Press Release 28 Aug, 2012
16
21. Ask the Right Questions…
“Organizations already have people who know
their own data better than mystical data
scientists….Learning Hadoop [or MongoDB] is
easier than learning the company’s business.”
(Gartner, 2012)
21
25. 25
Applications
CRM, ERP, Collaboration, Mobile, BI
Data Management
Online Data
RDBMS
RDBMS
Offline Data
Hadoop
Infrastructure
OS & Virtualization, Compute, Storage, Network
EDW
Security & Auditing
Management & Monitoring
Enterprise Big Data Stack
26. Consideration – Online vs. Offline
Online
• Real-time
• Low-latency
• High availability
26
vs.
Offline
• Long-running
• High-Latency
• Availability is lower priority
28. Hadoop Is Good for…
Risk Modeling
Recommendation
Engine
Ad Targeting
Transaction
Analysis
Trade
Surveillance
Network Failure
Prediction
28
Churn Analysis
Search Quality
Data Lake
29. MongoDB/NoSQL Is Good for…
360° View of the
Customer
Fraud Detection
User Data
Management
Content
Management &
Delivery
Reference Data
Product Catalogs
29
Mobile & Social
Apps
Machine to
Machine Apps
Data Hub
32. Customer example: Online Travel
Travel
Algorithms
MongoDB
Connector for
Hadoop
•
•
•
•
32
Flights, hotels and cars
Real-time offers
User profiles, reviews
User metadata (previous
purchases, clicks, views)
•
•
•
•
User segmentation
Offer recommendation engine
Ad serving engine
Bundling engine
33. Predictive Analytics
Government
Algorithms
MongoDB
+ Hadoop
• Predictive analytics system
for crime, health issues
• Diverse, unstructured (incl.
geospatial) data from 30+
agencies
• Correlate data in real-time
33
• Long-form trend analysis
• MongoDB data dumped into
Hadoop, analyzed, re-inserted
into MongoDB for better realtime response
36. MongoDB + Hadoop Connector
• Makes MongoDB a Hadoop-enabled file system
• Read and write to live data, in-place
• Copy data between Hadoop and MongoDB
• Full support for data processing
– Hive
– MapReduce
– Pig
– Streaming
– EMR
36
MongoDB
Connector for
Hadoop
Big Data is new, and you’re likely going to fail as you start. But it’s almost guaranteed, as well, that you won’t know which data to capture, or how to leverage it, without trial and error. As such, if you were to “design for failure,” what key things would you need? You need to reduce the cost of failure, both in terms of time and money. You’d need to build on data infrastructure that supports your iterations toward success and then rewards you by making it easy and cost effective to scale.
IBM designed IMS with Rockwell and Caterpillar starting in 1966 for the Apollo program. IMS's challenge was to inventory the very large bill of materials (BOM) for the Saturn V moon rocket and Apollo space vehicle.
Loading a paper tape reader on the KDF9 computer.
IBM designed IMS with Rockwell and Caterpillar starting in 1966 for the Apollo program. IMS's challenge was to inventory the very large bill of materials (BOM) for the Saturn V moon rocket and Apollo space vehicle.
This is helpful because as much as 95% of enterprise information is unstructured, and doesn’t fit neatly into tidy rows and columns. NoSQL and Hadoop allow for dynamic schema.
The industry is talking about Hadoop and MongoDB for Big Data. So should you
Why not Hbase? MongoDB dramatically more popularMuch easier to useWorks from small scale to large scaleFar closer to the functionality available in RDBMS, including geospatial, secondary indexes, text search40+ languages mean you can work in your preferred programming language
The industry is not betting on RDBMS for Big Data. Neither should you
This is where MongoDB fits into the existing enterprise IT stackMongoDB is an operational data store used for online data, in the same way that Oracle is an operational data store. It supports applications that ingest, store, manage and even analyze data in real-time. (Compared to Hadoop and data warehouses, which are used for offline, batch analytical workloads.)
OrSo not everyone would agree with the term offline big data
What each of these has in common is that they’re retrospective: they’re about looking at the past to help predict the future. The learnings from these Hadoop applications end up being applied by a different technology. This is where MongoDB comes in.
Marketing has been breaking people down into segments (Hadoop - user base as a whole) for a long time, while new marketing needs to focus on individuals (user base as a user). CriteoIf you're only optimizing on the aggregate data, you're missing out on the personalization. but i you only do the individual, you're missing out on patterns across your entire user baseYou want to do both but the systems needed to do this (on a single data set) tend to be architecturally at odds with each other.One dataset. That system lives in two different databases which are taking care of the different processing needs on the dataIn order for Hadoop/column-level processing to be useful, you want to have lots of columns. You need your real-time data store to be very rich, or you'll lose information. You don't want to be simplifying that data from the start by putting it into an RDBMS or key-value store. as a specific example, people talk about Hadoop used for log file analysis. Those log files lack a lot of context coming out of your web server for example (IP address, time stamp, etc. but not any real context as to what the log files mean - this is the watch movie button). You can actually have a much richer version of that interaction by keeping that data in a doc db that describes what is actually happening in that log