Your Big Data Arsenal - Strata 2013


Published on

Matt Asay presents at Strata 2013 on how NoSQL fits into the Big Data landscape, particularly how MongoDB and Hadoop work well together. Not an infomercial.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Big Data is new, and you’re likely going to fail as you start. But it’s almost guaranteed, as well, that you won’t know which data to capture, or how to leverage it, without trial and error. As such, if you were to “design for failure,” what key things would you need? You need to reduce the cost of failure, both in terms of time and money. You’d need to build on data infrastructure that supports your iterations toward success and then rewards you by making it easy and cost effective to scale.
  • IBM designed IMS with Rockwell and Caterpillar starting in 1966 for the Apollo program. IMS's challenge was to inventory the very large bill of materials (BOM) for the Saturn V moon rocket and Apollo space vehicle.
  • Loading a paper tape reader on the KDF9 computer.
  • IBM designed IMS with Rockwell and Caterpillar starting in 1966 for the Apollo program. IMS's challenge was to inventory the very large bill of materials (BOM) for the Saturn V moon rocket and Apollo space vehicle.
  • This is helpful because as much as 95% of enterprise information is unstructured, and doesn’t fit neatly into tidy rows and columns. NoSQL and Hadoop allow for dynamic schema.
  • The industry is talking about Hadoop and MongoDB for Big Data. So should you
  • Why not Hbase? MongoDB dramatically more popularMuch easier to useWorks from small scale to large scaleFar closer to the functionality available in RDBMS, including geospatial, secondary indexes, text search40+ languages mean you can work in your preferred programming language
  • The industry is not betting on RDBMS for Big Data. Neither should you
  • This is where MongoDB fits into the existing enterprise IT stackMongoDB is an operational data store used for online data, in the same way that Oracle is an operational data store. It supports applications that ingest, store, manage and even analyze data in real-time. (Compared to Hadoop and data warehouses, which are used for offline, batch analytical workloads.)
  • OrSo not everyone would agree with the term offline big data
  • What each of these has in common is that they’re retrospective: they’re about looking at the past to help predict the future. The learnings from these Hadoop applications end up being applied by a different technology. This is where MongoDB comes in.
  • Marketing has been breaking people down into segments (Hadoop - user base as a whole) for a long time, while new marketing needs to focus on individuals (user base as a user). CriteoIf you're only optimizing on the aggregate data, you're missing out on the personalization. but i you only do the individual, you're missing out on patterns across your entire user baseYou want to do both but the systems needed to do this (on a single data set) tend to be architecturally at odds with each other.One dataset. That system lives in two different databases which are taking care of the different processing needs on the dataIn order for Hadoop/column-level processing to be useful, you want to have lots of columns. You need your real-time data store to be very rich, or you'll lose information. You don't want to be simplifying that data from the start by putting it into an RDBMS or key-value store. as a specific example, people talk about Hadoop used for log file analysis. Those log files lack a lot of context coming out of your web server for example (IP address, time stamp, etc. but not any real context as to what the log files mean - this is the watch movie button). You can actually have a much richer version of that interaction by keeping that data in a doc db that describes what is actually happening in that log
  • Your Big Data Arsenal - Strata 2013

    1. 1. Essential Tools For Your Big Data Arsenal Matt Asay (@mjasay) VP, Business Development & Strategy, MongoDB
    2. 2. The Big Data Unknown
    3. 3. Top Big Data Challenges? Translation? Most struggle to know what Big Data is, how to manage it and who can manage it Source: Gartner 3
    4. 4. Understanding Big Data – It’s Not Very “Big” 64% - Ingest diverse, new data in real-time 15% - More than 100TB of data 20% - Less than 100TB (average of all? <20TB) from Big Data Executive Summary – 50+ top executives from Government and F500 firms 4
    5. 5. Innovation As Iteration
    6. 6. “I have not failed. I've just found 10,000 ways that won't work.” ― Thomas A. Edison
    7. 7. Back in 1970…Cars Were Great! 7
    8. 8. So Were Computers! 8
    9. 9. Lots of Great Innovations Since 1970 9
    10. 10. Including the Relational Database 10
    11. 11. RDBMS Makes Development Hard Code DB Schema Application 11 XML Config Object Relational Mapping Relational Database
    12. 12. And Even Harder To Iterate New Table New Column New Table Name Pet Phone New Column 3 months later… 12 Email
    13. 13. From Complexity to Simplicity RDBMS MongoDB { _id : ObjectId("4c4ba5e5e8aabf3"), employee_name: "Dunham, Justin", department : "Marketing", title : "Product Manager, Web", report_up: "Neray, Graham", pay_band: “C", benefits : [ { type : "Health", plan : "PPO Plus" }, { type : "Dental", plan : "Standard" } ] } 13
    14. 14. So…Use Open Source 14
    15. 15. Big Data != Big Upfront Payment 15
    16. 16. RDBMS Is Expensive To Scale “Clients can also opt to run zEC12 without a raised datacenter floor -- a first for high-end IBM mainframes.” IBM Press Release 28 Aug, 2012 16
    17. 17. Spoiled for choice Database Ranking 1 Oracle 2 MySQL 3 Microsoft SQL Server 4 PostgreSQL 5 DB2 6 MongoDB 7 Microsoft Access 8 SQLite 9 Sybase 10 Teradata 17 Relational DBMS 1583.84 Relational DBMS 1331.34 Relational DBMS 1207 Relational DBMS 177.01 Relational DBMS 175.83 NoSQL Document Store 149.48 Relational DBMS 142.49 Relational DBMS 77.88 Relational DBMS 73.66 Relational DBMS 54.41 54.23 25.58 -106.78 -5.22 3.58 -2.71 -4.21 -4.9 -1.68 3.32
    18. 18. Remember the Long Tail? 18
    19. 19. It Didn’t Work Out So Well 19
    20. 20. Use Popular, Well-Known Technologies 20 Source: Silicon Angle, 2012
    21. 21. Ask the Right Questions… “Organizations already have people who know their own data better than mystical data scientists….Learning Hadoop [or MongoDB] is easier than learning the company’s business.” (Gartner, 2012) 21
    22. 22. Leverage Existing Skills 22
    23. 23. Search as a Sign? 23
    24. 24. When To Use Hadoop, NoSQL
    25. 25. 25 Applications CRM, ERP, Collaboration, Mobile, BI Data Management Online Data RDBMS RDBMS Offline Data Hadoop Infrastructure OS & Virtualization, Compute, Storage, Network EDW Security & Auditing Management & Monitoring Enterprise Big Data Stack
    26. 26. Consideration – Online vs. Offline Online • Real-time • Low-latency • High availability 26 vs. Offline • Long-running • High-Latency • Availability is lower priority
    27. 27. Consideration – Online vs. Offline Online 27 vs. Offline
    28. 28. Hadoop Is Good for… Risk Modeling Recommendation Engine Ad Targeting Transaction Analysis Trade Surveillance Network Failure Prediction 28 Churn Analysis Search Quality Data Lake
    29. 29. MongoDB/NoSQL Is Good for… 360° View of the Customer Fraud Detection User Data Management Content Management & Delivery Reference Data Product Catalogs 29 Mobile & Social Apps Machine to Machine Apps Data Hub
    30. 30. How To Use The Two Together?
    31. 31. Finding Waldo 31
    32. 32. Customer example: Online Travel Travel Algorithms MongoDB Connector for Hadoop • • • • 32 Flights, hotels and cars Real-time offers User profiles, reviews User metadata (previous purchases, clicks, views) • • • • User segmentation Offer recommendation engine Ad serving engine Bundling engine
    33. 33. Predictive Analytics Government Algorithms MongoDB + Hadoop • Predictive analytics system for crime, health issues • Diverse, unstructured (incl. geospatial) data from 30+ agencies • Correlate data in real-time 33 • Long-form trend analysis • MongoDB data dumped into Hadoop, analyzed, re-inserted into MongoDB for better realtime response
    34. 34. Data Hub Churn Analysis Insurance MongoDB Connector for Hadoop • • • • • 34 Insurance policies Demographic data Customer web data Call center data Real-time churn detection • Customer action analysis • Churn prediction algorithms
    35. 35. Machine Learning Ad-Serving Algorithms MongoDB Connector for Hadoop • • • • • 35 Catalogs and products User profiles Clicks Views Transactions • User segmentation • Recommendation engine • Prediction engine
    36. 36. MongoDB + Hadoop Connector • Makes MongoDB a Hadoop-enabled file system • Read and write to live data, in-place • Copy data between Hadoop and MongoDB • Full support for data processing – Hive – MapReduce – Pig – Streaming – EMR 36 MongoDB Connector for Hadoop
    37. 37. @mjasay