Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Warehousing Meetup: Intro to NoSQL databases


Published on

We talked about NoSQL databases: What they are, how they’re used and where they fit in existing enterprise data ecosystems.

Mike O’Brian from 10gen, introduced the syntax and usage patterns for a new aggregation system in MongoDB and give some demonstrations of aggregation using the new system. The new MongoDB aggregation framework makes it simple to do tasks such as counting, averaging, and finding minima or maxima while grouping by keys in a collection, complementing MongoDB’s built-in map/reduce capabilities.

For more information, visit our website at or email us at

Published in: Technology

Big Data Warehousing Meetup: Intro to NoSQL databases

  1. 1. Sponsored By:Big Data Warehousing MeetupToday’s Topic: Introduction toNoSQL with 10Gen
  2. 2. WELCOME!Joe CasertaFounder & President, Caserta Concepts
  3. 3. 7:00 NetworkingGrab a slice of pizza and a drink...7:15 Joe CasertaPresident, Caserta ConceptsAuthor, Data Warehouse ETL ToolkitWelcomeAbout the Meetup and about Caserta Concepts7:30 Elliott CordoPrincipal Consultant, Caserta ConceptsIntro to NoSQL7:50 Mike O’Brian10GenMongoDB8:10 -9:00More NetworkingTell us what you’re up to…Agenda
  4. 4. About BDW Meetup• Big Data is a complex, rapidly changinglandscape• We want to share our stories and hearabout yours• Great networking opportunity for likeminded data nerds• Opportunities to collaborate on excitingprojects• Next BDW Meetup: June 10.• Topic: TBD (What would you like to see?)Send ideas to
  5. 5. About Caserta Concepts• Financial Services• Healthcare / Insurance• Retail / eCommerce• Digital Media / Marketing• K-12 / Higher EducationIndustries Served• President: Joe Caserta, industry thought leader,consultant, educator and co-author, The DataWarehouse ETL Toolkit (Wiley, 2004)Founded in 2001• Big Data Analytics• Data Warehousing• Business Intelligence• Strategic DataEcosystemsFocusedExpertise
  6. 6. Client PortfolioFinance& InsuranceRetail/eCommerce& ManufacturingEducation& Services
  7. 7. Expertise & OfferingsStrategic Roadmap/Assessment/ConsultingDatabaseBI/Visualization/AnalyticsMaster Data ManagementBig DataAnalyticsStorm
  8. 8. OpportunitiesDoes this word cloud excite you?Speak with us about our open positions:
  9. 9. ContactsJoe CasertaPresident & Founder, Caserta ConceptsP: (855) 755-2246 x227E: joe@casertaconcepts.comDana CanavanDirector, Sales & MarketingP: (855) 755-2246 x226E: dana@casertaconcepts.comElliott CordoPrincipal Consultant, Caserta ConceptsP: (855) 755-2246 x267E: elliott@casertaconcepts.cominfo@casertaconcepts.com1(855)
  10. 10. ANALYZING DATA: INTRO TO NOSQLElliott CordoPrincipal Consultant, Caserta Concepts
  11. 11. Soo.. No More SQL?• Relational databases still have their place• Flexible/General Purpose• Rich Query Syntax• Familiar• However there are some interesting alternatives foranalytic databases• Columnar/Key Value• Document• Graph• PS. many NoSQL databases have SQL-Like interfaces Think Not Only SQL!
  12. 12. Why are we doing this?Not all data is efficiently stored in a relational DB.• Sparse Data• Data with a lot of variation• Relationships -> funny how relational databases are notgreat at relations
  13. 13. Scale and PerformancePerformance:• Relational databases have a lot of features, overhead that wedon’t need in many cases. Although we will miss some…Scaling:• Most relational databases scale vertically giving them limits tohow large they can get. Federation and Sharding is anawkward manual process.• Most NoSQL scale horizontally on commodity hardwareNote Graph database architecture lends itself to a single graphexisting on one server. Several vendors have overcome this:Titan, InfiniteGraph.
  14. 14. Object Impedance MismatchRelational databases rarely look the way our applications wantthem too. So much time is assembling and disassemblingrelational data.GetSaleSelect * Sales_Header Join Sales_Detail JoinSales_Tender join User Join Order Type JoinTender Type Join Product Join Channel JoinUser_Account etc, etcCreateSaleInsert into Sales HeaderInsert into Sales DetailInsert/Update User_AccountInsert into Sales Tenderetc, etc
  15. 15. But what will we sacrifice?• NoSQL DB’s have fairly simple query languages. Limitedsupport for the following:• Joins• Aggregation• Secondary indexesWhy? - NoSQL databases were born to be highperformance• Data is stored as it is to be used (tuned to a query) ratherthan modeled around entities. So a sophisticated querylanguage is not needed.
  16. 16. So what about NoSQL as the DataWarehouse?• NoSQL databases are generally not as flexible as relationaldatabases for ad-hoc questions.• Secondary indexes provide some flexibility but lack of Joinsrequires denormalization• Materialized views: Joins and aggregates can be implementedvia Map Reduce. Even using our animal friends:• However materializing the world has it’s drawbacks!
  17. 17. NoSQL can be a good fit for certainanalytic applications• High volumes/Low Latency analyticenvironments• Queries are largely known and can beprecomuted in-stream (via application itself orStorm) or in batch using Map Reduce• Cassandra also has counter functions whichcan be helpful in pre-computing aggregates.• Sweet spot is very high volumes with relativelystatic analytic requirements.RDBMS NoSQLVolumeQueryFlexibility
  18. 18. • Platforms: Cassandra, HBase• Column families are the equivalent to a table in a RDMS• Primary unit of storage is a column, they are storedcontiguouslySkinny Rows: Most like relational database. Exceptcolumns are optional and not stored if omitted:Wide Rows: Rows can be billions of columns wide, usedfor time series, relationships, secondary indexes:Columnar
  19. 19. Document• Platforms: MongoDB, CouchDB• Collections are the equivalent to a table in a RDMS• Primary unit of storage is a document{ “User" : ”Bobby”,“Email”:,“Channel”: “Web”,“State”: “NJ” }{ “User" : ”Susie”,“Email”: “”,“PreferredCategories: [{ Category: “Fashion”,CategoryAdded: “2012-01-01” },{ Category: “Outdoor Equipment”,CategoryAdded: “2013-01-01” } ],“Channel”: In-Store }
  20. 20. Graph• Platforms: NeoJ4, Titan• Relationship are front and center! Relationships can have propertiesof their own.BobbyJillianFrankHair bowsChainsawFriendsLikesPurchasedDate: 2013-02-14Channel: In-StoreFriendsSusiePurchasedDate: 2013-01-31Recommendation: MaybeJillian wants a Chainsaw too!FriendsLikes ProfileDate: 2013-01-01Gremlin query language:• Find all Franks outgoing Relationships• Find all Products related to Jillian• Find shortest path from Frank to Susie• Cool collaborative filtering functions too!
  21. 21. Our Use Case: High Volume SensorAnalytics• Ingestion and analytics of Sensor Data• 6 to 12 BILLION records being ingested daily (average140k records per second at peek load)!• Ingested data must be stored to disk and highly available• Pre-defined aggregates and event monitors must be nearreal-time• Ad-hoc query capabilities required on historical data
  22. 22. How do we hope to accomplish this?Storm ClusterSensorDatad3.js AnalyticsHadoop ClusterLow LatencyAnalyticsCassandraClusterKafkaAtomic dataAggregatesEvent Monitors• The Kafka messaging system is used for ingestion• Storm is used for real-time ETL and outputs atomic dataand derived data needed for analytics• Real time analytics are produced from the aggregateddata.• Higher latency ad-hoc analytics are done in Hadoopusing Pig and Hive
  23. 23. Parting ThoughtPolyglot Persistence – “where any decent sizedenterprise will have a variety of different data storagetechnologies for different kinds of data. There will stillbe large amounts of it managed in relational stores,but increasingly well be first asking how we want tomanipulate the data and only then figuring out whattechnology is the best bet for it.”-- Martin Fowler