• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Data Warehousing Meetup with Riak
 

Big Data Warehousing Meetup with Riak

on

  • 1,055 views

Elliott Cordo, Principal Consultant at Caserta Concepts, delivered a talk on NoSQL data storage architectures at our most recent Big Data Warehousing Meetup: what they are, how they're used and why ...

Elliott Cordo, Principal Consultant at Caserta Concepts, delivered a talk on NoSQL data storage architectures at our most recent Big Data Warehousing Meetup: what they are, how they're used and why you can't ignore them in the context of existing enterprise data ecosystems.

For more information, check out our website at http://www.casertaconcepts.com/.

Statistics

Views

Total Views
1,055
Views on SlideShare
1,055
Embed Views
0

Actions

Likes
2
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Big Data Warehousing Meetup with Riak Big Data Warehousing Meetup with Riak Presentation Transcript

    • About the BDW Meetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Founded by Caserta Concepts, DW, BI & Big Data Analytics Consulting • Next BDW Meetup: September 23
    • About Caserta Concepts • Financial Services • Healthcare / Insurance • Retail / eCommerce • Digital Media / Marketing • K-12 / Higher Education Industries Served • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004) Founded in 2001 • Big Data Analytics • Data Warehousing • Business Intelligence • Strategic Data Ecosystems Focused Expertise
    • Implementation Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Database BI/Visualization/ Analytics Master Data Management Big Data Analytics Storm
    • WHY NOSQL MATTERS… FOR ANALYTICS Elliott Cordo Principal Consultant, Caserta Concepts
    • NOSQL So what? • NoSQL is one of the most exciting movements in BIG DATA. • NoSQL is changing the way a lot of people think about application development  especially analytic applications • Not all data is efficiently stored or processed in a relational DB. • High data volumes • Data does not fit or require relational model • We have new tools in our arsenal for processing, storing, and analyzing data with these new challenges
    • But we love SQL • Relational databases still have their place • Flexible • Rich Query Syntax • THEY HAVE JOINS AND AGGREGATION!! • The relational DB is great at being general purpose! • You build nice normalized structure, establish the logical relationships and then you can build any query you need for your application. • This has kept us happy in the Data Warehousing (and app-dev) world for decades!
    • Scale and Performance Performance: • Relational databases have a lot of features, overhead that we in many cases don’t need. Scale out: • Most relational databases scale vertically giving them limits to how large they can get. Federation and Sharding is an awkward manual process. • Most NoSQL scale horizontally on commodity hardware
    • But what will we sacrifice? • Query Features: • NoSQL DB’s have fairly simple query languages. Limited or no support for the following outside of map reduce: • Joins • Aggregation Why? - NoSQL databases were born to be high performance. Data is stored as it is to be used (tuned to a query) rather than modeled around entities. So a sophisticated query language is not needed. • BI and ETL tool support limited
    • So what about NoSQL for Analytics? • NoSQL databases are generally not as flexible as relational databases for ad-hoc questions. • Secondary indexes provide some flexibility but lack of Joins generally requires denormalization • Materialized views: Joins and aggregates can be implemented via Map Reduce  However materializing the world has it’s drawbacks! • A different way of doing things: • Client-side join: query “dimensions” get key  query “fact” on secondary index. • Link walking  leverage metadata link between entities  Riak! • Ad-hoc Map Reduce jobs • Aggregate Navigation  navigate from aggregate entities of different grain • Search! Limited native BI Tool Support!
    • NoSQL can be a great fit for analytic applications!! • High volumes/Low Latency analytic environments • Queries are largely known and can be precomuted in-stream (via application itself or Storm) or in batch using Map Reduce • Sweet spot is very high volumes with relatively static analytic requirements. • Common Design Pattern: • Compute aggregates and events in-line and store to aggregate entities in NoSQL • Write enriched detail records to NoSQL or Hadoop for further processing RDBMS NoSQL Volume QueryFlexibility
    • BIG ETL • One of the promises of BIG DATA is being able to enrich, process, enormous data volumes • The processing engines: • Storm  inline, real-time processing • Hadoop  batch processing • NoSQL can play an integral part of this architecture: • Distributed Lookup Cache • Shared State • Queueing Mechanism Storm Topology Data Sources Relational EDW Analytic Data-stores
    • NoSQL databases are like snowflakes or Smurfs.. They are all special and no two are alike! Let’s review the main categories and determine their general fit for analytic applications
    • • Platforms: Riak, Redis • Buckets/Collections are equivalent to a table in a RDMS • Primary unit of storage is a key value pair, the value can be anything ranging from number to a JSON Document. • Key value stores are super fast and simple • Analytic Capabilities: • Although many have very spartan feature sets some platforms like Riak have analytic friendly Links, Tags Metadata, and powerful map reduce capabilities! • Generally writes and reads are ultra-fast  good candidate for a BIG ETL component Key Value
    • • Platforms: Cassandra, HBase • Column families are the equivalent to a table in a RDMS • Primary unit of storage is a column, they are stored contiguously Skinny Rows: Most like relational database. Except columns are optional and not stored if omitted: Wide Rows: Rows can be billions of columns wide, used for time series, relationships, analytics! Analytic Capabilities: Widely used in analytic applications Typical analytic design pattern is to use “Skinny Rows” for detail records, “Wide Rows” for aggregates. Columnar
    • Document • Platforms: MongoDB, CouchDB • Collections are the equivalent to a table in a RDMS • Primary unit of storage is a document { “User" : ”Bobby”, “Email”: bobby@db-lover.com, “Channel”: “Web”, “State”: “NJ” } { “User" : ”Susie”, “Email”: “Susie@sql-enthusiast.com”, “PreferredCategories: [ { Category: “Fashion”, CategoryAdded: “2012-01-01” }, { Category: “Outdoor Equipment”, CategoryAdded: “2013-01-01” } ], “Channel”: In-Store } Analytic Capabilities: most similar to relational in function but requires denormalization. Secondary index support, map reduce. Mongo has a cool new aggregation framework.
    • In the Real World: High Volume Sensor Analytics • Ingestion and analytics of Sensor Data • 6 to 12 BILLION records being ingested daily (average 500k records per second at peek load)! • Ingested data must be stored to disk and highly available • Pre-defined aggregates and event monitors must be near real-time • Ad-hoc query capabilities required on historical data
    • One way to do it.. that worked Storm Cluster Sensor Data d3.js Analytics Hadoop Cluster Low Latency Analytics Atomic data Aggregates Event Monitors • The Kafka messaging system is used for ingestion • Storm is used for real-time ETL and outputs atomic data and derived data needed for analytics • Redis is used as a reference data lookup cache • Real time analytics are produced from the aggregated data. • Higher latency ad-hoc analytics are done in Hadoop using Pig and Hive Kafka
    • Parting Thought Polyglot Persistence – “where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” -- Martin Fowler
    • Contact Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com info@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com