Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Choosing the Right Big Data Tools for the Job - A Polyglot Approach


Published on

Published in: Technology
  • Be the first to comment

Choosing the Right Big Data Tools for the Job - A Polyglot Approach

  1. 1. Choosing The Right Big Data Tools For The Job – A Polyglot Approach A Webinar Presented by Leon Guzenda on August 9, 2012
  2. 2. OverviewThe Problem• Current Big Data Analytics• Relationship Analytics• Leveraging Alternative Technologies – NoSQL• The Polyglot Approach
  3. 3. About Objectivity Inc.Company • Objectivity, Inc. is headquartered in Sunnyvale, CA. • Established in 1988 to tackle database problems that network/hierarchical/relational and file-based technologies struggle with. • Objectivity has over two decades of Big Data and NoSQL experienceProducts • Develops NoSQL platforms for managing and discovering relationships and patterns in complex data: • Objectivity/DB - an object database that manages localized, centralized or distributed databases • InfiniteGraph - a massively scalable graph database built on Objectivity/DB that enables organizations to find, store and exploit the relationships in their dataMarkets • The Big Data market is projected to be around $12B in 2012, with a CAGR of 28% over the next five years. • 40% per year data growth, cloud adoption, mobile usage and improved real-time analytics underpin Objectivity’s growth opportunities as a Big Data analytics enabler.Customers • Embedded in hundreds of enterprises, government organizations and products - millions of deployments.Financials • Consistently generates increased revenues. • Privately held by the employees and a few venture capital companies. Copyright © Objectivity, Inc. 2012
  4. 4. The ProblemInformation Overload!Making sense of it all takes time and $$$ Current “Big Data” Analytics
  5. 5. A Typical “Big Data” Analytics Setup Data Aggregation and Analytics Applications Commodity Linux Platforms and/or High Performance Computing Clusters Column Data Graph Object K-V RDBMS Hadoop Doc DB Store W/H DB DB Store Structured Semi-Structured Unstructured
  6. 6. Leveraging Alternative Technologies
  7. 7. Not Only SQL – a group of 4 primary technologies• Users choose between four different primary technologies for different purposes: – Key-Value Stores – “Big Table” Clones – Document Databases – Object and Graph databases (including InfiniteGraph)• Many implementations sacrifice consistency (ACID transactions, CAP – eventual consistency) for performance.• Technologies such as Objectivity/DB and InfiniteGraph offer ACID transactions, with consistency and performance.
  8. 8. The NoSQL Market
  9. 9. Key-Value Stores“Dynamo: Amazon’s High Available Key-Value Store” [2007]• Data model: – Global key-value mapping – Scalable (sharded) HashMap KEY VALUE – Highly fault tolerant (typically)• Examples: – Riak, Redis and Voldemort
  10. 10. Key-Value Stores: Pros & Cons• Strengths: – Simple data model – Great at scaling out horizontally – Scalable – Available KEY VALUE• Weaknesses: – Simplistic data model – Poor for complex data – Unsuited for interconnected data
  11. 11. Big Table Clones – Column Family• Google’s “Bigtable: A Distributed Storage System for Structured Data” [2006]• Column-Family are essentially Big Table clones. Column• Data Model: KEY Column Name Value D/Time – A big table, with column families. – Map-reduce for parallel query/processing.• Examples: – Hbase, HyperTable and Cassandra.
  12. 12. Big Table Clones – Pros & Cons• Strengths: – Data model supports semi-structured data – Naturally indexed (columns) – Good at scaling out horizontally Column• Weaknesses: KEY Column Name Value D/Time – Complex data model – Unsuited for highly interconnected data
  13. 13. Document Databases• Data Model: – A collection of unstructured or semi-structured documents. – Each document is referenced using a key-value pair. – The “value” can range from unstructured text to a collection of key- value pairs or a group of XML objects. – Index-centric to support queries based on content.• Examples: KEY DOCUMENT – CouchDB and MongoDB.
  14. 14. Document Databases – Pros & Cons• Strengths: – Simple, powerful data model – Good scalability if sharding is supported• Weaknesses: KEY DOCUMENT – Unsuited for interconnected data – Query model limited is to keys and indexes – Generally uses Map-Reduce (designed for batch operations) for larger queries
  15. 15. Object Databases• Data Model [ODMG93]: – Objects have a Class (type) and a group of Values – Each Object instance has a unique Object Identifier [OID] – Connections use Object Identifiers for efficiency – Supports class inheritance and polymorphism• Examples: OID OBJECT – Objectivity/DB and db4objects Connections
  16. 16. Object Databases – Pros & Cons• Strengths: – Simple, powerful data model that includes inheritance and polymorphism – Every object has a class (type) and a unique Object Identifier – Good scalability if sharding is supported – Uses Object Identifiers instead of JOIN tables to support very fast navigational operations OID OBJECT Connections• Weaknesses: – The query language never became a standard – Supports standard object oriented languages but isnt supported by a wide range of third party tools in the way that SQL is.
  17. 17. Graph Databases• Data model: – Node (Vertex) and Relationship (Edge) objects – Directed – May be a hypergraph (edges with multiple endpoints)• Examples: – InfiniteGraph, Neo4j, OrientDB, AllegroGraph, TitanDB and Dex 2 N VERTEX EDGE
  18. 18. Graph Databases – Pros & Cons• Strengths: – Extremely fast for connected data – Scales out, typically – Easy to query (navigation) – Simple data model• Weaknesses: – May not support distribution or sharding – Requires conceptual shift... a different way of thinking 2 N VERTEX EDGE
  19. 19. Competing “Big Data” Analytics Solutions
  20. 20. Typical “Big Data” Analytics Phases Analytics and Front-End Processing Repository Visualization Tools The strategic competitors are all moving in the same direction
  21. 21. Incremental Improvements Aren’t EnoughAll current solutions use the same basic architectural model• None of the current solutions have a way to store connections between entities in different silos• Most analytic technology focuses on the content of the data nodes, rather than the many kinds of connections between the nodes and the data in those connections• Why? Because relational and most NoSQL solutions are bad at handling relationships.• Object and Graph databases can efficiently store, manage and query the many kinds of relationships hidden in the data.
  22. 22. Relationship Analytics
  23. 23. Example 1 - Market AnalysisThe 10 companies that control a majority of U.S. consumer goods brands
  24. 24. Example 2 - DemographicsUsed in social network analysis, marketing, medical research etc.
  25. 25. Example 3 - Seed To Consumer Tracking ?
  26. 26. Example 4 - Ad Placement NetworksSmartphone Ad placement - based on the the user’s profile and location data captured by opt-in applications.• The location data can be stored and distilled in a key-value and column store hybrid database, such as Cassandra• The locations are matched with geospatial data to deduce user interests.• As Ad placement orders arrive, an application built on a graph database such as InfiniteGraph, matches groups of users with Ads:• Maximizes relevance for the user.• Yields maximum value for the advertiser and the placer.
  27. 27. Example 5 - Healthcare InformaticsProblem: Physicians need better electronic records for managing patient data on a global basis and match symptoms, causes, treatments and interdependencies to improve diagnoses and outcomes.• Solution: Create a database capable of leveraging existing architecture using NOSQL tools such as Objectivity/DB and InfiniteGraph that can handle data capture, symptoms, diagnoses, treatments, reactions to medications, interactions and progress.• Result: It works: • Diagnosis is faster and more accurate • The knowledge base tracks similar medical cases. • Treatment success rates have improved.
  28. 28. Relationship (Connection) Analytics...Relational DatabaseThink about the SQL query for finding all links between the two “blue” rows... Good luck! Table_A Table_B Table_C Table_D Table_E Table_F Table_G Relational databases aren’t good at handling complex relationships!
  29. 29. Relationship (Connection) Analytics...Relational DatabaseThink about the SQL query for finding all links between the two “blue” rows... Good luck! Table_A Table_B Table_C Table_D Table_E Table_F Table_GObjectivity/DB or InfiniteGraph - The solution can be found with a few lines of code A3 G4
  30. 30. Visual Analytics
  31. 31. The Polyglot Approach
  32. 32. Lesson 1 – The Repository Matters A LotNEED RDBMS Key- Column Document ODBMS Graph Value Family Database DatabaseOLTP YES No Maybe No Maybe NoText No No No YES Maybe NoHandlingMultimedia No Maybe No Maybe YES MaybeEngineering/ No No No No YES MaybeScientificBusiness YES No Maybe No Maybe MaybeIntelligenceLog Maybe No Maybe No YES MaybeProcessingConnection No No No No Maybe YESHandling/Analysis
  33. 33. Lesson 2 – Languages and Tools Matter Too NEED Repository Language BI Tools Visual Analytics OLTP RDBMS SQL, Java YES Maybe Text Document Java, XML No Maybe Database Multimedia ODBMS Java, C++ No Maybe Eng/Science ODBMS C,C++, R Maybe YES Fortran Business RDBMS Java, SQL, R YES YES Intelligence Log NoSQL, C++, R, Maybe YES Processing ODBMS Java, SQL Connection Graph Java, C++, Maybe YES Handling/ Database SPARQL Analysis
  35. 35. ...SUMMARY: A Polyglot Approach Works Best
  37. 37. SPARE SLIDES
  38. 38. InfiniteGraph - The Enterprise Graph Database• A high performance distributed database engine that supports analyst-time decision support and actionable intelligence• Cost effective link analysis – flexible deployment on commodity resources (hardware and OS).• Efficient, scalable, risk averse technology – enterprise proven.• High Speed parallel ingest to load graph data quickly.• Parallel, distributed queries• Flexible plugin architecture• Complementary technology• Fast proof of concept – easy to use Graph API.
  39. 39. Objectivity/DB A distributed, object database built for handling data with many complex relationships.• Reliable - Deployed in process control, telecom and medical equipment, Big Science, complex financial, defense and Intelligence Community applications.• Provably scalable - used to build the World’s first Petabyte+ database at Stanford Linear Accelerator in the year 2000.• Advanced query capabilities - Parallel Query Engine• Interoperable - across languages and platforms – C++, C#, Java, Python and SQL++ – Linux, Mac OS X and Windows (32 and 64-bit)
  40. 40. The Big Data Connection PlatformData Visualization & Analytics *Now HP *Now IBMBig Data Connection PlatformProcessing Platform *Now EMC *Now IBM *Now IBM *Now Teradata *Now HP *Now SAP Connectors / Integration Servers / File Storage *Now Oracle
  41. 41. The Big Data Connection PlatformData Visualization & Analytics *Now HP *Now IBMBig Data Connection PlatformProcessing Platform *Now EMC *Now IBM *Now IBM *Now Teradata *Now HP *Now SAP Connectors / Integration Servers / File Storage *Now Oracle
  42. 42. Thank You! Please take a look at objectivity.comFor Online Demos, White Papers, Free Downloads, Samples & Tutorials You Can Also See Us At NoSQL Now! In San Jose, CA on August 22