Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

2,007 views

Published on

Slides from my Strata+Hadoop 2015 Conference session titled: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP. This talk describes the Doradus OLAP query/storage engine, which is an open source module that runs on top of the Cassandra NoSQL DB. Among the benefits of this service is fast data loading, a rich query language with full text and graph query features, and very dense data storage. See the Notes section for details on each slide.

Published in: Software
  • Login to see the comments

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

  1. 1. One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group
  2. 2. What is Doradus?  Storage and query service  Leverages Cassandra NoSQL DB  Pure Java - Stateless - Embeddable or standalone  Open source: Apache 2.0 License 30 Doradus: The Tarantula Nebula Source: Hubble Space Telescope
  3. 3. Why Use Doradus?  Easy to use: no client driver  Spider storage manager - Good for unstructured data  OLAP storage manager - Near real time data warehousing  Compared to Cassandra alone: - Data model, searching, analytics  Compared to Hadoop: - Fast data loads and queries - Dense storage: less hardware Cassandra Data Applications REST API OLAP Spider Doradus
  4. 4. A Multi-Node Cluster Doradus Cassandra Data Node 2 Cassandra Data Doradus Cassandra Data Node 1 Node 3 Applications REST API Secondary Doradus instances are optional
  5. 5. Why Did We Build Doradus OLAP?  Some tough customer requirements: - Statistical queries most important - Need to scan millions of objects/second - User-customizable “insights” = millions of possible queries  Couldn’t use indexes, pre-computed queries, etc.  Disk physics - ~100's of random reads/second - ~1000's of serial reads/second  Needed a radically new approach!
  6. 6. Doradus OLAP  Combines ideas from: - Online Analytical Processing: data arranged in static cubes - Columnar databases: Column-oriented storage and compression - NoSQL databases: Sharding  Features: - Fast loading: up to 500K objects/second/node - Dense storage: 1 billion objects in 2 GB! - Fast cube merging: typically seconds - No indexes!
  7. 7. Example: Message Tracking Schema Message Participant Address Person Manager Employees Person Address  Attachments Message Participants Message Address Participants Attachment
  8. 8. DQL Object Queries  Builds on Lucene syntax - Full text queries  Adds link paths - Directed graph searches - Quantifiers and filters - Transitive searches  Other features - Stateless paging - Sorting  Examples: - LastName = Smith AND NOT (FirstName : Jo*) AND BirthDate = [1986 TO 1992] - ALL(Participants).ANY(Address.WHERE (Email='*.gmail.com')).Person.Department : support - Employees^(4).Office='San Jose’
  9. 9. DQL Aggregate Queries  Metric functions - COUNT, AVERAGE, MIN, MAX, DISTINCT, ...  Multi-level grouping  Grouping functions - BATCH, BOTTOM, FIRST, LAST, LOWER, SETS, TERMS, TOP, TRUNCATE, UPPER, WHERE, ...  Examples: - metric=COUNT(*), AVERAGE(Size), MIN(Participants.Address.Person.Birthdate) - metric=DISTINCT(Attachments.Extension); groups=Tags, Participants.Address.Person.Department; query=Attachments.Size > 100000 - metric=AVERAGE(Size); groups=TOP(10,Participants.Address.Email)
  10. 10. OLAP Data Loading EventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Sources
  11. 11. OLAP Data Loading Batch 1 EventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Batch 2 Batch 3 ... Sources Batches Batch 4
  12. 12. OLAP Data Loading Batch 1 EventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Batch 2 Batch 3 ... 2014-03-01 2014-02-28 2014-02-27 Sources Batches Shards Batch 4 Merge
  13. 13. OLAP Data Loading Batch 1 EventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Batch 2 Batch 3 ... 2014-03-01 2014-02-28 2014-02-27 Sources Batches Shards OLAP Store Batch 4 Merge
  14. 14. Storing Batches Field Values ID 5amhvv7J2otBu48Z6PE5cA 7CgvDf5mOU78jNVc58eu cZpz2q4Jf8Rc2HK9Cg08 ... Size 48120 5435 24220 ... SendDate 1280246462000 1279354872112 1279357261413 ... Priority 0 0 1 ... Subject.txt ballades encash nautch colloquy geared nettlier outdoors culvert hypothec winder stolons ungot guiding rupiahs outgone ... Subject 1 2 0 ... ... Data is sorted by object ID and stored as columnar, compressed blobs Key Columns Email/Message/2014-03-01/{Batch GUID}/ID [compressed data] Email/Message/2014-03-01/{Batch GUID}/Size [compressed data] Email/Message/2014-03-01/{Batch GUID}/SendDate [compressed data] ... ...... OLAP Table Field Value Arrays Compressed rows
  15. 15. Merging Batches Key Columns Email/Message/2014-03-01/ID [compressed data] Email/Message/2014/03-01/Size [compressed data] Email/Message/2014-03-01/SendDate [compressed data] ... ... Email/Person/2014-03-01/ID [compressed data] Email/Person/2014-03-01/FirstName [compressed data] Email/Person/2014-03-01/LastName [compressed data] ... ... Email/Address/2014-03-01/ID [compressed data] Email/Address/2014-03-01/Person [compressed data] Email/Address/2014/-03-01/Message [compressed data] ... ... Email/Message/2014-02-28/ID [compressed data] Email/Message/2014-02-28/Size [compressed data] ... Batch #1: Shard 2014-03-01 Message Table ID ... Size ... SendDate ... ... Batch #2: Shard 2014-03-01 Message Table ID ... Size ... SendDate ... ... ... OLAP Store Person Table ID ... FirstName ... Lastname ... ... Address Table ID ... Person ... Messages ... ... Person Table ID ... FirstName ... Lastname ... ... Address Table ID ... Person ... Messages ... ... Message table data Shard 2014-03-01 Person table data Shard 2014-03-01 Address table data Shard 2014-03-01 Data for other shards
  16. 16. Does Merging Take Long?
  17. 17. OLAP Query Execution  Example query: - Count messages with Size between 1000-10000 and HasBeenSent=false in shards 2014-03-01 to 2014-03-31  How many rows are read? - 2 fields x 31 shards = 62 rows - Typically represents millions of values  Value arrays are scanned in memory  Physical rows are read on “cold” start only - Multiple caching levels for “warm” and “hot” data
  18. 18. 1 Billion Objects in 2GB? Example Security Event (CSV format): Fixed Fields Variable Fields Computer Name MAILSERVER18 1 MAILSERVER18$ Log Name Security 2 Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 Workstation Type Success Audit 4 (0x0,0x142999A) Source Security 5 3 Category Logon/Logoff 6 Kerberos Event ID 540 7 Kerberos User Domain NT AUTHORITY User Name SYSTEM User SID S-1-5-18 MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security, "Logon/Logoff", 540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation, "(0x0,0x142999A)",3,Kerberos,Kerberos
  19. 19. Events Schema Events Insertion Strings Fields: • ComputerName (text) • LogName (text) • Timestamp (timestamp) • Type (text) • Source (text) Fields: • Index (integer) • Value (text) • Event (link) Count: 115 Million Count: 880 Million Params   Event (inverse) • Category (text) • EventID (integer) • UserDomain (text) • UserSID (text) • Params (link)
  20. 20. Event Schema Load  Load stats: Total shards: 860 Total events: 114,572,247 Total ins strings: 879,529,753 Total objects: 994,102,000 Total load time: 2 hours, 2 minutes, 36 seconds (MacBook Air)  Space usage: :nodetool -h localhost status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns Host ID Token Rack UN 127.0.0.1 1.96 GB 100.0% 860887ef-2027-431a-a425-c67a9445d0e6 -9176223118562734495 rack1
  21. 21. Demo  1) Count all Events in all shards - 860 shards => 115M events  2) Find the top 5 hours-of-the-day when certain privileged events fail: - Event IDs are any of 577, 681, 529 - Event type is ‘Failure Audit’ - Insertion string 8 is (0x0,0x3E7) - Event occurred in first half of 2005 (181 shards)
  22. 22. Doradus OLAP Summary  Advantages: Simple REST API All fields are searchable without indexes Ad-hoc statistical searches Support for graph-based queries Near real time data warehousing Dense storage = less hardware Horizontally scalable when needed
  23. 23. Doradus OLAP Summary  Good for applications where data: Is continuous/streaming Is structured to semi-structured Can be loaded in batches Is partitionable, especially by time Is typically queried in a subset of shards Emphasizes statistical queries
  24. 24. Thank You!  Where to find Doradus - Source: github.com/dell-oss/Doradus - Downloads: search.maven.org  Contact me - Randy.Guck@software.dell.com - @randyguck

×