Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sqrrl October Webinar: Data Modeling and Indexing

433 views

Published on

This webinar provides a technical deep dive into the NoSQL database Apache Accumulo. Sqrrl extends Accumulo with additional security, analytical, and data modeling tools. Topics include data modeling techniques, secondary indices, JSON and Graph capabilities for Accumulo.

Published in: Data & Analytics, Technology
  • Be the first to comment

Sqrrl October Webinar: Data Modeling and Indexing

  1. 1. Securely explore your data DATA MODELING AND INDEXING FOR APACHE ACCUMULO Sqrrl Webinar Series October, 2013 Adam Fuchs, CTO Sqrrl Data, Inc.
  2. 2. RECAP 1.  Introduction to Sqrrl and Accumulo 2.  Security In The Wild 3.  Sqrrl and Accumulo Technology 4.  The Data-Centric Security Ecosystem In our September Webinar: Sqrrl, Apache Accumulo, and Cell-Level Security Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 2%
  3. 3. TODAY’S DISCUSSION 1.  Sqrrl and Accumulo Technology Review 2.  Table Designs 1.  Dynamic Documents 2.  Graphs 3.  Inverted Indexes 3.  Putting It All Together with Sqrrl Data Modeling and Indexing for Apache Accumulo Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 3%
  4. 4. LAYERED ARCHITECTURE Turtles all the way down... Accumulo'RPC' (Sorted(Key/Value(I/O)( Hadoop'RPC' (File(I/O)( Application Sqrrl Enterprise Sqrrl'API'over'Apache'Thri8'RPC' (JSON,(Graph,(Aggrega=on,( Search,(etc.)( Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 4%
  5. 5. An Accumulo key is a 5-tuple, consisting of: "   Row: Controls Atomicity "   Column Family: Controls Locality "   Column Qualifier: Controls Uniqueness "   Visibility Label: Controls Access "   Timestamp: Controls Versioning Row Col. Fam. Col. Qual. Visibility Timestamp Value John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute … John Doe Test Results Cholesterol JD|PCP_JD 20120912 183 John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100… Accumulo(Key/Value(Example( ACCUMULO DATA FORMAT Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 5%
  6. 6. Instance new%ZooKeeperInstance(...)% new%MockInstance()% Connector getConnector(...)% TableOperations InstanceOperations SecurityOperations Scanner BatchScanner createScanner(...)% createBatchScanner(...)% Range IteratorOption Map.Entry Key Value iterator()% BatchWriter createBatchWriter(...)% Mutation addMuta3on(...)% THE ACCUMULO CLIENT API Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 6%
  7. 7. InJMemory% Map% Write%Ahead% Log% (For%Recovery)% Sorted,% Indexed% File% Sorted,% Indexed% File% Sorted,% Indexed% File% Tablet(Data(Flow( Reads& Iterator% Tree% Minor& Compac0 on& Merging&/&Major& Compac0on& Iterator% Tree% Writes& Iterator% Tree% Scan& Tablet%Server% Tablet% Tablet%Server% Tablet% Tablet%Server% Tablet% Applica3on% Zookeeper% Zookeeper% Zookeeper% Master% HDFS% Read/Write& Store/Replicate& Assign/Balance& Delegate&Authority& Delegate&Authority& Applica3on% Applica3on% ACCUMULO TECHNOLOGY Strengths •  Shared-Nothing => Scalability •  Micro-Batching for Efficient Random I/O •  High Concurrency, Low Latency for Denormalized Data •  Sparse, Flexible Schema supports dynamic and diverse data models •  Cell-level Security promotes sharing Weaknesses •  Sorting induces write multiplication factor •  Sparse schema support induces additional storage overhead Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 7%
  8. 8. TODAY’S DISCUSSION 1.  Sqrrl and Accumulo Technology Review 2.  Table Designs 1.  Dynamic Documents 2.  Graphs 3.  Inverted Indexes 3.  Putting It All Together with Sqrrl Data Modeling and Indexing for Apache Accumulo Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 8%
  9. 9. PROXY/NETFLOW EXAMPLE Source Destination Port Bytes In Bytes Out Protocol 10.1.2.3 google.com 80 73,824 15,632 http 10.1.2.4 facebook.com 443 10,328 13,284,129 https 10.1.2.4 google.com 80 623,249 93,125 http 10.1.2.3 abcd1234.ru 3133 7 158 523,698,104 unknown 10.1.2.3 netflix.com 443 434,855,357 1,392,994 https 10.1.2.4 google.com 443 23,084 583,331 https 10.1.2.3 10.1.2.5 22 204 158 ssh Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 9%
  10. 10. INDEXES AND QFDS Logs/ Observations Input Indexes Question- Focused Datasets Transformation •  Immutable( •  AppendHOnly( •  RealHTime( •  Online( •  Sorted( •  Grouped( •  Aggregated( Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 10%
  11. 11. QFD KEY GENERATION Source Destination Port Bytes In Bytes Out Protocol 10.1.2.3 google.com 80 73,824 15,632 http Key% % % % % % %J>%%Value% 10.1.2.3,%Bytes%In%% % %J>%+73,824% 10.1.2.3,%Bytes%Out% % %J>%+15,632% 10.1.2.3,%Ports%Used% % %J>%+{80}% 10.1.2.3,%Protocols%Used% %J>%+{hap}% Hosts QFD 0x00 . . . 0xFF Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 11%
  12. 12. HOSTS QFD WITH AGGREGATION IP Ports Used Protos Used Total Bytes In Total Bytes Out Ports Hosted Protos Hosted 10.1.2.3 {22, 80, 443, 31337} {http, https, ssh, unknown} 434,931,543 525,106,888 - - 10.1.2.4 {80, 443} {http, https} 656,661 13,960,585 - - 10.1.2.5 - - 158 204 {22} {ssh} New%Contribu3on:%(10.1.2.5,%Total%Bytes%In%J>%+3,215)% 158%+3,215%3,373% Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 12%
  13. 13. facebook.co m google.com abcd1234.ru netflix.com 10.1.2.3 10.1.2.4 10.1.2.5 CONNECTIVITY GRAPH Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 13% Row Col. Fam. Col. Qual. Val. 10.1.2.3 Contacts 10.1.2.5 - 10.1.2.3 Contacts abcd1234.ru - 10.1.2.3 Contacts google.com - 10.1.2.3 Contacts netflix.com - 10.1.2.4 Contacts facebook.com - 10.1.2.4 Contacts google.com - Row Col. Fam. Col. Qual. Val 10.1.2.5 Serves 10.1.2.3 - abcd1234.ru Serves 10.1.2.3 - facebook.com Serves 10.1.2.4 - google.com Serves 10.1.2.3 - google.com Serves 10.1.2.4 - netflix.com Serves 10.1.2.3 -
  14. 14. INVERTED INDEXING Table:( Row:( Column(Family:( Column(Qualifier:( Value:( Forward(Index( <UUID>( <Type>( <Field>( <Term>( Inverted(Index( <Field>( <Term>( <UUID>( <Digest(of(Event>( Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 14%
  15. 15. INVERTED INDEXING Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 15%
  16. 16. ADVANCED INDEXING Table:( Row:( Column(Family:( Column(Qualifier( (Tuples):( Value:( Shard(Table( <Par==on(ID>( “Docs”( “Inv.(Index”( “Field(Index”( <UUID>( <Value>( <Term>( <UUID>( <Field:Term>( <UUID>(<Field>( “Geo”( <Hash>( <UUID>( Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 16%
  17. 17. TODAY’S DISCUSSION 1.  Sqrrl and Accumulo Technology Review 2.  Table Designs 1.  Dynamic Documents 2.  Graphs 3.  Inverted Indexes 3.  Putting It All Together with Sqrrl Data Modeling and Indexing for Apache Accumulo Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 17%
  18. 18. SQRRL ENTERPRISE •  Dynamic Documents •  JSON I/O support •  Cell-level Security and Efficient Aggregation Extensions •  Dynamic Graphs •  Co-partitioned with Documents for Integrated Search and Discovery •  Search •  Lucene Query Syntax •  Accumulo Indexes Preserve Security Model •  Processing •  SQL-Like Language for Transforming and Aggregating Results •  Parallel Slicing and Extraction Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 18% Simple API for Advanced Accumulo Usage
  19. 19. REAL-TIME OPERATIONAL APPS Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% Contact us for a demo 19%
  20. 20. HOW TO LEARN MORE Download our White Paper "  www.sqrrl.com/whitepaper Watch a video "  www.sqrrl.com/downloads#videos Request a demo or one-on-one workshop "  www.sqrrl.com/contact Come meet us "  Accumulo Meetup (October 28, New York) "  Strata + Hadoop World (October 28-30, New York) "  IBM IOD (November 4-7, Las Vegas) "  SC13 (November 18-21, Denver) Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 20%
  21. 21. THANK YOU Thanks for attending! To keep up to date with Sqrrl, check out or social media sites: www.twitter.com/sqrrl_inc www.linkedin.com/company/sqrrl Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 21%

×