Securely explore your data
DATA MODELING AND
INDEXING FOR
APACHE ACCUMULO
Sqrrl Webinar Series
October, 2013
Adam Fuchs, CTO
Sqrrl Data, Inc.
RECAP
1.  Introduction to Sqrrl and Accumulo
2.  Security In The Wild
3.  Sqrrl and Accumulo Technology
4.  The Data-Centric Security Ecosystem
In our September Webinar:
Sqrrl, Apache Accumulo, and Cell-Level Security
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 2%
TODAY’S DISCUSSION
1.  Sqrrl and Accumulo Technology Review
2.  Table Designs
1.  Dynamic Documents
2.  Graphs
3.  Inverted Indexes
3.  Putting It All Together with Sqrrl
Data Modeling and Indexing for Apache Accumulo
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 3%
LAYERED ARCHITECTURE
Turtles all the way down...
Accumulo'RPC'
(Sorted(Key/Value(I/O)(
Hadoop'RPC'
(File(I/O)(
Application
Sqrrl Enterprise
Sqrrl'API'over'Apache'Thri8'RPC'
(JSON,(Graph,(Aggrega=on,(
Search,(etc.)(
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 4%
An Accumulo key is a 5-tuple, consisting of:
"   Row: Controls Atomicity
"   Column Family: Controls Locality
"   Column Qualifier: Controls Uniqueness
"   Visibility Label: Controls Access
"   Timestamp: Controls Versioning
Row Col. Fam. Col. Qual. Visibility Timestamp Value
John Doe Notes PCP PCP_JD 20120912
Patient suffers
from an acute …
John Doe Test Results Cholesterol JD|PCP_JD 20120912 183
John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass
John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100…
Accumulo(Key/Value(Example(
ACCUMULO DATA FORMAT
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 5%
Instance
new%ZooKeeperInstance(...)%
new%MockInstance()%
Connector
getConnector(...)%
TableOperations
InstanceOperations
SecurityOperations
Scanner BatchScanner
createScanner(...)% createBatchScanner(...)%
Range
IteratorOption
Map.Entry
Key Value
iterator()%
BatchWriter
createBatchWriter(...)%
Mutation
addMuta3on(...)%
THE ACCUMULO CLIENT API
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 6%
InJMemory%
Map%
Write%Ahead%
Log%
(For%Recovery)%
Sorted,%
Indexed%
File%
Sorted,%
Indexed%
File%
Sorted,%
Indexed%
File%
Tablet(Data(Flow(
Reads&
Iterator%
Tree%
Minor&
Compac0
on&
Merging&/&Major&
Compac0on&
Iterator%
Tree%
Writes& Iterator%
Tree%
Scan&
Tablet%Server%
Tablet%
Tablet%Server%
Tablet%
Tablet%Server%
Tablet%
Applica3on%
Zookeeper%
Zookeeper%
Zookeeper%
Master%
HDFS%
Read/Write&
Store/Replicate&
Assign/Balance&
Delegate&Authority&
Delegate&Authority&
Applica3on%
Applica3on%
ACCUMULO TECHNOLOGY
Strengths
•  Shared-Nothing => Scalability
•  Micro-Batching for Efficient
Random I/O
•  High Concurrency, Low Latency
for Denormalized Data
•  Sparse, Flexible Schema supports
dynamic and diverse data models
•  Cell-level Security promotes
sharing
Weaknesses
•  Sorting induces write multiplication
factor
•  Sparse schema support induces
additional storage overhead
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 7%
TODAY’S DISCUSSION
1.  Sqrrl and Accumulo Technology Review
2.  Table Designs
1.  Dynamic Documents
2.  Graphs
3.  Inverted Indexes
3.  Putting It All Together with Sqrrl
Data Modeling and Indexing for Apache Accumulo
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 8%
PROXY/NETFLOW EXAMPLE
Source Destination Port Bytes In Bytes Out Protocol
10.1.2.3 google.com 80 73,824 15,632 http
10.1.2.4 facebook.com 443 10,328 13,284,129 https
10.1.2.4 google.com 80 623,249 93,125 http
10.1.2.3 abcd1234.ru 3133
7
158 523,698,104 unknown
10.1.2.3 netflix.com 443 434,855,357 1,392,994 https
10.1.2.4 google.com 443 23,084 583,331 https
10.1.2.3 10.1.2.5 22 204 158 ssh
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 9%
INDEXES AND QFDS
Logs/
Observations
Input
Indexes
Question-
Focused
Datasets
Transformation
•  Immutable(
•  AppendHOnly(
•  RealHTime(
•  Online(
•  Sorted(
•  Grouped(
•  Aggregated(
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 10%
QFD KEY GENERATION
Source Destination Port Bytes In Bytes Out Protocol
10.1.2.3 google.com 80 73,824 15,632 http
Key% % % % % % %J>%%Value%
10.1.2.3,%Bytes%In%% % %J>%+73,824%
10.1.2.3,%Bytes%Out% % %J>%+15,632%
10.1.2.3,%Ports%Used% % %J>%+{80}%
10.1.2.3,%Protocols%Used% %J>%+{hap}%
Hosts QFD
0x00
.
.
.
0xFF
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 11%
HOSTS QFD WITH AGGREGATION
IP Ports
Used
Protos
Used
Total
Bytes In
Total
Bytes Out
Ports
Hosted
Protos
Hosted
10.1.2.3 {22, 80,
443,
31337}
{http,
https, ssh,
unknown}
434,931,543 525,106,888 - -
10.1.2.4 {80,
443}
{http,
https}
656,661 13,960,585 - -
10.1.2.5 - - 158 204 {22} {ssh}
New%Contribu3on:%(10.1.2.5,%Total%Bytes%In%J>%+3,215)%
158%+3,215%3,373%
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 12%
facebook.co
m
google.com
abcd1234.ru
netflix.com
10.1.2.3
10.1.2.4
10.1.2.5
CONNECTIVITY GRAPH
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 13%
Row Col. Fam. Col. Qual. Val.
10.1.2.3 Contacts 10.1.2.5 -
10.1.2.3 Contacts abcd1234.ru -
10.1.2.3 Contacts google.com -
10.1.2.3 Contacts netflix.com -
10.1.2.4 Contacts facebook.com -
10.1.2.4 Contacts google.com -
Row Col. Fam. Col. Qual. Val
10.1.2.5 Serves 10.1.2.3 -
abcd1234.ru Serves 10.1.2.3 -
facebook.com Serves 10.1.2.4 -
google.com Serves 10.1.2.3 -
google.com Serves 10.1.2.4 -
netflix.com Serves 10.1.2.3 -
INVERTED INDEXING
Table:(
Row:(
Column(Family:(
Column(Qualifier:(
Value:(
Forward(Index(
<UUID>(
<Type>(
<Field>(
<Term>(
Inverted(Index(
<Field>(
<Term>(
<UUID>(
<Digest(of(Event>(
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 14%
INVERTED INDEXING
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 15%
ADVANCED INDEXING
Table:(
Row:(
Column(Family:(
Column(Qualifier(
(Tuples):(
Value:(
Shard(Table(
<Par==on(ID>(
“Docs”( “Inv.(Index”( “Field(Index”(
<UUID>(
<Value>(
<Term>(
<UUID>(
<Field:Term>(
<UUID>(<Field>(
“Geo”(
<Hash>(
<UUID>(
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 16%
TODAY’S DISCUSSION
1.  Sqrrl and Accumulo Technology Review
2.  Table Designs
1.  Dynamic Documents
2.  Graphs
3.  Inverted Indexes
3.  Putting It All Together with Sqrrl
Data Modeling and Indexing for Apache Accumulo
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 17%
SQRRL ENTERPRISE
•  Dynamic Documents
•  JSON I/O support
•  Cell-level Security and Efficient Aggregation Extensions
•  Dynamic Graphs
•  Co-partitioned with Documents for Integrated Search and
Discovery
•  Search
•  Lucene Query Syntax
•  Accumulo Indexes Preserve Security Model
•  Processing
•  SQL-Like Language for Transforming and Aggregating Results
•  Parallel Slicing and Extraction
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 18%
Simple API for Advanced Accumulo Usage
REAL-TIME OPERATIONAL APPS
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary%
Contact us for a demo
19%
HOW TO LEARN MORE
Download our White Paper
"  www.sqrrl.com/whitepaper
Watch a video
"  www.sqrrl.com/downloads#videos
Request a demo or one-on-one workshop
"  www.sqrrl.com/contact
Come meet us
"  Accumulo Meetup (October 28, New York)
"  Strata + Hadoop World (October 28-30, New York)
"  IBM IOD (November 4-7, Las Vegas)
"  SC13 (November 18-21, Denver)
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 20%
THANK YOU
Thanks for attending!
To keep up to date
with Sqrrl, check out
or social media sites:
www.twitter.com/sqrrl_inc
www.linkedin.com/company/sqrrl
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 21%

Sqrrl October Webinar: Data Modeling and Indexing