Securely explore your data

SQRRL ENTERPRISE +
APACHE ACCUMULO:
A secure, scalable, real-time
analysis framework

Adam Fuc...
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks...
TWO HALVES OF REAL-TIME
Data-Driven
Real-Time  reduce event to reaction time

© 2013 Sqrrl | All Rights Reserved | Propri...
Data-Driven + Query-Driven Real-Time Ecosystem
Actions
3

SPE
4

Data

1

Dashboards

2
5

NoSQL+
6

1.
2.
3.
4.
5.
6.

In...
This talk focuses on the database.
Actions
3

SPE
4

Data

1

Dashboards

2
5

NoSQL+
6

Interactive
Analysis Tools
(Disco...
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks...
ACCUMULO DATA FORMAT
An Accumulo key is a 5-tuple, consisting of:
- Row: Controls Atomicity
- Column Family: Controls Loca...
ACCUMULO TABLETS
Well-Known
Location
(zookeeper)

Collections of KV pairs form Tables
Tables are partitioned into Tablets
...
ACCUMULO PROCESSES
Zookeeper

Tablet Server

Zookeeper
Zookeeper

Tablet
Read/Write

Delegate
Authority

Assign/Balance

T...
TABLET DATA FLOW
Tablet
Writes

In-Memory
Map

Iterator
Tree

Iterator
Tree

Reads

Minor
Compaction

Sorted,
Indexed
File...
WORD COUNT:
Summing Aggregating Iterator

Input Corpus

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
...
ITERATOR FRAMEWORK
Iterator Operations:
- File Reads
- Block Caching
- Merging
- Deletion
- Isolation
- Locality Groups
- ...
ACCUMULO LATENCIES
~ms

~ms

Ingesters

Tablet Servers
InInInMemory
Memory
Memory
Map
Map
Map

Batch
Writer

ms - min

Inp...
ACCUMULO THROUGHPUT
Scan:
up to 1M entries/s
per node

Ingest:
up to 500K entries/s
per node
~ms

~ms

Ingesters

Tablet S...
SQRRL ENTERPRISE
Bulk Processing
Integration

Exploratory /
Operational Apps

Built on Apache Accumulo
Graph +
Document I/...
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks...
DATA-CENTRIC SECURITY
Definition: Data carries with it information that is required
to make policy decisions on its releas...
SECURITY
Example Accumulo Key/Value Pairs

Accumulo is the only
NoSQL database with
cell-level access
controls
© 2013 Sqrr...
DATA-CENTRIC SECURITY ECOSYSTEM
Key
Mgmt

Audits
End Users

Labeler

Sqrrl
Enterprise

Policies

Data

Policy
Engine

Apps...
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks...
HIERARCHICAL DECOMPOSITION
<person>

Row:

Column Family:

Column Qualifier:

Value:

attribute

purchases

returns

age

...
MATERIALIZED TABLE
Key/Value Pair

Row:

bill

Column
Family:

george

attribute purchases

attribute purchases

Column
Qu...
FORWARD AND INVERTED INDEX
Table:

Forward Index

Inverted Index

Row:

<UUID>

<Term>

Column Family:

<Type>

<UUID>

Co...
FORWARD AND INVERTED INDEX

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential

24
CUSTOM INDEXING
Table:

Geo Index

Latitude

Longitude

10110101001 00111010010

11010110110

<GeoHash>

Row:

10100111011...
D4M 2.0 SCHEMA FOR TWITTER DATA
Table:

Tedge

TedgeT

Row:

<UUID>

<value>

Column Family:

“stat”

“time”

“user”

“wor...
D4M 2.0 SCHEMA FOR TWITTER DATA
Table:

TedgeDegT

Ttext

Row:

<value>

<UUID>

Column Family:

“stat”

“time”

“user”

“...
D4M 2.0 SCHEMA FOR TWITTER DATA

Source: D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Databa...
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks...
ACCUMULO WITH D4M 2.0 SCHEMA PERFORMANCE
Maximizing throughput on an 8-node, 192-core cluster:

Source: D4M 2.0 Schema: A ...
ACCUMULO SCALABILITY: GRAPH500 BENCHMARK

source: http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf...
ATOMIC INCREMENT PERFORMANCE COMPARISON
Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo)

© 2013 Sqrrl | All R...
QUESTIONS?

Adam Fuchs, CTO
Sqrrl Data, Inc.

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential

33
Upcoming SlideShare
Loading in …5
×

Adam Fuchs' Accumulo Talk at NoSQL Now! 2013

1,656 views
1,491 views

Published on

Adam Fuch provides an overview of Accumulo and Sqrrl Enterprise at the 2013 NoSQL Now! conference

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,656
On SlideShare
0
From Embeds
0
Number of Embeds
257
Actions
Shares
0
Downloads
74
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Adam Fuchs' Accumulo Talk at NoSQL Now! 2013

  1. 1. Securely explore your data SQRRL ENTERPRISE + APACHE ACCUMULO: A secure, scalable, real-time analysis framework Adam Fuchs, CTO Sqrrl Data, Inc. August 21, 2013
  2. 2. OUTLINE Two Halves of “Real-Time” Accumulo and Sqrrl Technology Data-Centric Security Table Designs Performance Benchmarks © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
  3. 3. TWO HALVES OF REAL-TIME Data-Driven Real-Time  reduce event to reaction time © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential Query-Driven Real-Time  reduce ingest to query latency
  4. 4. Data-Driven + Query-Driven Real-Time Ecosystem Actions 3 SPE 4 Data 1 Dashboards 2 5 NoSQL+ 6 1. 2. 3. 4. 5. 6. Interactive Analysis Tools (Discovery + Forensics) SPE queries NoSQL to enrich streaming data SPE persists results in NoSQL for future query SPE takes action automatically SPE issues data-driven alerts Sqrrl provides context for dashboards Analysis tools query use Sqrrl to search and manipulate historical data © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
  5. 5. This talk focuses on the database. Actions 3 SPE 4 Data 1 Dashboards 2 5 NoSQL+ 6 Interactive Analysis Tools (Discovery + Forensics) 1. 2. 3. 4. 5. 6. SPE queries NoSQL to enrich streaming data SPE persists results in NoSQL for future query SPE takes action automatically SPE issues data-driven alerts Sqrrl provides context for dashboards Analysis tools query use Sqrrl to search and manipulate historical data © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 5
  6. 6. OUTLINE Two Halves of “Real-Time” Accumulo and Sqrrl Technology Data-Centric Security Table Designs Performance Benchmarks © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
  7. 7. ACCUMULO DATA FORMAT An Accumulo key is a 5-tuple, consisting of: - Row: Controls Atomicity - Column Family: Controls Locality - Column Qualifier: Controls Uniqueness - Visibility Label: Controls Access - Timestamp: Controls Versioning Accumulo Key/Value Example © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 7
  8. 8. ACCUMULO TABLETS Well-Known Location (zookeeper) Collections of KV pairs form Tables Tables are partitioned into Tablets Metadata tablets hold info about other tablets, forming a 3-level hierarchy A Tablet is a unit of work for a Tablet Server Root Tablet -∞ to ∞ Metadata Tablet 1 Metadata Tablet 2 -∞ to “Encyclopedia:Ocelot” “Encyclopedia:Ocelot” to ∞ Table: Adam’s Table Data Tablet -∞ : thing Data Tablet thing : ∞ Table: Encyclopedia Data Tablet -∞ : Ocelot © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential Data Tablet Ocelot : Yak Data Tablet Yak : ∞ Table: Foo Data Tablet -∞ to ∞ 8
  9. 9. ACCUMULO PROCESSES Zookeeper Tablet Server Zookeeper Zookeeper Tablet Read/Write Delegate Authority Assign/Balance Tablet Server Master Application Application Tablet Store/Replicate Tablet Server HDFS © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential Application Tablet 9
  10. 10. TABLET DATA FLOW Tablet Writes In-Memory Map Iterator Tree Iterator Tree Reads Minor Compaction Sorted, Indexed File Write Ahead Log (For Recovery) Scan Merging / Major Compaction © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential Sorted, Indexed File Iterator Tree Sorted, Indexed File 10
  11. 11. WORD COUNT: Summing Aggregating Iterator Input Corpus © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 11
  12. 12. ITERATOR FRAMEWORK Iterator Operations: - File Reads - Block Caching - Merging - Deletion - Isolation - Locality Groups - Range Selection - Column Selection - Cell-level Security - Versioning - Filtering - Aggregation - Partitioned Joins © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 12
  13. 13. ACCUMULO LATENCIES ~ms ~ms Ingesters Tablet Servers InInInMemory Memory Memory Map Map Map Batch Writer ms - min Input ~ms Scan Scan Scan Iterators Iterators Iterators Queriers Scanner/ Batch Scanner Output Compactio Compactio n Compaction n Iterators Iterators Iterators RFile RFile RFiles © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 13
  14. 14. ACCUMULO THROUGHPUT Scan: up to 1M entries/s per node Ingest: up to 500K entries/s per node ~ms ~ms Ingesters Tablet Servers InInInMemory Memory Memory Map Map Map Batch Writer ms - min Input ~ms Scan Scan Scan Iterators Iterators Iterators Queriers Scanner /Batch Scanner Output Compacti Compacti on Compaction on Iterators Iterators Iterators RFile RFile RFiles Read-Modify-Write Latency: ~ms  >1K entries/s challenging with R-M-W © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 14
  15. 15. SQRRL ENTERPRISE Bulk Processing Integration Exploratory / Operational Apps Built on Apache Accumulo Graph + Document I/O Sqrrl API over Apache Thrift RPC (JSON, Graph, Aggregation, Search, etc.) • • • • • Sqrrl proprietary Automated indexing Custom iterators Lucene integration Security extensions Sqrrl Server Accumulo RPC (Sorted Key/Value I/O) • Open source (including Sqrrl contributions) Hadoop RPC (File I/O) • Open source or commercial distributions © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 15
  16. 16. OUTLINE Two Halves of “Real-Time” Accumulo and Sqrrl Technology Data-Centric Security Table Designs Performance Benchmarks © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 16
  17. 17. DATA-CENTRIC SECURITY Definition: Data carries with it information that is required to make policy decisions on its releasability. User 1 Sqrrl/ Accumul o © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential User 2 17
  18. 18. SECURITY Example Accumulo Key/Value Pairs Accumulo is the only NoSQL database with cell-level access controls © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 18
  19. 19. DATA-CENTRIC SECURITY ECOSYSTEM Key Mgmt Audits End Users Labeler Sqrrl Enterprise Policies Data Policy Engine Apps Auth. Service User Attributes © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 19
  20. 20. OUTLINE Two Halves of “Real-Time” Accumulo and Sqrrl Technology Data-Centric Security Table Designs Performance Benchmarks © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 20
  21. 21. HIERARCHICAL DECOMPOSITION <person> Row: Column Family: Column Qualifier: Value: attribute purchases returns age discount sneakers hat <age> <rate> <cost> <cost> © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 21
  22. 22. MATERIALIZED TABLE Key/Value Pair Row: bill Column Family: george attribute purchases attribute purchases Column Qualifier: age discount sneakers age sneakers Value: 49 40% $100 27 $83 © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 22
  23. 23. FORWARD AND INVERTED INDEX Table: Forward Index Inverted Index Row: <UUID> <Term> Column Family: <Type> <UUID> Column Qualifier: <Field> <Type+Field> <Term> <Digest of Event> Value: © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 23
  24. 24. FORWARD AND INVERTED INDEX © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 24
  25. 25. CUSTOM INDEXING Table: Geo Index Latitude Longitude 10110101001 00111010010 11010110110 <GeoHash> Row: 101001110111010101011100001011100 Column Family: <Event Type> Column Qualifier: Value: Depth <UUID> <Digest of Event> © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 25
  26. 26. D4M 2.0 SCHEMA FOR TWITTER DATA Table: Tedge TedgeT Row: <UUID> <value> Column Family: “stat” “time” “user” “word ” “stat” “time” “user” “word ” Column Qualifier: <stat> <time> <user > <word > <UUID > <UUID > <UUID > <UUID > “1” “1” “1” “1” “1” “1” “1” “1” Value: © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 26
  27. 27. D4M 2.0 SCHEMA FOR TWITTER DATA Table: TedgeDegT Ttext Row: <value> <UUID> Column Family: “stat” “time” “user” “word ” “text” Column Qualifier: “degre e” “degre e” “degre e” “degre e” - Value: <count> <count> <count> © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential <count> <text> 27
  28. 28. D4M 2.0 SCHEMA FOR TWITTER DATA Source: D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database, Kepner et. al., HPEC 2013 © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 28
  29. 29. OUTLINE Two Halves of “Real-Time” Accumulo and Sqrrl Technology Data-Centric Security Table Designs Performance Benchmarks © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 29
  30. 30. ACCUMULO WITH D4M 2.0 SCHEMA PERFORMANCE Maximizing throughput on an 8-node, 192-core cluster: Source: D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database, Kepner et. al., HPEC 2013 © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 30
  31. 31. ACCUMULO SCALABILITY: GRAPH500 BENCHMARK source: http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 31
  32. 32. ATOMIC INCREMENT PERFORMANCE COMPARISON Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo) © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 32
  33. 33. QUESTIONS? Adam Fuchs, CTO Sqrrl Data, Inc. © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 33

×