Securely explore your data

SQRRL ENTERPRISE +
APACHE ACCUMULO:
A secure, scalable, real-time
analysis framework

Adam Fuc...
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks...
TWO HALVES OF REAL-TIME
Data-Driven
Real-Time  reduce event to reaction time

© 2013 Sqrrl | All Rights Reserved | Propri...
Data-Driven + Query-Driven Real-Time Ecosystem
Actions
3

SPE
4

Data

1

Dashboards

2
5

NoSQL+
6

1.
2.
3.
4.
5.
6.

In...
This talk focuses on the database.
Actions
3

SPE
4

Data

1

Dashboards

2
5

NoSQL+
6

Interactive
Analysis Tools
(Disco...
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks...
ACCUMULO DATA FORMAT
An Accumulo key is a 5-tuple, consisting of:
- Row: Controls Atomicity
- Column Family: Controls Loca...
ACCUMULO TABLETS
Well-Known
Location
(zookeeper)

Collections of KV pairs form Tables
Tables are partitioned into Tablets
...
ACCUMULO PROCESSES
Zookeeper

Tablet Server

Zookeeper
Zookeeper

Tablet
Read/Write

Delegate
Authority

Assign/Balance

T...
TABLET DATA FLOW
Tablet
Writes

In-Memory
Map

Iterator
Tree

Iterator
Tree

Reads

Minor
Compaction

Sorted,
Indexed
File...
WORD COUNT:
Summing Aggregating Iterator

Input Corpus

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
...
ITERATOR FRAMEWORK
Iterator Operations:
- File Reads
- Block Caching
- Merging
- Deletion
- Isolation
- Locality Groups
- ...
ACCUMULO LATENCIES
~ms

~ms

Ingesters

Tablet Servers
InInInMemory
Memory
Memory
Map
Map
Map

Batch
Writer

ms - min

Inp...
ACCUMULO THROUGHPUT
Scan:
up to 1M entries/s
per node

Ingest:
up to 500K entries/s
per node
~ms

~ms

Ingesters

Tablet S...
SQRRL ENTERPRISE
Bulk Processing
Integration

Exploratory /
Operational Apps

Built on Apache Accumulo
Graph +
Document I/...
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks...
DATA-CENTRIC SECURITY
Definition: Data carries with it information that is required
to make policy decisions on its releas...
SECURITY
Example Accumulo Key/Value Pairs

Accumulo is the only
NoSQL database with
cell-level access
controls
© 2013 Sqrr...
DATA-CENTRIC SECURITY ECOSYSTEM
Key
Mgmt

Audits
End Users

Labeler

Sqrrl
Enterprise

Policies

Data

Policy
Engine

Apps...
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks...
HIERARCHICAL DECOMPOSITION
<person>

Row:

Column Family:

Column Qualifier:

Value:

attribute

purchases

returns

age

...
MATERIALIZED TABLE
Key/Value Pair

Row:

bill

Column
Family:

george

attribute purchases

attribute purchases

Column
Qu...
FORWARD AND INVERTED INDEX
Table:

Forward Index

Inverted Index

Row:

<UUID>

<Term>

Column Family:

<Type>

<UUID>

Co...
FORWARD AND INVERTED INDEX

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential

24
CUSTOM INDEXING
Table:

Geo Index

Latitude

Longitude

10110101001 00111010010

11010110110

<GeoHash>

Row:

10100111011...
D4M 2.0 SCHEMA FOR TWITTER DATA
Table:

Tedge

TedgeT

Row:

<UUID>

<value>

Column Family:

“stat”

“time”

“user”

“wor...
D4M 2.0 SCHEMA FOR TWITTER DATA
Table:

TedgeDegT

Ttext

Row:

<value>

<UUID>

Column Family:

“stat”

“time”

“user”

“...
D4M 2.0 SCHEMA FOR TWITTER DATA

Source: D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Databa...
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks...
ACCUMULO WITH D4M 2.0 SCHEMA PERFORMANCE
Maximizing throughput on an 8-node, 192-core cluster:

Source: D4M 2.0 Schema: A ...
ACCUMULO SCALABILITY: GRAPH500 BENCHMARK

source: http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf...
ATOMIC INCREMENT PERFORMANCE COMPARISON
Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo)

© 2013 Sqrrl | All R...
QUESTIONS?

Adam Fuchs, CTO
Sqrrl Data, Inc.

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential

33
Upcoming SlideShare
Loading in...5
×

Adam Fuchs' Accumulo Talk at NoSQL Now! 2013

1,332

Published on

Adam Fuch provides an overview of Accumulo and Sqrrl Enterprise at the 2013 NoSQL Now! conference

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,332
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
72
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Adam Fuchs' Accumulo Talk at NoSQL Now! 2013

  1. 1. Securely explore your data SQRRL ENTERPRISE + APACHE ACCUMULO: A secure, scalable, real-time analysis framework Adam Fuchs, CTO Sqrrl Data, Inc. August 21, 2013
  2. 2. OUTLINE Two Halves of “Real-Time” Accumulo and Sqrrl Technology Data-Centric Security Table Designs Performance Benchmarks © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
  3. 3. TWO HALVES OF REAL-TIME Data-Driven Real-Time  reduce event to reaction time © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential Query-Driven Real-Time  reduce ingest to query latency
  4. 4. Data-Driven + Query-Driven Real-Time Ecosystem Actions 3 SPE 4 Data 1 Dashboards 2 5 NoSQL+ 6 1. 2. 3. 4. 5. 6. Interactive Analysis Tools (Discovery + Forensics) SPE queries NoSQL to enrich streaming data SPE persists results in NoSQL for future query SPE takes action automatically SPE issues data-driven alerts Sqrrl provides context for dashboards Analysis tools query use Sqrrl to search and manipulate historical data © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
  5. 5. This talk focuses on the database. Actions 3 SPE 4 Data 1 Dashboards 2 5 NoSQL+ 6 Interactive Analysis Tools (Discovery + Forensics) 1. 2. 3. 4. 5. 6. SPE queries NoSQL to enrich streaming data SPE persists results in NoSQL for future query SPE takes action automatically SPE issues data-driven alerts Sqrrl provides context for dashboards Analysis tools query use Sqrrl to search and manipulate historical data © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 5
  6. 6. OUTLINE Two Halves of “Real-Time” Accumulo and Sqrrl Technology Data-Centric Security Table Designs Performance Benchmarks © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
  7. 7. ACCUMULO DATA FORMAT An Accumulo key is a 5-tuple, consisting of: - Row: Controls Atomicity - Column Family: Controls Locality - Column Qualifier: Controls Uniqueness - Visibility Label: Controls Access - Timestamp: Controls Versioning Accumulo Key/Value Example © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 7
  8. 8. ACCUMULO TABLETS Well-Known Location (zookeeper) Collections of KV pairs form Tables Tables are partitioned into Tablets Metadata tablets hold info about other tablets, forming a 3-level hierarchy A Tablet is a unit of work for a Tablet Server Root Tablet -∞ to ∞ Metadata Tablet 1 Metadata Tablet 2 -∞ to “Encyclopedia:Ocelot” “Encyclopedia:Ocelot” to ∞ Table: Adam’s Table Data Tablet -∞ : thing Data Tablet thing : ∞ Table: Encyclopedia Data Tablet -∞ : Ocelot © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential Data Tablet Ocelot : Yak Data Tablet Yak : ∞ Table: Foo Data Tablet -∞ to ∞ 8
  9. 9. ACCUMULO PROCESSES Zookeeper Tablet Server Zookeeper Zookeeper Tablet Read/Write Delegate Authority Assign/Balance Tablet Server Master Application Application Tablet Store/Replicate Tablet Server HDFS © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential Application Tablet 9
  10. 10. TABLET DATA FLOW Tablet Writes In-Memory Map Iterator Tree Iterator Tree Reads Minor Compaction Sorted, Indexed File Write Ahead Log (For Recovery) Scan Merging / Major Compaction © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential Sorted, Indexed File Iterator Tree Sorted, Indexed File 10
  11. 11. WORD COUNT: Summing Aggregating Iterator Input Corpus © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 11
  12. 12. ITERATOR FRAMEWORK Iterator Operations: - File Reads - Block Caching - Merging - Deletion - Isolation - Locality Groups - Range Selection - Column Selection - Cell-level Security - Versioning - Filtering - Aggregation - Partitioned Joins © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 12
  13. 13. ACCUMULO LATENCIES ~ms ~ms Ingesters Tablet Servers InInInMemory Memory Memory Map Map Map Batch Writer ms - min Input ~ms Scan Scan Scan Iterators Iterators Iterators Queriers Scanner/ Batch Scanner Output Compactio Compactio n Compaction n Iterators Iterators Iterators RFile RFile RFiles © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 13
  14. 14. ACCUMULO THROUGHPUT Scan: up to 1M entries/s per node Ingest: up to 500K entries/s per node ~ms ~ms Ingesters Tablet Servers InInInMemory Memory Memory Map Map Map Batch Writer ms - min Input ~ms Scan Scan Scan Iterators Iterators Iterators Queriers Scanner /Batch Scanner Output Compacti Compacti on Compaction on Iterators Iterators Iterators RFile RFile RFiles Read-Modify-Write Latency: ~ms  >1K entries/s challenging with R-M-W © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 14
  15. 15. SQRRL ENTERPRISE Bulk Processing Integration Exploratory / Operational Apps Built on Apache Accumulo Graph + Document I/O Sqrrl API over Apache Thrift RPC (JSON, Graph, Aggregation, Search, etc.) • • • • • Sqrrl proprietary Automated indexing Custom iterators Lucene integration Security extensions Sqrrl Server Accumulo RPC (Sorted Key/Value I/O) • Open source (including Sqrrl contributions) Hadoop RPC (File I/O) • Open source or commercial distributions © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 15
  16. 16. OUTLINE Two Halves of “Real-Time” Accumulo and Sqrrl Technology Data-Centric Security Table Designs Performance Benchmarks © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 16
  17. 17. DATA-CENTRIC SECURITY Definition: Data carries with it information that is required to make policy decisions on its releasability. User 1 Sqrrl/ Accumul o © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential User 2 17
  18. 18. SECURITY Example Accumulo Key/Value Pairs Accumulo is the only NoSQL database with cell-level access controls © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 18
  19. 19. DATA-CENTRIC SECURITY ECOSYSTEM Key Mgmt Audits End Users Labeler Sqrrl Enterprise Policies Data Policy Engine Apps Auth. Service User Attributes © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 19
  20. 20. OUTLINE Two Halves of “Real-Time” Accumulo and Sqrrl Technology Data-Centric Security Table Designs Performance Benchmarks © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 20
  21. 21. HIERARCHICAL DECOMPOSITION <person> Row: Column Family: Column Qualifier: Value: attribute purchases returns age discount sneakers hat <age> <rate> <cost> <cost> © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 21
  22. 22. MATERIALIZED TABLE Key/Value Pair Row: bill Column Family: george attribute purchases attribute purchases Column Qualifier: age discount sneakers age sneakers Value: 49 40% $100 27 $83 © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 22
  23. 23. FORWARD AND INVERTED INDEX Table: Forward Index Inverted Index Row: <UUID> <Term> Column Family: <Type> <UUID> Column Qualifier: <Field> <Type+Field> <Term> <Digest of Event> Value: © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 23
  24. 24. FORWARD AND INVERTED INDEX © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 24
  25. 25. CUSTOM INDEXING Table: Geo Index Latitude Longitude 10110101001 00111010010 11010110110 <GeoHash> Row: 101001110111010101011100001011100 Column Family: <Event Type> Column Qualifier: Value: Depth <UUID> <Digest of Event> © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 25
  26. 26. D4M 2.0 SCHEMA FOR TWITTER DATA Table: Tedge TedgeT Row: <UUID> <value> Column Family: “stat” “time” “user” “word ” “stat” “time” “user” “word ” Column Qualifier: <stat> <time> <user > <word > <UUID > <UUID > <UUID > <UUID > “1” “1” “1” “1” “1” “1” “1” “1” Value: © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 26
  27. 27. D4M 2.0 SCHEMA FOR TWITTER DATA Table: TedgeDegT Ttext Row: <value> <UUID> Column Family: “stat” “time” “user” “word ” “text” Column Qualifier: “degre e” “degre e” “degre e” “degre e” - Value: <count> <count> <count> © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential <count> <text> 27
  28. 28. D4M 2.0 SCHEMA FOR TWITTER DATA Source: D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database, Kepner et. al., HPEC 2013 © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 28
  29. 29. OUTLINE Two Halves of “Real-Time” Accumulo and Sqrrl Technology Data-Centric Security Table Designs Performance Benchmarks © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 29
  30. 30. ACCUMULO WITH D4M 2.0 SCHEMA PERFORMANCE Maximizing throughput on an 8-node, 192-core cluster: Source: D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database, Kepner et. al., HPEC 2013 © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 30
  31. 31. ACCUMULO SCALABILITY: GRAPH500 BENCHMARK source: http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 31
  32. 32. ATOMIC INCREMENT PERFORMANCE COMPARISON Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo) © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 32
  33. 33. QUESTIONS? Adam Fuchs, CTO Sqrrl Data, Inc. © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 33
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×