SlideShare a Scribd company logo
D4M and Apache Accumulo
Vijay Gadepally, Lauren Edwards,
Dylan Hutchison, Jeremy Kepner
Accumulo Summit
College Park, MD
April 29, 2015
This work is sponsored by the Assistant Secretary of Defense for Research
and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions,
interpretations, recommendations and conclusions are those of the authors
and are not necessarily endorsed by the United States Government
Accumulo Summit
VNG - 2
Giving away the punch line
• D4M is a popular open source software tool that
connects scientists with Big Data technologies
• D4M-Accumulo binding provides high performance
connectivity to Apache Accumulo for quick analytic
prototyping
• Graphulo: Implement GraphBLAS server-side
iterators and operators on Accumulo tables
Accumulo Summit
VNG - 3
Outline
• Introduction
• D4M Overview
• D4M Details
• Demonstration
• Conclusions
Accumulo Summit
VNG - 4
Common Big Data Challenge
CommandersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Data
Users
Gap
2000 2005 2010 2015 & Beyond
Rapidly increasing
- Data volume
- Data velocity
- Data variety
- Data veracity (security)
Accumulo Summit
VNG - 5
Common Big Data Architecture
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Accumulo Summit
VNG - 6
Common Big Data Architecture
- Data Volume: Cloud Computing -
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Operators
MIT
SuperCloud
Enterprise Cloud
Big Data Cloud Database Cloud
Compute Cloud
MIT SuperCloud merges four clouds
Accumulo Summit
VNG - 7
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Lincoln benchmarking
validated Accumulo performance
Common Big Data Architecture
- Data Velocity: Accumulo Database -
Accumulo Summit
VNG - 8
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
D4M demonstrated a
universal approach to diverse data
columnsrows
Σ
raw
Common Big Data Architecture
- Data Variety: D4M Schema -
intel reports, DNA, health records, publication
citations, web logs, social media, building alarms,
cyber, … all handled by a common 4 table schema
Accumulo Summit
VNG - 9
Common Big Data Architecture
- Data Veracity: Security Tools-
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Using cryptography to protect
sensitive data
-Verifiable Query Results-
-Computing on Masked Data-
Big Data
Cloud
Masked
Query
Plaintext
Query
Encrypt
CMD
Masked
Analytic
Result
Decrypt
Plaintext
Analytic
Result
Accumulo Summit
VNG - 10
Outline
• Introduction
• D4M Overview
• D4M Details
• Demonstration
• Conclusions
Accumulo Summit
VNG - 11
High Level Language: D4M
http://d4m.mit.edu
Accumulo
Distributed Database
Query:
Alice
Bob
Cathy
David
Earl
Associative Arrays
Numerical Computing Environment
D4M
Dynamic
Distributed
Dimensional
Data Model
A
C
D
E
B
A D4M query returns a sparse
matrix or a graph…
…for statistical signal processing
or graph analysis in MATLAB
D4M binds associative arrays to databases, enabling rapid
prototyping of data-intensive cloud analytics and visualization
Accumulo Summit
VNG - 12
What is D4M?
• The Dynamic Distributed Dimensional Data Model:
– Support for mathematical foundation – associative arrays
– Schema to represent most unstructured data as associative arrays
– Software tools to connect with variety of databases such as Apache
Accumulo, SciDB, mySQL, PostgreSQL, …
• Software tools currently
implemented in
MATLAB/Octave, and Julia
(v1)
• Connect to databases via
JDBC (relational), SHIM
(SciDB) or custom Java API
(Accumulo)
Accumulo Summit
VNG - 13
• Key innovation: mathematical closure
– All associative array operations return associative arrays
• Enables composable mathematical operations
A + B A - B A & B A|B A*B
• Enables composable query operations via array indexing
A('alice bob ',:) A('alice ',:) A('al* ',:)
A('alice : bob ',:) A(1:2,:) A == 47.0
• Simple to implement in a library in programming environments
with: 1st class support of 2D arrays, operator overloading,
sparse linear algebra
Mathematical Foundation:
Associative Arrays
• Complex queries with ~50x less effort than Java/SQL
• Naturally leads to high performance parallel implementation
• Need a schema to convert arbitrary data to associative array
Accumulo Summit
VNG - 14
D4M Data Schema
• A structure described in a language supported by the database
management system.
• Use D4M schema to represent heterogeneous data types in
common data format
– Schema converts structured or unstructured raw text to a tuple
representation supported by Accumulo:
• Usually use a 4 table representation
– The Edge Table, the Transpose Table, Degree Table, Raw Table
33659254179712 2013-05-20 21:21:42 20798128
kiefpief web 3b77caf94bfc81fe I am
sending love to Oklahoma. And actually -- to everyone who
may need it. You are loved. And you are not alone.
Promise. #PrayforOklahoma
33660010027264 2013-05-20 21:54:56 35.99894978 -
78.90660222 -8783842.7781526 4300476.86376416
22435220 RyanBLeslie Twitter for iPad348803787
bced47a0c99c71d0 @HaydenBigCntry RT @jiminhofe:
The devastation in Oklahoma is
…
D4M
Schema
(33659254179712, time|2013-05-20 21:21:42, 1)
(33659254179712, user|kiefpief, 1)
(33659254179712, text, Sending love to OK #PrayforOklahoma)
(33659254179712, word|Sending, 1)
(33660010027264, time|2013-05-20 21:54:56, 1)
(33660010027264, lat|-78.90660222, 1 )
(33660010027264, lon|35.99894978, 1)
(33660010027264, user|RyanBLeslie, 1)
(33660010027264, RT|@HaydenBigCntry , 1)
(33660010027264, word|Oklahoma, 1)
…
Accumulo Summit
VNG - 15
4 Table D4M Schema
row_num col1 col2 col3
001 row1col1 row1col2 word1 word2 word3
002 row2col1 row2col2 word2 word3
003 … … word1 word3
col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3
row_num|001 1 1 1 1 1
row_num|002 1 1 1 1
row_num|003 1 1
col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3
Degree 1 1 1 1 2 2 3
row_num|001 row_num|002 row_num|003
col1|row1col1 1
col1|row2col1
col2|row1col2 1 1
col2|row2col2 1
col3|word1 1 1
col3|word2 1 1
col3|word3 1 1
Tedge
TedgeDeg
TedgeT
text
row_num|001 word1 word2 word3
row_num|002 word2 word3
row_num|003 word1 word3
TedgeTxt
Accumulo Summit
VNG - 16
Outline
• Introduction
• D4M Overview
• D4M Details
• Demonstration
• Conclusions
Accumulo Summit
VNG - 17
D4M Software Library
• Associative Array representation works very well as an interface
among databases.
• D4M currently implemented in languages with first class
support of sparse matrices:
– MATLAB
– GNU Octave
– Julia (in progress)
• Implemented in ~2000 lines of MATLAB code
Download D4M
Source from
d4m.mit.edu
d4m_api.zip
matlab_src/
d4m_api_java.jar
libext.zip
dependency JARs
Accumulo Summit
VNG - 18
D4M: What a user sees
(row, col, val)
Matlab strings
d4m
Matlab API
d4m_api_java
Java API
Accumulo
Java API
Accumulo
Table
% D4M Associative Array API
row = 'r1,r2,'; col = 'c1,c1,'; val = '7,3,';
A = Assoc(row,col,val,@min);
% D4M Accumulo API
DB = DBserver(’zoohost.edu:2181', 'Accumulo',
'instance', 'user', 'password');
T = DB('Table'); % Create table if doesn't exist.
put(T,A); % Put associative array in T.
Aret = T(:,:); % Scan all of T.
Accumulo Summit
VNG - 19
D4M: What a developer sees
Type Matlab/Julia File Java Class Use
Table
management
DBcreate.m D4mDbTableOperationsCreate table
@DBserver/ls.m D4mDbInfo List tables
@DBtable/nnz.m D4mDbTableOperations
Number of entries in table,
summed from table's tablets
DBdelete.m D4mDbTableOperationsDelete table
Write DBinsert.m D4mDbInsert Insert
Scan
@DBtable/DBtable.m D4mDataSearch Create query holder
@DBtable/subsref.m D4mDataSearch Do query, possibly holding batches
@DBtable/close.m D4mDataSearch Reset query
Delete
@DBtable/deleteTriple.m AccumuloDelete Delete entries
@DBtable/deleteAssoc.m AccumuloDelete Delete entries
Iterators
@DBtable/ColCombiner.m D4mDbTableOperationsList table iterators
@DBtable/addColCombiner.m D4mDbTableOperationsAdd all-scope table iterator
@DBtable/deleteColCombiner.m D4mDbTableOperationsRemove iterator
Splits
@DBtable/Splits.m D4mDbTableOperations
Return splits, number of entries in each
tablet, tablet server addresses
@DBtable/addSplits.m D4mDbTableOperationsAdd new table split
@DBtable/putSplits.m D4mDbTableOperationsReplace table splits, merging old splits
@DBtable/mergeSplits.m D4mDbTableOperationsRemove splits by merging tablets
• Source code released and available!
Accumulo Summit
VNG - 20
D4M Write
More details on Batched Insert – 500 kB by default
• putNumBytes() controls #entries to insert in one batch, on MATLAB side
• Independent batches: each creates, flushes and closes separate
BatchWriters
• Guarantee BatchWriters correctly closed
• No need to maintain BatchWriter lifecycle in MATLAB
• 30 ms maximum latency before flushing
• 50 Write threads
• 1 MB maximum memory on BatchWriter, plenty for default batch size
Key
Value
Assoc
Val
Row ID
Assoc
Row
Column Timestamp
Family
putColumn
Family()
Qualifier
Assoc
Col
Visibility
put
Security()
Accumulo Summit
VNG - 21
D4M Scan Example
1. Translate Matlab queries into ranges for BatchScanner
T(:,:) %Scan all
T('r1;r5;:;r7;', :) %Scan given row ranges
T(:, 'c1;') %Use fetchColumn(), or row scan
Transpose table
T('r5;:;r9;', 'c1;:;c3;') %Complicated; break into simpler
queries
2. Hold state of Scanner iterator as state of MATLAB object
T_it = Iterator(T, 'elements', 1e5); % 100k entry batch size
A = T_it(:,:); % Initial query
while nnz(A) % While there is another batch
handleBatch(A);
A = T_it(); % Get next batch
end
Accumulo Summit
VNG - 22
Parallel Accumulo Access
Sample script writing files to Accumulo in parallel:
T = DB('Tedge','TedgeT');
myFiles = global_ind(zeros(Nfile,1,map([Np 1],{},0:Np-1)));
for i = myFiles
fname = ['data/' num2str(i)]; % Create filename.
load([fname '.A.mat']); % Load file data.
put(T,num2str(A)); % Insert to Accumulo.
end
Run on 4 local processors: eval(pRUN('Script',4,{}));
• D4M + pMATLAB gives rise to high performance
Accumulo Summit
VNG - 23
Accumulo Scaling on MIT SuperCloud
• Scales linearly with ingest processes, server nodes, and data size
Servernodes
Accumulo Summit
VNG - 24
115,000,000 inserts per second
• Using supercomputing techniques allows peak insert to be achieve
within seconds of launch
1M edge
Graph500
graph
43K
43B edges in
5 minutes
Accumulo Summit
VNG - 25
Outline
• Introduction
• D4M Overview
• D4M Details
• Demonstration
• Conclusions
Accumulo Summit
VNG - 26
D4M Twitter Demo
• August 24, 2014: Earthquake in Northern California
• Tweets from August 24-25
• Using D4M for:
– Exploration
– Analytics
– Visualization
Accumulo Summit
VNG - 27
Set Table Bindings
Accumulo Summit
VNG - 28
Query Tweets
Accumulo Summit
VNG - 29
Find Common Locations
Accumulo Summit
VNG - 30
Filter Tweets
Accumulo Summit
VNG - 31
Query for Full Tweets
Accumulo Summit
VNG - 32
Load Stopwords
Accumulo Summit
VNG - 33
Remove Stopwords
Accumulo Summit
VNG - 34
Find Co-Occurring Words
Accumulo Summit
VNG - 35
Remove Diagonal
Accumulo Summit
VNG - 36
See Words Most Used Together
Accumulo Summit
VNG - 37
Display on Map
Accumulo Summit
VNG - 38
Outline
• Introduction
• D4M Overview
• D4M Details
• Demonstration
• Conclusions
Accumulo Summit
VNG - 39
Summary
• D4M is a popular software tool that connects
scientists with Big Data technologies
• D4M-Accumulo binding provides high performance
connectivity to Apache Accumulo for quick analytic
prototyping
• Current research expands this connection to support
high performance graph analytics
Accumulo Summit
VNG - 40
• Graphulo: Implement GraphBLAS server-side iterators and operators on
Accumulo tables
• Use case: Queued analytics = Localized within a neighborhood
• Aim for Accumulo Contrib
• Released:
– Design Document
• Upcoming:
– Beta version of tools in
late May/early June
• Future:
– Scalability
– Schemas
– More example algorithms
G R A P H U L O
http://graphulo.mit.edu
Graphulo:
Contact Dylan Hutchison if you have any thoughts!
dhutchis@mit.edu
Accumulo Summit
VNG - 41
Acknowledgements
• Bill Arcand
• Bill Bergeron
• David Bestor
• Chansup Byun
• Matt Hubbell
• Jeremy Kepner
• Jake Bolewski
• Pete Michaleas
• Julie Mullen
• Andy Prout
• Albert Reuther
• Tony Rosa
• Charles Yee
• Dylan Hutchison
And many more …
Accumulo Summit
VNG - 42
Thank you!
• Contact:
– Vijay Gadepally (vijayg@ll.mit.edu)
– Lauren Edwards (lauren.edwards@ll.mit.edu)
– Jeremy Kepner (kepner@ll.mit.edu)

More Related Content

What's hot

Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min Qiu
Spark Summit
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
Spark Summit
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Spark Summit
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Databricks
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Spark Summit
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
Databricks
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 

What's hot (20)

Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min Qiu
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 

Viewers also liked

An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
Donald Miner
 
Accumulo Summit 2015: Event-Driven Big Data with Accumulo - Leveraging Big Da...
Accumulo Summit 2015: Event-Driven Big Data with Accumulo - Leveraging Big Da...Accumulo Summit 2015: Event-Driven Big Data with Accumulo - Leveraging Big Da...
Accumulo Summit 2015: Event-Driven Big Data with Accumulo - Leveraging Big Da...
Accumulo Summit
 
Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick Introduction
James Salter
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
Hortonworks
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
Jared Winick
 
Rapid prototyping seminar
Rapid prototyping seminarRapid prototyping seminar
Rapid prototyping seminar
avwhysoserious
 

Viewers also liked (6)

An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
 
Accumulo Summit 2015: Event-Driven Big Data with Accumulo - Leveraging Big Da...
Accumulo Summit 2015: Event-Driven Big Data with Accumulo - Leveraging Big Da...Accumulo Summit 2015: Event-Driven Big Data with Accumulo - Leveraging Big Da...
Accumulo Summit 2015: Event-Driven Big Data with Accumulo - Leveraging Big Da...
 
Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick Introduction
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
Rapid prototyping seminar
Rapid prototyping seminarRapid prototyping seminar
Rapid prototyping seminar
 

Similar to Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Spark streaming
Spark streamingSpark streaming
Spark streaming
Noam Shaish
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
Adam Doyle
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with Go
James Tan
 
How to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issuesHow to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issues
Cloudera, Inc.
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Big Data Spain
 
DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATA
InfluxData
 
Big Data Analytics With MATLAB
Big Data Analytics With MATLABBig Data Analytics With MATLAB
Big Data Analytics With MATLAB
CodeOps Technologies LLP
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for Graphs
Jean Ihm
 
WS-VLAM workflow
WS-VLAM workflowWS-VLAM workflow
WS-VLAM workflowguest6295d0
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Mostafa Majidpour
 
Deep_dive_on_Amazon_Neptune_DAT361.pdf
Deep_dive_on_Amazon_Neptune_DAT361.pdfDeep_dive_on_Amazon_Neptune_DAT361.pdf
Deep_dive_on_Amazon_Neptune_DAT361.pdf
ShaikAsif83
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)
jaxLondonConference
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.
darach
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
Aapo Kyrölä
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
Spark Summit
 

Similar to Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks] (20)

Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with Go
 
How to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issuesHow to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issues
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATA
 
Big Data Analytics With MATLAB
Big Data Analytics With MATLABBig Data Analytics With MATLAB
Big Data Analytics With MATLAB
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for Graphs
 
WS-VLAM workflow
WS-VLAM workflowWS-VLAM workflow
WS-VLAM workflow
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Deep_dive_on_Amazon_Neptune_DAT361.pdf
Deep_dive_on_Amazon_Neptune_DAT361.pdfDeep_dive_on_Amazon_Neptune_DAT361.pdf
Deep_dive_on_Amazon_Neptune_DAT361.pdf
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 

Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

  • 1. D4M and Apache Accumulo Vijay Gadepally, Lauren Edwards, Dylan Hutchison, Jeremy Kepner Accumulo Summit College Park, MD April 29, 2015 This work is sponsored by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government
  • 2. Accumulo Summit VNG - 2 Giving away the punch line • D4M is a popular open source software tool that connects scientists with Big Data technologies • D4M-Accumulo binding provides high performance connectivity to Apache Accumulo for quick analytic prototyping • Graphulo: Implement GraphBLAS server-side iterators and operators on Accumulo tables
  • 3. Accumulo Summit VNG - 3 Outline • Introduction • D4M Overview • D4M Details • Demonstration • Conclusions
  • 4. Accumulo Summit VNG - 4 Common Big Data Challenge CommandersOperators Analysts Users MaritimeGround SpaceC2 CyberOSINT <html> Data AirHUMINTWeather Data Users Gap 2000 2005 2010 2015 & Beyond Rapidly increasing - Data volume - Data velocity - Data variety - Data veracity (security)
  • 5. Accumulo Summit VNG - 5 Common Big Data Architecture WarfightersOperators Analysts Users MaritimeGround SpaceC2 CyberOSINT <html> Data AirHUMINTWeather Analytics A C DE B Computing Web Files Scheduler Ingest & Enrichment Ingest & EnrichmentIngest Databases
  • 6. Accumulo Summit VNG - 6 Common Big Data Architecture - Data Volume: Cloud Computing - WarfightersOperators Analysts Users MaritimeGround SpaceC2 CyberOSINT <html> Data AirHUMINTWeather Analytics A C DE B Computing Web Files Scheduler Ingest & Enrichment Ingest & EnrichmentIngest Databases Operators MIT SuperCloud Enterprise Cloud Big Data Cloud Database Cloud Compute Cloud MIT SuperCloud merges four clouds
  • 7. Accumulo Summit VNG - 7 WarfightersOperators Analysts Users MaritimeGround SpaceC2 CyberOSINT <html> Data AirHUMINTWeather Analytics A C DE B Computing Web Files Scheduler Ingest & Enrichment Ingest & EnrichmentIngest Databases Lincoln benchmarking validated Accumulo performance Common Big Data Architecture - Data Velocity: Accumulo Database -
  • 8. Accumulo Summit VNG - 8 WarfightersOperators Analysts Users MaritimeGround SpaceC2 CyberOSINT <html> Data AirHUMINTWeather Analytics A C DE B Computing Web Files Scheduler Ingest & Enrichment Ingest & EnrichmentIngest Databases D4M demonstrated a universal approach to diverse data columnsrows Σ raw Common Big Data Architecture - Data Variety: D4M Schema - intel reports, DNA, health records, publication citations, web logs, social media, building alarms, cyber, … all handled by a common 4 table schema
  • 9. Accumulo Summit VNG - 9 Common Big Data Architecture - Data Veracity: Security Tools- WarfightersOperators Analysts Users MaritimeGround SpaceC2 CyberOSINT <html> Data AirHUMINTWeather Analytics A C DE B Computing Web Files Scheduler Ingest & Enrichment Ingest & EnrichmentIngest Databases Using cryptography to protect sensitive data -Verifiable Query Results- -Computing on Masked Data- Big Data Cloud Masked Query Plaintext Query Encrypt CMD Masked Analytic Result Decrypt Plaintext Analytic Result
  • 10. Accumulo Summit VNG - 10 Outline • Introduction • D4M Overview • D4M Details • Demonstration • Conclusions
  • 11. Accumulo Summit VNG - 11 High Level Language: D4M http://d4m.mit.edu Accumulo Distributed Database Query: Alice Bob Cathy David Earl Associative Arrays Numerical Computing Environment D4M Dynamic Distributed Dimensional Data Model A C D E B A D4M query returns a sparse matrix or a graph… …for statistical signal processing or graph analysis in MATLAB D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization
  • 12. Accumulo Summit VNG - 12 What is D4M? • The Dynamic Distributed Dimensional Data Model: – Support for mathematical foundation – associative arrays – Schema to represent most unstructured data as associative arrays – Software tools to connect with variety of databases such as Apache Accumulo, SciDB, mySQL, PostgreSQL, … • Software tools currently implemented in MATLAB/Octave, and Julia (v1) • Connect to databases via JDBC (relational), SHIM (SciDB) or custom Java API (Accumulo)
  • 13. Accumulo Summit VNG - 13 • Key innovation: mathematical closure – All associative array operations return associative arrays • Enables composable mathematical operations A + B A - B A & B A|B A*B • Enables composable query operations via array indexing A('alice bob ',:) A('alice ',:) A('al* ',:) A('alice : bob ',:) A(1:2,:) A == 47.0 • Simple to implement in a library in programming environments with: 1st class support of 2D arrays, operator overloading, sparse linear algebra Mathematical Foundation: Associative Arrays • Complex queries with ~50x less effort than Java/SQL • Naturally leads to high performance parallel implementation • Need a schema to convert arbitrary data to associative array
  • 14. Accumulo Summit VNG - 14 D4M Data Schema • A structure described in a language supported by the database management system. • Use D4M schema to represent heterogeneous data types in common data format – Schema converts structured or unstructured raw text to a tuple representation supported by Accumulo: • Usually use a 4 table representation – The Edge Table, the Transpose Table, Degree Table, Raw Table 33659254179712 2013-05-20 21:21:42 20798128 kiefpief web 3b77caf94bfc81fe I am sending love to Oklahoma. And actually -- to everyone who may need it. You are loved. And you are not alone. Promise. #PrayforOklahoma 33660010027264 2013-05-20 21:54:56 35.99894978 - 78.90660222 -8783842.7781526 4300476.86376416 22435220 RyanBLeslie Twitter for iPad348803787 bced47a0c99c71d0 @HaydenBigCntry RT @jiminhofe: The devastation in Oklahoma is … D4M Schema (33659254179712, time|2013-05-20 21:21:42, 1) (33659254179712, user|kiefpief, 1) (33659254179712, text, Sending love to OK #PrayforOklahoma) (33659254179712, word|Sending, 1) (33660010027264, time|2013-05-20 21:54:56, 1) (33660010027264, lat|-78.90660222, 1 ) (33660010027264, lon|35.99894978, 1) (33660010027264, user|RyanBLeslie, 1) (33660010027264, RT|@HaydenBigCntry , 1) (33660010027264, word|Oklahoma, 1) …
  • 15. Accumulo Summit VNG - 15 4 Table D4M Schema row_num col1 col2 col3 001 row1col1 row1col2 word1 word2 word3 002 row2col1 row2col2 word2 word3 003 … … word1 word3 col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3 row_num|001 1 1 1 1 1 row_num|002 1 1 1 1 row_num|003 1 1 col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3 Degree 1 1 1 1 2 2 3 row_num|001 row_num|002 row_num|003 col1|row1col1 1 col1|row2col1 col2|row1col2 1 1 col2|row2col2 1 col3|word1 1 1 col3|word2 1 1 col3|word3 1 1 Tedge TedgeDeg TedgeT text row_num|001 word1 word2 word3 row_num|002 word2 word3 row_num|003 word1 word3 TedgeTxt
  • 16. Accumulo Summit VNG - 16 Outline • Introduction • D4M Overview • D4M Details • Demonstration • Conclusions
  • 17. Accumulo Summit VNG - 17 D4M Software Library • Associative Array representation works very well as an interface among databases. • D4M currently implemented in languages with first class support of sparse matrices: – MATLAB – GNU Octave – Julia (in progress) • Implemented in ~2000 lines of MATLAB code Download D4M Source from d4m.mit.edu d4m_api.zip matlab_src/ d4m_api_java.jar libext.zip dependency JARs
  • 18. Accumulo Summit VNG - 18 D4M: What a user sees (row, col, val) Matlab strings d4m Matlab API d4m_api_java Java API Accumulo Java API Accumulo Table % D4M Associative Array API row = 'r1,r2,'; col = 'c1,c1,'; val = '7,3,'; A = Assoc(row,col,val,@min); % D4M Accumulo API DB = DBserver(’zoohost.edu:2181', 'Accumulo', 'instance', 'user', 'password'); T = DB('Table'); % Create table if doesn't exist. put(T,A); % Put associative array in T. Aret = T(:,:); % Scan all of T.
  • 19. Accumulo Summit VNG - 19 D4M: What a developer sees Type Matlab/Julia File Java Class Use Table management DBcreate.m D4mDbTableOperationsCreate table @DBserver/ls.m D4mDbInfo List tables @DBtable/nnz.m D4mDbTableOperations Number of entries in table, summed from table's tablets DBdelete.m D4mDbTableOperationsDelete table Write DBinsert.m D4mDbInsert Insert Scan @DBtable/DBtable.m D4mDataSearch Create query holder @DBtable/subsref.m D4mDataSearch Do query, possibly holding batches @DBtable/close.m D4mDataSearch Reset query Delete @DBtable/deleteTriple.m AccumuloDelete Delete entries @DBtable/deleteAssoc.m AccumuloDelete Delete entries Iterators @DBtable/ColCombiner.m D4mDbTableOperationsList table iterators @DBtable/addColCombiner.m D4mDbTableOperationsAdd all-scope table iterator @DBtable/deleteColCombiner.m D4mDbTableOperationsRemove iterator Splits @DBtable/Splits.m D4mDbTableOperations Return splits, number of entries in each tablet, tablet server addresses @DBtable/addSplits.m D4mDbTableOperationsAdd new table split @DBtable/putSplits.m D4mDbTableOperationsReplace table splits, merging old splits @DBtable/mergeSplits.m D4mDbTableOperationsRemove splits by merging tablets • Source code released and available!
  • 20. Accumulo Summit VNG - 20 D4M Write More details on Batched Insert – 500 kB by default • putNumBytes() controls #entries to insert in one batch, on MATLAB side • Independent batches: each creates, flushes and closes separate BatchWriters • Guarantee BatchWriters correctly closed • No need to maintain BatchWriter lifecycle in MATLAB • 30 ms maximum latency before flushing • 50 Write threads • 1 MB maximum memory on BatchWriter, plenty for default batch size Key Value Assoc Val Row ID Assoc Row Column Timestamp Family putColumn Family() Qualifier Assoc Col Visibility put Security()
  • 21. Accumulo Summit VNG - 21 D4M Scan Example 1. Translate Matlab queries into ranges for BatchScanner T(:,:) %Scan all T('r1;r5;:;r7;', :) %Scan given row ranges T(:, 'c1;') %Use fetchColumn(), or row scan Transpose table T('r5;:;r9;', 'c1;:;c3;') %Complicated; break into simpler queries 2. Hold state of Scanner iterator as state of MATLAB object T_it = Iterator(T, 'elements', 1e5); % 100k entry batch size A = T_it(:,:); % Initial query while nnz(A) % While there is another batch handleBatch(A); A = T_it(); % Get next batch end
  • 22. Accumulo Summit VNG - 22 Parallel Accumulo Access Sample script writing files to Accumulo in parallel: T = DB('Tedge','TedgeT'); myFiles = global_ind(zeros(Nfile,1,map([Np 1],{},0:Np-1))); for i = myFiles fname = ['data/' num2str(i)]; % Create filename. load([fname '.A.mat']); % Load file data. put(T,num2str(A)); % Insert to Accumulo. end Run on 4 local processors: eval(pRUN('Script',4,{})); • D4M + pMATLAB gives rise to high performance
  • 23. Accumulo Summit VNG - 23 Accumulo Scaling on MIT SuperCloud • Scales linearly with ingest processes, server nodes, and data size Servernodes
  • 24. Accumulo Summit VNG - 24 115,000,000 inserts per second • Using supercomputing techniques allows peak insert to be achieve within seconds of launch 1M edge Graph500 graph 43K 43B edges in 5 minutes
  • 25. Accumulo Summit VNG - 25 Outline • Introduction • D4M Overview • D4M Details • Demonstration • Conclusions
  • 26. Accumulo Summit VNG - 26 D4M Twitter Demo • August 24, 2014: Earthquake in Northern California • Tweets from August 24-25 • Using D4M for: – Exploration – Analytics – Visualization
  • 27. Accumulo Summit VNG - 27 Set Table Bindings
  • 28. Accumulo Summit VNG - 28 Query Tweets
  • 29. Accumulo Summit VNG - 29 Find Common Locations
  • 30. Accumulo Summit VNG - 30 Filter Tweets
  • 31. Accumulo Summit VNG - 31 Query for Full Tweets
  • 32. Accumulo Summit VNG - 32 Load Stopwords
  • 33. Accumulo Summit VNG - 33 Remove Stopwords
  • 34. Accumulo Summit VNG - 34 Find Co-Occurring Words
  • 35. Accumulo Summit VNG - 35 Remove Diagonal
  • 36. Accumulo Summit VNG - 36 See Words Most Used Together
  • 37. Accumulo Summit VNG - 37 Display on Map
  • 38. Accumulo Summit VNG - 38 Outline • Introduction • D4M Overview • D4M Details • Demonstration • Conclusions
  • 39. Accumulo Summit VNG - 39 Summary • D4M is a popular software tool that connects scientists with Big Data technologies • D4M-Accumulo binding provides high performance connectivity to Apache Accumulo for quick analytic prototyping • Current research expands this connection to support high performance graph analytics
  • 40. Accumulo Summit VNG - 40 • Graphulo: Implement GraphBLAS server-side iterators and operators on Accumulo tables • Use case: Queued analytics = Localized within a neighborhood • Aim for Accumulo Contrib • Released: – Design Document • Upcoming: – Beta version of tools in late May/early June • Future: – Scalability – Schemas – More example algorithms G R A P H U L O http://graphulo.mit.edu Graphulo: Contact Dylan Hutchison if you have any thoughts! dhutchis@mit.edu
  • 41. Accumulo Summit VNG - 41 Acknowledgements • Bill Arcand • Bill Bergeron • David Bestor • Chansup Byun • Matt Hubbell • Jeremy Kepner • Jake Bolewski • Pete Michaleas • Julie Mullen • Andy Prout • Albert Reuther • Tony Rosa • Charles Yee • Dylan Hutchison And many more …
  • 42. Accumulo Summit VNG - 42 Thank you! • Contact: – Vijay Gadepally (vijayg@ll.mit.edu) – Lauren Edwards (lauren.edwards@ll.mit.edu) – Jeremy Kepner (kepner@ll.mit.edu)

Editor's Notes

  1. Title Page
  2. Conclusion Slide.
  3. Outline slide.
  4. Increasing data volume, velocity, and variety has created a growing gap between data and users. Image Sources: Operators: http://en.wikipedia.org/wiki/Air_and_Space_Operations_Center Analysts: © Comstock http://www.gettyimages.com/detail/photo/businessman-at-computer-royalty-free-image/78479774. Commanders: US Forces Korea General News http://www.usfk.mil/usfk/%28A%28dL8DLge1ywEkAAAAYzg4NWY4MzMtM2I0OS00YWI5LTljYjctMWQ0NDM4MGUwYzVmgU4GPacw1yQ4-d8XCgyTu_0lbjQ1%29S%28aw5nfc45hpuvapn5pihn0o45%29%29/news.annual.command.post.exercise.winds.down.in.korea.printview.648 OSINT: Acrobat logo is © Adobe Systems Inc., © Twitter and Office 2011 logos are © Microsoft Weather: © Rebecca van Ommen : http://www.gettyimages.com/detail/photo/paper-craft-weather-royalty-free-image/180478515 HUMINT: U.S. Dept of State http://www.rewardsforjustice.net/index.cfm?page=zulkifli&language=english C2: http://www.disa.mil/Services/Command-and-Control/GCCS-J Ground: Staff Sgt. William Tremblay/U.S. Army via Wired : http://www.wired.com/dangerroom/2010/09/afghan-biometric-dragnet-could-snag-millions/ Maritime: This file is a work of a sailor or employee of the U.S. Navy, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image is in the public domain. http://en.wikipedia.org/wiki/File:USS_Lake_Champlain_%28CG-57%29.JPG Air: This image or file is a work of a U.S. Air Force Airman or employee, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image or file is in the public domain. http://en.wikipedia.org/wiki/File:MQ-9_Reaper_in_flight_%282007%29.jpg Space: This file is in the public domain because it was solely created by NASA. NASA copyright policy states that "NASA material is not protected by copyright unless noted". http://en.wikipedia.org/wiki/File:CloudSat_-_Artist_Concept.jpg Cyber: © derrrek : http://www.gettyimages.com/detail/illustration/abstract-backgrounds-royalty-free-illustration/185548379 This graphic was previously approved for public release as MS-77705
  5. The Common Big Data Architecture shows the components that are common to many big data and supercomputing systems. The platform is designed to support standardized data access and dynamic composition of functionalities. Image Sources: Operators: http://en.wikipedia.org/wiki/Air_and_Space_Operations_Center Analysts: © Comstock http://www.gettyimages.com/detail/photo/businessman-at-computer-royalty-free-image/78479774. Commanders: US Forces Korea General News http://www.usfk.mil/usfk/%28A%28dL8DLge1ywEkAAAAYzg4NWY4MzMtM2I0OS00YWI5LTljYjctMWQ0NDM4MGUwYzVmgU4GPacw1yQ4-d8XCgyTu_0lbjQ1%29S%28aw5nfc45hpuvapn5pihn0o45%29%29/news.annual.command.post.exercise.winds.down.in.korea.printview.648 OSINT: Acrobat logo is © Adobe Systems Inc., © Twitter and Office 2011 logos are © Microsoft Weather: © Rebecca van Ommen : http://www.gettyimages.com/detail/photo/paper-craft-weather-royalty-free-image/180478515 HUMINT: U.S. Dept of State http://www.rewardsforjustice.net/index.cfm?page=zulkifli&language=english C2: http://www.disa.mil/Services/Command-and-Control/GCCS-J Ground: Staff Sgt. William Tremblay/U.S. Army via Wired : http://www.wired.com/dangerroom/2010/09/afghan-biometric-dragnet-could-snag-millions/ Maritime: This file is a work of a sailor or employee of the U.S. Navy, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image is in the public domain. http://en.wikipedia.org/wiki/File:USS_Lake_Champlain_%28CG-57%29.JPG Air: This image or file is a work of a U.S. Air Force Airman or employee, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image or file is in the public domain. http://en.wikipedia.org/wiki/File:MQ-9_Reaper_in_flight_%282007%29.jpg Space: This file is in the public domain because it was solely created by NASA. NASA copyright policy states that "NASA material is not protected by copyright unless noted". http://en.wikipedia.org/wiki/File:CloudSat_-_Artist_Concept.jpg Cyber: © derrrek : http://www.gettyimages.com/detail/illustration/abstract-backgrounds-royalty-free-illustration/185548379 This graphic was previously approved for public release as MS-77705
  6. Addressing data volume requires a large computing cloud. MIT SuperCloud merges four main clouds. Image Sources: Operators: http://en.wikipedia.org/wiki/Air_and_Space_Operations_Center Analysts: © Comstock http://www.gettyimages.com/detail/photo/businessman-at-computer-royalty-free-image/78479774. Commanders: US Forces Korea General News http://www.usfk.mil/usfk/%28A%28dL8DLge1ywEkAAAAYzg4NWY4MzMtM2I0OS00YWI5LTljYjctMWQ0NDM4MGUwYzVmgU4GPacw1yQ4-d8XCgyTu_0lbjQ1%29S%28aw5nfc45hpuvapn5pihn0o45%29%29/news.annual.command.post.exercise.winds.down.in.korea.printview.648 OSINT: Acrobat logo is © Adobe Systems Inc., © Twitter and Office 2011 logos are © Microsoft Weather: © Rebecca van Ommen : http://www.gettyimages.com/detail/photo/paper-craft-weather-royalty-free-image/180478515 HUMINT: U.S. Dept of State http://www.rewardsforjustice.net/index.cfm?page=zulkifli&language=english C2: http://www.disa.mil/Services/Command-and-Control/GCCS-J Ground: Staff Sgt. William Tremblay/U.S. Army via Wired : http://www.wired.com/dangerroom/2010/09/afghan-biometric-dragnet-could-snag-millions/ Maritime: This file is a work of a sailor or employee of the U.S. Navy, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image is in the public domain. http://en.wikipedia.org/wiki/File:USS_Lake_Champlain_%28CG-57%29.JPG Air: This image or file is a work of a U.S. Air Force Airman or employee, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image or file is in the public domain. http://en.wikipedia.org/wiki/File:MQ-9_Reaper_in_flight_%282007%29.jpg Space: This file is in the public domain because it was solely created by NASA. NASA copyright policy states that "NASA material is not protected by copyright unless noted". http://en.wikipedia.org/wiki/File:CloudSat_-_Artist_Concept.jpg Cyber: © derrrek : http://www.gettyimages.com/detail/illustration/abstract-backgrounds-royalty-free-illustration/185548379 This graphic was previously approved for public release as MS-77705
  7. Data velocity requires high performance databases. Lincoln has confirmed the claims of the Accumulo database. Image Sources: Operators: http://en.wikipedia.org/wiki/Air_and_Space_Operations_Center Analysts: © Comstock http://www.gettyimages.com/detail/photo/businessman-at-computer-royalty-free-image/78479774. Commanders: US Forces Korea General News http://www.usfk.mil/usfk/%28A%28dL8DLge1ywEkAAAAYzg4NWY4MzMtM2I0OS00YWI5LTljYjctMWQ0NDM4MGUwYzVmgU4GPacw1yQ4-d8XCgyTu_0lbjQ1%29S%28aw5nfc45hpuvapn5pihn0o45%29%29/news.annual.command.post.exercise.winds.down.in.korea.printview.648 OSINT: Acrobat logo is © Adobe Systems Inc., © Twitter and Office 2011 logos are © Microsoft Weather: © Rebecca van Ommen : http://www.gettyimages.com/detail/photo/paper-craft-weather-royalty-free-image/180478515 HUMINT: U.S. Dept of State http://www.rewardsforjustice.net/index.cfm?page=zulkifli&language=english C2: http://www.disa.mil/Services/Command-and-Control/GCCS-J Ground: Staff Sgt. William Tremblay/U.S. Army via Wired : http://www.wired.com/dangerroom/2010/09/afghan-biometric-dragnet-could-snag-millions/ Maritime: This file is a work of a sailor or employee of the U.S. Navy, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image is in the public domain. http://en.wikipedia.org/wiki/File:USS_Lake_Champlain_%28CG-57%29.JPG Air: This image or file is a work of a U.S. Air Force Airman or employee, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image or file is in the public domain. http://en.wikipedia.org/wiki/File:MQ-9_Reaper_in_flight_%282007%29.jpg Space: This file is in the public domain because it was solely created by NASA. NASA copyright policy states that "NASA material is not protected by copyright unless noted". http://en.wikipedia.org/wiki/File:CloudSat_-_Artist_Concept.jpg Cyber: © derrrek : http://www.gettyimages.com/detail/illustration/abstract-backgrounds-royalty-free-illustration/185548379 This graphic was previously approved for public release as MS-77705
  8. Data variety require new schemas that allow data to be flexibly stored. D4M Schema provides a simple approach that has been validated on a wide range of data. Image Sources: Operators: http://en.wikipedia.org/wiki/Air_and_Space_Operations_Center Analysts: © Comstock http://www.gettyimages.com/detail/photo/businessman-at-computer-royalty-free-image/78479774. Commanders: US Forces Korea General News http://www.usfk.mil/usfk/%28A%28dL8DLge1ywEkAAAAYzg4NWY4MzMtM2I0OS00YWI5LTljYjctMWQ0NDM4MGUwYzVmgU4GPacw1yQ4-d8XCgyTu_0lbjQ1%29S%28aw5nfc45hpuvapn5pihn0o45%29%29/news.annual.command.post.exercise.winds.down.in.korea.printview.648 OSINT: Acrobat logo is © Adobe Systems Inc., © Twitter and Office 2011 logos are © Microsoft Weather: © Rebecca van Ommen : http://www.gettyimages.com/detail/photo/paper-craft-weather-royalty-free-image/180478515 HUMINT: U.S. Dept of State http://www.rewardsforjustice.net/index.cfm?page=zulkifli&language=english C2: http://www.disa.mil/Services/Command-and-Control/GCCS-J Ground: Staff Sgt. William Tremblay/U.S. Army via Wired : http://www.wired.com/dangerroom/2010/09/afghan-biometric-dragnet-could-snag-millions/ Maritime: This file is a work of a sailor or employee of the U.S. Navy, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image is in the public domain. http://en.wikipedia.org/wiki/File:USS_Lake_Champlain_%28CG-57%29.JPG Air: This image or file is a work of a U.S. Air Force Airman or employee, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image or file is in the public domain. http://en.wikipedia.org/wiki/File:MQ-9_Reaper_in_flight_%282007%29.jpg Space: This file is in the public domain because it was solely created by NASA. NASA copyright policy states that "NASA material is not protected by copyright unless noted". http://en.wikipedia.org/wiki/File:CloudSat_-_Artist_Concept.jpg Cyber: © derrrek : http://www.gettyimages.com/detail/illustration/abstract-backgrounds-royalty-free-illustration/185548379 This graphic was previously approved for public release as MS-77705
  9. The Common Big Data Architecture shows the components that are common to many big data and supercomputing systems. The platform is designed to support standardized data access and dynamic composition of functionalities. Image Sources: Operators: http://en.wikipedia.org/wiki/Air_and_Space_Operations_Center Analysts: © Comstock http://www.gettyimages.com/detail/photo/businessman-at-computer-royalty-free-image/78479774. Commanders: US Forces Korea General News http://www.usfk.mil/usfk/%28A%28dL8DLge1ywEkAAAAYzg4NWY4MzMtM2I0OS00YWI5LTljYjctMWQ0NDM4MGUwYzVmgU4GPacw1yQ4-d8XCgyTu_0lbjQ1%29S%28aw5nfc45hpuvapn5pihn0o45%29%29/news.annual.command.post.exercise.winds.down.in.korea.printview.648 OSINT: Acrobat logo is © Adobe Systems Inc., © Twitter and Office 2011 logos are © Microsoft Weather: © Rebecca van Ommen : http://www.gettyimages.com/detail/photo/paper-craft-weather-royalty-free-image/180478515 HUMINT: U.S. Dept of State http://www.rewardsforjustice.net/index.cfm?page=zulkifli&language=english C2: http://www.disa.mil/Services/Command-and-Control/GCCS-J Ground: Staff Sgt. William Tremblay/U.S. Army via Wired : http://www.wired.com/dangerroom/2010/09/afghan-biometric-dragnet-could-snag-millions/ Maritime: This file is a work of a sailor or employee of the U.S. Navy, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image is in the public domain. http://en.wikipedia.org/wiki/File:USS_Lake_Champlain_%28CG-57%29.JPG Air: This image or file is a work of a U.S. Air Force Airman or employee, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image or file is in the public domain. http://en.wikipedia.org/wiki/File:MQ-9_Reaper_in_flight_%282007%29.jpg Space: This file is in the public domain because it was solely created by NASA. NASA copyright policy states that "NASA material is not protected by copyright unless noted". http://en.wikipedia.org/wiki/File:CloudSat_-_Artist_Concept.jpg Cyber: © derrrek : http://www.gettyimages.com/detail/illustration/abstract-backgrounds-royalty-free-illustration/185548379 This graphic was previously approved for public release as MS-77705
  10. Outline slide.
  11. We make use of the high-level language D4M to enable to construction of graph representations of large-scale data
  12. D4M is made up of 3 componenets – mathematical foundation, d4m scham and software tools.
  13. Associative arrays operations are composable, enabling complex queries to be constructed with a few lines. Shorter code -> Easier to audit, better for security!
  14. Introduction to schemas and how they map onto triple stores. The D4M schema allows one to use the mathematics behind associative arrays within this schema to perform mathematical operations on big data sets D4M schema converts structured or unstructured raw data to the 3-tuple representation supported by Accumulo: row is a unique identifier (often some variation of a time stamp) column is a unique representation of the data value is typically just ‘1’
  15. Standard “exploded” D4M Schema used with many database. 4 table schema is introduced
  16. Outline slide.
  17. @min is collision function. From a user perspective he/she sees a software library
  18. From an Accumulo developer point of view, one sees the conneciton between the library and java calls to the Accumulo library.
  19. Justification for 50 write threads and 30 ms latency? (chosen after performance tuning) max memory could create trouble if chunk size increased Set Column Family, Visibility with a separate call Design Choice: We could have used a handle object with an explicit destructor that closes the BatchWriter when the object is destroyed. Instead we are synchronous. Easier.
  20. In a D4M Scan, there are many ways in which the underlying Java library is being called.
  21. In order to achieve high performance, one can combine D4M with pMATLAB.
  22. Accumulo demonstrates linear scaling across data sizes and hardware
  23. Achieved a peak performance of 115,000,000 inserts per second.
  24. Outline slide.
  25. Now, we will show a demonstration of D4M in action on a Twitter dataset. This shows how quickly this tools can be used to prototype algorithms.
  26. Step 1: Set table bindings to Accumulo database
  27. Now you can query tweets using the keyword of interest. Recall that this calls a scanner on the server side
  28. To find common locations, we can make use of the tweet geohash
  29. You can filter tweets based on location, in this case ww is stuff in northern california
  30. You can use the D4M schema txt table to get the full tweets.
  31. Remove all common words – referred to as stop words
  32. Remove all common words – referred to as stop words
  33. Find words that occur together in the same tweet
  34. Get rid of self loops
  35. Find common words used together in a tweet. Any surprises?
  36. Plot onto a map. Of course, there are many different viz tools that can be used
  37. Outline slide.
  38. Conclusion Slide.
  39. Non-Lincoln Work. Contact Dylan Hutchison for more information!
  40. Acknowledgements page. Many people who have made this possible.
  41. Interns, Rotations, Jobs, etc.