The Power of Declarative Analytics

August 30, 2014
The Power of
Declarative Analytics
Yunyao Li
IBM Almaden Research Center
Acknowledgement: Shiv, Sekar, Fred, Laura,
Berthold, and many more to list here.
© 2014 IBM Corporation

Unlocking the value from big data

Bank
1
15%
Bank
2
15%
Case Study: Sentiment Analysis
Product catalog, Customer Master Data, …
Text
Analytics
Social Media
• Relationships
• Products
• Personal
Attributes
360o Profile Interests
• Life
Events
Statistical
Analysis,
Report Gen.
Bank
6
23%
Bank
5
20%
Bank
4
21%
Bank
3
5%
Custom3er 360º
Who can we cross/up sell?
What are our customers
thinking of our brand?
What do our customers want?

FDA approval
1 compound
Drug Development Pipeline
Intervention should happen at critical transition bottlenecks between stages
(most likely to impact outcome)
Avg cost to develop a drug: 1.2 billion
0-I IIIIIIII IIIIIIIIIIII IIIIVVVV
Laboratory
50,000 +
Compounds
Pre-Clinical
250 Compounds
Clinical
5 Compounds
Time to develop a drug; 12 -15 years
“..Toxicity and Serious Adverse Events in Late Stage Drug Development
are the Major Causes of Drug Failure”
Adapted from PhRMA (Pharmaceutical Research and Manufacturers of America) 2013 profile

Structure
Indication
Containdication
Mode of action
/ target
Effect level
Side Effects
Case Study: Drug Discovery
SSSSttttrrrruuuuccccttttuuuurrrreeeedddd aaaannnndddd '
uuuunnnnssssttttrrrruuuuccccttttuuuurrrreeeedddd ddddaaaattttaaaa ssssoooouuuurrrrcccceeeessss
Data-driven decision making
More efficient clinical trial design, data analytics and drug success / failure predictions

•Water agencies: Improve credit profile for water infrastructure projects
• Lenders: Better estimate cost and profits of such projects
• Insurers: Better understand underlying risk of such projects
• Consumers: Access water at an affordable price despite of increasing population and demand for water
Case Study: Water Cost Index
Who care about the cost of water?
• Financial reports
• News feeds
• Websites
• …
What is the cost of water
in different regions?
Financial Analytics
• Provides market
benchmark
• Spurs growth of financial
products for both water
producers and investors

• Wall Street Journal article
on WCI
Case Study: Water Cost Index
• Financial reports
• News feeds
• Websites
• …
Statistical
Analysis
Water Cost Index
• Uganda signed up as 1st
customer for WCI
Text
Analytics
Financial Analytics
• WCI published on ongoing
basis starting end of Sep.
2013
7

What is WWWhhhaaattt iiisss tttthhhhiiiissss ttttaaaallllkkkk aaaabbbboooouuuutttt ????
What makes analytics tasks difficult and what can be learnt the
success of relational systems
Brief description of declarative systems being built at IBM for
√ Information Extraction (SystemT)
√ Machine Learning (SystemML)
X Entity Resolution (DeeR)
8 9/10/2014 IBM Research – Almaden
Data integration
Statistic Analysis
/Machine Learning
Information
Extraction
Databases
Semi-
/Unstructured
Documents

Challenges in Information Extraction
Example: Named Entity Recognition NNNNaaaammmmeeeedddd EEEEnnnnttttiiiittttyyyy HHHHiiiieeeerrrraaaarrrrcccchhhhyyyy::::
200+ types: Person, Organization, Location,…
.....…………………
…….……………….
Laura Haas
works for
IBM in
San Jose, CA.
….………………….
…..…………………
Information
Extraction
Person Org Loc
Laura Haas IBM San Jose,CA
© 2014 9 9/10/2014 IBM Research – Almaden IBM Confidential IBM Corporation

NNNNaaaammmmeeeedddd EEEEnnnnttttiiiittttyyyy HHHHiiiieeeerrrraaaarrrrcccchhhhyyyy::::
BBBBrrrreeeeaaaaddddtttthhhh
Wide varieties of extraction tasks
IIIIEEEE ddddeeeevvvveeeellllooooppppmmmmeeeennnntttt ttttaaaakkkkeeeessss eeeeffffffffoooorrrrtttt!!!!!!!!
• CCCCoooolllllllleeeeccccttttiiiinnnngggg ddddiiiiccccttttiiiioooonnnnaaaarrrriiiieeeessss
•WWWWrrrriiiittttiiiinnnngggg rrrreeeegggguuuullllaaaarrrr eeeexxxxpppprrrreeeessssssssiiiioooonnnnssss
• CCCCoooolllllllleeeeccccttttiiiinnnngggg ooootttthhhheeeerrrr wwwwoooorrrrdddd-lllleeeevvvveeeellll ffffeeeeaaaattttuuuurrrreeeessss
LLLLaaaabbbbeeeelllliiiinnnngggg ++++ ttttrrrraaaaiiiinnnniiiinnnngggg////ttttuuuunnnniiiinnnngggg mmmmaaaacccchhhhiiiinnnneeee lllleeeeaaaarrrrnnnniiiinnnngggg mmmmooooddddeeeellllssss
oooorrrr
WWWWrrrriiiittttiiiinnnngggg ++++ tttteeeessssttttiiiinnnngggg rrrruuuulllleeeessss

222200000000++++ ttttyyyyppppeeeessss:::: PPPPeeeerrrrssssoooonnnn,,,, OOOOrrrrggggaaaannnniiiizzzzaaaattttiiiioooonnnn,,,, LLLLooooccccaaaattttiiiioooonnnn,,,,…………
DDDDoooommmmaaaaiiiinnnn ccccuuuussssttttoooommmmiiiizzzzaaaattttiiiioooonnnnssss iiiissss uuuussssuuuuaaaallllllllyyyy rrrreeeeqqqquuuuiiiirrrreeeedddd!!!!!!!!
Entity Boundary: PPPPeeeerrrrssssoooonnnn oooorrrr
PPPPoooossssiiiittttiiiioooonnnn ++++ PPPPeeeerrrrssssoooonnnn ????
… Pres. Barack Obama arrived
today at the White House …
Entity Definition:
LLLLooooccccaaaattttiiiioooonnnn////FFFFaaaacccciiiilllliiiittttyyyy////OOOOrrrrggggaaaannnniiiizzzzaaaattttiiiioooonnnn????
CCCCoooommmmpppplllleeeexxxxiiiittttyyyy
In development customization

State-of-the-art Open-Source Rule-based
System
• 80,000+ dictionary entries
• 4,800 lines of JAPE and Java code
• Accuracy (English): 50%-80%
• Performance: 20KB/sec, 8GB RAM
DDDDoooommmmaaaaiiiinnnn ccccuuuussssttttoooommmmiiiizzzzaaaattttiiiioooonnnnssss iiiissss uuuussssuuuuaaaallllllllyyyy rrrreeeeqqqquuuuiiiirrrreeeedddd!!!!!!!!
Entity Boundary: PPPPeeeerrrrssssoooonnnn oooorrrr
PPPPoooossssiiiittttiiiioooonnnn ++++ PPPPeeeerrrrssssoooonnnn ????
… Pres. Barack Obama arrived
today at the White House …
Entity Definition:
LLLLooooccccaaaattttiiiioooonnnn////FFFFaaaacccciiiilllliiiittttyyyy////OOOOrrrrggggaaaannnniiiizzzzaaaattttiiiioooonnnn????
In development customization
State-of-the-art Machine-learning system
• Combination of 4 classifiers
• 150,000+ dictionary entries
• 15+ regexes for word features
• Accuracy: 89%
• Throughput: ~ 10 KB/sec
SSSSccccaaaalllleeee
450M+ tweets per day, …

~1500 lines of Java code
Different Loss function
KL-divergence Wide varieties of ML models
450M+ tweets per day, …
Challenges in Scalable CCChhhaaalllllleeennngggeeesss iiinnn SSScccaaalllaaabbbllleee MMMMaaaacccchhhhiiiinnnneeee LLLLeeeeaaaarrrrnnnniiiinnnngggg
topics topics words
V ≈ W H
documents
x
13 9/10/2014 IBM Research – Almaden
[Liu, WWW 2010]
• Billions of non-zeros within tens of hours
• Careful partitioning of data
• Maximize data locality and parallelism
% initialize W, H
while (~converged)
W = W*(V%*%t(H))/(W%*%H%*%t(H))
H = H*(t(W)%*%V)/(t(W)%*%W%*%H)
end
W = W*max(V%*%t(H) – alphaW JW, 0)/(W%*%H%*%t(H))
H = H*max(t(W)%*%V – alphaH JH, 0)/(t(W)%*%W%*%H)
W = W*((S*V)%*%t(H))/((S*(W%*%H))%*%t(H))
H = H*(t(W)%*%(S*V))/(t(W)%*%(S*(W%*%H)))
Regularizers
JW,JH
Weighted Sq Loss/
Matrix Completion Setting
Parallel implementation is half the story !
Typical application requires experimenting with multiple variants
W = W*(V/(W%*%H) %*% t(H))/(E*%t(H))
In implementation
H = H*(t(W)%*%(V/(W%*%H)))/(t(W)%*%E)

WWWWhhhhaaaatttt iiiissss ccccoooommmmmmmmoooonnnn aaaaccccrrrroooossssssss tttthhhheeeesssseeee aaaannnnaaaallllyyyyttttiiiiccccssss ttttaaaasssskkkkssss ????
Variety of problems and solutions
–Every customer’s data problems are unique in some
way
–Need to quickly implement new business logic
–Need to experiment with multiple algorithms for a
particular analytic problem
Quality of answers is very important !!
–High quality analytics requires “complex” programs
–Skilled developers + domain experts
Performance is critical
–Bigger data demands faster execution
• Social Media:
Twitter alone has 400M+ messages / day; 1TB+ per day
• Financial Data:
SEC alone has 20M+ filings, several TBs of data, with documents range
from few KBs to few MBs
• Machine Data:
One application server under moderate load at medium logging level
1GB of logs per day
© 2014 IBM Research – Almaden IBM Corporation

DDDDeeeeccccllllaaaarrrraaaattttiiiivvvveeee SSSSyyyysssstttteeeemmmmssss :::: TTTThhhheeee RRRReeeellllaaaattttiiiioooonnnnaaaallll WWWWoooorrrrlllldddd
Compute average salary
for each department
select D.did, avg(E.salary)
from Employee E, Department D
where E.did = D.did
group by D.did
Task
SQL Query
Declarative High-level Language
User specifies tasks in a high-level
language, w/o specifying algorithms for
data processing
Optimization Query Optimizer
…
TTTTaaaabbbblllleeeessss,,,, IIIInnnnddddiiiicccceeeessss
Execution
Strategy
Query Optimization
System uses optimization strategies to
choose from alternate execution plans
Physical Data Independence
User does not have to worry about
physical data representation and
access aids while writing queries;
system manages the physical layer
© 2014 15 9/10/2014 IBM Research – Almaden IBM Corporation

WWWWhhhhyyyy ddddiiiidddd RRRReeeellllaaaattttiiiioooonnnnaaaallll SSSSyyyysssstttteeeemmmmssss ssssuuuucccccccceeeeeeeedddd ????
Pat Selinger
Boeing said “We can ask questions we could
never find the answers to before. We’re now able
to do more than we could ever do before.”
SIGMOD Record, December 2003
Bruce Lindsay The invention of nonprocedural specification was
a tremendous simplification that made it much
easier to specify applications. No longer did you
have to say which index to use and which join
method to use to get the job done.
SIGMOD Record, June 2005
Michael Stonebraker
Query optimizers can beat all but the best
DBMS application programmers.
“What Goes Around Comes Around”,
Readings in Database Systems, 4th Edition, 2005

WWWWhhhhaaaatttt iiiissss ccccoooommmmmmmmoooonnnn aaaaccccrrrroooossssssss tttthhhheeeesssseeee aaaannnnaaaallllyyyyttttiiiiccccssss ttttaaaasssskkkkssss ????
Variety of problems and solutions
–Every customer’s data problems are unique
in some way
–Need to quickly implement new business logic
–Need to experiment with multiple algorithms
for a particular analytic problem
Quality of answers is very important !!
–High quality analytics “complex” programs
–Skilled developers + domain experts
Performance is critical
–Bigger data faster execution
© 2014 IBM Research – Almaden IBM Corporation

What am I going to talk about ?
What makes analytics tasks difficult and what can be
learnt from the success of relational systems
Brief description of declarative systems built at IBM and
the design choices made along the way
SystemT
(Information Extraction)
SystemML
(Machine Learning)
Design
Choices
Analytics
Systems
Data Model
Operations
Language Syntax
Platform
© 2014 18 IBM Research – Almaden 9/10/2014 IBM Corporation

IIIInnnnffffoooorrrrmmmmaaaattttiiiioooonnnn EEEExxxxttttrrrraaaaccccttttiiiioooonnnn - SSSSyyyysssstttteeeemmmmTTTT

Informal Music Band Reviews from Blogs
Start with Concert Mention
At least 3 occurrences of Music Review Snippet and Generic Review Snippet
Review ends with one of these.
Complete review is
within 200 tokens
Concert Mention Pattern
Consecutive Review Snippets are within 25 tokens
I went … to the OTIS concert last night
a bunch of other bands also playing
The sax player in that band…
They played … “I Will Survive”…
Music
Review
Snippet
Review within
200 tokens

SSSSttttaaaatttteeee-ooooffff-tttthhhheeee-aaaarrrrtttt:::: CCCCoooommmmmmmmoooonnnn PPPPaaaatttttttteeeerrrrnnnn SSSSppppeeeecccciiiiffffiiiiccccaaaattttiiiioooonnnn LLLLaaaannnngggguuuuaaaaggggeeee ((((CCCCPPPPSSSSLLLL))))
A common language to specify and represent extraction rules as cascading grammars
Developed jointly between SRI and Department of Defense (1999)
Example Rule: Band Member name followed within 5 tokens by Instrument clue is a Music Review Snippet
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. augue rutrum
lorem velit, sed RRRReeeevvvviiiieeeewwwwSSSSnnnniiiippppppppeeeetttt, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat
〈BandMember〉 〈Token〉{0,5} 〈Instrument〉〈MusicReviewSnippet〉
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. luctus, risus in sagittis
facilisis BBBBaaaannnnddddMMMMeeeemmmmbbbbeeeerrrr tttthhhheeeeiiiirrrr lllleeeeaaaadddd vvvvooooccccaaaallll////IIIInnnnssssttttrrrruuuummmmeeeennnntttt hendrerit faucibus pede mi ipsum.
Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis,
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Proin JJJJoooonnnn FFFFoooorrrreeeemmmmaaaannnn tttthhhheeeeiiiirrrr lllleeeeaaaadddd vvvvooooccccaaaallll////
IIIInnnnssssttttrrrruuuummmmeeeennnntttt arcu tincidunt
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin in
sagittis , BBBBaaaannnnddddMMMMeeeemmmmbbbbeeeerrrr tttthhhheeeeiiiirrrr lllleeeeaaaadddd
vvvvooooccccaaaallll////gggguuuuiiiittttaaaarrrriiiisssstttt rutrum velit sed amet lt arcu tincidunt
〈〈Token〉[~ “([A-Z]w+)s+[A-Z]w+”] 〈BandMember〉〉 〈Token〉[~ “pipe | guitarist | …”] 〈Instrument〉
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur
risus in sagittis facilisis JJJJoooonnnn FFFFoooorrrreeeemmmmaaaannnn tttthhhheeeeiiiirrrr lllleeeeaaaadddd vvvvooooccccaaaallll////gggguuuuiiiittttaaaarrrriiiisssstttt hendrerit faucibus pede mi ipsum.
Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis,
LLLLeeeevvvveeeellll 2222

WWWWhhhhyyyy iiiissss tttthhhhiiiissss nnnnooootttt ssssuuuuffffffffiiiicccciiiieeeennnntttt ????
Consecutive Review Snippets are within 25 tokens
Start with Concert Mention
At least 3 occurrences of Music Review Snippet or Generic Review Snippet
Review ends with one of these.
Complete review is
within 200 tokens
Counting and aggregations are not natural primitives in grammar and have to be handled in
custom code [Chiticariu, ACL 2010]
Finely tuned grammar-based extraction system, with custom code for counting and
aggregation, took ~ 6 hours to extract reviews from a million web logs

SSSSyyyysssstttteeeemmmmTTTT – DDDDeeeeccccllllaaaarrrraaaattttiiiivvvveeee AAAApppppppprrrrooooaaaacccchhhh ttttoooo IIIInnnnffffoooorrrrmmmmaaaattttiiiioooonnnn EEEExxxxttttrrrraaaaccccttttiiiioooonnnn
Annotated
Document
Stream
AQL SystemT
Optimizer
SystemT
Runtime
Compiled
Operator
Graph
Rule language with
familiar SQL-like syntax
Specify annotator
semantics declaratively
Choose an efficient
execution plan that
implements the
semantics
Highly scalable,
embeddable Java
runtime
Input
Document
Stream
See SIGMOD 2010 tutorial [Chiticariu et al., 2010]
for details on other recent declarative IE systems

Expressing Music Review Snippet Rule in AQL
BandMember Instrument
0-5 tokens
create view MusicReviewSnippet as
select B.name as member, I.value as instrument,
CombineSpans(B.name,I.value) as review
from BandMember B, Instrument I
where FollowsTok(B.name, I.value, 0, 5);
create view BandMember as
extract regex /[A-Z]w+s+[A-Z]w+] / on D.text
from Document D;
Choice of SQL-like syntax for AQL motivated by wider adoption of SQL

WWWWhhhhaaaatttt mmmmaaaakkkkeeeessss AAAAQQQQLLLL eeeexxxxpppprrrreeeessssssssiiiivvvveeee????
Extraction primitives
–Regular Expressions
–Dictionary
Text-specific primitives
–Multi-lingual tokenization and parts-of-speech
–Sentence and paragraph boundary detection
–Span-based predicates
Set-level primitives
–Join
–Block
–Consolidation
–Group By

HHHHoooowwww wwwwiiiillllllll tttthhhheeee MMMMuuuussssiiiicccc BBBBaaaannnndddd RRRReeeevvvviiiieeeewwww eeeexxxxttttrrrraaaaccccttttoooorrrr wwwwoooorrrrkkkk iiiinnnn SSSSyyyysssstttteeeemmmmTTTT????
Block
Music
Find blocks of three or
more “Review Snippet”
Review Snippet
Blocks of Review Snippet
Review Snippet
Generic
Review Snippet
Join predicates
enforce additional
constraints
Concert Mention
Join
Union
patterns
…

CCCCllllaaaassssssss ooooffff OOOOppppttttiiiimmmmiiiizzzzaaaattttiiiioooonnnnssss iiiinnnn SSSSyyyysssstttteeeemmmmTTTT
RRRReeeewwwwrrrriiiitttteeee-bbbbaaaasssseeeedddd: rewrite algebraic
operator graph
–Shared Dictionary Matching
–Shared Regular Expression
Evaluation
–On-demand tokenization
CCCCoooosssstttt-bbbbaaaasssseeeedddd: relies on novel selectivity
estimation for text-specific operators
–Standard transformations
• E.g., push down selections
–Restricted Span Evaluation
• Evaluate expensive operators on
restricted regions of the document
Tokenization overhead is paid only once
(followed within 5 tokens)
BandMember
Plan B
BandMember
Plan C
Plan A
Join
Instrument
Identify Instrument starting
Extract text to the right within 5 tokens
Identify BandMember ending
within 5 tokens
Extract text to the left
Restricted Span Evaluation
Instrument

Performance benefits using SystemT
[Chiticariu et al. ACL’10]
Music Band Review extraction task over a million web logs
– SystemT vs. the grammar implementation
• 10 minutes vs. ~ 6 hours
Named-entity extraction task over multiple document corpora
– SystemT throughput ranges from 400 – 900 KB/sec/core (depending on the size of
the document)
– SystemT vs. State-of-the-Art Learning-based System [Florian et al, CoNLL’03]
~ 50 times higher throughput
– SystemT vs. State-of-the-Art Grammar-based System [ANNIE, Cunningham et al,
ACL’02]
~ 10 - 50 times higher throughput
~ 60 - 90% less memory consumption
Revisiting the Twitter example, for keeping up with today’s tweets with 18 cores
– SystemT takes 30 minutes per day as opposed to running 24/7 for the state-of-the-art
system

Runs fast ! But is SystemT expressive enough to compare on quality ?
[Chiticariu et al. ACL’10, EMNLP’10]
SystemT outperforms current best results on multiple benchmark datasets
– CoNLL 2003
• F-measure between 89% and 92% for Person, Organization and Location
tasks
• Beats the state-of-the-art results consistently by up to 4%
– Enron Email
• F-measure 85% for Person task
• Better than the state-of-the-art result by 7%

What design choices did we make for SystemT ?
SystemT
SystemML
(Machine Learning)
Data Model Document-at-a-time model
Data types: Span, Tuple, Relation
Operations Feature extraction primitives
Language
Syntax
SQL-like syntax
Platform Embeddable runtime deployed in a wide
range of execution environments
Design
Choices
Analytics
Systems

SSSSyyyysssstttteeeemmmmMMMMLLLL

SSSSttttaaaattttuuuussss QQQQuuuuoooo ooooffff MMMMaaaacccchhhhiiiinnnneeee LLLLeeeeaaaarrrrnnnniiiinnnngggg AAAAllllggggoooorrrriiiitttthhhhmmmmssss
Machine Learning algorithm implementations today
– Specialized languages for Machine Learning
• R, Matlab
• Execution strategy for programs is determined by user
– Low-level implementations
• Directly implement ML algorithms on specific platforms
–Hand-tuned implementations on specialized hardware GPU, BlueGene etc.
But the programmer has to handle
– Performance optimizations due to data and compute platform characteristics
– Parallelization for specific platforms
IBM Research – Almaden 9/10/2014
© 2014 32 IBM Corporation

SSSSyyyysssstttteeeemmmmMMMMLLLL GGGGooooaaaallllssss
GNMF: V ≈ U = W H
V=readMM(in/V, rows=1e8, cols=1e5);
W=readMM(in/W, rows=1e8, cols=10);
H=readMM(in/H, rows=10, cols=1e5);
max_iteration=20;
i=0;
while(imax_iteration){
H=H*(t(W)%*%V)/(t(W)%*%W%*%H);
W=W*(V%*%t(H))/(W%*%H%*%t(H));
i=i+1;}
Higher level
Optimizations
Operator
Implementations
MapReduce
Platform
'
MR1
MR2
MRn
Provide language to implement ML algorithms
Support specific ML constructs such as cross
validation, bootstrapping, ensembles as first class
citizens
Optimizations based on data and system
characteristics
Scalable operator implementations

SystemML Architecture
DML: Declarative Machine Learning Language
– Retain expressivity of current ML languages including
procedural constructs like while and for loops
High-Level Operator (HOP) Component
– Represent dataflow in DAGs of matrices and scalar operations
– Choose from alternative execution plans using algebraic
rewrites and cost-based optimization
Low-Level Operator (LOP) Component
– Low-level physical execution plan over key-value pairs
– “Piggyback” operations to reduce number of MapReduce jobs
Runtime
– Efficient data representation and implementation of individual
operations in MapReduce framework
– Control module to orchestrate MR jobs

Simple Example of how SystemML works
Binary hop
Multiply
B Binary hop
Divide
C D
Binary lop
Multiply
Group lop
C
A = B * (C / D)
Binary lop
Divide
Group lop
D
B
R1
M1
MR Job
Language HOP Component LOP Component Runtime
LOP represents the physical plan
for the program with a DAG for
each statement block.
LOP operates on key-value pairs
and scalars
Multiple low-level
operators combined
in a MapReduce job
HOP represents the logical
flow of the program as DAGs
for each statement block.
HOP operates on matrices
and scalars
Input DML parsed
into statement blocks
with typed variables

Declarative Machine DDDeeeccclllaaarrraaatttiiivvveee MMMaaaccchhhiiinnneee LLLLeeeeaaaarrrrnnnniiiinnnngggg LLLLaaaannnngggguuuuaaaaggggeeee
Syntax borrowed from R
What is supported
– Data Types: matrix, vector, scalar
– Statements
• Input/Output, Assignment, Control Structures (while, for), Rand
– Expressions
• Operators : Arithmetic, Comparative, Boolean, Matrix Multiplication
• Built-in Functions : Linear Algebra (transpose, …), Matrix aggregation (colSum, ...) ,
Mathematical (ln, sqrt, …)
– External Functions
– Machine Learning specific constructs : Cross validation, Ensemble learning
36 IBM Research – Almaden 9/10/2014

CCCCaaaatttteeeeggggoooorrrriiiieeeessss ooooffff OOOOppppttttiiiimmmmiiiizzzzaaaattttiiiioooonnnn iiiinnnn SSSSyyyysssstttteeeemmmmMMMMLLLL
HOP component
– Algebraic rewrites (e.g., matrix computation reordering)
– Cost-based optimization (e.g., choosing between different plans for matrix multiplication)
– Selection of physical representation of matrices (e.g., cell versus block representation)
LOP component
– Piggybacking (packing lops that can be evaluated together in a single MapReduce job)
Runtime
– Data representation (e.g., sparse versus dense)
– Sparsity-aware operator implementations

Performance Numbers
Gaussian NMF:
V = readMM (example.GNMF.V, rows= 1000, cols=100, nnzs= 2000, format=text);
W = readMM (example.GNMF.W, rows= 1000, cols=20, nnzs= 20000, format=text);
H = readMM (example.GNMF.H, rows= 20, cols=100, nnzs= 2000, format=text);
max_iteration = 10
i = 0
while (i max_iteration) {
H = H * ((t(W) %*% V) / ( (t(W) %*% W) %*% H))
W = W * ((V %*% t(H)) / ( W %*% (H %*% t(H))))
i = i + 1
}
writeMM (W,example.GNMF.W.result, format=text);
writeMM (H,example.GNMF.H.result, format=text);
Data Size
Time per
iteration
Lines of
Code
Runtime Platform
In SystemML
5 billion non zeros
(50m X 100k, sparsity
1x10-3)
1.2 hours
11 lines of
DML code
40 cores, 4 GB RAM per core
WWW 2010
4.38 billion non zeros
(43.9 m X 768m,
sparsity 1.3x10-7)
7 hours
1500 lines of
Java code
SCOPE cluster

AAAAddddddddiiiittttiiiioooonnnnaaaallll AAAAllllggggoooorrrriiiitttthhhhmmmmssss
Execution Time (sec)
400
350
300
250
200
150
100
50
0
PageRank
G=readMM(in/G, rows=1e6, cols=1e6);
p=readMM(in/p, rows=1e6, cols=1);
e=readMM(in/e, rows=1e6, cols=1);
ut=readMM(in/ut, rows=1, cols=1e6);
alpha=0.85;
max_iteration=20;
i=0;
p=alpha*(G%*%p)+(1-alpha)*(e%*%u%*%p);
i=i+1}
writeMM(p, out/p);
DML PageRank
G: n x n, sparsity=0.001 V: d x 100000, sparsity=0.001
0 400 800 1200 1600
#rows and #columns in G (thousand)
Execution Time (sec)
800
600
400
200
0
Sparse Linear Regression
V=readMM(in/V, rows=1e8, cols=1e5);
b=readMM(in/b, rows=1e8, cols=1);
lambda = 1e-6;
r=-b ;
p=-r ;
norm_r2=sum(r*r);
max_iteration=20;
i=0;
q=((t(V) %*% (V %*% p)) + lambda*p)
alpha= norm_r2/(t(p)%*%q);
w=w+alpha*p;
old_norm_r2=norm_r2;
r=r+alpha*q;
beta=norm_r2/old_norm_r2;
p=-r+beta*p;
i=i+1;}
writeMM(w, out/w);
DML Linear Regression
0 2 4 6 8 10 12 14 16 18 20
#rows in V (million)

What design choices did we WWWhhhaaattt dddeeesssiiigggnnn ccchhhoooiiiccceeesss dddiiiddd wwweee mmmmaaaakkkkeeee ffffoooorrrr SSSSyyyysssstttteeeemmmmMMMMLLLL ????
SystemT
SystemML
(Machine Learning)
Analytics
Systems
Data Model Document-at-a-time model
Data types: Span, Tuple,
Relation
Data types: Matrix, Vector, Scalar
Operations Feature extraction primitives
Procedural constructs
e.g., while, for
Linear Algebra operations
External Functions
Machine Learning specific constructs
e.g., Cross Validation, Ensemble Learning
Language
Syntax
SQL-like syntax R-like syntax
Platform Embeddable runtime deployed
in a wide range of execution
environments
MapReduce Runtime
Design
Choices

SSSSuuuummmmmmmmaaaarrrryyyy

LLLLeeeessssssssoooonnnnssss LLLLeeeeaaaarrrrnnnneeeedddd
– SystemT
• Ships with eight IBM products
• To date have not encountered a request that is not expressible in AQL
– SystemML
• Ships with IBM BigInsights August beta this year
• Declarative is the goal; but to express Machine Learning algorithms procedural constructs are needed
• Users naturally gravitate to procedural constructs. Limiting usage of such constructs to only when
required to specify “what needs to be done” may need lot of training
– SystemT
• Choice of SQL-like syntax and Eclipse-based tooling quickly enabled hundreds of users with varied
background
• But traditional NLP-trainees prompted us to provide a layer on top of AQL with grammar-like syntax
• Business users demand even simpler and more usable tooling
– SystemML
• Early days but multiple users inside IBM and almost all are previous R / Matlab users.
• Familiar R syntax helps ML users up and running al most immediately
– SystemT
• Document at a time model and all in-memory optimizations
• Demonstrates that an order-of-magnitude throughput improvement can be obtained
• Hardware acceleration further speed up the execution
– SystemML
• Computation on a large-scale distributed platform
• Initial eIxBpMe Rrieesnecarecsh r–eAinlmfoadrecne the argument “Query optimizers can b9e/1a0t/2 a0l1l4 but the best programmers”

Maintenance
Tooling Research for the Development Life-Cycle
Development
[ACL’11,12,13,CHI’13]
Develop
Analyze Test
Deploy
Refine
Test
Task Analysis
• Concordance Viewer
• Active labeling
• Labeling tool
• Extraction plan
• Track provenance [VLDB’10]
• Contextual clue discovery[CIKM’11]
• Regex learning [EMNLP’08]
• Suggest rule changes [VLDB’10]
• Rule induction [EMNLP’12]
• Dictionary refinement [SIGMOD’13]
• Rule learning
• NE Interface [EMNLP’10]
• Tagger UI [SIGMOD’07]

Result Viewer
Eclipse EEEcccllliiipppssseee TTTToooooooollllssss OOOOvvvveeeerrrrvvvviiiieeeewwww
EEEEaaaasssseeee ooooffff
PPPPrrrrooooggggrrrraaaammmmmmmmiiiinnnngggg
AAAAuuuuttttoooommmmaaaattttiiiicccc
DDDDiiiissssccccoooovvvveeeerrrryyyy
PPPPeeeerrrrffffoooorrrrmmmmaaaannnncccceeee
TTTTuuuunnnniiiinnnngggg
AQL Editor
Explain
Pattern Discovery
Regex Learner
AAAAQLLLL EEEEddddiiiittttoooorrrr:::: syntax highlighting, auto-complete,
hyperlink navigation
RRRReeeessssuuuulllltttt VVVViiiieeeewwwweeeerrrr:::: visualize/compare/evaluate
EEEExxxxppppllllaaaaiiiinnnn:::: show how each result was generated
WWWWoooorrrrkkkkfffflllloooowwww UUUUIIII: end-to-end development wizard
RRRReeeeggggeeeexxxx GGGGeeeennnneeeerrrraaaattttoooorrrr:::: generate regular expressions
from examples
PPPPaaaatttttttteeeerrrrnnnn DDDDiiiissssccccoooovvvveeeerrrryyyy: identify patterns in the data
PPPPrrrrooooffffiiiilllleeeerrrr: identify performance bottlenecks to be
hand tuned

WWWWeeeebbbb TTTToooooooollllssss OOOOvvvveeeerrrrvvvviiiieeeewwww
PPPPrrrrooooggggrrrraaaammmmmmmmiiiinnnngggg
SSSShhhhaaaarrrriiiinnnngggg
CCCCaaaannnnvvvvaaaassss::::
• Visual construction of extractors
• Customization of existing extractors
RRRReeeessssuuuulllltttt VVVViiiieeeewwwweeeerrrr:::: visualize/compare/evaluate
CCCCoooonnnncccceeeepppptttt ccccaaaattttaaaalllloooogggg:::: share concepts
PPPPrrrroooojjjjeeeecccctttt:::: share extractor development
Even for non-programmers

DDDDoooonnnn hhhhaaaassss tttthhhheeee llllaaaasssstttt wwwwoooorrrrdddd …………
Don Chamberlin
We set out to help non-programmers interact with
databases to open up access to data to a whole new
class of people who could do things that were
never possible before. The problem that we didn't
think we were working on at all was how to embed
query languages into host languages, or how to
make a language that would serve as an
interchange medium between different systems -
those are the ways in which SQL ultimately turned
out to be very successful,
SQL Reunion, 1995
Maybe ..
Don observed that success of SQL was due to the language serving as an
interchange medium between systems. In contrast declarative systems
for analytics may indeed be successful for the original purpose that SQL
was intended – open up access to analytics to a whole new class of people

Thank You!
47

The Power of Declarative Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Power of Declarative Analytics

Similar to The Power of Declarative Analytics (20)

More from Yunyao Li

More from Yunyao Li (20)

Recently uploaded

Recently uploaded (20)

The Power of Declarative Analytics