SlideShare a Scribd company logo
August 30, 2014 
The Power of 
Declarative Analytics 
Yunyao Li 
IBM Almaden Research Center 
Acknowledgement: Shiv, Sekar, Fred, Laura, 
Berthold, and many more to list here. 
© 2014 IBM Corporation
Unlocking the value from big data 
© 2014 IBM Corporation
Bank 
1 
15% 
Bank 
2 
15% 
© 2014 IBM Corporation 
Case Study: Sentiment Analysis 
Product catalog, Customer Master Data, … 
Text 
Analytics 
Social Media 
• Relationships 
• Products 
• Personal 
Attributes 
360o Profile Interests 
• Life 
Events 
Statistical 
Analysis, 
Report Gen. 
Bank 
6 
23% 
Bank 
5 
20% 
Bank 
4 
21% 
Bank 
3 
5% 
Custom3er 360º 
Who can we cross/up sell? 
What are our customers 
thinking of our brand? 
What do our customers want?
FDA approval 
1 compound 
Drug Development Pipeline 
Intervention should happen at critical transition bottlenecks between stages 
(most likely to impact outcome) 
© 2014 IBM Corporation 
Avg cost to develop a drug: 1.2 billion 
0-I IIIIIIII IIIIIIIIIIII IIIIVVVV 
Laboratory 
50,000 + 
Compounds 
Pre-Clinical 
250 Compounds 
Clinical 
5 Compounds 
Time to develop a drug; 12 -15 years 
“..Toxicity and Serious Adverse Events in Late Stage Drug Development 
are the Major Causes of Drug Failure” 
Adapted from PhRMA (Pharmaceutical Research and Manufacturers of America) 2013 profile
© 2014 IBM Corporation 
Structure 
Indication 
Containdication 
Mode of action 
/ target 
Effect level 
Side Effects 
Case Study: Drug Discovery 
SSSSttttrrrruuuuccccttttuuuurrrreeeedddd aaaannnndddd ' 
uuuunnnnssssttttrrrruuuuccccttttuuuurrrreeeedddd ddddaaaattttaaaa ssssoooouuuurrrrcccceeeessss 
Data-driven decision making 
More efficient clinical trial design, data analytics and drug success / failure predictions
•Water agencies: Improve credit profile for water infrastructure projects 
• Lenders: Better estimate cost and profits of such projects 
• Insurers: Better understand underlying risk of such projects 
• Consumers: Access water at an affordable price despite of increasing population and demand for water 
© 2014 IBM Corporation 
Case Study: Water Cost Index 
Who care about the cost of water? 
• Financial reports 
• News feeds 
• Websites 
• … 
What is the cost of water 
in different regions? 
Financial Analytics 
• Provides market 
benchmark 
• Spurs growth of financial 
products for both water 
producers and investors
• Wall Street Journal article 
on WCI 
© 2014 IBM Corporation 
Case Study: Water Cost Index 
• Financial reports 
• News feeds 
• Websites 
• … 
Statistical 
Analysis 
Water Cost Index 
• Uganda signed up as 1st 
customer for WCI 
Text 
Analytics 
Financial Analytics 
• WCI published on ongoing 
basis starting end of Sep. 
2013 
7
© 2014 IBM Corporation 
What is WWWhhhaaattt iiisss tttthhhhiiiissss ttttaaaallllkkkk aaaabbbboooouuuutttt ???? 
What makes analytics tasks difficult and what can be learnt the 
success of relational systems 
Brief description of declarative systems being built at IBM for 
√ Information Extraction (SystemT) 
√ Machine Learning (SystemML) 
X Entity Resolution (DeeR) 
8 9/10/2014 IBM Research – Almaden 
Data integration 
Statistic Analysis 
/Machine Learning 
Information 
Extraction 
Databases 
Semi- 
/Unstructured 
Documents
Challenges in Information Extraction 
Example: Named Entity Recognition NNNNaaaammmmeeeedddd EEEEnnnnttttiiiittttyyyy HHHHiiiieeeerrrraaaarrrrcccchhhhyyyy:::: 
200+ types: Person, Organization, Location,… 
.....………………… 
…….………………. 
Laura Haas 
works for 
IBM in 
San Jose, CA. 
….…………………. 
…..………………… 
Information 
Extraction 
Person Org Loc 
Laura Haas IBM San Jose,CA 
© 2014 9 9/10/2014 IBM Research – Almaden IBM Confidential IBM Corporation
Challenges in Information Extraction 
NNNNaaaammmmeeeedddd EEEEnnnnttttiiiittttyyyy HHHHiiiieeeerrrraaaarrrrcccchhhhyyyy:::: 
200+ types: Person, Organization, Location,… 
BBBBrrrreeeeaaaaddddtttthhhh 
Wide varieties of extraction tasks 
IIIIEEEE ddddeeeevvvveeeellllooooppppmmmmeeeennnntttt ttttaaaakkkkeeeessss eeeeffffffffoooorrrrtttt!!!!!!!! 
• CCCCoooolllllllleeeeccccttttiiiinnnngggg ddddiiiiccccttttiiiioooonnnnaaaarrrriiiieeeessss 
•WWWWrrrriiiittttiiiinnnngggg rrrreeeegggguuuullllaaaarrrr eeeexxxxpppprrrreeeessssssssiiiioooonnnnssss 
• CCCCoooolllllllleeeeccccttttiiiinnnngggg ooootttthhhheeeerrrr wwwwoooorrrrdddd-lllleeeevvvveeeellll ffffeeeeaaaattttuuuurrrreeeessss 
LLLLaaaabbbbeeeelllliiiinnnngggg ++++ ttttrrrraaaaiiiinnnniiiinnnngggg////ttttuuuunnnniiiinnnngggg mmmmaaaacccchhhhiiiinnnneeee lllleeeeaaaarrrrnnnniiiinnnngggg mmmmooooddddeeeellllssss 
oooorrrr 
WWWWrrrriiiittttiiiinnnngggg ++++ tttteeeessssttttiiiinnnngggg rrrruuuulllleeeessss 
© 2014 10 9/10/2014 IBM Research – Almaden IBM Confidential IBM Corporation
Challenges in Information Extraction 
NNNNaaaammmmeeeedddd EEEEnnnnttttiiiittttyyyy HHHHiiiieeeerrrraaaarrrrcccchhhhyyyy:::: 
222200000000++++ ttttyyyyppppeeeessss:::: PPPPeeeerrrrssssoooonnnn,,,, OOOOrrrrggggaaaannnniiiizzzzaaaattttiiiioooonnnn,,,, LLLLooooccccaaaattttiiiioooonnnn,,,,………… 
BBBBrrrreeeeaaaaddddtttthhhh 
Wide varieties of extraction tasks 
IIIIEEEE ddddeeeevvvveeeellllooooppppmmmmeeeennnntttt ttttaaaakkkkeeeessss eeeeffffffffoooorrrrtttt!!!!!!!! 
DDDDoooommmmaaaaiiiinnnn ccccuuuussssttttoooommmmiiiizzzzaaaattttiiiioooonnnnssss iiiissss uuuussssuuuuaaaallllllllyyyy rrrreeeeqqqquuuuiiiirrrreeeedddd!!!!!!!! 
Entity Boundary: PPPPeeeerrrrssssoooonnnn oooorrrr 
PPPPoooossssiiiittttiiiioooonnnn ++++ PPPPeeeerrrrssssoooonnnn ???? 
… Pres. Barack Obama arrived 
today at the White House … 
Entity Definition: 
LLLLooooccccaaaattttiiiioooonnnn////FFFFaaaacccciiiilllliiiittttyyyy////OOOOrrrrggggaaaannnniiiizzzzaaaattttiiiioooonnnn???? 
CCCCoooommmmpppplllleeeexxxxiiiittttyyyy 
In development  customization 
© 2014 11 9/10/2014 IBM Research – Almaden IBM Confidential IBM Corporation
Challenges in Information Extraction 
NNNNaaaammmmeeeedddd EEEEnnnnttttiiiittttyyyy HHHHiiiieeeerrrraaaarrrrcccchhhhyyyy:::: 
200+ types: Person, Organization, Location,… 
BBBBrrrreeeeaaaaddddtttthhhh 
Wide varieties of extraction tasks 
State-of-the-art Open-Source Rule-based 
System 
• 80,000+ dictionary entries 
• 4,800 lines of JAPE and Java code 
• Accuracy (English): 50%-80% 
• Performance: 20KB/sec, 8GB RAM 
IIIIEEEE ddddeeeevvvveeeellllooooppppmmmmeeeennnntttt ttttaaaakkkkeeeessss eeeeffffffffoooorrrrtttt!!!!!!!! 
DDDDoooommmmaaaaiiiinnnn ccccuuuussssttttoooommmmiiiizzzzaaaattttiiiioooonnnnssss iiiissss uuuussssuuuuaaaallllllllyyyy rrrreeeeqqqquuuuiiiirrrreeeedddd!!!!!!!! 
Entity Boundary: PPPPeeeerrrrssssoooonnnn oooorrrr 
PPPPoooossssiiiittttiiiioooonnnn ++++ PPPPeeeerrrrssssoooonnnn ???? 
… Pres. Barack Obama arrived 
today at the White House … 
Entity Definition: 
LLLLooooccccaaaattttiiiioooonnnn////FFFFaaaacccciiiilllliiiittttyyyy////OOOOrrrrggggaaaannnniiiizzzzaaaattttiiiioooonnnn???? 
CCCCoooommmmpppplllleeeexxxxiiiittttyyyy 
In development  customization 
State-of-the-art Machine-learning system 
• Combination of 4 classifiers 
• 150,000+ dictionary entries 
• 15+ regexes for word features 
• Accuracy: 89% 
• Throughput: ~ 10 KB/sec 
SSSSccccaaaalllleeee 
450M+ tweets per day, … 
© 2014 12 9/10/2014 IBM Research – Almaden IBM Confidential IBM Corporation
~1500 lines of Java code 
BBBBrrrreeeeaaaaddddtttthhhh 
SSSSccccaaaalllleeee 
Different Loss function 
KL-divergence Wide varieties of ML models 
450M+ tweets per day, … 
© 2014 IBM Corporation 
Challenges in Scalable CCChhhaaalllllleeennngggeeesss iiinnn SSScccaaalllaaabbbllleee MMMMaaaacccchhhhiiiinnnneeee LLLLeeeeaaaarrrrnnnniiiinnnngggg 
topics topics words 
V ≈ W H 
documents 
x 
13 9/10/2014 IBM Research – Almaden 
[Liu, WWW 2010] 
• Billions of non-zeros within tens of hours 
• Careful partitioning of data 
• Maximize data locality and parallelism 
% initialize W, H 
while (~converged) 
W = W*(V%*%t(H))/(W%*%H%*%t(H)) 
H = H*(t(W)%*%V)/(t(W)%*%W%*%H) 
end 
W = W*max(V%*%t(H) – alphaW JW, 0)/(W%*%H%*%t(H)) 
H = H*max(t(W)%*%V – alphaH JH, 0)/(t(W)%*%W%*%H) 
W = W*((S*V)%*%t(H))/((S*(W%*%H))%*%t(H)) 
H = H*(t(W)%*%(S*V))/(t(W)%*%(S*(W%*%H))) 
Regularizers 
JW,JH 
Weighted Sq Loss/ 
Matrix Completion Setting 
Parallel implementation is half the story ! 
Typical application requires experimenting with multiple variants 
W = W*(V/(W%*%H) %*% t(H))/(E*%t(H)) 
CCCCoooommmmpppplllleeeexxxxiiiittttyyyy 
In implementation 
H = H*(t(W)%*%(V/(W%*%H)))/(t(W)%*%E)
WWWWhhhhaaaatttt iiiissss ccccoooommmmmmmmoooonnnn aaaaccccrrrroooossssssss tttthhhheeeesssseeee aaaannnnaaaallllyyyyttttiiiiccccssss ttttaaaasssskkkkssss ???? 
 Variety of problems and solutions 
–Every customer’s data  problems are unique in some 
way 
–Need to quickly implement new business logic 
–Need to experiment with multiple algorithms for a 
particular analytic problem 
 Quality of answers is very important !! 
–High quality analytics requires “complex” programs 
–Skilled developers + domain experts 
 Performance is critical 
–Bigger data demands faster execution 
• Social Media: 
Twitter alone has 400M+ messages / day; 1TB+ per day 
• Financial Data: 
SEC alone has 20M+ filings, several TBs of data, with documents range 
from few KBs to few MBs 
• Machine Data: 
One application server under moderate load at medium logging level  
1GB of logs per day 
BBBBrrrreeeeaaaaddddtttthhhh 
CCCCoooommmmpppplllleeeexxxxiiiittttyyyy 
© 2014 IBM Research – Almaden IBM Corporation 
SSSSccccaaaalllleeee
DDDDeeeeccccllllaaaarrrraaaattttiiiivvvveeee SSSSyyyysssstttteeeemmmmssss :::: TTTThhhheeee RRRReeeellllaaaattttiiiioooonnnnaaaallll WWWWoooorrrrlllldddd 
Compute average salary 
for each department 
select D.did, avg(E.salary) 
from Employee E, Department D 
where E.did = D.did 
group by D.did 
Task 
SQL Query 
Declarative High-level Language 
User specifies tasks in a high-level 
language, w/o specifying algorithms for 
data processing 
Optimization Query Optimizer 
… 
TTTTaaaabbbblllleeeessss,,,, IIIInnnnddddiiiicccceeeessss 
Execution 
Strategy 
Query Optimization 
System uses optimization strategies to 
choose from alternate execution plans 
Physical Data Independence 
User does not have to worry about 
physical data representation and 
access aids while writing queries; 
system manages the physical layer 
© 2014 15 9/10/2014 IBM Research – Almaden IBM Corporation
WWWWhhhhyyyy ddddiiiidddd RRRReeeellllaaaattttiiiioooonnnnaaaallll SSSSyyyysssstttteeeemmmmssss ssssuuuucccccccceeeeeeeedddd ???? 
Pat Selinger 
Boeing said “We can ask questions we could 
never find the answers to before. We’re now able 
to do more than we could ever do before.” 
SIGMOD Record, December 2003 
Bruce Lindsay The invention of nonprocedural specification was 
a tremendous simplification that made it much 
easier to specify applications. No longer did you 
have to say which index to use and which join 
method to use to get the job done. 
SIGMOD Record, June 2005 
Michael Stonebraker 
Query optimizers can beat all but the best 
DBMS application programmers. 
“What Goes Around Comes Around”, 
Readings in Database Systems, 4th Edition, 2005 
© 2014 16 9/10/2014 IBM Research – Almaden IBM Corporation
WWWWhhhhaaaatttt iiiissss ccccoooommmmmmmmoooonnnn aaaaccccrrrroooossssssss tttthhhheeeesssseeee aaaannnnaaaallllyyyyttttiiiiccccssss ttttaaaasssskkkkssss ???? 
 Variety of problems and solutions 
–Every customer’s data  problems are unique 
in some way 
–Need to quickly implement new business logic 
–Need to experiment with multiple algorithms 
for a particular analytic problem 
 Quality of answers is very important !! 
–High quality analytics “complex” programs 
–Skilled developers + domain experts 
 Performance is critical 
–Bigger data faster execution 
BBBBrrrreeeeaaaaddddtttthhhh 
CCCCoooommmmpppplllleeeexxxxiiiittttyyyy 
© 2014 IBM Research – Almaden IBM Corporation 
SSSSccccaaaalllleeee
What am I going to talk about ? 
 What makes analytics tasks difficult and what can be 
learnt from the success of relational systems 
 Brief description of declarative systems built at IBM and 
the design choices made along the way 
SystemT 
(Information Extraction) 
SystemML 
(Machine Learning) 
Design 
Choices 
Analytics 
Systems 
Data Model 
Operations 
Language Syntax 
Platform 
© 2014 18 IBM Research – Almaden 9/10/2014 IBM Corporation
IIIInnnnffffoooorrrrmmmmaaaattttiiiioooonnnn EEEExxxxttttrrrraaaaccccttttiiiioooonnnn - SSSSyyyysssstttteeeemmmmTTTT 
© 2014 19 9/10/2014 IBM Research – Almaden IBM Corporation
Informal Music Band Reviews from Blogs 
Start with Concert Mention 
At least 3 occurrences of Music Review Snippet and Generic Review Snippet 
Review ends with one of these. 
Complete review is 
within 200 tokens 
Concert Mention Pattern 
Consecutive Review Snippets are within 25 tokens 
I went … to the OTIS concert last night 
a bunch of other bands also playing 
The sax player in that band… 
They played … “I Will Survive”… 
Music 
Review 
Snippet 
Review within 
200 tokens 
© 2014 20 9/10/2014 IBM Research – Almaden IBM Corporation
SSSSttttaaaatttteeee-ooooffff-tttthhhheeee-aaaarrrrtttt:::: CCCCoooommmmmmmmoooonnnn PPPPaaaatttttttteeeerrrrnnnn SSSSppppeeeecccciiiiffffiiiiccccaaaattttiiiioooonnnn LLLLaaaannnngggguuuuaaaaggggeeee ((((CCCCPPPPSSSSLLLL)))) 
A common language to specify and represent extraction rules as cascading grammars 
Developed jointly between SRI and Department of Defense (1999) 
Example Rule: Band Member name followed within 5 tokens by Instrument clue is a Music Review Snippet 
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. augue rutrum 
lorem velit, sed RRRReeeevvvviiiieeeewwwwSSSSnnnniiiippppppppeeeetttt, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. 
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat 
〈BandMember〉 〈Token〉{0,5} 〈Instrument〉〈MusicReviewSnippet〉 
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. luctus, risus in sagittis 
facilisis BBBBaaaannnnddddMMMMeeeemmmmbbbbeeeerrrr tttthhhheeeeiiiirrrr lllleeeeaaaadddd vvvvooooccccaaaallll////IIIInnnnssssttttrrrruuuummmmeeeennnntttt hendrerit faucibus pede mi ipsum. 
Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, 
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. 
Proin JJJJoooonnnn FFFFoooorrrreeeemmmmaaaannnn tttthhhheeeeiiiirrrr lllleeeeaaaadddd vvvvooooccccaaaallll//// 
IIIInnnnssssttttrrrruuuummmmeeeennnntttt arcu tincidunt 
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin in 
sagittis , BBBBaaaannnnddddMMMMeeeemmmmbbbbeeeerrrr tttthhhheeeeiiiirrrr lllleeeeaaaadddd 
vvvvooooccccaaaallll////gggguuuuiiiittttaaaarrrriiiisssstttt rutrum velit sed amet lt arcu tincidunt 
〈〈Token〉[~ “([A-Z]w+)s+[A-Z]w+”] 〈BandMember〉〉 〈Token〉[~ “pipe | guitarist | …”] 〈Instrument〉 
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur 
risus in sagittis facilisis JJJJoooonnnn FFFFoooorrrreeeemmmmaaaannnn tttthhhheeeeiiiirrrr lllleeeeaaaadddd vvvvooooccccaaaallll////gggguuuuiiiittttaaaarrrriiiisssstttt hendrerit faucibus pede mi ipsum. 
Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, 
LLLLeeeevvvveeeellll 2222 
LLLLeeeevvvveeeellll 1111 
LLLLeeeevvvveeeellll 0000 
© 2014 21 9/10/2014 IBM Research – Almaden IBM Corporation
WWWWhhhhyyyy iiiissss tttthhhhiiiissss nnnnooootttt ssssuuuuffffffffiiiicccciiiieeeennnntttt ???? 
Consecutive Review Snippets are within 25 tokens 
Start with Concert Mention 
At least 3 occurrences of Music Review Snippet or Generic Review Snippet 
Review ends with one of these. 
Complete review is 
within 200 tokens 
 Counting and aggregations are not natural primitives in grammar and have to be handled in 
custom code [Chiticariu, ACL 2010] 
 Finely tuned grammar-based extraction system, with custom code for counting and 
aggregation, took ~ 6 hours to extract reviews from a million web logs 
© 2014 22 9/10/2014 IBM Research – Almaden IBM Corporation
SSSSyyyysssstttteeeemmmmTTTT – DDDDeeeeccccllllaaaarrrraaaattttiiiivvvveeee AAAApppppppprrrrooooaaaacccchhhh ttttoooo IIIInnnnffffoooorrrrmmmmaaaattttiiiioooonnnn EEEExxxxttttrrrraaaaccccttttiiiioooonnnn 
Annotated 
Document 
Stream 
AQL SystemT 
Optimizer 
SystemT 
Runtime 
Compiled 
Operator 
Graph 
Rule language with 
familiar SQL-like syntax 
Specify annotator 
semantics declaratively 
Choose an efficient 
execution plan that 
implements the 
semantics 
Highly scalable, 
embeddable Java 
runtime 
Input 
Document 
Stream 
See SIGMOD 2010 tutorial [Chiticariu et al., 2010] 
for details on other recent declarative IE systems 
© 2014 23 9/10/2014 IBM Research – Almaden IBM Corporation
Expressing Music Review Snippet Rule in AQL 
BandMember Instrument 
0-5 tokens 
create view MusicReviewSnippet as 
select B.name as member, I.value as instrument, 
CombineSpans(B.name,I.value) as review 
from BandMember B, Instrument I 
where FollowsTok(B.name, I.value, 0, 5); 
create view BandMember as 
extract regex /[A-Z]w+s+[A-Z]w+] / on D.text 
from Document D; 
Choice of SQL-like syntax for AQL motivated by wider adoption of SQL 
© 2014 24 9/10/2014 IBM Research – Almaden IBM Corporation
WWWWhhhhaaaatttt mmmmaaaakkkkeeeessss AAAAQQQQLLLL eeeexxxxpppprrrreeeessssssssiiiivvvveeee???? 
 Extraction primitives 
–Regular Expressions 
–Dictionary 
 Text-specific primitives 
–Multi-lingual tokenization and parts-of-speech 
–Sentence and paragraph boundary detection 
–Span-based predicates 
 Set-level primitives 
–Join 
–Block 
–Consolidation 
–Group By 
© 2014 25 9/10/2014 IBM Research – Almaden IBM Corporation
HHHHoooowwww wwwwiiiillllllll tttthhhheeee MMMMuuuussssiiiicccc BBBBaaaannnndddd RRRReeeevvvviiiieeeewwww eeeexxxxttttrrrraaaaccccttttoooorrrr wwwwoooorrrrkkkk iiiinnnn SSSSyyyysssstttteeeemmmmTTTT???? 
Block 
Music 
Find blocks of three or 
more “Review Snippet” 
Review Snippet 
Blocks of Review Snippet 
Review Snippet 
Generic 
Review Snippet 
Join predicates 
enforce additional 
constraints 
Concert Mention 
Join 
Union 
patterns 
… 
© 2014 26 9/10/2014 IBM Research – Almaden IBM Corporation
CCCCllllaaaassssssss ooooffff OOOOppppttttiiiimmmmiiiizzzzaaaattttiiiioooonnnnssss iiiinnnn SSSSyyyysssstttteeeemmmmTTTT 
 RRRReeeewwwwrrrriiiitttteeee-bbbbaaaasssseeeedddd: rewrite algebraic 
operator graph 
–Shared Dictionary Matching 
–Shared Regular Expression 
Evaluation 
–On-demand tokenization 
 CCCCoooosssstttt-bbbbaaaasssseeeedddd: relies on novel selectivity 
estimation for text-specific operators 
–Standard transformations 
• E.g., push down selections 
–Restricted Span Evaluation 
• Evaluate expensive operators on 
restricted regions of the document 
Tokenization overhead is paid only once 
(followed within 5 tokens) 
BandMember 
Plan B 
BandMember 
Plan C 
Plan A 
Join 
Instrument 
Identify Instrument starting 
Extract text to the right within 5 tokens 
Identify BandMember ending 
within 5 tokens 
Extract text to the left 
Restricted Span Evaluation 
Instrument 
© 2014 27 9/10/2014 IBM Research – Almaden IBM Corporation
© 2014 IBM Corporation 
Performance benefits using SystemT 
[Chiticariu et al. ACL’10] 
 Music Band Review extraction task over a million web logs 
– SystemT vs. the grammar implementation 
• 10 minutes vs. ~ 6 hours 
 Named-entity extraction task over multiple document corpora 
– SystemT throughput ranges from 400 – 900 KB/sec/core (depending on the size of 
the document) 
– SystemT vs. State-of-the-Art Learning-based System [Florian et al, CoNLL’03] 
~ 50 times higher throughput 
– SystemT vs. State-of-the-Art Grammar-based System [ANNIE, Cunningham et al, 
ACL’02] 
~ 10 - 50 times higher throughput 
~ 60 - 90% less memory consumption 
 Revisiting the Twitter example, for keeping up with today’s tweets with 18 cores 
– SystemT takes 30 minutes per day as opposed to running 24/7 for the state-of-the-art 
system
Runs fast ! But is SystemT expressive enough to compare on quality ? 
[Chiticariu et al. ACL’10, EMNLP’10] 
 SystemT outperforms current best results on multiple benchmark datasets 
– CoNLL 2003 
• F-measure between 89% and 92% for Person, Organization and Location 
tasks 
• Beats the state-of-the-art results consistently by up to 4% 
– Enron Email 
• F-measure 85% for Person task 
• Better than the state-of-the-art result by 7% 
© 2014 29 9/10/2014 IBM Research – Almaden IBM Corporation
© 2014 IBM Corporation 
What design choices did we make for SystemT ? 
SystemT 
(Information Extraction) 
SystemML 
(Machine Learning) 
Data Model Document-at-a-time model 
Data types: Span, Tuple, Relation 
Operations Feature extraction primitives 
Text-specific primitives 
Set-level primitives 
Language 
Syntax 
SQL-like syntax 
Platform Embeddable runtime deployed in a wide 
range of execution environments 
Design 
Choices 
Analytics 
Systems
SSSSyyyysssstttteeeemmmmMMMMLLLL 
© 2014 31 9/10/2014 IBM Research – Almaden IBM Confidential IBM Corporation
SSSSttttaaaattttuuuussss QQQQuuuuoooo ooooffff MMMMaaaacccchhhhiiiinnnneeee LLLLeeeeaaaarrrrnnnniiiinnnngggg AAAAllllggggoooorrrriiiitttthhhhmmmmssss 
 Machine Learning algorithm implementations today 
– Specialized languages for Machine Learning 
• R, Matlab 
• Execution strategy for programs is determined by user 
– Low-level implementations 
• Directly implement ML algorithms on specific platforms 
–Hand-tuned implementations on specialized hardware GPU, BlueGene etc. 
 But the programmer has to handle 
– Performance optimizations due to data and compute platform characteristics 
– Parallelization for specific platforms 
IBM Research – Almaden 9/10/2014 
© 2014 32 IBM Corporation
SSSSyyyysssstttteeeemmmmMMMMLLLL GGGGooooaaaallllssss 
GNMF: V ≈ U = W H 
V=readMM(in/V, rows=1e8, cols=1e5); 
W=readMM(in/W, rows=1e8, cols=10); 
H=readMM(in/H, rows=10, cols=1e5); 
max_iteration=20; 
i=0; 
while(imax_iteration){ 
H=H*(t(W)%*%V)/(t(W)%*%W%*%H); 
W=W*(V%*%t(H))/(W%*%H%*%t(H)); 
i=i+1;} 
Higher level 
Optimizations 
Operator 
Implementations 
MapReduce 
Platform 
' 
MR1 
MR2 
MRn 
Provide language to implement ML algorithms 
Support specific ML constructs such as cross 
validation, bootstrapping, ensembles as first class 
citizens 
Optimizations based on data and system 
characteristics 
Scalable operator implementations 
IBM Research – Almaden 9/10/2014 
© 2014 33 IBM Corporation
SystemML Architecture 
 DML: Declarative Machine Learning Language 
– Retain expressivity of current ML languages including 
procedural constructs like while and for loops 
 High-Level Operator (HOP) Component 
– Represent dataflow in DAGs of matrices and scalar operations 
– Choose from alternative execution plans using algebraic 
rewrites and cost-based optimization 
 Low-Level Operator (LOP) Component 
– Low-level physical execution plan over key-value pairs 
– “Piggyback” operations to reduce number of MapReduce jobs 
 Runtime 
– Efficient data representation and implementation of individual 
operations in MapReduce framework 
– Control module to orchestrate MR jobs 
IBM Research – Almaden 9/10/2014 
© 2014 34 IBM Corporation
Simple Example of how SystemML works 
Binary hop 
Multiply 
B Binary hop 
Divide 
C D 
Binary lop 
Multiply 
Group lop 
C 
A = B * (C / D) 
Binary lop 
Divide 
Group lop 
D 
B 
R1 
M1 
MR Job 
Language HOP Component LOP Component Runtime 
LOP represents the physical plan 
for the program with a DAG for 
each statement block. 
LOP operates on key-value pairs 
and scalars 
Multiple low-level 
operators combined 
in a MapReduce job 
HOP represents the logical 
flow of the program as DAGs 
for each statement block. 
HOP operates on matrices 
and scalars 
Input DML parsed 
into statement blocks 
with typed variables 
IBM Research – Almaden 9/10/2014 
© 2014 35 IBM Corporation
© 2014 IBM Corporation 
Declarative Machine DDDeeeccclllaaarrraaatttiiivvveee MMMaaaccchhhiiinnneee LLLLeeeeaaaarrrrnnnniiiinnnngggg LLLLaaaannnngggguuuuaaaaggggeeee 
 Syntax borrowed from R 
 What is supported 
– Data Types: matrix, vector, scalar 
– Statements 
• Input/Output, Assignment, Control Structures (while, for), Rand 
– Expressions 
• Operators : Arithmetic, Comparative, Boolean, Matrix Multiplication 
• Built-in Functions : Linear Algebra (transpose, …), Matrix aggregation (colSum, ...) , 
Mathematical (ln, sqrt, …) 
– External Functions 
– Machine Learning specific constructs : Cross validation, Ensemble learning 
36 IBM Research – Almaden 9/10/2014
CCCCaaaatttteeeeggggoooorrrriiiieeeessss ooooffff OOOOppppttttiiiimmmmiiiizzzzaaaattttiiiioooonnnn iiiinnnn SSSSyyyysssstttteeeemmmmMMMMLLLL 
 HOP component 
– Algebraic rewrites (e.g., matrix computation reordering) 
– Cost-based optimization (e.g., choosing between different plans for matrix multiplication) 
– Selection of physical representation of matrices (e.g., cell versus block representation) 
 LOP component 
– Piggybacking (packing lops that can be evaluated together in a single MapReduce job) 
 Runtime 
– Data representation (e.g., sparse versus dense) 
– Sparsity-aware operator implementations 
IBM Research – Almaden 9/10/2014 
© 2014 37 IBM Corporation
Performance Numbers 
Gaussian NMF: 
V = readMM (example.GNMF.V, rows= 1000, cols=100, nnzs= 2000, format=text); 
W = readMM (example.GNMF.W, rows= 1000, cols=20, nnzs= 20000, format=text); 
H = readMM (example.GNMF.H, rows= 20, cols=100, nnzs= 2000, format=text); 
max_iteration = 10 
i = 0 
while (i  max_iteration) { 
H = H * ((t(W) %*% V) / ( (t(W) %*% W) %*% H)) 
W = W * ((V %*% t(H)) / ( W %*% (H %*% t(H)))) 
i = i + 1 
} 
writeMM (W,example.GNMF.W.result, format=text); 
writeMM (H,example.GNMF.H.result, format=text); 
Data Size 
Time per 
iteration 
Lines of 
Code 
Runtime Platform 
In SystemML 
5 billion non zeros 
(50m X 100k, sparsity 
1x10-3) 
1.2 hours 
11 lines of 
DML code 
40 cores, 4 GB RAM per core 
WWW 2010 
4.38 billion non zeros 
(43.9 m X 768m, 
sparsity 1.3x10-7) 
7 hours 
1500 lines of 
Java code 
SCOPE cluster 
IBM Research – Almaden 9/10/2014 
© 2014 38 IBM Corporation
AAAAddddddddiiiittttiiiioooonnnnaaaallll AAAAllllggggoooorrrriiiitttthhhhmmmmssss 
Execution Time (sec) 
400 
350 
300 
250 
200 
150 
100 
50 
0 
PageRank 
G=readMM(in/G, rows=1e6, cols=1e6); 
p=readMM(in/p, rows=1e6, cols=1); 
e=readMM(in/e, rows=1e6, cols=1); 
ut=readMM(in/ut, rows=1, cols=1e6); 
alpha=0.85; 
max_iteration=20; 
i=0; 
while(imax_iteration){ 
p=alpha*(G%*%p)+(1-alpha)*(e%*%u%*%p); 
i=i+1} 
writeMM(p, out/p); 
DML PageRank 
G: n x n, sparsity=0.001 V: d x 100000, sparsity=0.001 
0 400 800 1200 1600 
#rows and #columns in G (thousand) 
Execution Time (sec) 
800 
600 
400 
200 
0 
Sparse Linear Regression 
V=readMM(in/V, rows=1e8, cols=1e5); 
b=readMM(in/b, rows=1e8, cols=1); 
lambda = 1e-6; 
r=-b ; 
p=-r ; 
norm_r2=sum(r*r); 
max_iteration=20; 
i=0; 
while(imax_iteration){ 
q=((t(V) %*% (V %*% p)) + lambda*p) 
alpha= norm_r2/(t(p)%*%q); 
w=w+alpha*p; 
old_norm_r2=norm_r2; 
r=r+alpha*q; 
beta=norm_r2/old_norm_r2; 
p=-r+beta*p; 
i=i+1;} 
writeMM(w, out/w); 
DML Linear Regression 
0 2 4 6 8 10 12 14 16 18 20 
#rows in V (million) 
IBM Research – Almaden 9/10/2014 
© 2014 39 IBM Corporation
© 2014 IBM Corporation 
What design choices did we WWWhhhaaattt dddeeesssiiigggnnn ccchhhoooiiiccceeesss dddiiiddd wwweee mmmmaaaakkkkeeee ffffoooorrrr SSSSyyyysssstttteeeemmmmMMMMLLLL ???? 
SystemT 
(Information Extraction) 
SystemML 
(Machine Learning) 
Analytics 
Systems 
Data Model Document-at-a-time model 
Data types: Span, Tuple, 
Relation 
Data types: Matrix, Vector, Scalar 
Operations Feature extraction primitives 
Text-specific primitives 
Set-level primitives 
Procedural constructs 
e.g., while, for 
Linear Algebra operations 
External Functions 
Machine Learning specific constructs 
e.g., Cross Validation, Ensemble Learning 
Language 
Syntax 
SQL-like syntax R-like syntax 
Platform Embeddable runtime deployed 
in a wide range of execution 
environments 
MapReduce Runtime 
Design 
Choices
SSSSuuuummmmmmmmaaaarrrryyyy 
© 2014 41 IBM Corporation
LLLLeeeessssssssoooonnnnssss LLLLeeeeaaaarrrrnnnneeeedddd 
– SystemT 
• Ships with eight IBM products 
• To date have not encountered a request that is not expressible in AQL 
– SystemML 
• Ships with IBM BigInsights August beta this year 
• Declarative is the goal; but to express Machine Learning algorithms procedural constructs are needed 
• Users naturally gravitate to procedural constructs. Limiting usage of such constructs to only when 
required to specify “what needs to be done” may need lot of training 
– SystemT 
• Choice of SQL-like syntax and Eclipse-based tooling quickly enabled hundreds of users with varied 
background 
• But traditional NLP-trainees prompted us to provide a layer on top of AQL with grammar-like syntax 
• Business users demand even simpler and more usable tooling 
– SystemML 
• Early days but multiple users inside IBM and almost all are previous R / Matlab users. 
• Familiar R syntax helps ML users up and running al most immediately 
– SystemT 
• Document at a time model and all in-memory optimizations 
• Demonstrates that an order-of-magnitude throughput improvement can be obtained 
• Hardware acceleration further speed up the execution 
– SystemML 
• Computation on a large-scale distributed platform 
• Initial eIxBpMe Rrieesnecarecsh r–eAinlmfoadrecne the argument “Query optimizers can b9e/1a0t/2 a0l1l4 but the best programmers” 
© 2014 42 IBM Corporation
Maintenance 
© 2014 IBM Corporation 
Tooling Research for the Development Life-Cycle 
Development 
[ACL’11,12,13,CHI’13] 
Develop 
Analyze Test 
Deploy 
Refine 
Test 
Task Analysis 
• Concordance Viewer 
• Active labeling 
• Labeling tool 
• Extraction plan 
• Track provenance [VLDB’10] 
• Contextual clue discovery[CIKM’11] 
• Regex learning [EMNLP’08] 
• Suggest rule changes [VLDB’10] 
• Rule induction [EMNLP’12] 
• Dictionary refinement [SIGMOD’13] 
• Rule learning 
• NE Interface [EMNLP’10] 
• Tagger UI [SIGMOD’07]
Result Viewer 
© 2014 IBM Corporation 
Eclipse EEEcccllliiipppssseee TTTToooooooollllssss OOOOvvvveeeerrrrvvvviiiieeeewwww 
EEEEaaaasssseeee ooooffff 
PPPPrrrrooooggggrrrraaaammmmmmmmiiiinnnngggg 
AAAAuuuuttttoooommmmaaaattttiiiicccc 
DDDDiiiissssccccoooovvvveeeerrrryyyy 
PPPPeeeerrrrffffoooorrrrmmmmaaaannnncccceeee 
TTTTuuuunnnniiiinnnngggg 
AQL Editor 
Explain 
Pattern Discovery 
Regex Learner 
AAAAQLLLL EEEEddddiiiittttoooorrrr:::: syntax highlighting, auto-complete, 
hyperlink navigation 
RRRReeeessssuuuulllltttt VVVViiiieeeewwwweeeerrrr:::: visualize/compare/evaluate 
EEEExxxxppppllllaaaaiiiinnnn:::: show how each result was generated 
WWWWoooorrrrkkkkfffflllloooowwww UUUUIIII: end-to-end development wizard 
RRRReeeeggggeeeexxxx GGGGeeeennnneeeerrrraaaattttoooorrrr:::: generate regular expressions 
from examples 
PPPPaaaatttttttteeeerrrrnnnn DDDDiiiissssccccoooovvvveeeerrrryyyy: identify patterns in the data 
PPPPrrrrooooffffiiiilllleeeerrrr: identify performance bottlenecks to be 
hand tuned
© 2014 IBM Corporation 
WWWWeeeebbbb TTTToooooooollllssss OOOOvvvveeeerrrrvvvviiiieeeewwww 
EEEEaaaasssseeee ooooffff 
PPPPrrrrooooggggrrrraaaammmmmmmmiiiinnnngggg 
EEEEaaaasssseeee ooooffff 
SSSShhhhaaaarrrriiiinnnngggg 
CCCCaaaannnnvvvvaaaassss:::: 
• Visual construction of extractors 
• Customization of existing extractors 
RRRReeeessssuuuulllltttt VVVViiiieeeewwwweeeerrrr:::: visualize/compare/evaluate 
CCCCoooonnnncccceeeepppptttt ccccaaaattttaaaalllloooogggg:::: share concepts 
PPPPrrrroooojjjjeeeecccctttt:::: share extractor development 
Even for non-programmers
DDDDoooonnnn hhhhaaaassss tttthhhheeee llllaaaasssstttt wwwwoooorrrrdddd ………… 
Don Chamberlin 
We set out to help non-programmers interact with 
databases to open up access to data to a whole new 
class of people who could do things that were 
never possible before. The problem that we didn't 
think we were working on at all was how to embed 
query languages into host languages, or how to 
make a language that would serve as an 
interchange medium between different systems - 
those are the ways in which SQL ultimately turned 
out to be very successful, 
SQL Reunion, 1995 
Maybe .. 
Don observed that success of SQL was due to the language serving as an 
interchange medium between systems. In contrast declarative systems 
for analytics may indeed be successful for the original purpose that SQL 
was intended – open up access to analytics to a whole new class of people 
© 2014 46 IBM Corporation
© 2014 IBM Corporation 
Thank You! 
47

More Related Content

What's hot

The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
Ivo Andreev
 
Text Analytics
Text Analytics Text Analytics
Text Analytics
Nicolas Morales
 
Natural Language Processing at Scale
Natural Language Processing at ScaleNatural Language Processing at Scale
Natural Language Processing at Scale
Andrei Lopatenko
 
BigMLSchool: ML Platforms and AutoML in the Enterprise
BigMLSchool: ML Platforms and AutoML in the EnterpriseBigMLSchool: ML Platforms and AutoML in the Enterprise
BigMLSchool: ML Platforms and AutoML in the Enterprise
BigML, Inc
 
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...Dr. Haxel Consult
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
Michael Gerke
 
AI in Multi Billion Search Engines. Career building in AI / Search. What make...
AI in Multi Billion Search Engines. Career building in AI / Search. What make...AI in Multi Billion Search Engines. Career building in AI / Search. What make...
AI in Multi Billion Search Engines. Career building in AI / Search. What make...
Andrei Lopatenko
 
Telecom datascience master_public
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_public
Vincent Michel
 
Webinar on IT Basics by IIM Rohtak for Admissions-2014
Webinar on IT Basics by IIM Rohtak for Admissions-2014Webinar on IT Basics by IIM Rohtak for Admissions-2014
Webinar on IT Basics by IIM Rohtak for Admissions-2014
PR Cell, IIM Rohtak
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
Andrei Lopatenko
 
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Juliet Hougland
 
Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analytics
Connected Data World
 
Graphs and Financial Services Analytics
Graphs and Financial Services AnalyticsGraphs and Financial Services Analytics
Graphs and Financial Services Analytics
Neo4j
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache Spark
InSemble
 
Productionising Machine Learning Models
Productionising Machine Learning ModelsProductionising Machine Learning Models
Productionising Machine Learning Models
Tash Bickley
 
Shanish_SQL_PLSQL_Profile
Shanish_SQL_PLSQL_ProfileShanish_SQL_PLSQL_Profile
Shanish_SQL_PLSQL_ProfileShanish Jain
 
Introduction to Machine Learning and Data Science using the Autonomous databa...
Introduction to Machine Learning and Data Science using the Autonomous databa...Introduction to Machine Learning and Data Science using the Autonomous databa...
Introduction to Machine Learning and Data Science using the Autonomous databa...
Sandesh Rao
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
bdonaldson
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
TigerGraph
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Simplilearn
 

What's hot (20)

The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Text Analytics
Text Analytics Text Analytics
Text Analytics
 
Natural Language Processing at Scale
Natural Language Processing at ScaleNatural Language Processing at Scale
Natural Language Processing at Scale
 
BigMLSchool: ML Platforms and AutoML in the Enterprise
BigMLSchool: ML Platforms and AutoML in the EnterpriseBigMLSchool: ML Platforms and AutoML in the Enterprise
BigMLSchool: ML Platforms and AutoML in the Enterprise
 
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
AI in Multi Billion Search Engines. Career building in AI / Search. What make...
AI in Multi Billion Search Engines. Career building in AI / Search. What make...AI in Multi Billion Search Engines. Career building in AI / Search. What make...
AI in Multi Billion Search Engines. Career building in AI / Search. What make...
 
Telecom datascience master_public
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_public
 
Webinar on IT Basics by IIM Rohtak for Admissions-2014
Webinar on IT Basics by IIM Rohtak for Admissions-2014Webinar on IT Basics by IIM Rohtak for Admissions-2014
Webinar on IT Basics by IIM Rohtak for Admissions-2014
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
 
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
 
Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analytics
 
Graphs and Financial Services Analytics
Graphs and Financial Services AnalyticsGraphs and Financial Services Analytics
Graphs and Financial Services Analytics
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache Spark
 
Productionising Machine Learning Models
Productionising Machine Learning ModelsProductionising Machine Learning Models
Productionising Machine Learning Models
 
Shanish_SQL_PLSQL_Profile
Shanish_SQL_PLSQL_ProfileShanish_SQL_PLSQL_Profile
Shanish_SQL_PLSQL_Profile
 
Introduction to Machine Learning and Data Science using the Autonomous databa...
Introduction to Machine Learning and Data Science using the Autonomous databa...Introduction to Machine Learning and Data Science using the Autonomous databa...
Introduction to Machine Learning and Data Science using the Autonomous databa...
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
 

Similar to The Power of Declarative Analytics

Five Critical Success Factors for Big Data and Traditional BI
Five Critical Success Factors for Big Data and Traditional BIFive Critical Success Factors for Big Data and Traditional BI
Five Critical Success Factors for Big Data and Traditional BI
Inside Analysis
 
The Data Lake: Empowering Your Data Science Team
The Data Lake: Empowering Your Data Science TeamThe Data Lake: Empowering Your Data Science Team
The Data Lake: Empowering Your Data Science Team
Senturus
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven Business
Inside Analysis
 
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
Amazon Web Services
 
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...
CIO Edge
 
Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data Analytics
Datameer
 
Presumption of Abundance: Architecting the Future of Success
Presumption of Abundance: Architecting the Future of SuccessPresumption of Abundance: Architecting the Future of Success
Presumption of Abundance: Architecting the Future of Success
Inside Analysis
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
Sean Roberts
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today's
Inside Analysis
 
IW14 Session: Mike Gualtieri, Forrester Research
IW14 Session: Mike Gualtieri, Forrester ResearchIW14 Session: Mike Gualtieri, Forrester Research
IW14 Session: Mike Gualtieri, Forrester Research
Software AG
 
IBS-BIAKM-2013-keynote
IBS-BIAKM-2013-keynoteIBS-BIAKM-2013-keynote
IBS-BIAKM-2013-keynote
Mahboob Hussain
 
Big Data Tools PowerPoint Presentation Slides
Big Data Tools PowerPoint Presentation SlidesBig Data Tools PowerPoint Presentation Slides
Big Data Tools PowerPoint Presentation Slides
SlideTeam
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
Kai Wähner
 
Streaming analytics webinar | 9.13.16 | Guest: Mike Gualtieri from Forrester
Streaming analytics webinar | 9.13.16 | Guest: Mike Gualtieri from ForresterStreaming analytics webinar | 9.13.16 | Guest: Mike Gualtieri from Forrester
Streaming analytics webinar | 9.13.16 | Guest: Mike Gualtieri from Forrester
Cubic Corporation
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
Inside Analysis
 
Agile data science
Agile data scienceAgile data science
Agile data science
Joel Horwitz
 
Intro big data analytics
Intro big data analyticsIntro big data analytics
Intro big data analytics
Hagar Alaa el-din
 
Quick Response Fraud Detection
Quick Response Fraud DetectionQuick Response Fraud Detection
Quick Response Fraud Detection
FraudBusters
 
Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...
Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...
Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...
Chief Analytics Officer Forum
 
Predicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AIPredicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AI
Sri Ambati
 

Similar to The Power of Declarative Analytics (20)

Five Critical Success Factors for Big Data and Traditional BI
Five Critical Success Factors for Big Data and Traditional BIFive Critical Success Factors for Big Data and Traditional BI
Five Critical Success Factors for Big Data and Traditional BI
 
The Data Lake: Empowering Your Data Science Team
The Data Lake: Empowering Your Data Science TeamThe Data Lake: Empowering Your Data Science Team
The Data Lake: Empowering Your Data Science Team
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven Business
 
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
 
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...
 
Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data Analytics
 
Presumption of Abundance: Architecting the Future of Success
Presumption of Abundance: Architecting the Future of SuccessPresumption of Abundance: Architecting the Future of Success
Presumption of Abundance: Architecting the Future of Success
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today's
 
IW14 Session: Mike Gualtieri, Forrester Research
IW14 Session: Mike Gualtieri, Forrester ResearchIW14 Session: Mike Gualtieri, Forrester Research
IW14 Session: Mike Gualtieri, Forrester Research
 
IBS-BIAKM-2013-keynote
IBS-BIAKM-2013-keynoteIBS-BIAKM-2013-keynote
IBS-BIAKM-2013-keynote
 
Big Data Tools PowerPoint Presentation Slides
Big Data Tools PowerPoint Presentation SlidesBig Data Tools PowerPoint Presentation Slides
Big Data Tools PowerPoint Presentation Slides
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
 
Streaming analytics webinar | 9.13.16 | Guest: Mike Gualtieri from Forrester
Streaming analytics webinar | 9.13.16 | Guest: Mike Gualtieri from ForresterStreaming analytics webinar | 9.13.16 | Guest: Mike Gualtieri from Forrester
Streaming analytics webinar | 9.13.16 | Guest: Mike Gualtieri from Forrester
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
Intro big data analytics
Intro big data analyticsIntro big data analytics
Intro big data analytics
 
Quick Response Fraud Detection
Quick Response Fraud DetectionQuick Response Fraud Detection
Quick Response Fraud Detection
 
Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...
Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...
Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...
 
Predicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AIPredicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AI
 

More from Yunyao Li

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Yunyao Li
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and Applications
Yunyao Li
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table Understanding
Yunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
Yunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
Yunyao Li
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language Understanding
Yunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
Yunyao Li
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
Yunyao Li
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural Languages
Yunyao Li
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social Media
Yunyao Li
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active Learning
Yunyao Li
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role Labeling
Yunyao Li
 
Coling poster
Coling posterColing poster
Coling poster
Yunyao Li
 
Coling demo
Coling demoColing demo
Coling demo
Yunyao Li
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
Yunyao Li
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Yunyao Li
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity Detection
Yunyao Li
 
Information Extraction --- An one hour summary
Information Extraction --- An one hour summaryInformation Extraction --- An one hour summary
Information Extraction --- An one hour summary
Yunyao Li
 
Adaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text NormalizationAdaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text Normalization
Yunyao Li
 

More from Yunyao Li (20)

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and Applications
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural Languages
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social Media
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active Learning
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role Labeling
 
Coling poster
Coling posterColing poster
Coling poster
 
Coling demo
Coling demoColing demo
Coling demo
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity Detection
 
Information Extraction --- An one hour summary
Information Extraction --- An one hour summaryInformation Extraction --- An one hour summary
Information Extraction --- An one hour summary
 
Adaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text NormalizationAdaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text Normalization
 

Recently uploaded

Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 

Recently uploaded (20)

Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 

The Power of Declarative Analytics

  • 1. August 30, 2014 The Power of Declarative Analytics Yunyao Li IBM Almaden Research Center Acknowledgement: Shiv, Sekar, Fred, Laura, Berthold, and many more to list here. © 2014 IBM Corporation
  • 2. Unlocking the value from big data © 2014 IBM Corporation
  • 3. Bank 1 15% Bank 2 15% © 2014 IBM Corporation Case Study: Sentiment Analysis Product catalog, Customer Master Data, … Text Analytics Social Media • Relationships • Products • Personal Attributes 360o Profile Interests • Life Events Statistical Analysis, Report Gen. Bank 6 23% Bank 5 20% Bank 4 21% Bank 3 5% Custom3er 360º Who can we cross/up sell? What are our customers thinking of our brand? What do our customers want?
  • 4. FDA approval 1 compound Drug Development Pipeline Intervention should happen at critical transition bottlenecks between stages (most likely to impact outcome) © 2014 IBM Corporation Avg cost to develop a drug: 1.2 billion 0-I IIIIIIII IIIIIIIIIIII IIIIVVVV Laboratory 50,000 + Compounds Pre-Clinical 250 Compounds Clinical 5 Compounds Time to develop a drug; 12 -15 years “..Toxicity and Serious Adverse Events in Late Stage Drug Development are the Major Causes of Drug Failure” Adapted from PhRMA (Pharmaceutical Research and Manufacturers of America) 2013 profile
  • 5. © 2014 IBM Corporation Structure Indication Containdication Mode of action / target Effect level Side Effects Case Study: Drug Discovery SSSSttttrrrruuuuccccttttuuuurrrreeeedddd aaaannnndddd ' uuuunnnnssssttttrrrruuuuccccttttuuuurrrreeeedddd ddddaaaattttaaaa ssssoooouuuurrrrcccceeeessss Data-driven decision making More efficient clinical trial design, data analytics and drug success / failure predictions
  • 6. •Water agencies: Improve credit profile for water infrastructure projects • Lenders: Better estimate cost and profits of such projects • Insurers: Better understand underlying risk of such projects • Consumers: Access water at an affordable price despite of increasing population and demand for water © 2014 IBM Corporation Case Study: Water Cost Index Who care about the cost of water? • Financial reports • News feeds • Websites • … What is the cost of water in different regions? Financial Analytics • Provides market benchmark • Spurs growth of financial products for both water producers and investors
  • 7. • Wall Street Journal article on WCI © 2014 IBM Corporation Case Study: Water Cost Index • Financial reports • News feeds • Websites • … Statistical Analysis Water Cost Index • Uganda signed up as 1st customer for WCI Text Analytics Financial Analytics • WCI published on ongoing basis starting end of Sep. 2013 7
  • 8. © 2014 IBM Corporation What is WWWhhhaaattt iiisss tttthhhhiiiissss ttttaaaallllkkkk aaaabbbboooouuuutttt ???? What makes analytics tasks difficult and what can be learnt the success of relational systems Brief description of declarative systems being built at IBM for √ Information Extraction (SystemT) √ Machine Learning (SystemML) X Entity Resolution (DeeR) 8 9/10/2014 IBM Research – Almaden Data integration Statistic Analysis /Machine Learning Information Extraction Databases Semi- /Unstructured Documents
  • 9. Challenges in Information Extraction Example: Named Entity Recognition NNNNaaaammmmeeeedddd EEEEnnnnttttiiiittttyyyy HHHHiiiieeeerrrraaaarrrrcccchhhhyyyy:::: 200+ types: Person, Organization, Location,… .....………………… …….………………. Laura Haas works for IBM in San Jose, CA. ….…………………. …..………………… Information Extraction Person Org Loc Laura Haas IBM San Jose,CA © 2014 9 9/10/2014 IBM Research – Almaden IBM Confidential IBM Corporation
  • 10. Challenges in Information Extraction NNNNaaaammmmeeeedddd EEEEnnnnttttiiiittttyyyy HHHHiiiieeeerrrraaaarrrrcccchhhhyyyy:::: 200+ types: Person, Organization, Location,… BBBBrrrreeeeaaaaddddtttthhhh Wide varieties of extraction tasks IIIIEEEE ddddeeeevvvveeeellllooooppppmmmmeeeennnntttt ttttaaaakkkkeeeessss eeeeffffffffoooorrrrtttt!!!!!!!! • CCCCoooolllllllleeeeccccttttiiiinnnngggg ddddiiiiccccttttiiiioooonnnnaaaarrrriiiieeeessss •WWWWrrrriiiittttiiiinnnngggg rrrreeeegggguuuullllaaaarrrr eeeexxxxpppprrrreeeessssssssiiiioooonnnnssss • CCCCoooolllllllleeeeccccttttiiiinnnngggg ooootttthhhheeeerrrr wwwwoooorrrrdddd-lllleeeevvvveeeellll ffffeeeeaaaattttuuuurrrreeeessss LLLLaaaabbbbeeeelllliiiinnnngggg ++++ ttttrrrraaaaiiiinnnniiiinnnngggg////ttttuuuunnnniiiinnnngggg mmmmaaaacccchhhhiiiinnnneeee lllleeeeaaaarrrrnnnniiiinnnngggg mmmmooooddddeeeellllssss oooorrrr WWWWrrrriiiittttiiiinnnngggg ++++ tttteeeessssttttiiiinnnngggg rrrruuuulllleeeessss © 2014 10 9/10/2014 IBM Research – Almaden IBM Confidential IBM Corporation
  • 11. Challenges in Information Extraction NNNNaaaammmmeeeedddd EEEEnnnnttttiiiittttyyyy HHHHiiiieeeerrrraaaarrrrcccchhhhyyyy:::: 222200000000++++ ttttyyyyppppeeeessss:::: PPPPeeeerrrrssssoooonnnn,,,, OOOOrrrrggggaaaannnniiiizzzzaaaattttiiiioooonnnn,,,, LLLLooooccccaaaattttiiiioooonnnn,,,,………… BBBBrrrreeeeaaaaddddtttthhhh Wide varieties of extraction tasks IIIIEEEE ddddeeeevvvveeeellllooooppppmmmmeeeennnntttt ttttaaaakkkkeeeessss eeeeffffffffoooorrrrtttt!!!!!!!! DDDDoooommmmaaaaiiiinnnn ccccuuuussssttttoooommmmiiiizzzzaaaattttiiiioooonnnnssss iiiissss uuuussssuuuuaaaallllllllyyyy rrrreeeeqqqquuuuiiiirrrreeeedddd!!!!!!!! Entity Boundary: PPPPeeeerrrrssssoooonnnn oooorrrr PPPPoooossssiiiittttiiiioooonnnn ++++ PPPPeeeerrrrssssoooonnnn ???? … Pres. Barack Obama arrived today at the White House … Entity Definition: LLLLooooccccaaaattttiiiioooonnnn////FFFFaaaacccciiiilllliiiittttyyyy////OOOOrrrrggggaaaannnniiiizzzzaaaattttiiiioooonnnn???? CCCCoooommmmpppplllleeeexxxxiiiittttyyyy In development customization © 2014 11 9/10/2014 IBM Research – Almaden IBM Confidential IBM Corporation
  • 12. Challenges in Information Extraction NNNNaaaammmmeeeedddd EEEEnnnnttttiiiittttyyyy HHHHiiiieeeerrrraaaarrrrcccchhhhyyyy:::: 200+ types: Person, Organization, Location,… BBBBrrrreeeeaaaaddddtttthhhh Wide varieties of extraction tasks State-of-the-art Open-Source Rule-based System • 80,000+ dictionary entries • 4,800 lines of JAPE and Java code • Accuracy (English): 50%-80% • Performance: 20KB/sec, 8GB RAM IIIIEEEE ddddeeeevvvveeeellllooooppppmmmmeeeennnntttt ttttaaaakkkkeeeessss eeeeffffffffoooorrrrtttt!!!!!!!! DDDDoooommmmaaaaiiiinnnn ccccuuuussssttttoooommmmiiiizzzzaaaattttiiiioooonnnnssss iiiissss uuuussssuuuuaaaallllllllyyyy rrrreeeeqqqquuuuiiiirrrreeeedddd!!!!!!!! Entity Boundary: PPPPeeeerrrrssssoooonnnn oooorrrr PPPPoooossssiiiittttiiiioooonnnn ++++ PPPPeeeerrrrssssoooonnnn ???? … Pres. Barack Obama arrived today at the White House … Entity Definition: LLLLooooccccaaaattttiiiioooonnnn////FFFFaaaacccciiiilllliiiittttyyyy////OOOOrrrrggggaaaannnniiiizzzzaaaattttiiiioooonnnn???? CCCCoooommmmpppplllleeeexxxxiiiittttyyyy In development customization State-of-the-art Machine-learning system • Combination of 4 classifiers • 150,000+ dictionary entries • 15+ regexes for word features • Accuracy: 89% • Throughput: ~ 10 KB/sec SSSSccccaaaalllleeee 450M+ tweets per day, … © 2014 12 9/10/2014 IBM Research – Almaden IBM Confidential IBM Corporation
  • 13. ~1500 lines of Java code BBBBrrrreeeeaaaaddddtttthhhh SSSSccccaaaalllleeee Different Loss function KL-divergence Wide varieties of ML models 450M+ tweets per day, … © 2014 IBM Corporation Challenges in Scalable CCChhhaaalllllleeennngggeeesss iiinnn SSScccaaalllaaabbbllleee MMMMaaaacccchhhhiiiinnnneeee LLLLeeeeaaaarrrrnnnniiiinnnngggg topics topics words V ≈ W H documents x 13 9/10/2014 IBM Research – Almaden [Liu, WWW 2010] • Billions of non-zeros within tens of hours • Careful partitioning of data • Maximize data locality and parallelism % initialize W, H while (~converged) W = W*(V%*%t(H))/(W%*%H%*%t(H)) H = H*(t(W)%*%V)/(t(W)%*%W%*%H) end W = W*max(V%*%t(H) – alphaW JW, 0)/(W%*%H%*%t(H)) H = H*max(t(W)%*%V – alphaH JH, 0)/(t(W)%*%W%*%H) W = W*((S*V)%*%t(H))/((S*(W%*%H))%*%t(H)) H = H*(t(W)%*%(S*V))/(t(W)%*%(S*(W%*%H))) Regularizers JW,JH Weighted Sq Loss/ Matrix Completion Setting Parallel implementation is half the story ! Typical application requires experimenting with multiple variants W = W*(V/(W%*%H) %*% t(H))/(E*%t(H)) CCCCoooommmmpppplllleeeexxxxiiiittttyyyy In implementation H = H*(t(W)%*%(V/(W%*%H)))/(t(W)%*%E)
  • 14. WWWWhhhhaaaatttt iiiissss ccccoooommmmmmmmoooonnnn aaaaccccrrrroooossssssss tttthhhheeeesssseeee aaaannnnaaaallllyyyyttttiiiiccccssss ttttaaaasssskkkkssss ???? Variety of problems and solutions –Every customer’s data problems are unique in some way –Need to quickly implement new business logic –Need to experiment with multiple algorithms for a particular analytic problem Quality of answers is very important !! –High quality analytics requires “complex” programs –Skilled developers + domain experts Performance is critical –Bigger data demands faster execution • Social Media: Twitter alone has 400M+ messages / day; 1TB+ per day • Financial Data: SEC alone has 20M+ filings, several TBs of data, with documents range from few KBs to few MBs • Machine Data: One application server under moderate load at medium logging level 1GB of logs per day BBBBrrrreeeeaaaaddddtttthhhh CCCCoooommmmpppplllleeeexxxxiiiittttyyyy © 2014 IBM Research – Almaden IBM Corporation SSSSccccaaaalllleeee
  • 15. DDDDeeeeccccllllaaaarrrraaaattttiiiivvvveeee SSSSyyyysssstttteeeemmmmssss :::: TTTThhhheeee RRRReeeellllaaaattttiiiioooonnnnaaaallll WWWWoooorrrrlllldddd Compute average salary for each department select D.did, avg(E.salary) from Employee E, Department D where E.did = D.did group by D.did Task SQL Query Declarative High-level Language User specifies tasks in a high-level language, w/o specifying algorithms for data processing Optimization Query Optimizer … TTTTaaaabbbblllleeeessss,,,, IIIInnnnddddiiiicccceeeessss Execution Strategy Query Optimization System uses optimization strategies to choose from alternate execution plans Physical Data Independence User does not have to worry about physical data representation and access aids while writing queries; system manages the physical layer © 2014 15 9/10/2014 IBM Research – Almaden IBM Corporation
  • 16. WWWWhhhhyyyy ddddiiiidddd RRRReeeellllaaaattttiiiioooonnnnaaaallll SSSSyyyysssstttteeeemmmmssss ssssuuuucccccccceeeeeeeedddd ???? Pat Selinger Boeing said “We can ask questions we could never find the answers to before. We’re now able to do more than we could ever do before.” SIGMOD Record, December 2003 Bruce Lindsay The invention of nonprocedural specification was a tremendous simplification that made it much easier to specify applications. No longer did you have to say which index to use and which join method to use to get the job done. SIGMOD Record, June 2005 Michael Stonebraker Query optimizers can beat all but the best DBMS application programmers. “What Goes Around Comes Around”, Readings in Database Systems, 4th Edition, 2005 © 2014 16 9/10/2014 IBM Research – Almaden IBM Corporation
  • 17. WWWWhhhhaaaatttt iiiissss ccccoooommmmmmmmoooonnnn aaaaccccrrrroooossssssss tttthhhheeeesssseeee aaaannnnaaaallllyyyyttttiiiiccccssss ttttaaaasssskkkkssss ???? Variety of problems and solutions –Every customer’s data problems are unique in some way –Need to quickly implement new business logic –Need to experiment with multiple algorithms for a particular analytic problem Quality of answers is very important !! –High quality analytics “complex” programs –Skilled developers + domain experts Performance is critical –Bigger data faster execution BBBBrrrreeeeaaaaddddtttthhhh CCCCoooommmmpppplllleeeexxxxiiiittttyyyy © 2014 IBM Research – Almaden IBM Corporation SSSSccccaaaalllleeee
  • 18. What am I going to talk about ? What makes analytics tasks difficult and what can be learnt from the success of relational systems Brief description of declarative systems built at IBM and the design choices made along the way SystemT (Information Extraction) SystemML (Machine Learning) Design Choices Analytics Systems Data Model Operations Language Syntax Platform © 2014 18 IBM Research – Almaden 9/10/2014 IBM Corporation
  • 19. IIIInnnnffffoooorrrrmmmmaaaattttiiiioooonnnn EEEExxxxttttrrrraaaaccccttttiiiioooonnnn - SSSSyyyysssstttteeeemmmmTTTT © 2014 19 9/10/2014 IBM Research – Almaden IBM Corporation
  • 20. Informal Music Band Reviews from Blogs Start with Concert Mention At least 3 occurrences of Music Review Snippet and Generic Review Snippet Review ends with one of these. Complete review is within 200 tokens Concert Mention Pattern Consecutive Review Snippets are within 25 tokens I went … to the OTIS concert last night a bunch of other bands also playing The sax player in that band… They played … “I Will Survive”… Music Review Snippet Review within 200 tokens © 2014 20 9/10/2014 IBM Research – Almaden IBM Corporation
  • 21. SSSSttttaaaatttteeee-ooooffff-tttthhhheeee-aaaarrrrtttt:::: CCCCoooommmmmmmmoooonnnn PPPPaaaatttttttteeeerrrrnnnn SSSSppppeeeecccciiiiffffiiiiccccaaaattttiiiioooonnnn LLLLaaaannnngggguuuuaaaaggggeeee ((((CCCCPPPPSSSSLLLL)))) A common language to specify and represent extraction rules as cascading grammars Developed jointly between SRI and Department of Defense (1999) Example Rule: Band Member name followed within 5 tokens by Instrument clue is a Music Review Snippet Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. augue rutrum lorem velit, sed RRRReeeevvvviiiieeeewwwwSSSSnnnniiiippppppppeeeetttt, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat 〈BandMember〉 〈Token〉{0,5} 〈Instrument〉〈MusicReviewSnippet〉 Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. luctus, risus in sagittis facilisis BBBBaaaannnnddddMMMMeeeemmmmbbbbeeeerrrr tttthhhheeeeiiiirrrr lllleeeeaaaadddd vvvvooooccccaaaallll////IIIInnnnssssttttrrrruuuummmmeeeennnntttt hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin JJJJoooonnnn FFFFoooorrrreeeemmmmaaaannnn tttthhhheeeeiiiirrrr lllleeeeaaaadddd vvvvooooccccaaaallll//// IIIInnnnssssttttrrrruuuummmmeeeennnntttt arcu tincidunt Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin in sagittis , BBBBaaaannnnddddMMMMeeeemmmmbbbbeeeerrrr tttthhhheeeeiiiirrrr lllleeeeaaaadddd vvvvooooccccaaaallll////gggguuuuiiiittttaaaarrrriiiisssstttt rutrum velit sed amet lt arcu tincidunt 〈〈Token〉[~ “([A-Z]w+)s+[A-Z]w+”] 〈BandMember〉〉 〈Token〉[~ “pipe | guitarist | …”] 〈Instrument〉 Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur risus in sagittis facilisis JJJJoooonnnn FFFFoooorrrreeeemmmmaaaannnn tttthhhheeeeiiiirrrr lllleeeeaaaadddd vvvvooooccccaaaallll////gggguuuuiiiittttaaaarrrriiiisssstttt hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, LLLLeeeevvvveeeellll 2222 LLLLeeeevvvveeeellll 1111 LLLLeeeevvvveeeellll 0000 © 2014 21 9/10/2014 IBM Research – Almaden IBM Corporation
  • 22. WWWWhhhhyyyy iiiissss tttthhhhiiiissss nnnnooootttt ssssuuuuffffffffiiiicccciiiieeeennnntttt ???? Consecutive Review Snippets are within 25 tokens Start with Concert Mention At least 3 occurrences of Music Review Snippet or Generic Review Snippet Review ends with one of these. Complete review is within 200 tokens Counting and aggregations are not natural primitives in grammar and have to be handled in custom code [Chiticariu, ACL 2010] Finely tuned grammar-based extraction system, with custom code for counting and aggregation, took ~ 6 hours to extract reviews from a million web logs © 2014 22 9/10/2014 IBM Research – Almaden IBM Corporation
  • 23. SSSSyyyysssstttteeeemmmmTTTT – DDDDeeeeccccllllaaaarrrraaaattttiiiivvvveeee AAAApppppppprrrrooooaaaacccchhhh ttttoooo IIIInnnnffffoooorrrrmmmmaaaattttiiiioooonnnn EEEExxxxttttrrrraaaaccccttttiiiioooonnnn Annotated Document Stream AQL SystemT Optimizer SystemT Runtime Compiled Operator Graph Rule language with familiar SQL-like syntax Specify annotator semantics declaratively Choose an efficient execution plan that implements the semantics Highly scalable, embeddable Java runtime Input Document Stream See SIGMOD 2010 tutorial [Chiticariu et al., 2010] for details on other recent declarative IE systems © 2014 23 9/10/2014 IBM Research – Almaden IBM Corporation
  • 24. Expressing Music Review Snippet Rule in AQL BandMember Instrument 0-5 tokens create view MusicReviewSnippet as select B.name as member, I.value as instrument, CombineSpans(B.name,I.value) as review from BandMember B, Instrument I where FollowsTok(B.name, I.value, 0, 5); create view BandMember as extract regex /[A-Z]w+s+[A-Z]w+] / on D.text from Document D; Choice of SQL-like syntax for AQL motivated by wider adoption of SQL © 2014 24 9/10/2014 IBM Research – Almaden IBM Corporation
  • 25. WWWWhhhhaaaatttt mmmmaaaakkkkeeeessss AAAAQQQQLLLL eeeexxxxpppprrrreeeessssssssiiiivvvveeee???? Extraction primitives –Regular Expressions –Dictionary Text-specific primitives –Multi-lingual tokenization and parts-of-speech –Sentence and paragraph boundary detection –Span-based predicates Set-level primitives –Join –Block –Consolidation –Group By © 2014 25 9/10/2014 IBM Research – Almaden IBM Corporation
  • 26. HHHHoooowwww wwwwiiiillllllll tttthhhheeee MMMMuuuussssiiiicccc BBBBaaaannnndddd RRRReeeevvvviiiieeeewwww eeeexxxxttttrrrraaaaccccttttoooorrrr wwwwoooorrrrkkkk iiiinnnn SSSSyyyysssstttteeeemmmmTTTT???? Block Music Find blocks of three or more “Review Snippet” Review Snippet Blocks of Review Snippet Review Snippet Generic Review Snippet Join predicates enforce additional constraints Concert Mention Join Union patterns … © 2014 26 9/10/2014 IBM Research – Almaden IBM Corporation
  • 27. CCCCllllaaaassssssss ooooffff OOOOppppttttiiiimmmmiiiizzzzaaaattttiiiioooonnnnssss iiiinnnn SSSSyyyysssstttteeeemmmmTTTT RRRReeeewwwwrrrriiiitttteeee-bbbbaaaasssseeeedddd: rewrite algebraic operator graph –Shared Dictionary Matching –Shared Regular Expression Evaluation –On-demand tokenization CCCCoooosssstttt-bbbbaaaasssseeeedddd: relies on novel selectivity estimation for text-specific operators –Standard transformations • E.g., push down selections –Restricted Span Evaluation • Evaluate expensive operators on restricted regions of the document Tokenization overhead is paid only once (followed within 5 tokens) BandMember Plan B BandMember Plan C Plan A Join Instrument Identify Instrument starting Extract text to the right within 5 tokens Identify BandMember ending within 5 tokens Extract text to the left Restricted Span Evaluation Instrument © 2014 27 9/10/2014 IBM Research – Almaden IBM Corporation
  • 28. © 2014 IBM Corporation Performance benefits using SystemT [Chiticariu et al. ACL’10] Music Band Review extraction task over a million web logs – SystemT vs. the grammar implementation • 10 minutes vs. ~ 6 hours Named-entity extraction task over multiple document corpora – SystemT throughput ranges from 400 – 900 KB/sec/core (depending on the size of the document) – SystemT vs. State-of-the-Art Learning-based System [Florian et al, CoNLL’03] ~ 50 times higher throughput – SystemT vs. State-of-the-Art Grammar-based System [ANNIE, Cunningham et al, ACL’02] ~ 10 - 50 times higher throughput ~ 60 - 90% less memory consumption Revisiting the Twitter example, for keeping up with today’s tweets with 18 cores – SystemT takes 30 minutes per day as opposed to running 24/7 for the state-of-the-art system
  • 29. Runs fast ! But is SystemT expressive enough to compare on quality ? [Chiticariu et al. ACL’10, EMNLP’10] SystemT outperforms current best results on multiple benchmark datasets – CoNLL 2003 • F-measure between 89% and 92% for Person, Organization and Location tasks • Beats the state-of-the-art results consistently by up to 4% – Enron Email • F-measure 85% for Person task • Better than the state-of-the-art result by 7% © 2014 29 9/10/2014 IBM Research – Almaden IBM Corporation
  • 30. © 2014 IBM Corporation What design choices did we make for SystemT ? SystemT (Information Extraction) SystemML (Machine Learning) Data Model Document-at-a-time model Data types: Span, Tuple, Relation Operations Feature extraction primitives Text-specific primitives Set-level primitives Language Syntax SQL-like syntax Platform Embeddable runtime deployed in a wide range of execution environments Design Choices Analytics Systems
  • 31. SSSSyyyysssstttteeeemmmmMMMMLLLL © 2014 31 9/10/2014 IBM Research – Almaden IBM Confidential IBM Corporation
  • 32. SSSSttttaaaattttuuuussss QQQQuuuuoooo ooooffff MMMMaaaacccchhhhiiiinnnneeee LLLLeeeeaaaarrrrnnnniiiinnnngggg AAAAllllggggoooorrrriiiitttthhhhmmmmssss Machine Learning algorithm implementations today – Specialized languages for Machine Learning • R, Matlab • Execution strategy for programs is determined by user – Low-level implementations • Directly implement ML algorithms on specific platforms –Hand-tuned implementations on specialized hardware GPU, BlueGene etc. But the programmer has to handle – Performance optimizations due to data and compute platform characteristics – Parallelization for specific platforms IBM Research – Almaden 9/10/2014 © 2014 32 IBM Corporation
  • 33. SSSSyyyysssstttteeeemmmmMMMMLLLL GGGGooooaaaallllssss GNMF: V ≈ U = W H V=readMM(in/V, rows=1e8, cols=1e5); W=readMM(in/W, rows=1e8, cols=10); H=readMM(in/H, rows=10, cols=1e5); max_iteration=20; i=0; while(imax_iteration){ H=H*(t(W)%*%V)/(t(W)%*%W%*%H); W=W*(V%*%t(H))/(W%*%H%*%t(H)); i=i+1;} Higher level Optimizations Operator Implementations MapReduce Platform ' MR1 MR2 MRn Provide language to implement ML algorithms Support specific ML constructs such as cross validation, bootstrapping, ensembles as first class citizens Optimizations based on data and system characteristics Scalable operator implementations IBM Research – Almaden 9/10/2014 © 2014 33 IBM Corporation
  • 34. SystemML Architecture DML: Declarative Machine Learning Language – Retain expressivity of current ML languages including procedural constructs like while and for loops High-Level Operator (HOP) Component – Represent dataflow in DAGs of matrices and scalar operations – Choose from alternative execution plans using algebraic rewrites and cost-based optimization Low-Level Operator (LOP) Component – Low-level physical execution plan over key-value pairs – “Piggyback” operations to reduce number of MapReduce jobs Runtime – Efficient data representation and implementation of individual operations in MapReduce framework – Control module to orchestrate MR jobs IBM Research – Almaden 9/10/2014 © 2014 34 IBM Corporation
  • 35. Simple Example of how SystemML works Binary hop Multiply B Binary hop Divide C D Binary lop Multiply Group lop C A = B * (C / D) Binary lop Divide Group lop D B R1 M1 MR Job Language HOP Component LOP Component Runtime LOP represents the physical plan for the program with a DAG for each statement block. LOP operates on key-value pairs and scalars Multiple low-level operators combined in a MapReduce job HOP represents the logical flow of the program as DAGs for each statement block. HOP operates on matrices and scalars Input DML parsed into statement blocks with typed variables IBM Research – Almaden 9/10/2014 © 2014 35 IBM Corporation
  • 36. © 2014 IBM Corporation Declarative Machine DDDeeeccclllaaarrraaatttiiivvveee MMMaaaccchhhiiinnneee LLLLeeeeaaaarrrrnnnniiiinnnngggg LLLLaaaannnngggguuuuaaaaggggeeee Syntax borrowed from R What is supported – Data Types: matrix, vector, scalar – Statements • Input/Output, Assignment, Control Structures (while, for), Rand – Expressions • Operators : Arithmetic, Comparative, Boolean, Matrix Multiplication • Built-in Functions : Linear Algebra (transpose, …), Matrix aggregation (colSum, ...) , Mathematical (ln, sqrt, …) – External Functions – Machine Learning specific constructs : Cross validation, Ensemble learning 36 IBM Research – Almaden 9/10/2014
  • 37. CCCCaaaatttteeeeggggoooorrrriiiieeeessss ooooffff OOOOppppttttiiiimmmmiiiizzzzaaaattttiiiioooonnnn iiiinnnn SSSSyyyysssstttteeeemmmmMMMMLLLL HOP component – Algebraic rewrites (e.g., matrix computation reordering) – Cost-based optimization (e.g., choosing between different plans for matrix multiplication) – Selection of physical representation of matrices (e.g., cell versus block representation) LOP component – Piggybacking (packing lops that can be evaluated together in a single MapReduce job) Runtime – Data representation (e.g., sparse versus dense) – Sparsity-aware operator implementations IBM Research – Almaden 9/10/2014 © 2014 37 IBM Corporation
  • 38. Performance Numbers Gaussian NMF: V = readMM (example.GNMF.V, rows= 1000, cols=100, nnzs= 2000, format=text); W = readMM (example.GNMF.W, rows= 1000, cols=20, nnzs= 20000, format=text); H = readMM (example.GNMF.H, rows= 20, cols=100, nnzs= 2000, format=text); max_iteration = 10 i = 0 while (i max_iteration) { H = H * ((t(W) %*% V) / ( (t(W) %*% W) %*% H)) W = W * ((V %*% t(H)) / ( W %*% (H %*% t(H)))) i = i + 1 } writeMM (W,example.GNMF.W.result, format=text); writeMM (H,example.GNMF.H.result, format=text); Data Size Time per iteration Lines of Code Runtime Platform In SystemML 5 billion non zeros (50m X 100k, sparsity 1x10-3) 1.2 hours 11 lines of DML code 40 cores, 4 GB RAM per core WWW 2010 4.38 billion non zeros (43.9 m X 768m, sparsity 1.3x10-7) 7 hours 1500 lines of Java code SCOPE cluster IBM Research – Almaden 9/10/2014 © 2014 38 IBM Corporation
  • 39. AAAAddddddddiiiittttiiiioooonnnnaaaallll AAAAllllggggoooorrrriiiitttthhhhmmmmssss Execution Time (sec) 400 350 300 250 200 150 100 50 0 PageRank G=readMM(in/G, rows=1e6, cols=1e6); p=readMM(in/p, rows=1e6, cols=1); e=readMM(in/e, rows=1e6, cols=1); ut=readMM(in/ut, rows=1, cols=1e6); alpha=0.85; max_iteration=20; i=0; while(imax_iteration){ p=alpha*(G%*%p)+(1-alpha)*(e%*%u%*%p); i=i+1} writeMM(p, out/p); DML PageRank G: n x n, sparsity=0.001 V: d x 100000, sparsity=0.001 0 400 800 1200 1600 #rows and #columns in G (thousand) Execution Time (sec) 800 600 400 200 0 Sparse Linear Regression V=readMM(in/V, rows=1e8, cols=1e5); b=readMM(in/b, rows=1e8, cols=1); lambda = 1e-6; r=-b ; p=-r ; norm_r2=sum(r*r); max_iteration=20; i=0; while(imax_iteration){ q=((t(V) %*% (V %*% p)) + lambda*p) alpha= norm_r2/(t(p)%*%q); w=w+alpha*p; old_norm_r2=norm_r2; r=r+alpha*q; beta=norm_r2/old_norm_r2; p=-r+beta*p; i=i+1;} writeMM(w, out/w); DML Linear Regression 0 2 4 6 8 10 12 14 16 18 20 #rows in V (million) IBM Research – Almaden 9/10/2014 © 2014 39 IBM Corporation
  • 40. © 2014 IBM Corporation What design choices did we WWWhhhaaattt dddeeesssiiigggnnn ccchhhoooiiiccceeesss dddiiiddd wwweee mmmmaaaakkkkeeee ffffoooorrrr SSSSyyyysssstttteeeemmmmMMMMLLLL ???? SystemT (Information Extraction) SystemML (Machine Learning) Analytics Systems Data Model Document-at-a-time model Data types: Span, Tuple, Relation Data types: Matrix, Vector, Scalar Operations Feature extraction primitives Text-specific primitives Set-level primitives Procedural constructs e.g., while, for Linear Algebra operations External Functions Machine Learning specific constructs e.g., Cross Validation, Ensemble Learning Language Syntax SQL-like syntax R-like syntax Platform Embeddable runtime deployed in a wide range of execution environments MapReduce Runtime Design Choices
  • 42. LLLLeeeessssssssoooonnnnssss LLLLeeeeaaaarrrrnnnneeeedddd – SystemT • Ships with eight IBM products • To date have not encountered a request that is not expressible in AQL – SystemML • Ships with IBM BigInsights August beta this year • Declarative is the goal; but to express Machine Learning algorithms procedural constructs are needed • Users naturally gravitate to procedural constructs. Limiting usage of such constructs to only when required to specify “what needs to be done” may need lot of training – SystemT • Choice of SQL-like syntax and Eclipse-based tooling quickly enabled hundreds of users with varied background • But traditional NLP-trainees prompted us to provide a layer on top of AQL with grammar-like syntax • Business users demand even simpler and more usable tooling – SystemML • Early days but multiple users inside IBM and almost all are previous R / Matlab users. • Familiar R syntax helps ML users up and running al most immediately – SystemT • Document at a time model and all in-memory optimizations • Demonstrates that an order-of-magnitude throughput improvement can be obtained • Hardware acceleration further speed up the execution – SystemML • Computation on a large-scale distributed platform • Initial eIxBpMe Rrieesnecarecsh r–eAinlmfoadrecne the argument “Query optimizers can b9e/1a0t/2 a0l1l4 but the best programmers” © 2014 42 IBM Corporation
  • 43. Maintenance © 2014 IBM Corporation Tooling Research for the Development Life-Cycle Development [ACL’11,12,13,CHI’13] Develop Analyze Test Deploy Refine Test Task Analysis • Concordance Viewer • Active labeling • Labeling tool • Extraction plan • Track provenance [VLDB’10] • Contextual clue discovery[CIKM’11] • Regex learning [EMNLP’08] • Suggest rule changes [VLDB’10] • Rule induction [EMNLP’12] • Dictionary refinement [SIGMOD’13] • Rule learning • NE Interface [EMNLP’10] • Tagger UI [SIGMOD’07]
  • 44. Result Viewer © 2014 IBM Corporation Eclipse EEEcccllliiipppssseee TTTToooooooollllssss OOOOvvvveeeerrrrvvvviiiieeeewwww EEEEaaaasssseeee ooooffff PPPPrrrrooooggggrrrraaaammmmmmmmiiiinnnngggg AAAAuuuuttttoooommmmaaaattttiiiicccc DDDDiiiissssccccoooovvvveeeerrrryyyy PPPPeeeerrrrffffoooorrrrmmmmaaaannnncccceeee TTTTuuuunnnniiiinnnngggg AQL Editor Explain Pattern Discovery Regex Learner AAAAQLLLL EEEEddddiiiittttoooorrrr:::: syntax highlighting, auto-complete, hyperlink navigation RRRReeeessssuuuulllltttt VVVViiiieeeewwwweeeerrrr:::: visualize/compare/evaluate EEEExxxxppppllllaaaaiiiinnnn:::: show how each result was generated WWWWoooorrrrkkkkfffflllloooowwww UUUUIIII: end-to-end development wizard RRRReeeeggggeeeexxxx GGGGeeeennnneeeerrrraaaattttoooorrrr:::: generate regular expressions from examples PPPPaaaatttttttteeeerrrrnnnn DDDDiiiissssccccoooovvvveeeerrrryyyy: identify patterns in the data PPPPrrrrooooffffiiiilllleeeerrrr: identify performance bottlenecks to be hand tuned
  • 45. © 2014 IBM Corporation WWWWeeeebbbb TTTToooooooollllssss OOOOvvvveeeerrrrvvvviiiieeeewwww EEEEaaaasssseeee ooooffff PPPPrrrrooooggggrrrraaaammmmmmmmiiiinnnngggg EEEEaaaasssseeee ooooffff SSSShhhhaaaarrrriiiinnnngggg CCCCaaaannnnvvvvaaaassss:::: • Visual construction of extractors • Customization of existing extractors RRRReeeessssuuuulllltttt VVVViiiieeeewwwweeeerrrr:::: visualize/compare/evaluate CCCCoooonnnncccceeeepppptttt ccccaaaattttaaaalllloooogggg:::: share concepts PPPPrrrroooojjjjeeeecccctttt:::: share extractor development Even for non-programmers
  • 46. DDDDoooonnnn hhhhaaaassss tttthhhheeee llllaaaasssstttt wwwwoooorrrrdddd ………… Don Chamberlin We set out to help non-programmers interact with databases to open up access to data to a whole new class of people who could do things that were never possible before. The problem that we didn't think we were working on at all was how to embed query languages into host languages, or how to make a language that would serve as an interchange medium between different systems - those are the ways in which SQL ultimately turned out to be very successful, SQL Reunion, 1995 Maybe .. Don observed that success of SQL was due to the language serving as an interchange medium between systems. In contrast declarative systems for analytics may indeed be successful for the original purpose that SQL was intended – open up access to analytics to a whole new class of people © 2014 46 IBM Corporation
  • 47. © 2014 IBM Corporation Thank You! 47