SlideShare a Scribd company logo
Analytics Query Engine
Nantia Makrynioti and Vasilis Vassalos
Athens University of Economics and Business

MEDAL 2016, Bordeaux, France
TOWARDS AN
• Huge amount of available data

• Decrease of storage cost

• Great use of systems facilitating data analytics in a
distributed fashion
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
INTRODUCTION
CURRENT STATE OF THE ART
• Libraries of algorithms

• Systems that provide operators for developing
distributed algorithms

✦ Combination of procedural and declarative
programming

✦ Integration of declarative operators to imperative
languages

✦ Closer to the style of statistical computing languages
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
Implementations of machine learning algorithms on top
of a given distributed framework

final	
  LogisticRegressionModel	
  model	
  =	
  new	
  
LogisticRegressionWithLBFGS()	
  
	
  	
  	
  	
  	
  	
  .setNumClasses(10)	
  
	
  	
  	
  	
  	
  	
  .run(training.rdd());	
  
Examples: Apache Mahout, MLlib, MADlib
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
LIBRARIES OF ALGORITHMS
+ Easy to use
- Too coarse grained
A programming model between SQL and Map Reduce

good_urls	
  =	
  FILTER	
  urls	
  BY	
  page	
  rank	
  >	
  0.2;	
  
groups	
  =	
  GROUP	
  good_urls	
  BY	
  category;	
  
Examples: Jaql, Pig Latin, Stratosphere’s Sopremo layer, uSQL
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
SYSTEMS OF OPERATORS (1)
A sequence of steps, with
each step performing a single high-
level transformation
Use of variables to store
intermediate results
+ Closer to data analysis users, yet
fairly declarative
- Limited support for iteration
Extensions of the MapReduce model

val	
  textFile	
  =	
  sc.textFile("hdfs://...")	
  
val	
  df	
  =	
  textFile.toDF("line")	
  
val	
  errors	
  =	
  df.filter(col("line").like("%ERROR%"))	
  
errors.count()	
  
Examples: DryadLINQ, Stratosphere’s PACT, Spark,
Tupleware
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
SYSTEMS OF OPERATORS (2)
Richer set of operators
Arbitrary
data flow graphs
+ Iteration support
- Considerable amount of user
defined code
R-like programming models

while(i	
  <	
  20)	
  {	
  
	
  	
  	
  	
  H	
  =	
  H	
  *	
  (t(W)	
  %*%	
  V)/(	
  t(W)	
  %*%	
  W	
  %*%	
  H);	
  
	
  	
  	
  	
  W	
  =	
  W	
  *	
  (V	
  %*%	
  t(H)/(W	
  %*%	
  H	
  %*%	
  t(H));	
  
	
  	
  	
  	
  i	
  =	
  i	
  +	
  1;	
  
}	
  
Examples: MLI API, SystemML
SYSTEMS OF OPERATORS (3)
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
Support for matrix
operations
+ Easier to express ML algorithms due to linear
algebra operations
- Still closer to imperative programming
good_urls	
  =	
  FILTER	
  urls	
  BY	
  page	
  rank	
  >	
  0.2;	
  
groups	
  =	
  GROUP	
  good_urls	
  BY	
  category;	
  
val	
  points	
  =	
  spark.textFile(…).map(parsePoint).cache()	
  
var	
  w	
  =	
  Vector.random(D);	
  
for(i	
  <-­‐	
  1	
  to	
  ITERATIONS)	
  {	
  
val	
  grad	
  =	
  spark.accumulator(new	
  Vector(D));	
  
for(p<-­‐points)	
  {	
  
val	
  s	
  =	
  (1/(1+exp(-­‐p.y*(w	
  dot	
  p.x)))-­‐1)*p.y	
  
grad	
  +=	
  s	
  *	
  p.x	
  
}	
  
w	
  -­‐=	
  grad.value	
  
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
WHAT’S MISSING?
Machine Learning
Algorithms: very close
to imperative
languages
Relational
queries: very close to
SQL style
Can we increase declarativity in

specification of ML algorithms?
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
WHAT’S MISSING?
VISION
An analytics query engine that would enable
declarative specification, operator based execution
and cost-based optimization of machine learning
algorithms on distributed processing platforms.
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
• A language with a SQL-like syntax

• Use of Datalog to express ML algorithms in a
declarative manner

✦ Translation of existing programming models for analytics
into Datalog programs [13]

✦ Extension of Datalog to cover a wider range of machine
learning tasks [14]
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
DECLARATIVE SYNTAX
Hypothesis 

Cost function

Optimization function 

Repeat { }
} Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
DECOMPOSITION OF
LINEAR REGRESSION
✓j := ✓j a
@
@✓j
J(✓)
h✓(x) = ✓T
x
J(✓) =
1
2m
mX
1
(h✓(x(i)
) y(i)
)2
DEFINE ML OPERATORS
Based on the example of Linear Regression we need:

✦ Linear algebra operators

✦ Aggregation operators

✦ Iteration

✦ Optimization functions are also good candidates to
define operators
• Query rewriting as in databases

✦ Operators for ML algorithms would result in clearer
semantics

• Compiler optimizations: function inlining, SIMD
vectorization
OPTIMIZATION TECHNIQUES
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
Mostly
applicable to relational
operators
Useful for user
defined code
More opportunities for
query rewriting
• Optimization techniques from machine learning

✦ Speed up training with large datasets using
stochastic gradient descent instead of batch
gradient descent

✦ Feature scaling for faster convergence

Note that these optimizations may produce slightly
different output
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
OPTIMIZING PROGRAMS
• Define a language with a declarative syntax

• Translate this language into a set of operators, which
express a wide range of machine learning tasks

• Based on their semantics, define algebraic properties
for query optimization

• Implement these operators on a distributed
processing framework
ACTION PLAN
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
Thank you!
Questions?
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
1. Apache mahout. http://mahout.apache.org/. 

2. X. Meng et al. MLlib: Machine learning in apache spark. CoRR, abs/1505.06807, 2015. 

3. J. Cohen et al. Mad skills: New analysis practices for big data. Proc. VLDB Endow., 2(2):1481–1492, Aug. 2009.

4. K. S. Beyer et al. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12)
1272-1283, 2011

5. C.Olston et al. Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM
SIGMOD International Conference on Management of Data, SIGMOD ’08, pages 1099–1110, 2008.

6. A. Alexandrov et al. The stratosphere platform for big data analytics. The VLDB Journal, 23(6):939-964, Dec.
2014.

7. U-SQL. http://usql.io/. 

8. A. Crotty et al. Tupleware: Redefining modern analytics. CoRR, abs/1406.6667, 2014. 

9. A. Ghoting et al. SystemML: Declarative machine learning on MapReduce. In Proceedings of the 2011 IEEE
27th International Conference on Data Engineering, ICDE ’11, pages 231–242, 2011. 

10. E. R. Sparks et al. MLI: An API for distributed machine learning. 

11. Y. Yu et al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level
language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation,
OSDI’08, pages 1–14, 2008. 

12. M. Zaharia et al. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference
on Hot Topics in Cloud Computing, HotCloud’10, pages 10–10, 2010.

13. Y. Bu et al. Scaling Datalog for Machine Learning on Big Data. CoRR, abs/1203.0160, 2012.

14. V. Barany et al., Declarative Statistical Modeling with Datalog. CoRR abs/1412.2221, 2014.

REFERENCES
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos

More Related Content

What's hot

Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
GraphAware
 
(Big) Data Science
(Big) Data Science(Big) Data Science
(Big) Data Science
Michal Bachman
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsTPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
Rim Moussa
 
Machine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro NegroMachine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro Negro
GraphAware
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Yuanyuan Tian
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
Khalid Salama
 
Graph computation
Graph computationGraph computation
Graph computation
Sigmoid
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Till Blume
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
Spark Summit
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
Carol McDonald
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Dippy Aggarwal
 
Vldb14
Vldb14Vldb14
Vldb14
hdbtracker
 
E05312426
E05312426E05312426
E05312426
IOSR-JEN
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
Slide 1
Slide 1Slide 1
Slide 1
butest
 

What's hot (19)

Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
(Big) Data Science
(Big) Data Science(Big) Data Science
(Big) Data Science
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsTPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
 
Machine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro NegroMachine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro Negro
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Graph computation
Graph computationGraph computation
Graph computation
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...
 
Vldb14
Vldb14Vldb14
Vldb14
 
E05312426
E05312426E05312426
E05312426
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
Slide 1
Slide 1Slide 1
Slide 1
 

Similar to towards_analytics_query_engine

Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
Daniel S. Katz
 
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
IJECEIAES
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
IRJET Journal
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
NavNeet KuMar
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
Gianvito Siciliano
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Yahoo Developer Network
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
ijcsit
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
Abhi Jit
 
What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2
Revolution Analytics
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Teradata Aster
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
Virot "Ta" Chiraphadhanakul
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
snehal parikh
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
eSAT Publishing House
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
Data science technology overview
Data science technology overviewData science technology overview
Data science technology overview
Soojung Hong
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
Kevin Crocker
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
Richard Garris
 

Similar to towards_analytics_query_engine (20)

Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
 
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
 
What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Data science technology overview
Data science technology overviewData science technology overview
Data science technology overview
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 

towards_analytics_query_engine

  • 1. Analytics Query Engine Nantia Makrynioti and Vasilis Vassalos Athens University of Economics and Business MEDAL 2016, Bordeaux, France TOWARDS AN
  • 2. • Huge amount of available data • Decrease of storage cost • Great use of systems facilitating data analytics in a distributed fashion Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos INTRODUCTION
  • 3. CURRENT STATE OF THE ART • Libraries of algorithms • Systems that provide operators for developing distributed algorithms ✦ Combination of procedural and declarative programming ✦ Integration of declarative operators to imperative languages ✦ Closer to the style of statistical computing languages Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
  • 4. Implementations of machine learning algorithms on top of a given distributed framework final  LogisticRegressionModel  model  =  new   LogisticRegressionWithLBFGS()              .setNumClasses(10)              .run(training.rdd());   Examples: Apache Mahout, MLlib, MADlib Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos LIBRARIES OF ALGORITHMS + Easy to use - Too coarse grained
  • 5. A programming model between SQL and Map Reduce good_urls  =  FILTER  urls  BY  page  rank  >  0.2;   groups  =  GROUP  good_urls  BY  category;   Examples: Jaql, Pig Latin, Stratosphere’s Sopremo layer, uSQL Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos SYSTEMS OF OPERATORS (1) A sequence of steps, with each step performing a single high- level transformation Use of variables to store intermediate results + Closer to data analysis users, yet fairly declarative - Limited support for iteration
  • 6. Extensions of the MapReduce model val  textFile  =  sc.textFile("hdfs://...")   val  df  =  textFile.toDF("line")   val  errors  =  df.filter(col("line").like("%ERROR%"))   errors.count()   Examples: DryadLINQ, Stratosphere’s PACT, Spark, Tupleware Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos SYSTEMS OF OPERATORS (2) Richer set of operators Arbitrary data flow graphs + Iteration support - Considerable amount of user defined code
  • 7. R-like programming models while(i  <  20)  {          H  =  H  *  (t(W)  %*%  V)/(  t(W)  %*%  W  %*%  H);          W  =  W  *  (V  %*%  t(H)/(W  %*%  H  %*%  t(H));          i  =  i  +  1;   }   Examples: MLI API, SystemML SYSTEMS OF OPERATORS (3) Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos Support for matrix operations + Easier to express ML algorithms due to linear algebra operations - Still closer to imperative programming
  • 8. good_urls  =  FILTER  urls  BY  page  rank  >  0.2;   groups  =  GROUP  good_urls  BY  category;   val  points  =  spark.textFile(…).map(parsePoint).cache()   var  w  =  Vector.random(D);   for(i  <-­‐  1  to  ITERATIONS)  {   val  grad  =  spark.accumulator(new  Vector(D));   for(p<-­‐points)  {   val  s  =  (1/(1+exp(-­‐p.y*(w  dot  p.x)))-­‐1)*p.y   grad  +=  s  *  p.x   }   w  -­‐=  grad.value   Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos WHAT’S MISSING? Machine Learning Algorithms: very close to imperative languages Relational queries: very close to SQL style
  • 9. Can we increase declarativity in
 specification of ML algorithms? Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos WHAT’S MISSING?
  • 10. VISION An analytics query engine that would enable declarative specification, operator based execution and cost-based optimization of machine learning algorithms on distributed processing platforms. Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
  • 11. • A language with a SQL-like syntax • Use of Datalog to express ML algorithms in a declarative manner ✦ Translation of existing programming models for analytics into Datalog programs [13] ✦ Extension of Datalog to cover a wider range of machine learning tasks [14] Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos DECLARATIVE SYNTAX
  • 12. Hypothesis Cost function Optimization function Repeat { } } Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos DECOMPOSITION OF LINEAR REGRESSION ✓j := ✓j a @ @✓j J(✓) h✓(x) = ✓T x J(✓) = 1 2m mX 1 (h✓(x(i) ) y(i) )2
  • 13. DEFINE ML OPERATORS Based on the example of Linear Regression we need: ✦ Linear algebra operators ✦ Aggregation operators ✦ Iteration ✦ Optimization functions are also good candidates to define operators
  • 14. • Query rewriting as in databases ✦ Operators for ML algorithms would result in clearer semantics • Compiler optimizations: function inlining, SIMD vectorization OPTIMIZATION TECHNIQUES Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos Mostly applicable to relational operators Useful for user defined code More opportunities for query rewriting
  • 15. • Optimization techniques from machine learning ✦ Speed up training with large datasets using stochastic gradient descent instead of batch gradient descent ✦ Feature scaling for faster convergence Note that these optimizations may produce slightly different output Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos OPTIMIZING PROGRAMS
  • 16. • Define a language with a declarative syntax • Translate this language into a set of operators, which express a wide range of machine learning tasks • Based on their semantics, define algebraic properties for query optimization • Implement these operators on a distributed processing framework ACTION PLAN Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
  • 17. Thank you! Questions? Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
  • 18. 1. Apache mahout. http://mahout.apache.org/. 2. X. Meng et al. MLlib: Machine learning in apache spark. CoRR, abs/1505.06807, 2015. 3. J. Cohen et al. Mad skills: New analysis practices for big data. Proc. VLDB Endow., 2(2):1481–1492, Aug. 2009. 4. K. S. Beyer et al. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12) 1272-1283, 2011 5. C.Olston et al. Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pages 1099–1110, 2008. 6. A. Alexandrov et al. The stratosphere platform for big data analytics. The VLDB Journal, 23(6):939-964, Dec. 2014. 7. U-SQL. http://usql.io/. 8. A. Crotty et al. Tupleware: Redefining modern analytics. CoRR, abs/1406.6667, 2014. 9. A. Ghoting et al. SystemML: Declarative machine learning on MapReduce. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, ICDE ’11, pages 231–242, 2011. 10. E. R. Sparks et al. MLI: An API for distributed machine learning. 11. Y. Yu et al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, pages 1–14, 2008. 12. M. Zaharia et al. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pages 10–10, 2010. 13. Y. Bu et al. Scaling Datalog for Machine Learning on Big Data. CoRR, abs/1203.0160, 2012. 14. V. Barany et al., Declarative Statistical Modeling with Datalog. CoRR abs/1412.2221, 2014.
 REFERENCES Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos