SlideShare a Scribd company logo
MAD SKILLS FOR ANALYSIS 
AND 
BIG DATA MACHINE LEARNING 
University of Helsinki 
Gianvito Siciliano 
(2014 - Distributed Computing Frameworks for Big Data Seminar)
COMPARISON OF 
• APPROACHES 
• PLATFORMS 
• ALGORITHMS
AGENDA 
1. Analysis intro: 
• needed skills (MAD) 
• important areas (IS, ML) 
2. Big Data intensive approaches: 
• HPC, ABDS, BDAS 
3. Machine Learning tool generations 
• SAS, Weka, Hadoop, Mahout, HaLoop, Spark (…) 
4. Large scale (ML) algorithms comparison 
• K-means, LogReg
Why data analysis? 
“So, what’s getting ubiquitous and cheap? Data. And What is 
complementary to data? Analysis. “ 
The value of data analysis has entered common culture, to uncover the 
unexpected in your data.
How to make sense of data? 
The MAD acronym, is made up of three inherent aspects 
on big data analysis: 
Magnetic: it concerns attracting data from heterogeneus 
sources, regardless of the quality of data.
How to make sense of data? 
The MAD acronym, is made up of three inherent aspects 
on big data analysis: 
Agile: that is about how to make fastly analysis, to obtain 
action which maximizes the value for the business
How to make sense of data? 
The MAD acronym, is made up of three inherent aspects 
on big data analysis: 
Deep: is to enable analysts to know both sophisticated 
statistical methods and the most performing ML algorithms 
to study enormous datasets on distributed environments.
How to go deep? 
• Inferential statistics, that allows you to capture the underlying 
properties of the population (prediction, causality analysis and 
distributional comparison) 
• Machine Learning, “…is the unsung hero that powers many of the 
most sophisticated big data analytic applications”.
DB design 
capture, modelling, manage, querying… 
(SQL) 
MAD skills, 2 key points 
Programming Style 
extract, transform, process, investigate… 
(MapReduce)
MAD design for smart environment! 
Parallel DBMSs are substantially faster than the MR system once the 
data is loaded, but that loading the data takes considerably longer in the 
db system 
MapReduce has captured the interest of many developers because of its 
simple 2-functions paradigm and it has widely viewed as a more attractive 
programming environment than SQL 
MR paradigm simplifies the schema-writing process for data: it just require 
to load and copy data into the storage system.
MAD design for smart environment! 
As each approach has its own set of pros and cons, the proposal can be a 
database-Hadoop hybrid approach to scalable machine learning where 
batch-learning is performed on the Hadoop platform, and data are stored 
(and organised) with the help of some parallel DBMSs. 
The critical-skill for a MAD analysts becomes the interoperability on 
complex pipeline that includes some stage in SQL and some in 
MapReduce syntax.
How to deal with Big Data and Machine Learning? 
• parallelizing and distributing data analysis 
• large-scale data sets 
• cluster and data fault tolerance 
• iterative processing
BIG DATA INTENSIVE PARADIGMS 
High Performance Computing 
is the use of parallel processing for running advanced 
application programs efficiently, reliably and quickly 
parallel processing (MPI) 
advance and high performance 
applications (Molecular Dynamics) 
separating the cluster (VMs), compute 
(SLURM) and storage layer (LUSTRE) 
supercomputing 
HPC stack 
app 
proc 
comm 
strg
BIG DATA INTENSIVE PARADIGMS 
Apache Big Data Stack 
Based on integration of compute and data, it introduces an application-level scheduling 
to facilitate heterogeneous application workloads and high-cluster utilization. 
MapReduce paradigm 
integration compute/data mgmt 
cheap hw 
low-need communication among clusters 
many open-source implementations, support and docs 
app 
proc 
comm 
tight coupling between storage (YARN) and resource 
(HDFS) 
no shared memory 
strg 
no support for iteration ABDS stack
BIG DATA INTENSIVE PARADIGMS 
Berkeley Data Analytics Stack 
It emerge in response of application requirements (short-running 
tasks) and to overcome the problems of its 
predecessor (data-caching). 
Transform and Act paradigm 
multi-level scheduler (MESOS) 
runtime iterative processing (SPARK) 
distributed shared memory (RDD) 
app 
proc 
comm 
strg 
…young? BDAS stack
FROM 2 PARADIGMS TO AN HYBRID TOOL 
HPC - data (intensive) parallel tasks workflows 
+ 
ABDS - computes demanding on clusters and MapReduce style for batch-processing 
= 
BDAS - provides caching and shared memory 
… 
ML - remember that algorithms need iterative processing! 
=> SPARK - Distributed framework for (big) data preparation and machine learning, based 
on Resilient (cache) system to recompute iterations
BIG DATA FRAMEWORK SPACE 
Age/Maturity 
Fast Data Big Analytics Big Application
THREE ML GENERATION OF TOOLS 
First generation 
Traditional ML tools 
for machine learning 
(SAS, SPSS, Weka, R). 
wide set of ML 
algorithms 
can facilitate deep 
analysis 
vertically scalable 
non distributed 
smaller data sets 
Second generation 
ML tools built over Hadoop 
(Mahout, Pentaho, 
RapidMiner) 
scale to large data sets 
distributed 
no database connectivity 
(ODBC) 
smaller sub-sets of algorithms 
low performance with multi-stage 
applications (e.g machine 
learning and graph processing) 
inefficient primitives for data 
sharing 
poor support for ad-hoc and 
interactive queries 
slow iterative computations 
Third generation 
New purpose-tools 
(HaLoop, Twister, Pregel, 
GraphLab, Spark) 
modularity 
shared memory 
iterative ML algorithms 
asynchronous graph 
processing 
cached memory across 
iterations/interactions
ML ALGORITHMS 
K-means for clustering analysis. 
The iteration time of k-means is dominated by compute-intensive task of calculating the centroids 
from a set of datapoints. 
Logistic Regression, a type of probabilistic statistical classification model. 
For the comparison it is used for a binary classification task: it is less compute-intensive than k-means 
and more sensitive to time spent in deserialization and I/O.
K-MEANS
a) b) c) 
Times s 
b) c) 
d) 
iterations 
e) f) 
machines iterations input 
d) 
Times (s) 
iterations 
Times (s) 
iterations 
LOG REG
CONCLUSIONS 
• MAD design can help the analysis process, like the AGILE methodology helps the software 
development process. 
• The better performance of parallel DBMSs should be complementary to MapReduce systems. 
• MapReduce provides powerful abstractions for data processing, analytics and machine learning to 
the end-user that naturally involves in the new ”transform and act” paradigm used in Spark. 
• Spark takes the best techniques from both ABDS and HPC. It is the core of BDAS and is the best 
framework in this scenario. 
• The resilient distributed datasets (RDDs) is an efficient, general-purpose and fault-tolerant 
abstraction for sharing data in cluster applications, and it is the added value of Spark. 
• Frameworks like Twister and HaLoop are good candidates to be an alternative to Spark but they 
do not appear to be mature enough.
Acknoledgements 
Dr. Sasu Tarkoma 
Dr. Mohammad Hoque 
Reviewers
Thank you! 
(gianvito.siciliano@gmail.com)

More Related Content

What's hot

mapreduce_presentation
mapreduce_presentationmapreduce_presentation
mapreduce_presentationAdam Martini
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
Mansi Mehra
 
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service IndexingA New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
ijdms
 
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
IJECEIAES
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
ijiert bestjournal
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
Geoffrey Fox
 
51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack
Geoffrey Fox
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
MIT College Of Engineering,Pune
 
Big data
Big dataBig data
Big data
Mina Soltani
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
Gezim Sejdiu
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Geoffrey Fox
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
RojaT4
 
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Geoffrey Fox
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
Geoffrey Fox
 
Workshop on Real-time & Stream Analytics IEEE BigData 2016
Workshop on Real-time & Stream Analytics IEEE BigData 2016Workshop on Real-time & Stream Analytics IEEE BigData 2016
Workshop on Real-time & Stream Analytics IEEE BigData 2016
Sabri Skhiri
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...
Geoffrey Fox
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Gezim Sejdiu
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaNithin Kakkireni
 

What's hot (18)

mapreduce_presentation
mapreduce_presentationmapreduce_presentation
mapreduce_presentation
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service IndexingA New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
 
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
 
Big data
Big dataBig data
Big data
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Workshop on Real-time & Stream Analytics IEEE BigData 2016
Workshop on Real-time & Stream Analytics IEEE BigData 2016Workshop on Real-time & Stream Analytics IEEE BigData 2016
Workshop on Real-time & Stream Analytics IEEE BigData 2016
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 

Viewers also liked

Image Classification and Retrieval logic
Image Classification and Retrieval logicImage Classification and Retrieval logic
Image Classification and Retrieval logic
Gianvito Siciliano
 
Avanced Image Classification
Avanced Image ClassificationAvanced Image Classification
Avanced Image Classification
Bayes Ahmed
 
your browser, my storage
your browser, my storageyour browser, my storage
your browser, my storage
Francesco Fullone
 
VMworld 2013: Tech Preview: Accelerating Data Operations Using VMware VVols a...
VMworld 2013: Tech Preview: Accelerating Data Operations Using VMware VVols a...VMworld 2013: Tech Preview: Accelerating Data Operations Using VMware VVols a...
VMworld 2013: Tech Preview: Accelerating Data Operations Using VMware VVols a...
VMworld
 
Societal Impact of Applied Data Science on the Big Data Stack
Societal Impact of Applied Data Science on the Big Data StackSocietal Impact of Applied Data Science on the Big Data Stack
Societal Impact of Applied Data Science on the Big Data Stack
Stealth Project
 
Big Data Case study - caixa bank
Big Data Case study - caixa bankBig Data Case study - caixa bank
Big Data Case study - caixa bank
Chungsik Yun
 
Introduction to Machine Learning for Oracle Database Professionals
Introduction to Machine Learning for Oracle Database ProfessionalsIntroduction to Machine Learning for Oracle Database Professionals
Introduction to Machine Learning for Oracle Database Professionals
Alex Gorbachev
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
Nicola Ferraro
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
Lowy Shin
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Taras Matyashovsky
 
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
Rajiv Shah
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data Analytics
Vijay Rao
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Impetus Technologies
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
Capgemini
 
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
Big Data Spain
 
The Efficient Big data Platform - IDC 360, Copenhagen
The Efficient Big data Platform - IDC 360, CopenhagenThe Efficient Big data Platform - IDC 360, Copenhagen
The Efficient Big data Platform - IDC 360, Copenhagen
Petri Pekkarinen
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow
Rajiv Shah
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big Data
DATAVERSITY
 
Deep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up SeattleDeep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up Seattle
Domino Data Lab
 

Viewers also liked (20)

Image Classification and Retrieval logic
Image Classification and Retrieval logicImage Classification and Retrieval logic
Image Classification and Retrieval logic
 
Avanced Image Classification
Avanced Image ClassificationAvanced Image Classification
Avanced Image Classification
 
your browser, my storage
your browser, my storageyour browser, my storage
your browser, my storage
 
VMworld 2013: Tech Preview: Accelerating Data Operations Using VMware VVols a...
VMworld 2013: Tech Preview: Accelerating Data Operations Using VMware VVols a...VMworld 2013: Tech Preview: Accelerating Data Operations Using VMware VVols a...
VMworld 2013: Tech Preview: Accelerating Data Operations Using VMware VVols a...
 
Societal Impact of Applied Data Science on the Big Data Stack
Societal Impact of Applied Data Science on the Big Data StackSocietal Impact of Applied Data Science on the Big Data Stack
Societal Impact of Applied Data Science on the Big Data Stack
 
Big Data Case study - caixa bank
Big Data Case study - caixa bankBig Data Case study - caixa bank
Big Data Case study - caixa bank
 
Introduction to Machine Learning for Oracle Database Professionals
Introduction to Machine Learning for Oracle Database ProfessionalsIntroduction to Machine Learning for Oracle Database Professionals
Introduction to Machine Learning for Oracle Database Professionals
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
 
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
 
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data Analytics
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
 
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
 
The Efficient Big data Platform - IDC 360, Copenhagen
The Efficient Big data Platform - IDC 360, CopenhagenThe Efficient Big data Platform - IDC 360, Copenhagen
The Efficient Big data Platform - IDC 360, Copenhagen
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big Data
 
Deep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up SeattleDeep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up Seattle
 

Similar to MAD skills for analysis and big data Machine Learning

Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd Iaetsd
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
Geoffrey Fox
 
Big Data Storage System Based on a Distributed Hash Tables System
Big Data Storage System Based on a Distributed Hash Tables SystemBig Data Storage System Based on a Distributed Hash Tables System
Big Data Storage System Based on a Distributed Hash Tables System
ijdms
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498IJRAT
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09Hiroshi Ono
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
Graisy Biswal
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Geoffrey Fox
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
IJERA Editor
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
Farheen Nilofer
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
Mansi Mehra
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
SANTOSH WAYAL
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Rio Info
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
BRNSSPublicationHubI
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoop
databloginfo
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
Debajani Mohanty
 

Similar to MAD skills for analysis and big data Machine Learning (20)

Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Big Data Storage System Based on a Distributed Hash Tables System
Big Data Storage System Based on a Distributed Hash Tables SystemBig Data Storage System Based on a Distributed Hash Tables System
Big Data Storage System Based on a Distributed Hash Tables System
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoop
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 

More from Gianvito Siciliano

Image Classification and Retrieval on Spark
Image Classification and Retrieval on SparkImage Classification and Retrieval on Spark
Image Classification and Retrieval on Spark
Gianvito Siciliano
 
Intro Angular Ionic
Intro Angular Ionic Intro Angular Ionic
Intro Angular Ionic
Gianvito Siciliano
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
Gianvito Siciliano
 
Social Study (project architecture review)
Social Study  (project architecture review)Social Study  (project architecture review)
Social Study (project architecture review)
Gianvito Siciliano
 
Consensus Concurrent problem
Consensus Concurrent problemConsensus Concurrent problem
Consensus Concurrent problem
Gianvito Siciliano
 
Yana - disabled assistance by google watch
Yana - disabled assistance by google watchYana - disabled assistance by google watch
Yana - disabled assistance by google watch
Gianvito Siciliano
 
Social study - Network
Social study - NetworkSocial study - Network
Social study - Network
Gianvito Siciliano
 
New interaction Technologies
New interaction TechnologiesNew interaction Technologies
New interaction Technologies
Gianvito Siciliano
 

More from Gianvito Siciliano (8)

Image Classification and Retrieval on Spark
Image Classification and Retrieval on SparkImage Classification and Retrieval on Spark
Image Classification and Retrieval on Spark
 
Intro Angular Ionic
Intro Angular Ionic Intro Angular Ionic
Intro Angular Ionic
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
 
Social Study (project architecture review)
Social Study  (project architecture review)Social Study  (project architecture review)
Social Study (project architecture review)
 
Consensus Concurrent problem
Consensus Concurrent problemConsensus Concurrent problem
Consensus Concurrent problem
 
Yana - disabled assistance by google watch
Yana - disabled assistance by google watchYana - disabled assistance by google watch
Yana - disabled assistance by google watch
 
Social study - Network
Social study - NetworkSocial study - Network
Social study - Network
 
New interaction Technologies
New interaction TechnologiesNew interaction Technologies
New interaction Technologies
 

Recently uploaded

Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 

Recently uploaded (20)

Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 

MAD skills for analysis and big data Machine Learning

  • 1. MAD SKILLS FOR ANALYSIS AND BIG DATA MACHINE LEARNING University of Helsinki Gianvito Siciliano (2014 - Distributed Computing Frameworks for Big Data Seminar)
  • 2. COMPARISON OF • APPROACHES • PLATFORMS • ALGORITHMS
  • 3. AGENDA 1. Analysis intro: • needed skills (MAD) • important areas (IS, ML) 2. Big Data intensive approaches: • HPC, ABDS, BDAS 3. Machine Learning tool generations • SAS, Weka, Hadoop, Mahout, HaLoop, Spark (…) 4. Large scale (ML) algorithms comparison • K-means, LogReg
  • 4. Why data analysis? “So, what’s getting ubiquitous and cheap? Data. And What is complementary to data? Analysis. “ The value of data analysis has entered common culture, to uncover the unexpected in your data.
  • 5. How to make sense of data? The MAD acronym, is made up of three inherent aspects on big data analysis: Magnetic: it concerns attracting data from heterogeneus sources, regardless of the quality of data.
  • 6. How to make sense of data? The MAD acronym, is made up of three inherent aspects on big data analysis: Agile: that is about how to make fastly analysis, to obtain action which maximizes the value for the business
  • 7. How to make sense of data? The MAD acronym, is made up of three inherent aspects on big data analysis: Deep: is to enable analysts to know both sophisticated statistical methods and the most performing ML algorithms to study enormous datasets on distributed environments.
  • 8. How to go deep? • Inferential statistics, that allows you to capture the underlying properties of the population (prediction, causality analysis and distributional comparison) • Machine Learning, “…is the unsung hero that powers many of the most sophisticated big data analytic applications”.
  • 9. DB design capture, modelling, manage, querying… (SQL) MAD skills, 2 key points Programming Style extract, transform, process, investigate… (MapReduce)
  • 10. MAD design for smart environment! Parallel DBMSs are substantially faster than the MR system once the data is loaded, but that loading the data takes considerably longer in the db system MapReduce has captured the interest of many developers because of its simple 2-functions paradigm and it has widely viewed as a more attractive programming environment than SQL MR paradigm simplifies the schema-writing process for data: it just require to load and copy data into the storage system.
  • 11. MAD design for smart environment! As each approach has its own set of pros and cons, the proposal can be a database-Hadoop hybrid approach to scalable machine learning where batch-learning is performed on the Hadoop platform, and data are stored (and organised) with the help of some parallel DBMSs. The critical-skill for a MAD analysts becomes the interoperability on complex pipeline that includes some stage in SQL and some in MapReduce syntax.
  • 12. How to deal with Big Data and Machine Learning? • parallelizing and distributing data analysis • large-scale data sets • cluster and data fault tolerance • iterative processing
  • 13. BIG DATA INTENSIVE PARADIGMS High Performance Computing is the use of parallel processing for running advanced application programs efficiently, reliably and quickly parallel processing (MPI) advance and high performance applications (Molecular Dynamics) separating the cluster (VMs), compute (SLURM) and storage layer (LUSTRE) supercomputing HPC stack app proc comm strg
  • 14. BIG DATA INTENSIVE PARADIGMS Apache Big Data Stack Based on integration of compute and data, it introduces an application-level scheduling to facilitate heterogeneous application workloads and high-cluster utilization. MapReduce paradigm integration compute/data mgmt cheap hw low-need communication among clusters many open-source implementations, support and docs app proc comm tight coupling between storage (YARN) and resource (HDFS) no shared memory strg no support for iteration ABDS stack
  • 15. BIG DATA INTENSIVE PARADIGMS Berkeley Data Analytics Stack It emerge in response of application requirements (short-running tasks) and to overcome the problems of its predecessor (data-caching). Transform and Act paradigm multi-level scheduler (MESOS) runtime iterative processing (SPARK) distributed shared memory (RDD) app proc comm strg …young? BDAS stack
  • 16. FROM 2 PARADIGMS TO AN HYBRID TOOL HPC - data (intensive) parallel tasks workflows + ABDS - computes demanding on clusters and MapReduce style for batch-processing = BDAS - provides caching and shared memory … ML - remember that algorithms need iterative processing! => SPARK - Distributed framework for (big) data preparation and machine learning, based on Resilient (cache) system to recompute iterations
  • 17. BIG DATA FRAMEWORK SPACE Age/Maturity Fast Data Big Analytics Big Application
  • 18. THREE ML GENERATION OF TOOLS First generation Traditional ML tools for machine learning (SAS, SPSS, Weka, R). wide set of ML algorithms can facilitate deep analysis vertically scalable non distributed smaller data sets Second generation ML tools built over Hadoop (Mahout, Pentaho, RapidMiner) scale to large data sets distributed no database connectivity (ODBC) smaller sub-sets of algorithms low performance with multi-stage applications (e.g machine learning and graph processing) inefficient primitives for data sharing poor support for ad-hoc and interactive queries slow iterative computations Third generation New purpose-tools (HaLoop, Twister, Pregel, GraphLab, Spark) modularity shared memory iterative ML algorithms asynchronous graph processing cached memory across iterations/interactions
  • 19. ML ALGORITHMS K-means for clustering analysis. The iteration time of k-means is dominated by compute-intensive task of calculating the centroids from a set of datapoints. Logistic Regression, a type of probabilistic statistical classification model. For the comparison it is used for a binary classification task: it is less compute-intensive than k-means and more sensitive to time spent in deserialization and I/O.
  • 21. a) b) c) Times s b) c) d) iterations e) f) machines iterations input d) Times (s) iterations Times (s) iterations LOG REG
  • 22. CONCLUSIONS • MAD design can help the analysis process, like the AGILE methodology helps the software development process. • The better performance of parallel DBMSs should be complementary to MapReduce systems. • MapReduce provides powerful abstractions for data processing, analytics and machine learning to the end-user that naturally involves in the new ”transform and act” paradigm used in Spark. • Spark takes the best techniques from both ABDS and HPC. It is the core of BDAS and is the best framework in this scenario. • The resilient distributed datasets (RDDs) is an efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications, and it is the added value of Spark. • Frameworks like Twister and HaLoop are good candidates to be an alternative to Spark but they do not appear to be mature enough.
  • 23. Acknoledgements Dr. Sasu Tarkoma Dr. Mohammad Hoque Reviewers