SlideShare a Scribd company logo
1 of 23
RPig: A Scalable Framework for Machine
Learning
and Advanced Statistical Functionalities
MingXue Wang
Sidath B. Handurukande
Mohamed Nassar
Network Management Lab, Ericsson Ireland
CloudCom 2012
Ericsson | Page 2
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 3
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 4
Big data analytic in network management
› Capability of Big data analytics
– Service assurance
– Predictive analysis
› Large amount of network data
– Thousands of cells, nodes
– Millions of connected devices, terminals
– Billions of sessions, events
› Machine learning and advanced statistical algorithms
– Network fault, KPI prediction
– CDR, traffic data analysis
Ericsson | Page 5
RPig framework Context
Service Assurance
..
..
RPig
RPig execution platform
VoIP QoE
alarm models
Network KPIs
(packet loss,
Jitter, delay, etc)
VoIP QoE alarms,
Triggers
Network KPIs -> Service KPIs -> Alarm events
SVM based
algorithm
VOIP use case:
Ericsson | Page 6
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 7
Hadoop and MapReduce
Our Framework (ML/DM)
Zookeeper
Coordination
Hadoop DFS
Hadoop Distributed File System
Hadoop MapReduce
Distributed parallel programming framework
Pig
Data flow
Mahout
ML/DM
Hive
SQL
HBase
NoSQL
S4
Streaming
Hama
BSP
…
Giraph
Graphs
…
Ambari
Management
…
› Big data management system
– terabytes/petabytes of data
– hundreds/thousands of nodes
› MapReduce
– map(k1,v1)-> list(k2,v2); reduce(k2,list(v2))->list(v3)
… …
Ericsson | Page 8
Pig and Pig Latin
› Pig - Big data management system
– Similar to SQL in RDBMS
– Pig Latin - A high level data flow language
› Events = FILTER Events BY (client == ’Skype’ OR ...);
– Define data processing flows on unstructured raw data
– Execution in MapReduce model
› Other similar
– JAQL from IBM, …
› Pro: Scalable; Distributed parallel processing
› Con: Not for ML and advanced statistical functionalities
Ericsson | Page 9
R and R packages
› R - Traditional statistical software
– A software and language for statistical computing and advanced
data analysis
– Thousands of R packages
– EMA calculation using the TTR package
› Library(TTR); results <- EMA(temp, 20)
› Other similar:
– Matlab, Weka, …
› Pro: Sophisticated statistical algorithms for advanced
analysis
–Clustering, Regression, etc.
› Con: Not scalable, data must be loaded in memory and run
in a single computer
Ericsson | Page 10
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 11
Related work- Extending R
› Extending traditional statistical software
› Scaling memory size
– Use hard disk as external memory
– E.g. RevoScaleR, bigmemory
› Scaling storage size
– Directly read/write data in large scale DMS
– E.g. Ricardo, RJDBC, RMySQL
› Scaling CPU power
– MapReduce based (e.g. RHIPE, RHadoop)
› Require manually design complex key-value pairs based map and
reduce functions
– Non MapReduce based (e.g. Rmpi, snow,cloudRmpi, Elastic-R)
› Do not support parallel data read/write as Hadoop
› Require write programs with complex MPI APIs
Ericsson | Page 12
Related work - Other solutions
› Developing new frameworks
› E.g.
– Mahout
› In a preliminary stage
› Lacking many commonly used algorithms, e.g. SVM
› It does not provide a high level language, such as R and Pig
– SystemML
› DML (a new ML Language) is not as flexible as R language
› lacking on commonly used statistical algorithm implementations
› Con: Lacking algorithm implementations; No high level
language support or else need to learn new language.
Ericsson | Page 13
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 14
RPIG framework
› Our approach: “RPig”
– Integrated framework
› R + Pig
– Integrated language
› Fast algorithms
development
– Auto distributed parallel
execution
Development
Execution
Ericsson | Page 15
RPig script
› Pig prepares the data movement; R does the statistical
tasks
› RPigEditor
Pig
operations
R
function
Ericsson | Page 16
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 17
Forecasting with EMA – case 1
› Case scenario
– Forecasting VoIP traffic in next time period
› Design: Reduce the data size then use the EMA calculation
› RPig Implementation summary
– Pig operations are used as pre-processing steps to summarize data
– Use any statistical algorithm implementations of R, directly on the
summarized data similar to the traditional single machine approach
of R
Raw
events
Summarized
events
outputPig
operations R functions
Ericsson | Page 18
Reduced Development Effort
› 15 configured nodes, 128
MB/block
› Two approaches
– Pig - implemented EMA in Java
to extend Pig
– RPig
› Small overhead
Pig approach: > 100 lines of code
Our RPig approach: less than 10 lines of code
Ericsson | Page 19
Prediction with SVM – case 2
› Case scenario
– Training a model for predicting Service KPIs based on Network
KPIs
› Design: Spilt the data to small SVM training tasks then
execute them in parallel
› RPig implementation summary
– Parallel or iterative statistical algorithms are expressed as parallel R
executions in a Pig data flow
Training data
Split
training data
output
Pig
operations
R functions
Split
Training data
Split
Training data
Ericsson | Page 20
ML Scalability
› Machine Learning (SVM training phase)
– CPU intensive rather than I/O intensive
– 6K training samples
Ericsson | Page 21
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 22
Conclusions
› RPig
– Scalable ML and Statistical functionalities while minimizing the development
effort
› Big data analytic in a high level language
– Without needing to learn new languages, APIs or rewrite complex statistical
algorithms.
› Parallelize executions automatically
– Handling low level operations (data transformation, fault handling, etc.)
itself.
› Future work
– Will focus on minimizing the overhead and increasing the usability of our
framework
2012 CloudCom,  RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities

More Related Content

What's hot

Deploying R in BI and Real time Applications
Deploying R in BI and Real time ApplicationsDeploying R in BI and Real time Applications
Deploying R in BI and Real time ApplicationsLou Bajuk
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with RGreat Wide Open
 
Accelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei ZahariaAccelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei ZahariaDatabricks
 
Validating credit cards on mobile using deep learning
Validating credit cards on mobile using deep learningValidating credit cards on mobile using deep learning
Validating credit cards on mobile using deep learningDataWorks Summit
 
Plume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis LibraryPlume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis LibraryTigerGraph
 
Real time applications using the R Language
Real time applications using the R LanguageReal time applications using the R Language
Real time applications using the R LanguageLou Bajuk
 
Predictive Models at Scale
Predictive Models at ScalePredictive Models at Scale
Predictive Models at ScaleNikhil Ketkar
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Sparkcarl_pulley
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Revolution Analytics
 
Intro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarIntro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarRevolution Analytics
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcsandit
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 

What's hot (18)

Deploying R in BI and Real time Applications
Deploying R in BI and Real time ApplicationsDeploying R in BI and Real time Applications
Deploying R in BI and Real time Applications
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Accelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei ZahariaAccelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei Zaharia
 
Validating credit cards on mobile using deep learning
Validating credit cards on mobile using deep learningValidating credit cards on mobile using deep learning
Validating credit cards on mobile using deep learning
 
Plume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis LibraryPlume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis Library
 
Real time applications using the R Language
Real time applications using the R LanguageReal time applications using the R Language
Real time applications using the R Language
 
Predictive Models at Scale
Predictive Models at ScalePredictive Models at Scale
Predictive Models at Scale
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
 
Intro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarIntro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User Webinar
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 

Similar to 2012 CloudCom, RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities

High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark DataWorks Summit/Hadoop Summit
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
NoSQL meetup July 2011
NoSQL meetup July 2011NoSQL meetup July 2011
NoSQL meetup July 2011Shay Hassidim
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processJampp
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
SAP on pay as you go model
SAP on pay as you go modelSAP on pay as you go model
SAP on pay as you go modelAjay Kumar Uppal
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
 
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdf
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdfth1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdf
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdfTarekHassan840678
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
Monetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersMonetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersDataWorks Summit
 
Monitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersMonitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersDataWorks Summit
 

Similar to 2012 CloudCom, RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities (20)

High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
NoSQL meetup July 2011
NoSQL meetup July 2011NoSQL meetup July 2011
NoSQL meetup July 2011
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the process
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
SAP on pay as you go model
SAP on pay as you go modelSAP on pay as you go model
SAP on pay as you go model
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdf
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdfth1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdf
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdf
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Monetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersMonetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service Providers
 
Monitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersMonitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service Providers
 

Recently uploaded

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

2012 CloudCom, RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities

  • 1. RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities MingXue Wang Sidath B. Handurukande Mohamed Nassar Network Management Lab, Ericsson Ireland CloudCom 2012
  • 2. Ericsson | Page 2 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 3. Ericsson | Page 3 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 4. Ericsson | Page 4 Big data analytic in network management › Capability of Big data analytics – Service assurance – Predictive analysis › Large amount of network data – Thousands of cells, nodes – Millions of connected devices, terminals – Billions of sessions, events › Machine learning and advanced statistical algorithms – Network fault, KPI prediction – CDR, traffic data analysis
  • 5. Ericsson | Page 5 RPig framework Context Service Assurance .. .. RPig RPig execution platform VoIP QoE alarm models Network KPIs (packet loss, Jitter, delay, etc) VoIP QoE alarms, Triggers Network KPIs -> Service KPIs -> Alarm events SVM based algorithm VOIP use case:
  • 6. Ericsson | Page 6 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 7. Ericsson | Page 7 Hadoop and MapReduce Our Framework (ML/DM) Zookeeper Coordination Hadoop DFS Hadoop Distributed File System Hadoop MapReduce Distributed parallel programming framework Pig Data flow Mahout ML/DM Hive SQL HBase NoSQL S4 Streaming Hama BSP … Giraph Graphs … Ambari Management … › Big data management system – terabytes/petabytes of data – hundreds/thousands of nodes › MapReduce – map(k1,v1)-> list(k2,v2); reduce(k2,list(v2))->list(v3) … …
  • 8. Ericsson | Page 8 Pig and Pig Latin › Pig - Big data management system – Similar to SQL in RDBMS – Pig Latin - A high level data flow language › Events = FILTER Events BY (client == ’Skype’ OR ...); – Define data processing flows on unstructured raw data – Execution in MapReduce model › Other similar – JAQL from IBM, … › Pro: Scalable; Distributed parallel processing › Con: Not for ML and advanced statistical functionalities
  • 9. Ericsson | Page 9 R and R packages › R - Traditional statistical software – A software and language for statistical computing and advanced data analysis – Thousands of R packages – EMA calculation using the TTR package › Library(TTR); results <- EMA(temp, 20) › Other similar: – Matlab, Weka, … › Pro: Sophisticated statistical algorithms for advanced analysis –Clustering, Regression, etc. › Con: Not scalable, data must be loaded in memory and run in a single computer
  • 10. Ericsson | Page 10 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 11. Ericsson | Page 11 Related work- Extending R › Extending traditional statistical software › Scaling memory size – Use hard disk as external memory – E.g. RevoScaleR, bigmemory › Scaling storage size – Directly read/write data in large scale DMS – E.g. Ricardo, RJDBC, RMySQL › Scaling CPU power – MapReduce based (e.g. RHIPE, RHadoop) › Require manually design complex key-value pairs based map and reduce functions – Non MapReduce based (e.g. Rmpi, snow,cloudRmpi, Elastic-R) › Do not support parallel data read/write as Hadoop › Require write programs with complex MPI APIs
  • 12. Ericsson | Page 12 Related work - Other solutions › Developing new frameworks › E.g. – Mahout › In a preliminary stage › Lacking many commonly used algorithms, e.g. SVM › It does not provide a high level language, such as R and Pig – SystemML › DML (a new ML Language) is not as flexible as R language › lacking on commonly used statistical algorithm implementations › Con: Lacking algorithm implementations; No high level language support or else need to learn new language.
  • 13. Ericsson | Page 13 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 14. Ericsson | Page 14 RPIG framework › Our approach: “RPig” – Integrated framework › R + Pig – Integrated language › Fast algorithms development – Auto distributed parallel execution Development Execution
  • 15. Ericsson | Page 15 RPig script › Pig prepares the data movement; R does the statistical tasks › RPigEditor Pig operations R function
  • 16. Ericsson | Page 16 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 17. Ericsson | Page 17 Forecasting with EMA – case 1 › Case scenario – Forecasting VoIP traffic in next time period › Design: Reduce the data size then use the EMA calculation › RPig Implementation summary – Pig operations are used as pre-processing steps to summarize data – Use any statistical algorithm implementations of R, directly on the summarized data similar to the traditional single machine approach of R Raw events Summarized events outputPig operations R functions
  • 18. Ericsson | Page 18 Reduced Development Effort › 15 configured nodes, 128 MB/block › Two approaches – Pig - implemented EMA in Java to extend Pig – RPig › Small overhead Pig approach: > 100 lines of code Our RPig approach: less than 10 lines of code
  • 19. Ericsson | Page 19 Prediction with SVM – case 2 › Case scenario – Training a model for predicting Service KPIs based on Network KPIs › Design: Spilt the data to small SVM training tasks then execute them in parallel › RPig implementation summary – Parallel or iterative statistical algorithms are expressed as parallel R executions in a Pig data flow Training data Split training data output Pig operations R functions Split Training data Split Training data
  • 20. Ericsson | Page 20 ML Scalability › Machine Learning (SVM training phase) – CPU intensive rather than I/O intensive – 6K training samples
  • 21. Ericsson | Page 21 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 22. Ericsson | Page 22 Conclusions › RPig – Scalable ML and Statistical functionalities while minimizing the development effort › Big data analytic in a high level language – Without needing to learn new languages, APIs or rewrite complex statistical algorithms. › Parallelize executions automatically – Handling low level operations (data transformation, fault handling, etc.) itself. › Future work – Will focus on minimizing the overhead and increasing the usability of our framework

Editor's Notes

  1. Scaling statistical analysis and machine learning on Hadoop for service assurance.
  2. For example IBM has its own alternative to Pig. Microsoft has its own alternative to Pig, IBM has its own alternative to S4 (deduct) Hstreaming () Foundation layer
  3. Pig allows define data analysis flows similar to SQL on unstructured raw data stored in HDFS. Pig can automatically generate MapReduce functions based on Pig scripts for scalable data processing.
  4. Real experiment results. Same training dataset, 10 folder cross-validation, one kernel, …