2012 CloudCom, RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities

RPig: A Scalable Framework for Machine
Learning
and Advanced Statistical Functionalities
MingXue Wang
Sidath B. Handurukande
Mohamed Nassar
Network Management Lab, Ericsson Ireland
CloudCom 2012

Ericsson | Page 2
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion

Ericsson | Page 3
Agenda
– Hadoop, Pig, R
– Related work
› RPig
– Case study
› Conclusion

Ericsson | Page 4
Big data analytic in network management
› Capability of Big data analytics
– Service assurance
– Predictive analysis
› Large amount of network data
– Thousands of cells, nodes
– Millions of connected devices, terminals
– Billions of sessions, events
› Machine learning and advanced statistical algorithms
– Network fault, KPI prediction
– CDR, traffic data analysis

Ericsson | Page 5
RPig framework Context
Service Assurance
..
..
RPig
RPig execution platform
VoIP QoE
alarm models
Network KPIs
(packet loss,
Jitter, delay, etc)
VoIP QoE alarms,
Triggers
Network KPIs -> Service KPIs -> Alarm events
SVM based
algorithm
VOIP use case:

Ericsson | Page 6
Agenda
– Hadoop, Pig, R
– Related work
› RPig
– Case study
› Conclusion

Ericsson | Page 7
Hadoop and MapReduce
Our Framework (ML/DM)
Zookeeper
Coordination
Hadoop DFS
Hadoop Distributed File System
Hadoop MapReduce
Distributed parallel programming framework
Pig
Data flow
Mahout
ML/DM
Hive
SQL
HBase
NoSQL
S4
Streaming
Hama
BSP
…
Giraph
Graphs
…
Ambari
Management
…
› Big data management system
– terabytes/petabytes of data
– hundreds/thousands of nodes
› MapReduce
– map(k1,v1)-> list(k2,v2); reduce(k2,list(v2))->list(v3)
… …

Ericsson | Page 8
Pig and Pig Latin
› Pig - Big data management system
– Similar to SQL in RDBMS
– Pig Latin - A high level data flow language
› Events = FILTER Events BY (client == ’Skype’ OR ...);
– Define data processing flows on unstructured raw data
– Execution in MapReduce model
› Other similar
– JAQL from IBM, …
› Pro: Scalable; Distributed parallel processing
› Con: Not for ML and advanced statistical functionalities

Ericsson | Page 9
R and R packages
› R - Traditional statistical software
– A software and language for statistical computing and advanced
data analysis
– Thousands of R packages
– EMA calculation using the TTR package
› Library(TTR); results <- EMA(temp, 20)
› Other similar:
– Matlab, Weka, …
› Pro: Sophisticated statistical algorithms for advanced
analysis
–Clustering, Regression, etc.
› Con: Not scalable, data must be loaded in memory and run
in a single computer

Ericsson | Page 10
Agenda
– Hadoop, Pig, R
– Related work
› RPig
– Case study
› Conclusion

Ericsson | Page 11
Related work- Extending R
› Extending traditional statistical software
› Scaling memory size
– Use hard disk as external memory
– E.g. RevoScaleR, bigmemory
› Scaling storage size
– Directly read/write data in large scale DMS
– E.g. Ricardo, RJDBC, RMySQL
› Scaling CPU power
– MapReduce based (e.g. RHIPE, RHadoop)
› Require manually design complex key-value pairs based map and
reduce functions
– Non MapReduce based (e.g. Rmpi, snow,cloudRmpi, Elastic-R)
› Do not support parallel data read/write as Hadoop
› Require write programs with complex MPI APIs

Ericsson | Page 12
Related work - Other solutions
› Developing new frameworks
› E.g.
– Mahout
› In a preliminary stage
› Lacking many commonly used algorithms, e.g. SVM
› It does not provide a high level language, such as R and Pig
– SystemML
› DML (a new ML Language) is not as flexible as R language
› lacking on commonly used statistical algorithm implementations
› Con: Lacking algorithm implementations; No high level
language support or else need to learn new language.

Ericsson | Page 13
Agenda
– Hadoop, Pig, R
– Related work
› RPig
– Case study
› Conclusion

Ericsson | Page 14
RPIG framework
› Our approach: “RPig”
– Integrated framework
› R + Pig
– Integrated language
› Fast algorithms
development
– Auto distributed parallel
execution
Development
Execution

Ericsson | Page 15
RPig script
› Pig prepares the data movement; R does the statistical
tasks
› RPigEditor
Pig
operations
R
function

Ericsson | Page 16
Agenda
– Hadoop, Pig, R
– Related work
› RPig
– Case study
› Conclusion

Ericsson | Page 17
Forecasting with EMA – case 1
› Case scenario
– Forecasting VoIP traffic in next time period
› Design: Reduce the data size then use the EMA calculation
› RPig Implementation summary
– Pig operations are used as pre-processing steps to summarize data
– Use any statistical algorithm implementations of R, directly on the
summarized data similar to the traditional single machine approach
of R
Raw
events
Summarized
events
outputPig
operations R functions

Ericsson | Page 18
Reduced Development Effort
› 15 configured nodes, 128
MB/block
› Two approaches
– Pig - implemented EMA in Java
to extend Pig
– RPig
› Small overhead
Pig approach: > 100 lines of code
Our RPig approach: less than 10 lines of code

Ericsson | Page 19
Prediction with SVM – case 2
› Case scenario
– Training a model for predicting Service KPIs based on Network
KPIs
› Design: Spilt the data to small SVM training tasks then
execute them in parallel
› RPig implementation summary
– Parallel or iterative statistical algorithms are expressed as parallel R
executions in a Pig data flow
Training data
Split
training data
output
Pig
operations
R functions
Split
Training data
Split
Training data

Ericsson | Page 20
ML Scalability
› Machine Learning (SVM training phase)
– CPU intensive rather than I/O intensive
– 6K training samples

Ericsson | Page 21
Agenda
– Hadoop, Pig, R
– Related work
› RPig
– Case study
› Conclusion

Ericsson | Page 22
Conclusions
› RPig
– Scalable ML and Statistical functionalities while minimizing the development
effort
› Big data analytic in a high level language
– Without needing to learn new languages, APIs or rewrite complex statistical
algorithms.
› Parallelize executions automatically
– Handling low level operations (data transformation, fault handling, etc.)
itself.
› Future work
– Will focus on minimizing the overhead and increasing the usability of our
framework

2012 CloudCom, RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities

2012 CloudCom, RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to 2012 CloudCom, RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities

Similar to 2012 CloudCom, RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities (20)

Recently uploaded

Recently uploaded (20)

2012 CloudCom, RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities

Editor's Notes