In many domains, such as Telecom, various scenarios necessitate the processing of large amounts of data using statistical and machine learning algorithms. A noticeable effort has been made to move the data management systems into MapReduce parallel processing environments, such as Hadoop, and Pig. Nevertheless, these systems lack the features of advanced machine learning and statistical analysis. Frameworks such as Mahout, on top of Hadoop, support machine learning, but their implementations are at the preliminary stage. For example, Mahout does not provide Support Vector Machine (SVM) algorithms and it is difficult to use. On the other hand, traditional statistical software tools, such as R, containing comprehensive statistical algorithms for advanced analysis, are widely used. But such software can only run on a single computer, and therefore it is not scalable. In this paper, we propose an integrated solution RPig, which takes the advantages of R (for machine learning and statistical analysis capabilities) and parallel data processing capabilities of Pig. The RPig framework offers a scalable, advanced data analysis solution for machine learning and statistical analysis. Analysis jobs can be easily developed with RPig script in high level languages. We describe the design, implementation and an eclipse-based RPigEditor for the RPig framework. Using application scenarios from the Telecom domain we show the usage of RPig and how the framework can significantly reduce the development effort. The results demonstrate the scalability of our framework and the simplicity of deployment for analysis jobs.
The Codex of Business Writing Software for Real-World Solutions 2.pptx
2012 CloudCom, RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities
1. RPig: A Scalable Framework for Machine
Learning
and Advanced Statistical Functionalities
MingXue Wang
Sidath B. Handurukande
Mohamed Nassar
Network Management Lab, Ericsson Ireland
CloudCom 2012
2. Ericsson | Page 2
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
3. Ericsson | Page 3
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
4. Ericsson | Page 4
Big data analytic in network management
› Capability of Big data analytics
– Service assurance
– Predictive analysis
› Large amount of network data
– Thousands of cells, nodes
– Millions of connected devices, terminals
– Billions of sessions, events
› Machine learning and advanced statistical algorithms
– Network fault, KPI prediction
– CDR, traffic data analysis
5. Ericsson | Page 5
RPig framework Context
Service Assurance
..
..
RPig
RPig execution platform
VoIP QoE
alarm models
Network KPIs
(packet loss,
Jitter, delay, etc)
VoIP QoE alarms,
Triggers
Network KPIs -> Service KPIs -> Alarm events
SVM based
algorithm
VOIP use case:
6. Ericsson | Page 6
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
7. Ericsson | Page 7
Hadoop and MapReduce
Our Framework (ML/DM)
Zookeeper
Coordination
Hadoop DFS
Hadoop Distributed File System
Hadoop MapReduce
Distributed parallel programming framework
Pig
Data flow
Mahout
ML/DM
Hive
SQL
HBase
NoSQL
S4
Streaming
Hama
BSP
…
Giraph
Graphs
…
Ambari
Management
…
› Big data management system
– terabytes/petabytes of data
– hundreds/thousands of nodes
› MapReduce
– map(k1,v1)-> list(k2,v2); reduce(k2,list(v2))->list(v3)
… …
8. Ericsson | Page 8
Pig and Pig Latin
› Pig - Big data management system
– Similar to SQL in RDBMS
– Pig Latin - A high level data flow language
› Events = FILTER Events BY (client == ’Skype’ OR ...);
– Define data processing flows on unstructured raw data
– Execution in MapReduce model
› Other similar
– JAQL from IBM, …
› Pro: Scalable; Distributed parallel processing
› Con: Not for ML and advanced statistical functionalities
9. Ericsson | Page 9
R and R packages
› R - Traditional statistical software
– A software and language for statistical computing and advanced
data analysis
– Thousands of R packages
– EMA calculation using the TTR package
› Library(TTR); results <- EMA(temp, 20)
› Other similar:
– Matlab, Weka, …
› Pro: Sophisticated statistical algorithms for advanced
analysis
–Clustering, Regression, etc.
› Con: Not scalable, data must be loaded in memory and run
in a single computer
10. Ericsson | Page 10
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
11. Ericsson | Page 11
Related work- Extending R
› Extending traditional statistical software
› Scaling memory size
– Use hard disk as external memory
– E.g. RevoScaleR, bigmemory
› Scaling storage size
– Directly read/write data in large scale DMS
– E.g. Ricardo, RJDBC, RMySQL
› Scaling CPU power
– MapReduce based (e.g. RHIPE, RHadoop)
› Require manually design complex key-value pairs based map and
reduce functions
– Non MapReduce based (e.g. Rmpi, snow,cloudRmpi, Elastic-R)
› Do not support parallel data read/write as Hadoop
› Require write programs with complex MPI APIs
12. Ericsson | Page 12
Related work - Other solutions
› Developing new frameworks
› E.g.
– Mahout
› In a preliminary stage
› Lacking many commonly used algorithms, e.g. SVM
› It does not provide a high level language, such as R and Pig
– SystemML
› DML (a new ML Language) is not as flexible as R language
› lacking on commonly used statistical algorithm implementations
› Con: Lacking algorithm implementations; No high level
language support or else need to learn new language.
13. Ericsson | Page 13
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
14. Ericsson | Page 14
RPIG framework
› Our approach: “RPig”
– Integrated framework
› R + Pig
– Integrated language
› Fast algorithms
development
– Auto distributed parallel
execution
Development
Execution
15. Ericsson | Page 15
RPig script
› Pig prepares the data movement; R does the statistical
tasks
› RPigEditor
Pig
operations
R
function
16. Ericsson | Page 16
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
17. Ericsson | Page 17
Forecasting with EMA – case 1
› Case scenario
– Forecasting VoIP traffic in next time period
› Design: Reduce the data size then use the EMA calculation
› RPig Implementation summary
– Pig operations are used as pre-processing steps to summarize data
– Use any statistical algorithm implementations of R, directly on the
summarized data similar to the traditional single machine approach
of R
Raw
events
Summarized
events
outputPig
operations R functions
18. Ericsson | Page 18
Reduced Development Effort
› 15 configured nodes, 128
MB/block
› Two approaches
– Pig - implemented EMA in Java
to extend Pig
– RPig
› Small overhead
Pig approach: > 100 lines of code
Our RPig approach: less than 10 lines of code
19. Ericsson | Page 19
Prediction with SVM – case 2
› Case scenario
– Training a model for predicting Service KPIs based on Network
KPIs
› Design: Spilt the data to small SVM training tasks then
execute them in parallel
› RPig implementation summary
– Parallel or iterative statistical algorithms are expressed as parallel R
executions in a Pig data flow
Training data
Split
training data
output
Pig
operations
R functions
Split
Training data
Split
Training data
20. Ericsson | Page 20
ML Scalability
› Machine Learning (SVM training phase)
– CPU intensive rather than I/O intensive
– 6K training samples
21. Ericsson | Page 21
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
22. Ericsson | Page 22
Conclusions
› RPig
– Scalable ML and Statistical functionalities while minimizing the development
effort
› Big data analytic in a high level language
– Without needing to learn new languages, APIs or rewrite complex statistical
algorithms.
› Parallelize executions automatically
– Handling low level operations (data transformation, fault handling, etc.)
itself.
› Future work
– Will focus on minimizing the overhead and increasing the usability of our
framework
Editor's Notes
Scaling statistical analysis and machine learning on Hadoop for service assurance.
For example IBM has its own alternative to Pig. Microsoft has its own alternative to Pig, IBM has its own alternative to S4 (deduct) Hstreaming ()
Foundation layer
Pig allows define data analysis flows similar to SQL on unstructured raw data stored in HDFS.
Pig can automatically generate MapReduce functions based on Pig scripts for scalable data processing.
Real experiment results.
Same training dataset, 10 folder cross-validation, one kernel, …