Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryPresentation Transcript
Parallel Data Mining Platform in Telecom Industry -- Big Cloud based Parallel Data Mining Platform Friday, Oct 2, 2009 NYC Research Institute of China Mobile Communication Corporation Feng Cao
Features compared between phase I and phase II
Conclusions and Future works
Large scale data in China Mobile Communication Corporation (CMCC)
Subscribers: 500 million
Subscribers’ CDR(calling data record) data
5~8TB/day in CMCC
For a branch company (> 20 million subscribers)
Voice: 100million* 1KB = 100GB/day
SMS: 100~200 million * 1KB = 100~200GB/day
Network signaling data, for a branch company (> 20 million subscribers)
GPRS signaling data: 48GB/day for a branch companies
3G signaling data: 300GB/day for a branch companies
voice, SMS signaling data, ……
Large Scale Data Applications and current solution
Analysis of User Behavior
Customer Churn Prediction
Service Association Analysis
Network QOS Analysis
Singalling Data Analysis
Service Optimization and Log Processing
Spam Message Filtering
Commercial database / data warehouse systems
Commercial Data Mining Tools
Most are running on Unix Servers, data stored in Storage Arrays
The Requirements Current solution Clemetine Enterprise Miner Intelligent Miner
BASS (Business Analysis Support System) is a BI system for CMCC to support enterprise decision-making, marketing management analysis, and sales.
BASS includes data extract layer, data process layer, data display layer, application layer
Main operation in data process layer is:
Data extract from other system,
Based on database system, most of operation are deal in database, which realizes ELT(Extract, Load and Transfer), rather than ETL.
Challenges and limitations of BASS
The invest of Hardware is large, and the enlargement is high cost.
62% invest is on hardware
Because there’s different critia between the unix server, when enlargement, we should buy totally new unix servers rather than just makeup some unix servers.
The management of IT system is complex.
One unix server can’t support a BASS, in every branch subsystme, there’s about 3-5 servers, such as ETL Server, Database Server, Interface Server, and Display server.
The pressure on database is over load.
ELT makes large pressure on database, in branch company, one server cant support all operation.
Data back up can’t be support well
Off line data back up (5 branches) cost lots of time, online data back up(8 branches) cost lots of resource, file back up (18branches) restore slowly
What is the BC-PDM
BC-PDM: Big Cloud based Parallel Data Mining Platform
A data mining solution for large-scale data analysis
Massive scalability - based on Hadoop
Low cost - commodity machines and free software
Customization – facing to application requirements
Easy to use - similar user interface to commercial tools
Large Scale Storage
Large Scale Data Process
Large Scale Data Mining
Data mining App
Features of BC-PDM (I)
GUI - Drag Operation for application modeling design
ETL (14 different ETL operations from 6 categories based on MapReduce)
Statistic, attribute processing, data sampling, query, data processing, redundancy data processing
Data mining Algorithm (9 algorithms from 3 categories based on MapReduce)
Clustering, Classifier, Association Analysis
Simple data analysis and preview
ETL (25 more)
To simulate SQL operation, support Join, Group by, Expression, case when, Update, and etc.
Data mining Algorithm (4 more)
Classifier, Sequence Association Analysis
Targeting general data analysis and data mining platform/tools
Features of BC-PDM(II)
Text, decision tree, cake graph, and histogram
Web based GUI
Provide SaaS mode for users
Data Transfer Tool
Provide data upload and download tools for SaaS
Multi-tanent and user group for branch, ACL for data access
Targeting general data analysis and data mining platform/tools
Case I – Mapreduce based ETL
Function- Redundancy Remove
To delete the same records in a CDR, and reserve the unique one.
Input Data Set the targe fields to Key, other fields to Value Reduce the same key, read from the value list and write once Output Data Define the target fields (one or all) Set the targe fields to Key, other fields to Value Set the targe fields to Key, other fields to Value MapTasker 1 MapTasker 2 MapTasker n ReduceTasker 1 Reduce the same key, read from the value list and write once ReduceTasker m
To discover association rules in data. It iteratively generates candidate k-length item sets from frequent item sets of length k−1.
Input Data Set the frequent k-1 length item sets to Key, appear times to Value Reduce the same key, read from the value list and sum Output Data Set the frequent k-1 length item sets to Key, appear times to Value Set the frequent k-1 length item sets to Key, appear times to Value MapTasker 1 MapTasker 2 MapTasker n ReduceTasker 1 Reduce the same key, read from the value list and sum ReduceTasker m Output rules satisfy both minimum support value and minimum confidence value
关键技术方案 - 并行关联规则算法 -PApriori 功能 Apriori 是基于统计频繁项集的策略发现属性间的关联关系 指标 1 ）实现查找频繁 k 项集的并行化 2 ）正确性与串行结果完全一致 3 ）扩展性优良， TB 级处理时间千秒级 参考方案 串行 Apriori 算法 我们的方案 1 ）采用 Map/Reduce 机制逐层迭代方法来发现频繁项集，在查找每个频繁 k 项集时进行并行化； 2 ）将数据转换为中间 Key/Value 对输出： key 为候选 k 项集， value 为项集计数；将各处理节点输出的数据进行合并处理，满足最小支持度阈值的作为频繁 k 项集； 3 ）由频集产生强关联规则，输出满足最小可信度阈值的关联规则。
Experiment Environment Software Hardware
Datanode ： 1-way 4core XEON 2.5G/8GB Mem/4*250G SATA II
Namenode/JobTracker: 2-way 2core AMD Opteron 2.6GHz /16G Mem/ 4*146G SAS
network ： Gbps Switch (now all 256 nodes connected on a 264-port switch)
OS ： RHEL5.2
Program language ： Jdk1.6 / Linux Shell
Tools ： Eclipse3.3
Evaluation of BC-PDM(Phase I)
Compare to SPSS Clementine, it satisfies application requirement
(16 nodes compared to an general unix server)
Key Technology Evaluation
The performance of parallel ETL improves about 12 to 16 times
The performance of data mining ETL improves about 10 to 50 times
When there are 256 nodes, it can store, process and mine the data on hundreds TB level.
The performance of the 3 applications of Shanghai Branch Company improves 3 to 7 times
Parallel ETL has excellent scalability
Parallel data mining algorithm has good scalability
Data Mining Applications
User Cluster Analysis: To find the difference of user group, characterize the groups to make precision marketing
Service association: To find the associations among new value added services to make out how to recommend new services to customers
The BC-PDM framework integrates data mining applications on MapReduce and HDFS platform.
In BC-PDM phase I, we implements 14 ETL operations and 9 data mining algorithms. Our practices and experiment results verified that data mining application on MapReduce could deal with large scale data and speed up the response time effectively.
For BASS’s requirements, BC-PDM phase II support SaaS mode and more features.
In phase II, we use Map chain to optimize performance, especially for operation sequence.
BC-PDM phase II is under developing, facing some challenges
Data privacy protection, if SaaS for public, security is most important.
The migration of online system from SQL to BC-PDM is difficult.
How to improve user-friendliness of BC-PDM
Workflow and API for designer is not so flexible as SQL.
ETL process, rather than ELT
Choose typical cases in BASS, use BC-PDM to realize the totally process from source data to database layer
Output the result to business database for display layer, because display layer need SQL support
using real data, compare to real system
Use the data mining result to total process of BI
Check the result buy marketing
People(cloud computing team from CMRI)
Collaborations are welcome! Thanks and Questions? email@example.com@chinamobile.com [email_address] Cloud Computing E-Channel (in Chinese) http://labs.chinamobile.com/cloud