• Like

Hw09 Hadoop Based Data Mining Platform For The Telecom Industry

Uploaded on


More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • sdf


  • 1. Parallel Data Mining Platform in Telecom Industry -- Big Cloud based Parallel Data Mining Platform Friday, Oct 2, 2009 NYC Research Institute of China Mobile Communication Corporation Feng Cao
  • 2. Outline
    • Introduce
    • BC-PDM architecture
      • Architecture
      • Features compared between phase I and phase II
    • Conclusions and Future works
      • Conclusions
      • Future works
  • 3. Large scale data in China Mobile Communication Corporation (CMCC)
    • Subscribers: 500 million
    • Subscribers’ CDR(calling data record) data
      • 5~8TB/day in CMCC
      • For a branch company (> 20 million subscribers)
        • Voice: 100million* 1KB = 100GB/day
        • SMS: 100~200 million * 1KB = 100~200GB/day
        • ……
    • Network signaling data, for a branch company (> 20 million subscribers)
      • GPRS signaling data: 48GB/day for a branch companies
      • 3G signaling data: 300GB/day for a branch companies
      • voice, SMS signaling data, ……
  • 4. Large Scale Data Applications and current solution
    • Precision marketing
      • Analysis of User Behavior
      • Customer Churn Prediction
      • Service Association Analysis
      • ……
    • Network Optimization
      • Network QOS Analysis
      • Singalling Data Analysis
      • ......
    • Service Optimization and Log Processing
      • Spam Message Filtering
      • ……
    • Commercial database / data warehouse systems
    • Commercial Data Mining Tools
    • Most are running on Unix Servers, data stored in Storage Arrays
    The Requirements Current solution Clemetine Enterprise Miner Intelligent Miner
  • 5. What’s BASS
    • BASS (Business Analysis Support System) is a BI system for CMCC to support enterprise decision-making, marketing management analysis, and sales.
    • BASS includes data extract layer, data process layer, data display layer, application layer
    • Main operation in data process layer is:
      • Data extract from other system,
      • Data transfer
      • Data gather
      • Data statics
    • Based on database system, most of operation are deal in database, which realizes ELT(Extract, Load and Transfer), rather than ETL.
  • 6. Challenges and limitations of BASS
    • The invest of Hardware is large, and the enlargement is high cost.
      • 62% invest is on hardware
      • Because there’s different critia between the unix server, when enlargement, we should buy totally new unix servers rather than just makeup some unix servers.
    • The management of IT system is complex.
      • One unix server can’t support a BASS, in every branch subsystme, there’s about 3-5 servers, such as ETL Server, Database Server, Interface Server, and Display server.
    • The pressure on database is over load.
      • ELT makes large pressure on database, in branch company, one server cant support all operation.
    • Data back up can’t be support well
      • Off line data back up (5 branches) cost lots of time, online data back up(8 branches) cost lots of resource, file back up (18branches) restore slowly
  • 7. What is the BC-PDM
    • BC-PDM: Big Cloud based Parallel Data Mining Platform
    • A data mining solution for large-scale data analysis
      • Massive scalability - based on Hadoop
      • Low cost - commodity machines and free software
      • Customization – facing to application requirements
      • Easy to use - similar user interface to commercial tools
  • 8. BC-PDM Architecture
    • Large Scale Storage
    • High performance
    • High Availablity
    • Low Price
    DE DT
    • Large Scale Data Process
    • Large Scale Data Mining
    • Excellent scalability
    Data mining App
  • 9. Features of BC-PDM (I)
    • BC-PDM(phase I)
      • Workflow management
        • GUI - Drag Operation for application modeling design
        • Job Monitoring
        • Flow Configuration
      • ETL (14 different ETL operations from 6 categories based on MapReduce)
        • Statistic, attribute processing, data sampling, query, data processing, redundancy data processing
      • Data mining Algorithm (9 algorithms from 3 categories based on MapReduce)
        • Clustering, Classifier, Association Analysis
    • BC-PDM(phase II)
      • DE(Data Exploration)
        • Simple data analysis and preview
      • ETL (25 more)
        • To simulate SQL operation, support Join, Group by, Expression, case when, Update, and etc.
      • Data mining Algorithm (4 more)
        • Classifier, Sequence Association Analysis
    • Targeting general data analysis and data mining platform/tools
  • 10. Features of BC-PDM(II)
    • BC-PDM(phase I)
      • Visualization
        • Text, decision tree, cake graph, and histogram
    • BC-PDM(phase II)
      • Web based GUI
        • Provide SaaS mode for users
      • Data Transfer Tool
        • Provide data upload and download tools for SaaS
      • Security
        • Multi-tanent and user group for branch, ACL for data access
    • Targeting general data analysis and data mining platform/tools
  • 11. Case I – Mapreduce based ETL
    • Function- Redundancy Remove
      • To delete the same records in a CDR, and reserve the unique one.
    Input Data Set the targe fields to Key, other fields to Value Reduce the same key, read from the value list and write once Output Data Define the target fields (one or all) Set the targe fields to Key, other fields to Value Set the targe fields to Key, other fields to Value MapTasker 1 MapTasker 2 MapTasker n ReduceTasker 1 Reduce the same key, read from the value list and write once ReduceTasker m
  • 12. 关键技术方案 - 并行 ETL- 冗余删除 功能 冗余删除操作实现了针对所有数据样本中完全相同的两条或多条记录进行删除,只保留相同记录中的一条记录。 指标 1 )实现数据表冗余删除的并行化 2 )正确性与串行结果完全一致 3 )加速比接近线性, TB 级处理时间千秒级 参考方案 数据库中的串行冗余删除 我们的方案 1 )通过 map 对待处理数据进行分块处理,每个数据块对应一个处理节点; map 中输入的 key 为默认值——每行数据的偏移量, value 为该行数据的文本形式,以此方式实现在每块中依次读入每行数据; map 任务输出中间 <key,value> 对,其中, key 从整行数据文本, value 为空文本; 2 )对具有相同 key 值的数据由 reduce 输出: key 为整行数据, value 值为空,即可实现同样的数据记录仅保留一条数据记录; 将 reduce 输出结果存储到分布式文件系统。
  • 13. Case II – Mapreduce based DM Algorithm
    • Function- Assocation
      • To discover association rules in data. It iteratively generates candidate k-length item sets from frequent item sets of length k−1.
    Input Data Set the frequent k-1 length item sets to Key, appear times to Value Reduce the same key, read from the value list and sum Output Data Set the frequent k-1 length item sets to Key, appear times to Value Set the frequent k-1 length item sets to Key, appear times to Value MapTasker 1 MapTasker 2 MapTasker n ReduceTasker 1 Reduce the same key, read from the value list and sum ReduceTasker m Output rules satisfy both minimum support value and minimum confidence value
  • 14. 关键技术方案 - 并行关联规则算法 -PApriori 功能 Apriori 是基于统计频繁项集的策略发现属性间的关联关系 指标 1 )实现查找频繁 k 项集的并行化 2 )正确性与串行结果完全一致 3 )扩展性优良, TB 级处理时间千秒级 参考方案 串行 Apriori 算法 我们的方案 1 )采用 Map/Reduce 机制逐层迭代方法来发现频繁项集,在查找每个频繁 k 项集时进行并行化; 2 )将数据转换为中间 Key/Value 对输出: key 为候选 k 项集, value 为项集计数;将各处理节点输出的数据进行合并处理,满足最小支持度阈值的作为频繁 k 项集; 3 )由频集产生强关联规则,输出满足最小可信度阈值的关联规则。
  • 15. Experiment Environment Software Hardware
      • 256 nodes
      • Datanode : 1-way 4core XEON 2.5G/8GB Mem/4*250G SATA II
      • Namenode/JobTracker: 2-way 2core AMD Opteron 2.6GHz /16G Mem/ 4*146G SAS
    • network : Gbps Switch (now all 256 nodes connected on a 264-port switch)
    • OS : RHEL5.2
    • Hadoop 0.19.1
    • Program language : Jdk1.6 / Linux Shell
    • Tools : Eclipse3.3
  • 16. Evaluation of BC-PDM(Phase I)
    • Correctness
      • Compare to SPSS Clementine, it satisfies application requirement
    • Performance
    • (16 nodes compared to an general unix server)
      • Key Technology Evaluation
        • The performance of parallel ETL improves about 12 to 16 times
        • The performance of data mining ETL improves about 10 to 50 times
        • When there are 256 nodes, it can store, process and mine the data on hundreds TB level.
      • Typical Application
        • The performance of the 3 applications of Shanghai Branch Company improves 3 to 7 times
    • Scalability
      • Parallel ETL has excellent scalability
      • Parallel data mining algorithm has good scalability
    • Data Mining Applications
      • User Cluster Analysis: To find the difference of user group, characterize the groups to make precision marketing
      • Service association: To find the associations among new value added services to make out how to recommend new services to customers
  • 17. Conclusions
    • The BC-PDM framework integrates data mining applications on MapReduce and HDFS platform.
    • In BC-PDM phase I, we implements 14 ETL operations and 9 data mining algorithms. Our practices and experiment results verified that data mining application on MapReduce could deal with large scale data and speed up the response time effectively.
    • For BASS’s requirements, BC-PDM phase II support SaaS mode and more features.
    • In phase II, we use Map chain to optimize performance, especially for operation sequence.
  • 18. Future works
    • BC-PDM phase II is under developing, facing some challenges
      • Data privacy protection, if SaaS for public, security is most important.
      • The migration of online system from SQL to BC-PDM is difficult.
      • How to improve user-friendliness of BC-PDM
      • Workflow and API for designer is not so flexible as SQL.
    • General Evaluation
      • Correctness
      • Performance
      • Scalability
    • Application evaluation
      • ETL process, rather than ELT
        • Choose typical cases in BASS, use BC-PDM to realize the totally process from source data to database layer
        • Output the result to business database for display layer, because display layer need SQL support
        • using real data, compare to real system
      • Data Mining
        • Use the data mining result to total process of BI
        • Check the result buy marketing
  • 19. People(cloud computing team from CMRI)
    • Shaoling Sun
    • Zhiguo Luo
    • Meng Xu
    • Dan Gao
    • Chao Deng
    • Ling Qian
    • Jinyu Han
    • Leitao Guo
    • Xu Wang
    • Zhihong Zhang
    • Ji Qi
    • Min Hu
    • Hongwei Sun
    • Peng Zhao
  • 20. Collaborations are welcome! Thanks and Questions? fengcao@chinamobile.comluozhiguo@chinamobile.com [email_address] Cloud Computing E-Channel (in Chinese) http://labs.chinamobile.com/cloud