• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications
 

Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications

on

  • 682 views

In this session we will talk about how we built a log analysis system for marketing using Hadoop, which explore the internet users' interests or feedback about specified products or themes from access ...

In this session we will talk about how we built a log analysis system for marketing using Hadoop, which explore the internet users' interests or feedback about specified products or themes from access log, query/click log and CGM data. Our system provides three features, which are 1) sentiment analysis, 2) co-occuring keyword extraction, and 3) user interests estimation. For large scale analysis, we use Hadoop with customized functions, which push down the shuffle size by amplifying map-side processing. We also show the features of our Hadoop cluster.

Statistics

Views

Total Views
682
Views on SlideShare
682
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications Presentation Transcript

    • Large-Scale Log Analysis for Marketing Kenji Hara/ Yukio Uematsu Innovative IP Architecture Center NTT Communications Corporation Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
    • Company Overview• Name: NTT Communications• Headquaters: Tokyo, Japan• Revenue: USD$ 12.9B(March, 2011; USD 1 = JPY 80)• Employees: 8,250(June, 2011)• Business Areas – International communication – Internet provider – System integration – Cloud services• History – 1952 NTT is established – 1987 NTT went public (Tokyo Stock Exchange: 9432) – 1999 spun off from NTT and incorporated (May 28, 1999) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 2
    • NTT Group, NTT Communications Corporate Structure Innovative IP Architecture Center R&D 100% First Sales Division Second Sales Division . .. US$ 12.9B revenue Global data, Internet Access, Voice, IT Global Sales DivisionNippon Telephone &Telegraph Video & Voice Division 100% Network Services Division US$ 24.4B revenue, Local Telecom Cloud Services Division 100% Applications and Cotent Division Product US$ 21.9B revenue, Local Telecom Solutions Division Customer Services Division 66.4% Service Infrastructure Division Operation US$ 52.8B revenue, Mobile Systems Division 54.2% Corporate Planning Division Staff US$ 14.5B revenue,System Integration Finance Division . ..
    • BizCITY: Cloud Services provided by NTT Communications          ICT Big Data AnalysisOutsourcing BizHosting BizMail SaaS BizStorage BizMarketing WebMail, Online Storage Multi Layer Virtual Server CRM/SFA Hosting Scheduler Analysis Big Data user log (user log) High-Speed Backbone between Datacenters Secure Connectivity Fire Wall InternetGlobalNW VPN Service Internet/IP Phone Guaranteed Burst Best Effort Mobile Mobile Access Thin Client Remote Access Mobile Access IP PhoneInternational Domestic Ubiquitous Office Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 4
    • Big Data in BizCITY BizStorage BizMarketing Online Storage Multi Layer BLO Analysis G Access Query CGM Log Log Log Data Private Data User Log Secure & High-Capacity Feature Mining Data for Marketing Storage Service StatisticsApplication Private Data Analysis Natural Language Processing Use hadoop for “enormous” user Next target log analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 5
    • Hadoop in Biz MarketingCGM Data Analysis Web Access Analysis Many Join Increasing Operations Tweets Data!! Per Day Jan July Jan July Jan July 2009 2009 2010 2010 2011 2011 Requirement for scalability Hadoop!! Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 6
    • CGM Data Analysis in Biz Marketing“Buzz Finder” supports marketing activityusing customers’ feedbacks in social media Buzz Finder Cra wl Marketer Promotion Tweet Company Branding Reputations Blog BLOG Crawl Advertiser R&D Search Diffrence with Ads’ Result t other ompanies llec Co Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 7
    • Data Flow in BuzzFinder PostgreSQL Hadoop Cluster PostgreSQL NLP and Statistics by Map/Reduce Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 8
    • Map/Reduce in BuzzFinder Map(NLP) Map(Data Extract) Reduce(Statistics) Keywords Keywords Keyword Keywords Linguistic CountCGM &User DataData Topics Topics Topic Topics Count Semtiments Semtiment Semtiment Semtiment Count Locations Locations Location Locations Count Index Data Points Index Data Index Data Rich data/record Small amount of records (x mil /day) Map is costly (mainly by NLP) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 9
    • Results of BuzzFinder(1/4)Trends of “Earthquake” and “Nuclear Power Plant” in twitter Earthquake Nuclear Power Plant 18565 tweets / day 65642 tweets / day Heavy white smoke from Fukushima No.1 nuclear power plant. 100,000 95,271 tweets 50,000 Many tweets abount “Earthquake” on 11th each month Trend overview of specified keywords in Twitter Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 10
    • Results of BuzzFinder(2/4)Topics about“Nuclear Power Plant” in September Topics about “Nuclear Power Plant” Tokyo Electric Power Japan Nuclear Accident Fukushima NodaPopular topics about specified keywords in Twitter Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 11
    • Results of BuzzFinder(3/4)Location analysis of “Nuclear Power Plant” Many Disaster Area Few Tokyo Area Many tweets from big city and disaster area Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 12
    • Results of BuzzFinder(4/4) Sentiment analysis of “Nuclear Power Plant” Positive Negative 51.6% 48.4% 52.5% 47.5% 2011/04 2011/08The sentiment of “Nuclear Power Plant” got more negative from April(1 month after the earthquake) to August.The sentiment is more negative than average sentiment(70% positive) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 13
    • Hadoop in Biz MarketingCGM Data Analysis Web Access Analysis Many Join Increasing Operations Tweets Data!! Per Day Jan July Jan July Jan July 2009 2009 2010 2010 2011 2011 Requirement for scalability Hadoop!! Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 14
    • Visualization of internet-users behaviors • Web access log consists of – time – url – userid • Other data Click stream based analysis – Location information ex.) Why users went out without conversion? – Referrer information – User attribute Statistics Click stream analysis (OLAP) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 15
    • Fast Map/Reduce for PaaS ServicesShuffle is costly! Map/Reduce speeding-up techniqueNormal Hadoop Cluster High Speed Hadoop Cluster Server reduction Speeding-up technique At a same speed 1. Summation 2. OLAP(multi join processing) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 16
    • Strategies for Shuffle Cost Reduction Map Multi-Reduce * Statistics Record reduce Pre-reduce during map function to reduce intermediate-data(summation) Local reduce Pre-reduce in the same server before combiner function Pjoin ** OLAP (join) Join with semi-join view Pre-processing redundant data for multiple join *, ** “Map Multi-Reduce” and“PJoin” are the techniques in NTT labs which are closed source now. Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 17
    • Map Multi-Reduce/Record Reduce Pre-reduce during map function to reduce intermediate-data Server Process Map Task Map Task Reduce Task Reduce Task FileNormal map/reduce sort&spil input Map MapOutputBuffer Spill files mergeParts Output l Pre-reduce function in mapMap/reduce with record reduce function Record MapOutputBuffe sort&spil Input Map Spill files mergeParts Output reduce r l Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 18
    • Map Multi-Reduce/Local Reduce Pre-reduce data in the same server before combiner function Server User Process Program fork fork fork File Local Reduce タスク Local Reduce assign assign map Master reduce assignInput Data local reduce Split 0 worker worker Split 1 worker Output worker File 0 Split 2 worker worker Output Split 3 worker worker File 1 Split 4 worker worker remote read, local sort read write Achieved twice as fast as the normal cluster Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 19
    • OLAP in Click Stream Based AnalysisClick stream data analysis uses star-join scheme Page info Location info click_stream User infoUnique key count is large Click info Scalable join is required! Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 20
    • Join using Map/Reduce• 3 ways to join by map/reduce – Memory-backed join/Reduce side join: hive implemented – Map-side join Memory- Reduce side Map-side join backed join join Scalability △ ○ Depends on implementation Shuffle cost High Very high Low Speed Fast Slow Depends on implementation Scalability is requirement so Shuffle is costly! Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 21
    • PJoin/Join using Semi-Join View Pre-processing redundant data for multiple join Join in map-side using pre-generated view, and only rest of join in reduce side DFS read shuffle siteinfo a mapper siteinfo b siteinfo_ reducer accesses processing accesses 1 + Joining with siteinfo hash(x) semi-joinh … siteinfo accesses 1Site description siteinfo z datasiteinfo primary key & siteinfo aforeign key (accesses primary key) siteinfo_ … … accesses 1 … hash(y) accesses 1 siteinfo_ accesses processing accesses n + Joining with … accesses hash(y) semi-join siteinfo siteinfo_ accesses n Access log accesses n Pre-computation siteinfo z Query execution accesses n Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 22
    • Experimental evaluation (Pjoin) 1TB access log join processing using Pjoin to verify the effectiveness 50 servers(normal hadoop cluster) = same speed!! 20 servers (Pjoin Applied Cluster) HiveQL PJoin vs Hive Pjo in マシン台数バリエーショ 選択率低 ン insert overwrite table q1_result 6 select 5 count(distinct s_sessionseqid)Processing time 4 from clckstrm c 処理時間 (分) 3 join page p on 2 c.c_pageseqid = p.p_pageseqid 1 and p.p_url like %blog.goo.ne.jp% 0 join session_info s 20 25 30 35 40 45 50 on server マシン台数 s.s_clckstrmseqid = c.c_clckstrmseqid 6. pjoin - > dis tinc t - > pjoin 案 7. pjoin - > rs join案 HIVE50台最速 and s.s_referer like %QUERY%; Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 23
    • Other verification of Hadoop 80• 40 servers 250 cores 70 60 WAN Processing time• Wide-area ethernet 50 40 30• LACP 4G between racks 20 10 0 0 5 10 15 20 25 30 Hadoop Cluster(250cores) Servers Rack 1(LOC1 ) Rack 2(LOC1) Rack 3 (LOC2 ) ・・・ ・・・ Namenode LACP 4GB WAN(30miles) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 24
    • Conclusions• NTT Communications provide cloud services, BizCITY• Solved two problems using hadoop in BizMarketing – NLP of Big CGM data – Join operations in big web access logs• Reduced operation cost using speeding up technique – Map Multi-Reduce – Pjoin• Introduced our hadoop cluster which consists of wide area network Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 25
    • Contacts• Kenji Hara, @haracane, kenji.hara@ntt.com• Yukio Uematsu, @alfyukio, y.uematsu@ntt.com• BizCITY: http://www.ntt.com/bizcity/ – BizStorage: http://www.ntt.com/bizstorage/ – BizMarketing: http://www.ntt.com/marketing/ Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 26