• Save
Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications
 

Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications

on

  • 2,831 views

In this session we will talk about how we built a log analysis system for marketing using Hadoop, which explore the internet users' interests or feedback about specified products or themes from access ...

In this session we will talk about how we built a log analysis system for marketing using Hadoop, which explore the internet users' interests or feedback about specified products or themes from access log, query/click log and CGM data. Our system provides three features, which are 1) sentiment analysis, 2) co-occuring keyword extraction, and 3) user interests estimation. For large scale analysis, we use Hadoop with customized functions, which push down the shuffle size by amplifying map-side processing. We also show the features of our Hadoop cluster.

Statistics

Views

Total Views
2,831
Views on SlideShare
2,287
Embed Views
544

Actions

Likes
0
Downloads
0
Comments
0

5 Embeds 544

http://www.cloudera.com 538
https://www.cloudera.com 2
http://blog.cloudera.com 2
http://cloudera.brian.dev 1
http://cloudera.matt.dev 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications Presentation Transcript

  • Large-Scale Log Analysis for Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Kenji Hara/ Yukio Uematsu Innovative IP Architecture Center NTT Communications Corporation
  • Company Overview
    • Name: NTT Communications
    • Headquarters: Tokyo, Japan
    • Revenue: USD$ 12.9B(March, 2011; USD 1 = JPY 80)
    • Employees: 8,250(June, 2011)
    • Business Areas
      • International communication
      • Internet service provider
      • System integration
      • Cloud services
    • History
      • 1952 NTT is established
      • 1987 NTT went public (Tokyo Stock Exchange: 9432)
      • 1999 spun off from NTT and incorporated (May 28, 1999)
    Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
  • NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access, Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone & Telegraph 100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Second Sales Division First Sales Division Global Sales Division ... Video & Voice Division Network Services Division Cloud Services Division Applications and Cotent Division Solutions Division Customer Services Division Service Infrastructure Division Systems Division Corporate Planning Division Finance Division ... Innovative IP Architecture Center Staff Operation Product R&D
  • NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access, Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone & Telegraph 100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Technical Support, SI Partnership
  • BizCITY: Cloud Services provided by NTT Communications Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. High-Speed Backbone between Datacenters Global NW Secure Connectivity Internet/IP Phone VPN Service           ICT Outsourcing Fire Wall Guaranteed Burst Best Effort Domestic International BizHosting Virtual Server Hosting BizMail WebMail, Scheduler SaaS CRM/SFA Internet BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log) Mobile Access Mobile Thin Client Ubiquitous Office Remote Access Mobile Access IP Phone Big Data Analysis BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log)
  • Big Data in BizCITY Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Private Data Analysis Natural Language Processing Statistics Secure & High-Capacity Storage Service Mining Data for Marketing User Log Private Data BizStorage Online Storage Multi Layer Analysis BizMarketing Access Log Use hadoop for “ enormous ” user log analysis CGM Log Query Log B Application Data Feature Next target BizMarketing
  • We provide a “cloud” service for marketing!!! Hadoop in cloud!!!!
  • Hadoop in BizMarketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Analysis Hadoop!! Many Join Operations Increasing Data!! Requirement for scalability Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Tweets Per Day
  • CGM Analysis in Biz Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. “ BuzzFinder ” supports marketing activity using customers ’ feedbacks in social media Crawl Crawl Marketer Advertiser Promoter R&D Branding Ads ’ Result Company Reputations Difference with other companies Tweet Blog Search Collect Buzz Finder Blog
  • Data Flow in BuzzFinder Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. PostgreSQL Hadoop Cluster PostgreSQL NLP and Statistics by Map/Reduce
  • Map/Reduce in BuzzFinder Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. CGM Data size/record is large Small amount of records (x mil /day) Map is costly (mainly by NLP) Keywords Customer Keywords Semtiment Locations Topics Index Data Keywords Semtiment Locations Topics Index Data Keyword Sentiment Location Topic Search Index Map(Data Extract) Keyword Count Topic Count Sentiment Count Location Count Reduce(Statistics) Features Map(NLP) Linguistic &User Data
  • Output of BuzzFinder: Keyword Trend Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Trends of “Nuclear Power Plant”and“Earthquake”in twitter 100,000 50,000 Earthquake Nuclear Power Plant 18565 tweets / day 65642 tweets / day Many tweets about “Earthquake” on 11 th each month Trends of specified keywords in Twitter Heavy white smoke from Fukushima No.1 nuclear power plant. 95,271 tweets
  • Output of BuzzFinder: Topic Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Topics about“Nuclear Power Plant” in September Popular topics about specified keywords in Twitter Topics about “Nuclear Power Plant” Tokyo Electric Power Japan Nuclear Accident Fukushima Noda
  • Output of BuzzFinder: Location Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Location analysis of “ Nuclear Power Plant ” Disaster Area Tokyo Area Many Few Many tweets from big city and disaster area
  • Output of BuzzFinder: Sentiment Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Sentiment analysis of “ Nuclear Power Plant ” APR 2011 AUG 2011 48.4% 51.6% 47.5% 52.5% Positive Negative The sentiment of “ Nuclear Power Plant ” got more negative from April (1 month after the earthquake) to August. The sentiment is more negative than average sentiment(70% positive)
  • Hadoop in Biz Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Data Analysis Hadoop!! Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Increasing Data!! Tweets Per Day Many Join Operations Requirement for scalability
  • Web Access Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. ex.) Why users went out without conversion? To find out internet-users’ behavior inside of the site Click stream based analysis
  • Visualization of internet-users behaviors
    • Web access log consists of
      • time
      • url
      • userid
    • Other data
      • Location information
      • Referrer information
      • User attribute
    Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Click stream based analysis ex.) Why users went out without conversion? Statistics Click stream analysis (OLAP)
  • Hadoop for PaaS Services Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. At a same speed Server reduction Speeding-up technique 1. Summation 2. OLAP(multi join processing) Want to reduce the cost! Normal Hadoop Cluster High Speed Hadoop Cluster Map/Reduce speeding-up technique
  • Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Our Cluster Normal Cluster Elephant in Cloud runs FAST!!
  • Strategies for Cost Reduction Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Map Multi-Reduce * Record reduce HashMap-based pre-combining before combiner advantages: 1) efficient combining by HashMap 2) reduction of # of spill operation Local reduce Combining mapper outputs in same servers advantages: reduction of amount of shuffle Pjoin ** Join with pre-partitioning and semi-join advantages: efficient for multi-table joins *, ** “ Map Multi-Reduce ” and “ Pjoin ” are developed in NTT labs; the source code is closed now. Statistics (summation) OLAP (join)
  • Map Multi-Reduce/Record Reduce Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. input Map MapOutputBuffer sort&spill Spill files mergeParts Output Normal map/reduce Map/ r educe with r ecord reduce Input Map MapOutputBuffer sort&spill Spill files mergeParts Output Record reduce Pre-combining function before combiner Pre-combining in map function to reduce # of spill operation Map Task Reduce Task Server Process File Smaller output buffer
  • Map Multi-Reduce/Local Reduce Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. User Program worker worker worker Input Data fork fork fork Master worker worker assign map assign reduce local write remote read, sort Output File 0 Output File 1 Split 1 Split 0 Split 2 Split 3 Split 4 read worker worker worker worker worker assign local reduce Server Process File Pre-reduce data in the same server before combiner function Local Reduce タスク Local Reduce タスク Local Reduce Twice as fast as the normal cluster
  • OLAP in Click Stream Based Analysis
    • Click stream-based analysis uses star-join scheme
    Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. click_stream Page info Location info User info Click info Scalable join is required! Amount of unique key is large
  • Join using Map/Reduce
    • 3 ways to join by map/reduce
      • Memory-backed join/Reduce side join: implemented in hive
      • Map-side join
    Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Combine map-side join and reduce-side join to reduce shuffle cost and disk space while keeping scalability Memory-backed join Reduce side join Map-side join Scalability NG Good Good Shuffle cost low high low Disk space good good bad
  • Pjoin/Join using Semi-Join View Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Query execution pageinfo z Pre-processing pageinfo click_strm pageinfo primary key & foreign key (click_strm primary key) Site description data Pre-processing redundant data for multiple join Join in map-side using pre-partitioning, and only rest of join in reduce side click_strm processing + semi-join mapper … click_strm processing + semi-join pageinfo a pageinfo _ click_strm 1 … pageinfo _ click_strm n click_strm n click_strm 1 Joining with pageinfo reducer … Joining with pageinfo … pageinfo b pageinfo a pageinfo z click_strm 1 click_strm n pageinfo _ click_strm n pageinfo _ click_strm 1 … hash(x) hash(y) hash(y) DFS read shuffle
  • Experimental Evaluation (Pjoin) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 1TB access log join processing using Pjoin to verify the effectiveness HiveQL No. of servers Processing time (min) Pjoin vs Hive(reduce side join) Pjoin(50 servers) Hive(50servers) Pjoin(20 servers) 50 servers(normal hadoop cluster) 23 servers (Pjoin applied cluster) = same speed!! insert overwrite table q1_result select count(distinct s_sessionseqid) from clckstrm c join page p on c.c_pageseqid = p.p_pageseqid and p.p_url like '%blog.goo.ne.jp%' join session_info s on s.s_clckstrmseqid = c.c_clckstrmseqid and s.s_referer like '%QUERY%';
  • Other Verification on Hadoop Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
    • 40 servers 250 cores
    • Wide-area ethernet
    • LACP 4G between racks
    Hadoop Cluster(250cores) Namenode ・・・ ・・・ Rack 1( LOC1 ) Rack 2( LOC1 ) Rack 3 ( LOC2 ) WAN(30miles) 300Mb LACP 4GB Processing time Servers WAN NO significant loss over WAN
  • Conclusions
    • NTT Communications provide cloud services, BizCITY
    • Solved two problems using hadoop in BizMarketing
      • NLP of big CGM data
      • OLAP in big web access logs
    • Reduced servers using speeding up techniques
      • Map Multi-Reduce
      • Pjoin
    • Introduced our verification environment which consists of wide area network
    Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
  • Contacts
    • Kenji Hara, @harakenji, kenji.hara@ntt.com
    • Yukio Uematsu, @alfyukio, y.uematsu@ntt.com
    • BizCITY: http://www.ntt.com/bizcity/
      • BizStorage: http://www.ntt.com/bizstorage/
      • BizMarketing: http://www.ntt.com/marketing/
    Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.