SlideShare a Scribd company logo
Large-Scale Log Analysis for Marketing   Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Kenji Hara/ Yukio Uematsu Innovative IP Architecture Center NTT Communications Corporation
Company Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved.
NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access,  Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone &  Telegraph  100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Second Sales Division First Sales Division Global Sales Division ... Video & Voice Division Network Services Division Cloud Services Division Applications and Cotent Division Solutions Division Customer Services Division Service Infrastructure Division Systems Division Corporate Planning Division Finance Division ... Innovative IP Architecture Center Staff Operation Product R&D
NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access,  Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone &  Telegraph  100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Technical Support, SI Partnership
BizCITY: Cloud Services provided by NTT Communications Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. High-Speed Backbone between Datacenters Global NW Secure Connectivity Internet/IP Phone VPN Service            ICT Outsourcing Fire Wall Guaranteed Burst Best Effort Domestic International BizHosting Virtual Server Hosting BizMail WebMail, Scheduler SaaS CRM/SFA Internet BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log) Mobile Access Mobile Thin Client Ubiquitous Office Remote Access Mobile Access IP Phone Big Data Analysis BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log)
Big Data in BizCITY Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Private Data Analysis Natural Language Processing Statistics  Secure & High-Capacity Storage Service Mining Data for Marketing User Log Private Data BizStorage Online Storage Multi Layer Analysis BizMarketing Access Log Use hadoop for  “ enormous ”  user log analysis CGM Log Query Log B Application Data Feature Next target BizMarketing
We provide a “cloud” service for marketing!!! Hadoop in cloud!!!!
Hadoop in BizMarketing Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Analysis Hadoop!! Many Join Operations Increasing Data!! Requirement for scalability Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Tweets Per Day
CGM Analysis in Biz Marketing Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. “ BuzzFinder ”  supports marketing activity using customers ’  feedbacks in social media Crawl Crawl Marketer Advertiser Promoter R&D Branding Ads ’  Result Company Reputations Difference with other companies Tweet Blog Search Collect Buzz Finder Blog
Data Flow in BuzzFinder Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. PostgreSQL Hadoop Cluster PostgreSQL NLP and Statistics by Map/Reduce
Map/Reduce in BuzzFinder Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. CGM Data size/record is large Small amount of records (x mil /day) Map is costly (mainly by NLP) Keywords Customer Keywords Semtiment Locations Topics Index Data Keywords Semtiment Locations Topics Index Data Keyword Sentiment Location Topic Search Index Map(Data Extract) Keyword Count Topic Count Sentiment Count Location Count Reduce(Statistics) Features Map(NLP) Linguistic &User Data
Output of BuzzFinder: Keyword Trend Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Trends of “Nuclear Power Plant”and“Earthquake”in twitter 100,000 50,000 Earthquake Nuclear Power Plant 18565 tweets / day 65642 tweets / day Many tweets about “Earthquake” on 11 th  each month Trends of specified keywords in Twitter Heavy white smoke from Fukushima No.1 nuclear power plant. 95,271 tweets
Output of BuzzFinder: Topic Analysis Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Topics about“Nuclear Power Plant” in September Popular topics about specified keywords in Twitter Topics about “Nuclear Power Plant” Tokyo Electric Power Japan Nuclear Accident Fukushima Noda
Output of BuzzFinder: Location Analysis Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Location analysis of  “ Nuclear Power Plant ” Disaster Area Tokyo Area Many Few Many tweets from big city and disaster area
Output of BuzzFinder: Sentiment Analysis Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Sentiment analysis of  “ Nuclear Power Plant ” APR 2011 AUG 2011 48.4% 51.6% 47.5% 52.5% Positive Negative The sentiment of  “ Nuclear Power Plant ”  got more negative from April (1 month after the earthquake) to August. The sentiment is more negative than average sentiment(70% positive)
Hadoop in Biz Marketing Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Data Analysis Hadoop!! Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Increasing Data!! Tweets Per Day Many Join Operations Requirement for scalability
Web Access Analysis Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. ex.) Why users went out without conversion? To find out internet-users’ behavior inside of the site Click stream based analysis
Visualization of  internet-users behaviors ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Click stream based analysis ex.) Why users went out without conversion? Statistics Click stream analysis (OLAP)
Hadoop for PaaS Services Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. At a same speed Server reduction Speeding-up technique 1. Summation 2. OLAP(multi join processing) Want to reduce the cost! Normal Hadoop Cluster High Speed Hadoop Cluster Map/Reduce speeding-up technique
Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Our Cluster Normal Cluster Elephant in Cloud runs FAST!!
Strategies for Cost Reduction Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Map Multi-Reduce   * Record reduce HashMap-based pre-combining before combiner advantages: 1) efficient combining by HashMap 2) reduction of # of spill operation Local reduce Combining mapper outputs in same servers advantages: reduction of amount of shuffle Pjoin  ** Join with pre-partitioning and semi-join advantages: efficient for multi-table joins *, **  “ Map Multi-Reduce ”  and  “ Pjoin ” are developed in NTT labs; the source code is closed now. Statistics (summation) OLAP (join)
Map Multi-Reduce/Record Reduce Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. input Map MapOutputBuffer sort&spill Spill files mergeParts Output Normal map/reduce Map/ r educe with  r ecord reduce Input Map MapOutputBuffer sort&spill Spill files mergeParts Output Record reduce Pre-combining function before combiner Pre-combining in map function to reduce # of spill operation Map Task Reduce Task Server Process File Smaller output buffer
Map Multi-Reduce/Local Reduce Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. User Program worker worker worker Input Data fork fork fork Master worker worker assign map assign reduce local  write remote read, sort Output  File 0 Output  File 1 Split 1 Split 0 Split 2 Split 3 Split 4 read worker worker worker worker worker assign local reduce Server Process File Pre-reduce data in the same server before combiner function Local Reduce  タスク Local Reduce  タスク Local Reduce Twice as fast as the normal  cluster
OLAP in Click Stream Based Analysis ,[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. click_stream Page info Location info User info Click info Scalable join is required! Amount of unique key is large
Join using Map/Reduce ,[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Combine map-side join and reduce-side join to reduce shuffle cost and disk space while keeping scalability Memory-backed join Reduce side join Map-side join Scalability NG Good Good Shuffle cost low high low Disk space good good bad
Pjoin/Join using Semi-Join View Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Query execution pageinfo  z Pre-processing pageinfo click_strm pageinfo  primary key & foreign key  (click_strm   primary key)  Site description data Pre-processing redundant data for multiple join Join in map-side using pre-partitioning, and only rest of join in reduce side click_strm  processing +  semi-join mapper … click_strm  processing +  semi-join pageinfo  a pageinfo _  click_strm  1 … pageinfo _  click_strm  n click_strm  n click_strm  1 Joining with  pageinfo reducer … Joining with  pageinfo … pageinfo  b pageinfo  a pageinfo  z click_strm  1 click_strm  n pageinfo _  click_strm  n pageinfo _  click_strm  1 … hash(x) hash(y) hash(y) DFS read shuffle
Experimental Evaluation (Pjoin) Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. 1TB access log join processing using Pjoin to verify the effectiveness HiveQL No. of servers Processing time (min) Pjoin vs Hive(reduce side join) Pjoin(50 servers) Hive(50servers) Pjoin(20 servers) 50 servers(normal hadoop cluster) 23 servers (Pjoin applied cluster) = same speed!! insert overwrite table q1_result select count(distinct s_sessionseqid) from clckstrm c join page p on c.c_pageseqid = p.p_pageseqid and p.p_url like '%blog.goo.ne.jp%' join session_info s on s.s_clckstrmseqid = c.c_clckstrmseqid and s.s_referer like '%QUERY%';
Other Verification on Hadoop Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. ,[object Object],[object Object],[object Object],Hadoop Cluster(250cores) Namenode ・・・ ・・・ Rack 1( LOC1 ) Rack 2( LOC1 ) Rack 3 ( LOC2 ) WAN(30miles) 300Mb LACP 4GB Processing time Servers WAN NO significant loss over WAN
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved.
Contacts ,[object Object],[object Object],[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved.

More Related Content

Similar to Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications

Accel series 2021_summer en
Accel series 2021_summer enAccel series 2021_summer en
Accel series 2021_summer en
NTTDATA INTRAMART
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the data
DataWorks Summit
 
Edge Computing risks and Opportunities for Telco and hyperscalers
Edge Computing risks and Opportunities for Telco and hyperscalersEdge Computing risks and Opportunities for Telco and hyperscalers
Edge Computing risks and Opportunities for Telco and hyperscalers
Patrick Lopez
 
Intra mart accel platform 2021winter-en
Intra mart accel platform 2021winter-enIntra mart accel platform 2021winter-en
Intra mart accel platform 2021winter-en
NTTDATA INTRAMART
 
UC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonUC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonH Eddie Newton
 
How Changing Mobile Technology Is Changing The Way We Do Business
How Changing Mobile Technology Is Changing The Way We Do Business How Changing Mobile Technology Is Changing The Way We Do Business
How Changing Mobile Technology Is Changing The Way We Do Business
Osaka University
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
Matillion
 
SenchaCon 2015 - The advanced operation portal built sencha ExtJs
SenchaCon 2015 - The advanced operation portal built sencha ExtJsSenchaCon 2015 - The advanced operation portal built sencha ExtJs
SenchaCon 2015 - The advanced operation portal built sencha ExtJs
直樹 益子
 
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven AlertsA DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
DevOps.com
 
Microsoft and aspect, transforming customer contact management
Microsoft and aspect, transforming customer contact managementMicrosoft and aspect, transforming customer contact management
Microsoft and aspect, transforming customer contact managementUnified Communications Online
 
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Nicola Sandoli
 
Device to Intelligence, IOT and Big Data in Oracle
Device to Intelligence, IOT and Big Data in OracleDevice to Intelligence, IOT and Big Data in Oracle
Device to Intelligence, IOT and Big Data in Oracle
JunSeok Seo
 
Spotfire
SpotfireSpotfire
What's New with Windows Phone - FoxCon Talk
What's New with Windows Phone - FoxCon TalkWhat's New with Windows Phone - FoxCon Talk
What's New with Windows Phone - FoxCon TalkSam Basu
 
The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013
RightScale
 
Effective IoT System on Openstack
Effective IoT System on OpenstackEffective IoT System on Openstack
Effective IoT System on Openstack
Takashi Kajinami
 
The impact of IOT - exchange cala - 2015
The impact of IOT - exchange cala - 2015The impact of IOT - exchange cala - 2015
The impact of IOT - exchange cala - 2015
Eduardo Pelegri-Llopart
 
Accel series 2019_winter_en
Accel series 2019_winter_enAccel series 2019_winter_en
Accel series 2019_winter_en
NTTDATA INTRAMART
 
Industrial IoT bootcamp
Industrial IoT bootcampIndustrial IoT bootcamp
Industrial IoT bootcamp
Lothar Schubert
 

Similar to Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications (20)

Accel series 2021_summer en
Accel series 2021_summer enAccel series 2021_summer en
Accel series 2021_summer en
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the data
 
Edge Computing risks and Opportunities for Telco and hyperscalers
Edge Computing risks and Opportunities for Telco and hyperscalersEdge Computing risks and Opportunities for Telco and hyperscalers
Edge Computing risks and Opportunities for Telco and hyperscalers
 
Intra mart accel platform 2021winter-en
Intra mart accel platform 2021winter-enIntra mart accel platform 2021winter-en
Intra mart accel platform 2021winter-en
 
UC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonUC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_Johnston
 
How Changing Mobile Technology Is Changing The Way We Do Business
How Changing Mobile Technology Is Changing The Way We Do Business How Changing Mobile Technology Is Changing The Way We Do Business
How Changing Mobile Technology Is Changing The Way We Do Business
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 
SenchaCon 2015 - The advanced operation portal built sencha ExtJs
SenchaCon 2015 - The advanced operation portal built sencha ExtJsSenchaCon 2015 - The advanced operation portal built sencha ExtJs
SenchaCon 2015 - The advanced operation portal built sencha ExtJs
 
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven AlertsA DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
 
Microsoft and aspect, transforming customer contact management
Microsoft and aspect, transforming customer contact managementMicrosoft and aspect, transforming customer contact management
Microsoft and aspect, transforming customer contact management
 
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
 
Device to Intelligence, IOT and Big Data in Oracle
Device to Intelligence, IOT and Big Data in OracleDevice to Intelligence, IOT and Big Data in Oracle
Device to Intelligence, IOT and Big Data in Oracle
 
Spotfire
SpotfireSpotfire
Spotfire
 
What's New with Windows Phone - FoxCon Talk
What's New with Windows Phone - FoxCon TalkWhat's New with Windows Phone - FoxCon Talk
What's New with Windows Phone - FoxCon Talk
 
The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013
 
Effective IoT System on Openstack
Effective IoT System on OpenstackEffective IoT System on Openstack
Effective IoT System on Openstack
 
Soma_Chakraborty (1)
Soma_Chakraborty (1)Soma_Chakraborty (1)
Soma_Chakraborty (1)
 
The impact of IOT - exchange cala - 2015
The impact of IOT - exchange cala - 2015The impact of IOT - exchange cala - 2015
The impact of IOT - exchange cala - 2015
 
Accel series 2019_winter_en
Accel series 2019_winter_enAccel series 2019_winter_en
Accel series 2019_winter_en
 
Industrial IoT bootcamp
Industrial IoT bootcampIndustrial IoT bootcamp
Industrial IoT bootcamp
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 

Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications

  • 1. Large-Scale Log Analysis for Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Kenji Hara/ Yukio Uematsu Innovative IP Architecture Center NTT Communications Corporation
  • 2.
  • 3. NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access, Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone & Telegraph 100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Second Sales Division First Sales Division Global Sales Division ... Video & Voice Division Network Services Division Cloud Services Division Applications and Cotent Division Solutions Division Customer Services Division Service Infrastructure Division Systems Division Corporate Planning Division Finance Division ... Innovative IP Architecture Center Staff Operation Product R&D
  • 4. NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access, Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone & Telegraph 100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Technical Support, SI Partnership
  • 5. BizCITY: Cloud Services provided by NTT Communications Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. High-Speed Backbone between Datacenters Global NW Secure Connectivity Internet/IP Phone VPN Service           ICT Outsourcing Fire Wall Guaranteed Burst Best Effort Domestic International BizHosting Virtual Server Hosting BizMail WebMail, Scheduler SaaS CRM/SFA Internet BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log) Mobile Access Mobile Thin Client Ubiquitous Office Remote Access Mobile Access IP Phone Big Data Analysis BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log)
  • 6. Big Data in BizCITY Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Private Data Analysis Natural Language Processing Statistics Secure & High-Capacity Storage Service Mining Data for Marketing User Log Private Data BizStorage Online Storage Multi Layer Analysis BizMarketing Access Log Use hadoop for “ enormous ” user log analysis CGM Log Query Log B Application Data Feature Next target BizMarketing
  • 7. We provide a “cloud” service for marketing!!! Hadoop in cloud!!!!
  • 8. Hadoop in BizMarketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Analysis Hadoop!! Many Join Operations Increasing Data!! Requirement for scalability Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Tweets Per Day
  • 9. CGM Analysis in Biz Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. “ BuzzFinder ” supports marketing activity using customers ’ feedbacks in social media Crawl Crawl Marketer Advertiser Promoter R&D Branding Ads ’ Result Company Reputations Difference with other companies Tweet Blog Search Collect Buzz Finder Blog
  • 10. Data Flow in BuzzFinder Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. PostgreSQL Hadoop Cluster PostgreSQL NLP and Statistics by Map/Reduce
  • 11. Map/Reduce in BuzzFinder Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. CGM Data size/record is large Small amount of records (x mil /day) Map is costly (mainly by NLP) Keywords Customer Keywords Semtiment Locations Topics Index Data Keywords Semtiment Locations Topics Index Data Keyword Sentiment Location Topic Search Index Map(Data Extract) Keyword Count Topic Count Sentiment Count Location Count Reduce(Statistics) Features Map(NLP) Linguistic &User Data
  • 12. Output of BuzzFinder: Keyword Trend Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Trends of “Nuclear Power Plant”and“Earthquake”in twitter 100,000 50,000 Earthquake Nuclear Power Plant 18565 tweets / day 65642 tweets / day Many tweets about “Earthquake” on 11 th each month Trends of specified keywords in Twitter Heavy white smoke from Fukushima No.1 nuclear power plant. 95,271 tweets
  • 13. Output of BuzzFinder: Topic Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Topics about“Nuclear Power Plant” in September Popular topics about specified keywords in Twitter Topics about “Nuclear Power Plant” Tokyo Electric Power Japan Nuclear Accident Fukushima Noda
  • 14. Output of BuzzFinder: Location Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Location analysis of “ Nuclear Power Plant ” Disaster Area Tokyo Area Many Few Many tweets from big city and disaster area
  • 15. Output of BuzzFinder: Sentiment Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Sentiment analysis of “ Nuclear Power Plant ” APR 2011 AUG 2011 48.4% 51.6% 47.5% 52.5% Positive Negative The sentiment of “ Nuclear Power Plant ” got more negative from April (1 month after the earthquake) to August. The sentiment is more negative than average sentiment(70% positive)
  • 16. Hadoop in Biz Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Data Analysis Hadoop!! Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Increasing Data!! Tweets Per Day Many Join Operations Requirement for scalability
  • 17. Web Access Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. ex.) Why users went out without conversion? To find out internet-users’ behavior inside of the site Click stream based analysis
  • 18.
  • 19. Hadoop for PaaS Services Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. At a same speed Server reduction Speeding-up technique 1. Summation 2. OLAP(multi join processing) Want to reduce the cost! Normal Hadoop Cluster High Speed Hadoop Cluster Map/Reduce speeding-up technique
  • 20. Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Our Cluster Normal Cluster Elephant in Cloud runs FAST!!
  • 21. Strategies for Cost Reduction Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Map Multi-Reduce * Record reduce HashMap-based pre-combining before combiner advantages: 1) efficient combining by HashMap 2) reduction of # of spill operation Local reduce Combining mapper outputs in same servers advantages: reduction of amount of shuffle Pjoin ** Join with pre-partitioning and semi-join advantages: efficient for multi-table joins *, ** “ Map Multi-Reduce ” and “ Pjoin ” are developed in NTT labs; the source code is closed now. Statistics (summation) OLAP (join)
  • 22. Map Multi-Reduce/Record Reduce Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. input Map MapOutputBuffer sort&spill Spill files mergeParts Output Normal map/reduce Map/ r educe with r ecord reduce Input Map MapOutputBuffer sort&spill Spill files mergeParts Output Record reduce Pre-combining function before combiner Pre-combining in map function to reduce # of spill operation Map Task Reduce Task Server Process File Smaller output buffer
  • 23. Map Multi-Reduce/Local Reduce Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. User Program worker worker worker Input Data fork fork fork Master worker worker assign map assign reduce local write remote read, sort Output File 0 Output File 1 Split 1 Split 0 Split 2 Split 3 Split 4 read worker worker worker worker worker assign local reduce Server Process File Pre-reduce data in the same server before combiner function Local Reduce タスク Local Reduce タスク Local Reduce Twice as fast as the normal cluster
  • 24.
  • 25.
  • 26. Pjoin/Join using Semi-Join View Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Query execution pageinfo z Pre-processing pageinfo click_strm pageinfo primary key & foreign key (click_strm primary key) Site description data Pre-processing redundant data for multiple join Join in map-side using pre-partitioning, and only rest of join in reduce side click_strm processing + semi-join mapper … click_strm processing + semi-join pageinfo a pageinfo _ click_strm 1 … pageinfo _ click_strm n click_strm n click_strm 1 Joining with pageinfo reducer … Joining with pageinfo … pageinfo b pageinfo a pageinfo z click_strm 1 click_strm n pageinfo _ click_strm n pageinfo _ click_strm 1 … hash(x) hash(y) hash(y) DFS read shuffle
  • 27. Experimental Evaluation (Pjoin) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 1TB access log join processing using Pjoin to verify the effectiveness HiveQL No. of servers Processing time (min) Pjoin vs Hive(reduce side join) Pjoin(50 servers) Hive(50servers) Pjoin(20 servers) 50 servers(normal hadoop cluster) 23 servers (Pjoin applied cluster) = same speed!! insert overwrite table q1_result select count(distinct s_sessionseqid) from clckstrm c join page p on c.c_pageseqid = p.p_pageseqid and p.p_url like '%blog.goo.ne.jp%' join session_info s on s.s_clckstrmseqid = c.c_clckstrmseqid and s.s_referer like '%QUERY%';
  • 28.
  • 29.
  • 30.