Your SlideShare is downloading. ×
Data analytics with hadoop hive on multiple data centers
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Data analytics with hadoop hive on multiple data centers

4,181
views

Published on


0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,181
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
1
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Analytics withHadoop/Hive onMultiple Data Centers. Hirotaka Niisato GMO Internet, Inc.
  • 2. about myself● Hirotaka Niisato(@hirotakaster)● Programmer● GMO Internet, SIProp Project● Work Robotics Kinect Android Networking MAKE: Solr Volunteer ...
  • 3. Data Analytics System● KPI reporting system for Cloud System● GMO Apps Cloud● Over 500 Titles mobage, gree, mixi, Hangame, facebook, nikoniko … etc● Data Center Japan, US(west coast)
  • 4. Analytics Specification● Social Game Data KPI DAU/PV, Play Time, Sales A/B Testing, Conversion … etc● Hourly, Daily, Weekly, Monthly● Since 2010/06 ~
  • 5. System Architecture SNS Game User SNS Platform MasterCloud System Management Monitoring System System Cloud Server (Game Server) Logging Scheduler ・・・・・・・・ Server MySQL Hadoop/Hive (for Hive) Data Center A Data Center N
  • 6. Specification, Statistics● Multiple NameNode per Data Center● Hardware Spacification CPU : 8~16CPU(HT) MEM: 12~64Gbyte HD : RAID 1, 5, 1+0● Statistics 6,000,000 blocks/44,000 jobs/day 1,000 over AP servers logging
  • 7. Data Flowload data local inpath hogehoge-access_log.*.log.gzoverwrite into table original_logspartition (log_date=2012-07-26, log_number=13);host string from deserializeridentity string from deserializeruser string from deserializer Cloud Servertime string from deserializer (Game Server)method string from deserializerrequest string from deserializerstatus string from deserializer Loggingsize string from deserializer Management Server Systemreferer string from deserializeragent string from deserializerlog_date stringlog_number tinyint Hadoop/Hive Schedulerhost stringtime stringmethod string HiveDriverrequest stringuserid stringlog_date string Filter → Hourly, Daily, Weekly, Monthly Reportlog_number tinyint (AB Testing, Conversion, DAU..etc)
  • 8. Conversion Count HQLINSERT OVERWRITE TABLE conversion_click PARTITION (log_date= :logDate, log_number=:logNumber) SELECT regexp_extract(request, convid=([a-zA-Z0-9%]), 1), regexp_extract(request, convflg=(A|B){1}, 1), count(1), :logMonth, :logWeek FROM parsed_log WHERE request RLIKE convid=[a-zA-Z0-9%] AND request RLIKE convflg=(A|B){1} AND log_date = :logDate AND log_number = :logNumber GROUP BY regexp_extract(request, convid=([a-zA-Z0-9%]), 1), regexp_extract(request, convflg=(A|B){1}, 1)
  • 9. Monitoring/Management(Zabbix)
  • 10. Memory Management● Namenode Memory File, Block, Directory● Hadoop Archive● Server Memory
  • 11. Trouble● Re-Analytics● Backup and Recovery● NameNode HA● Hive vs MapReduce
  • 12. Thank you