Data Analytics with
Hadoop/Hive on
Multiple Data Centers.

               Hirotaka Niisato
               GMO Internet, Inc.
about myself
●
    Hirotaka Niisato(@hirotakaster)
●
    Programmer
●
    GMO Internet, SIProp Project
●
    Work
    Robotics Kinect Android Networking MAKE: Solr Volunteer ...
Data Analytics System
●
    KPI reporting system for Cloud System
●
    GMO Apps Cloud
●
    Over 500 Titles
    mobage, gree, mixi, Hangame, facebook, nikoniko … etc
●
    Data Center
    Japan, US(west coast)
Analytics Specification
●
    Social Game Data KPI
    DAU/PV, Play Time, Sales
    A/B Testing, Conversion … etc


●
    Hourly, Daily, Weekly, Monthly


●
    Since 2010/06 ~
System Architecture
  SNS                                                         Game
  User                            SNS Platform                Master




Cloud System                                     Management   Monitoring
                                                   System      System


            Cloud Server
           (Game Server)



      Logging
                    Scheduler          ・・・・・・・・
       Server


                      MySQL
    Hadoop/Hive
                     (for Hive)

         Data Center A                                   Data Center N
Specification, Statistics
●
    Multiple NameNode per Data Center
●
    Hardware Spacification
    CPU : 8~16CPU(HT)
    MEM: 12~64Gbyte
    HD : RAID 1, 5, 1+0
●
    Statistics
    6,000,000 blocks/44,000 jobs/day
    1,000 over AP servers logging
Data Flow
load data local inpath 'hogehoge-access_log.*.log.gz'
overwrite into table original_logs
partition (log_date='2012-07-26', log_number=13);

host      string from deserializer
identity   string from deserializer
user       string from deserializer               Cloud Server
time      string from deserializer               (Game Server)
method     string from deserializer
request    string from deserializer
status    string from deserializer                  Logging
size      string from deserializer                                            Management
                                                     Server                     System
referer   string from deserializer
agent      string from deserializer
log_date        string
log_number      tinyint
                                          Hadoop/Hive         Scheduler
host     string
time     string
method   string                                                  HiveDriver
request  string
userid   string
log_date     string                          Filter → Hourly, Daily, Weekly, Monthly Report
log_number tinyint                           (AB Testing, Conversion, DAU..etc)
Conversion Count HQL
INSERT OVERWRITE TABLE conversion_click
 PARTITION (log_date= :logDate, log_number=:logNumber)
   SELECT regexp_extract(request, 'convid=([a-zA-Z0-9%])', 1),
             regexp_extract(request, 'convflg=(A|B){1}', 1),
             count(1),
             :logMonth,
             :logWeek
     FROM parsed_log
   WHERE request RLIKE 'convid=[a-zA-Z0-9%]'
      AND request RLIKE 'convflg=(A|B){1}'
      AND log_date = :logDate
      AND log_number = :logNumber
 GROUP BY regexp_extract(request, 'convid=([a-zA-Z0-9%])', 1),
           regexp_extract(request, 'convflg=(A|B){1}', 1)
Monitoring/Management(Zabbix)
Memory Management
●
    Namenode Memory
    File, Block, Directory



●
    Hadoop Archive


●
    Server Memory
Trouble
●
    Re-Analytics
●
    Backup and Recovery
●
    NameNode HA
●
    Hive vs MapReduce
Thank you

Data analytics with hadoop hive on multiple data centers

  • 1.
    Data Analytics with Hadoop/Hiveon Multiple Data Centers. Hirotaka Niisato GMO Internet, Inc.
  • 2.
    about myself ● Hirotaka Niisato(@hirotakaster) ● Programmer ● GMO Internet, SIProp Project ● Work Robotics Kinect Android Networking MAKE: Solr Volunteer ...
  • 3.
    Data Analytics System ● KPI reporting system for Cloud System ● GMO Apps Cloud ● Over 500 Titles mobage, gree, mixi, Hangame, facebook, nikoniko … etc ● Data Center Japan, US(west coast)
  • 4.
    Analytics Specification ● Social Game Data KPI DAU/PV, Play Time, Sales A/B Testing, Conversion … etc ● Hourly, Daily, Weekly, Monthly ● Since 2010/06 ~
  • 5.
    System Architecture SNS Game User SNS Platform Master Cloud System Management Monitoring System System Cloud Server (Game Server) Logging Scheduler ・・・・・・・・ Server MySQL Hadoop/Hive (for Hive) Data Center A Data Center N
  • 6.
    Specification, Statistics ● Multiple NameNode per Data Center ● Hardware Spacification CPU : 8~16CPU(HT) MEM: 12~64Gbyte HD : RAID 1, 5, 1+0 ● Statistics 6,000,000 blocks/44,000 jobs/day 1,000 over AP servers logging
  • 7.
    Data Flow load datalocal inpath 'hogehoge-access_log.*.log.gz' overwrite into table original_logs partition (log_date='2012-07-26', log_number=13); host string from deserializer identity string from deserializer user string from deserializer Cloud Server time string from deserializer (Game Server) method string from deserializer request string from deserializer status string from deserializer Logging size string from deserializer Management Server System referer string from deserializer agent string from deserializer log_date string log_number tinyint Hadoop/Hive Scheduler host string time string method string HiveDriver request string userid string log_date string Filter → Hourly, Daily, Weekly, Monthly Report log_number tinyint (AB Testing, Conversion, DAU..etc)
  • 8.
    Conversion Count HQL INSERTOVERWRITE TABLE conversion_click PARTITION (log_date= :logDate, log_number=:logNumber) SELECT regexp_extract(request, 'convid=([a-zA-Z0-9%])', 1), regexp_extract(request, 'convflg=(A|B){1}', 1), count(1), :logMonth, :logWeek FROM parsed_log WHERE request RLIKE 'convid=[a-zA-Z0-9%]' AND request RLIKE 'convflg=(A|B){1}' AND log_date = :logDate AND log_number = :logNumber GROUP BY regexp_extract(request, 'convid=([a-zA-Z0-9%])', 1), regexp_extract(request, 'convflg=(A|B){1}', 1)
  • 9.
  • 10.
    Memory Management ● Namenode Memory File, Block, Directory ● Hadoop Archive ● Server Memory
  • 11.
    Trouble ● Re-Analytics ● Backup and Recovery ● NameNode HA ● Hive vs MapReduce
  • 12.