Hadoop & Zing
Upcoming SlideShare
Loading in...5
×
 

Hadoop & Zing

on

  • 2,718 views

 

Statistics

Views

Total Views
2,718
Views on SlideShare
2,718
Embed Views
0

Actions

Likes
6
Downloads
66
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • SequenceFile  is a flat file consisting of binary key/value pairs. It is extensively used in  MapReduce  as input/output formats. It is also worth noting that, internally, the temporary outputs of maps are stored using SequenceFile.
  • Persistent connection One-way communication

Hadoop & Zing Hadoop & Zing Presentation Transcript

  • PRESENTER: HUNGVV W: http://me.zing.vn/hung.vo E: [email_address] 2011-08 HADOOP & ZING
  • AGENDA Using Hadoop in Zing Rank Introduction to Hadoop, Hive A case study: Log Collecting, Analyzing & Reporting System ter Estimate Conclusion 1 3 2
  • Hadoop & Zing
    • What
      • It’s a framework for large-scale data processing
      • Inspired by Google’s architecture: Map Reduce and GFS
      • A top-level Apache project – Hadoop is open source
    • Why
      • Fault-tolerant hardware is expensive
      • Hadoop is designed to run on cheap commodity hardware
      • It automatically handles data replication and node failure
      • It does the hard work – you can focus on processing data
  • Data Flow into Hadoop Web Servers Scribe MidTier Network Storage and Servers Hadoop Hive Warehouse MySQL
  • Hive – Data Warehouse
    • A system for managing and querying structured data build on top of Hadoop
      • Map-Reduce for execution
      • HDFS for storage
      • Metadata in an RDBMS
    • Key building Principles:
      • SQL as a familiar data warehousing tool
      • Extensibility - Types, Functions, Formats, Scripts
      • Scalability and Performance
    • Efficient SQL to Map-Reduce Compiler
  • Hive Architecture HDFS Map Reduce Web UI + Hive CLI + JDBC/ODBC Browse, Query, DDL Hive QL Parser Planner Optimizer Execution SerDe CSV Thrift Regex UDF/UDAF substr sum average FileFormats TextFile SequenceFile RCFile User-defined Map-reduce Scripts
  • Hive DDL
    • DDL
      • Complex columns
      • Partitions
      • Buckets
    • Example
      • CREATE TABLE stats_active_daily( username STRING, userid INT, last_login INT, num_login INT, num_longsession INT) PARTITIONED BY(dt STRING, app STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n‘ STORED AS TEXTFILE;
  • Hive DML
    • Data loading
      • LOAD DATA LOCAL INPATH '/data/scribe_log/STATSLOGIN/STATSLOGIN-$YESTERDAY*' OVERWRITE INTO TABLE stats_login PARTITION(dt='$YESTERDAY', app='${APP}');
    • Insert data into Hive tables
      • INSERT OVERWRITE TABLE stats_active_daily PARTITION (dt='$YESTERDAY', app='${APP}') SELECT username, userid, MAX(login_time), COUNT(1), SUM(IF(login_type=3,1,0)) FROM stats_login WHERE dt='$YESTERDAY' and app='${APP}' GROUP BY username, userid;
  • Hive Query Language
    • SQL
      • Where
      • Group By
      • Equi-Join
      • Sub query in "From" clause
  • Multi-table Group-By/Insert
        • FROM user_information
        • INSERT OVERWRITE TABLE log_user_gender PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', genderid, COUNT(1) GROUP BY genderid
        • INSERT OVERWRITE TABLE log_user_age PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', YEAR(dob), COUNT(1) GROUP BY YEAR(dob)
        • INSERT OVERWRITE TABLE log_user_education PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', educationid, COUNT(1) GROUP BY educationid
        • INSERT OVERWRITE TABLE log_user_job PARTITION (dt='$YESTERDAY‘) SELECT '$YESTERDAY', jobid, COUNT(1) GROUP BY jobid
  • File Formats
    • TextFile:
      • Easy for other applications to write/read
      • Gzip text files are not splittable
    • SequenceFile:
      • http://wiki.apache.org/hadoop/SequenceFile
      • Only hadoop can read it
      • Support splittable compression
    • RCFile: Block-based columnar storage
      • https://issues.apache.org/jira/browse/HIVE-352
      • Use SequenceFile block format
      • Columnar storage inside a block
      • 25% smaller compressed size
      • On-par or better query performance depending on the query
  • SerDe
    • Serialization/Deserialization
    • Row Format
      • CSV (LazySimpleSerDe)
      • Thrift (ThriftSerDe)
      • Regex (RegexSerDe)
      • Hive Binary Format (LazyBinarySerDe)
    • LazySimpleSerDe and LazyBinarySerDe
      • Deserialize the field when needed
      • Reuse objects across different rows
      • Text and Binary format
  • UDF/UDAF
    • Features:
      • Use either Java or Hadoop Objects (int, Integer, IntWritable)
      • Overloading
      • Variable-length arguments
      • Partial aggregation for UDAF
    • Example UDF:
      • public class UDFExampleAdd extends UDF { public int evaluate(int a, int b) { return a + b; } }
  • What we use Hadoop for?
    • Storing Zing Me core log data
    • Storing Zing Me Game/App log data
    • Storing backup data
    • Processing/Analyzing data with HIVE
    • Storing social data (feed, comment, voting, chat messages, …) with HBase
  • Data Usage
    • Statistics per day:
      • ~ 300 GB of new data added per day
      • ~ 800 GB of data scanned per day
      • ~ 10,000 Hive j obs per day
  • Where is the data stored?
    • Hadoop/Hive Warehouse
      • 90T data
      • 20 nodes, 16 cores/node
      • 16 TB per node
      • Replication= 2
  • Log Collecting, Analyzing & Reporting
    • Need
      • Simple & high performance framework for log collection
      • Central, high-available & scalable storage
      • Ease-of-use tool for data analyzing (schema-based, SQL-like query, …)
      • Robust framework to develop report
    • Version 1 (RDBMS-style)
      • Log data go directly into MySQL database (Master)
      • Transform data into another MySQL database (off-load)
      • Statistics queries running and export data into another MySQL tables
    • Performance problem
      • Slow log insert, concurrent insert
      • Slow query-time on large dataset
    Log Collecting, Analyzing & Reporting
    • Version 2 (Scribe, Hadoop & Hive)
      • Fast log
      • Acceptable query-time on large dataset
      • Data replication
      • Distributed calculation
    Log Collecting, Analyzing & Reporting
    • Components
      • Log Collector
      • Log/Data Transformer
      • Data Analyzer
      • Web Reporter
    • Process
      • Log define
      • Log integrate (into application)
      • Log/Data analyze
      • Report develop
    Log Collecting, Analyzing & Reporting
    • Log Collector
      • Scribe:
        • a server for aggregating streaming log data
        • designed to scale to a very large number of nodes and be robust to network and node failures
        • hierarchy stores
        • Thrift service using the non-blocking C++ server
      • Thrift-client in C/C++, Java, PHP, …
    Log Collecting, Analyzing & Reporting
    • Log format (common)
      • Application-action log
        • server_ip server_domain client_ip username actionid createdtime appdata execution_time
      • Request log
        • server_ip request_domain request_uri request_time execution_time memory client_ip username application
      • Game action log
        • time  username  actionid  gameid  goldgain  coingain  expgain  itemtype    itemid  userid_affect  appdata
    Log Collecting, Analyzing & Reporting
    • Scribe – file store
        • port=1463
        • max_msg_per_second=2000000
        • max_queue_size=10000000
        • new_thread_per_category=yes
        • num_thrift_server_threads=10
        • check_interval=3
        • # DEFAULT - write all other categories to /data/scribe_log
        • <store>
        • category=default
        • type=file
        • file_path=/data/scribe_log
        • base_filename=default_log
        • max_size=8000000000
        • add_newlines=1
        • rotate_period=hourly
        • #rotate_hour=0
        • rotate_minute=1
        • </store>
    Log Collecting, Analyzing & Reporting
    • Scribe – buffer store
        • <store>
        • category=default
        • type=buffer
        • target_write_size=20480
        • max_write_interval=1
        • buffer_send_rate=1
        • retry_interval=30
        • retry_interval_range=10
        • <primary>
        • type=network
        • remote_host=xxx.yyy.zzz.ttt
        • remote_port=1463
        • </primary>
        • <secondary>
        • type=file
        • fs_type=std
        • file_path=/tmp
        • base_filename=zmlog_backup
        • max_size=30000000
        • </secondary>
        • </store>
    Log Collecting, Analyzing & Reporting
    • Log/Data Transformer
      • Help to import data from multi-type source into Hive
      • Semi-automated
      • Log files to Hive:
        • LOAD DATA LOCAL INPATH … OVERWRITE INTO TABLE…
      • MySQL data to Hive:
        • Data extract using SELECT … INTO OUTFILE …
        • Import using LOAD DATA
    Log Collecting, Analyzing & Reporting
    • Data Analyzer
      • Calculation using Hive query language (HQL): SQL-like
      • Data partitioning, query optimization:
        • very important to improve speed
        • distributed data reading
        • optimize query for one-pass data reading
      • Automation
        • hive --service cli -f hql_file
        • Bash shell, crontab
      • Export data and import into MySQL for web report
        • Export with Hadoop command-line: hadoop fs -cat
        • Import using LOAD DATA LOCAL INFILE … INTO TABLE …
    Log Collecting, Analyzing & Reporting
    • Web Reporter
      • PHP web application
      • Modular
      • Standard format and template
      • jpgraph
    Log Collecting, Analyzing & Reporting
    • Applications
      • Summarization
        • User/Apps indicators: active, churn-rate, login, return…
        • User demographics: age, gender, education, job, location…
        • User interactions/Apps actions
      • Data mining
      • Spam Detection
      • Application performance
      • Ad-hoc Analysis
    Log Collecting, Analyzing & Reporting
  • THANK YOU!