PRESENTER: HUNGVV W:  http://me.zing.vn/hung.vo E:  [email_address] 2011-08 HADOOP & ZING
AGENDA Using Hadoop in Zing Rank Introduction  to Hadoop, Hive A case study: Log Collecting, Analyzing & Reporting System ...
Hadoop  &  Zing <ul><li>What </li></ul><ul><ul><li>It’s a framework for large-scale data processing </li></ul></ul><ul><ul...
Data Flow into Hadoop Web Servers Scribe MidTier Network Storage and Servers Hadoop Hive Warehouse MySQL
Hive – Data Warehouse <ul><li>A system for managing and querying structured data build on top of Hadoop </li></ul><ul><ul>...
Hive Architecture HDFS Map Reduce Web UI + Hive CLI + JDBC/ODBC Browse, Query, DDL Hive QL Parser Planner Optimizer Execut...
Hive DDL <ul><li>DDL </li></ul><ul><ul><li>Complex columns </li></ul></ul><ul><ul><li>Partitions </li></ul></ul><ul><ul><l...
Hive DML <ul><li>Data loading </li></ul><ul><ul><li>LOAD DATA LOCAL INPATH '/data/scribe_log/STATSLOGIN/STATSLOGIN-$YESTER...
Hive Query Language <ul><li>SQL </li></ul><ul><ul><li>Where </li></ul></ul><ul><ul><li>Group By </li></ul></ul><ul><ul><li...
Multi-table Group-By/Insert <ul><ul><ul><li>FROM user_information </li></ul></ul></ul><ul><ul><ul><li>INSERT OVERWRITE TAB...
File Formats <ul><li>TextFile: </li></ul><ul><ul><li>Easy for other applications to write/read </li></ul></ul><ul><ul><li>...
SerDe <ul><li>Serialization/Deserialization </li></ul><ul><li>Row Format </li></ul><ul><ul><li>CSV (LazySimpleSerDe) </li>...
UDF/UDAF <ul><li>Features: </li></ul><ul><ul><li>Use either Java or Hadoop Objects (int, Integer, IntWritable) </li></ul><...
What we use Hadoop for? <ul><li>Storing Zing Me core log data </li></ul><ul><li>Storing Zing Me Game/App log data </li></u...
Data Usage <ul><li>Statistics per day: </li></ul><ul><ul><li>~  300 GB  of new data added per day </li></ul></ul><ul><ul><...
Where is the data stored? <ul><li>Hadoop/Hive Warehouse </li></ul><ul><ul><li>90T  data </li></ul></ul><ul><ul><li>20  nod...
Log Collecting, Analyzing & Reporting <ul><li>Need </li></ul><ul><ul><li>Simple & high performance framework for log colle...
<ul><li>Version 1 (RDBMS-style) </li></ul><ul><ul><li>Log data go directly into MySQL database (Master) </li></ul></ul><ul...
<ul><li>Version 2 (Scribe, Hadoop & Hive) </li></ul><ul><ul><li>Fast log </li></ul></ul><ul><ul><li>Acceptable query-time ...
<ul><li>Components </li></ul><ul><ul><li>Log Collector </li></ul></ul><ul><ul><li>Log/Data Transformer </li></ul></ul><ul>...
<ul><li>Log Collector </li></ul><ul><ul><li>Scribe: </li></ul></ul><ul><ul><ul><li>a server for aggregating streaming log ...
<ul><li>Log format (common) </li></ul><ul><ul><li>Application-action log </li></ul></ul><ul><ul><ul><li>server_ip  server_...
<ul><li>Scribe – file store </li></ul><ul><ul><ul><li>port=1463 </li></ul></ul></ul><ul><ul><ul><li>max_msg_per_second=200...
<ul><li>Scribe – buffer store </li></ul><ul><ul><ul><li><store> </li></ul></ul></ul><ul><ul><ul><li>category=default </li>...
<ul><li>Log/Data Transformer </li></ul><ul><ul><li>Help to import data from multi-type source into Hive </li></ul></ul><ul...
<ul><li>Data Analyzer </li></ul><ul><ul><li>Calculation using Hive query language (HQL): SQL-like </li></ul></ul><ul><ul><...
<ul><li>Web Reporter </li></ul><ul><ul><li>PHP web application </li></ul></ul><ul><ul><li>Modular </li></ul></ul><ul><ul><...
<ul><li>Applications </li></ul><ul><ul><li>Summarization </li></ul></ul><ul><ul><ul><li>User/Apps indicators: active, chur...
THANK YOU!
Upcoming SlideShare
Loading in...5
×

Hadoop & Zing

2,615

Published on

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,615
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
74
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • SequenceFile  is a flat file consisting of binary key/value pairs. It is extensively used in  MapReduce  as input/output formats. It is also worth noting that, internally, the temporary outputs of maps are stored using SequenceFile.
  • Persistent connection One-way communication
  • Hadoop & Zing

    1. 1. PRESENTER: HUNGVV W: http://me.zing.vn/hung.vo E: [email_address] 2011-08 HADOOP & ZING
    2. 2. AGENDA Using Hadoop in Zing Rank Introduction to Hadoop, Hive A case study: Log Collecting, Analyzing & Reporting System ter Estimate Conclusion 1 3 2
    3. 3. Hadoop & Zing <ul><li>What </li></ul><ul><ul><li>It’s a framework for large-scale data processing </li></ul></ul><ul><ul><li>Inspired by Google’s architecture: Map Reduce and GFS </li></ul></ul><ul><ul><li>A top-level Apache project – Hadoop is open source </li></ul></ul><ul><li>Why </li></ul><ul><ul><li>Fault-tolerant hardware is expensive </li></ul></ul><ul><ul><li>Hadoop is designed to run on cheap commodity hardware </li></ul></ul><ul><ul><li>It automatically handles data replication and node failure </li></ul></ul><ul><ul><li>It does the hard work – you can focus on processing data </li></ul></ul>
    4. 4. Data Flow into Hadoop Web Servers Scribe MidTier Network Storage and Servers Hadoop Hive Warehouse MySQL
    5. 5. Hive – Data Warehouse <ul><li>A system for managing and querying structured data build on top of Hadoop </li></ul><ul><ul><li>Map-Reduce for execution </li></ul></ul><ul><ul><li>HDFS for storage </li></ul></ul><ul><ul><li>Metadata in an RDBMS </li></ul></ul><ul><li>Key building Principles: </li></ul><ul><ul><li>SQL as a familiar data warehousing tool </li></ul></ul><ul><ul><li>Extensibility - Types, Functions, Formats, Scripts </li></ul></ul><ul><ul><li>Scalability and Performance </li></ul></ul><ul><li>Efficient SQL to Map-Reduce Compiler </li></ul>
    6. 6. Hive Architecture HDFS Map Reduce Web UI + Hive CLI + JDBC/ODBC Browse, Query, DDL Hive QL Parser Planner Optimizer Execution SerDe CSV Thrift Regex UDF/UDAF substr sum average FileFormats TextFile SequenceFile RCFile User-defined Map-reduce Scripts
    7. 7. Hive DDL <ul><li>DDL </li></ul><ul><ul><li>Complex columns </li></ul></ul><ul><ul><li>Partitions </li></ul></ul><ul><ul><li>Buckets </li></ul></ul><ul><li>Example </li></ul><ul><ul><li>CREATE TABLE stats_active_daily( username STRING, userid INT, last_login INT, num_login INT, num_longsession INT) PARTITIONED BY(dt STRING, app STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n‘ STORED AS TEXTFILE; </li></ul></ul>
    8. 8. Hive DML <ul><li>Data loading </li></ul><ul><ul><li>LOAD DATA LOCAL INPATH '/data/scribe_log/STATSLOGIN/STATSLOGIN-$YESTERDAY*' OVERWRITE INTO TABLE stats_login PARTITION(dt='$YESTERDAY', app='${APP}'); </li></ul></ul><ul><li>Insert data into Hive tables </li></ul><ul><ul><li>INSERT OVERWRITE TABLE stats_active_daily PARTITION (dt='$YESTERDAY', app='${APP}') SELECT username, userid, MAX(login_time), COUNT(1), SUM(IF(login_type=3,1,0)) FROM stats_login WHERE dt='$YESTERDAY' and app='${APP}' GROUP BY username, userid; </li></ul></ul>
    9. 9. Hive Query Language <ul><li>SQL </li></ul><ul><ul><li>Where </li></ul></ul><ul><ul><li>Group By </li></ul></ul><ul><ul><li>Equi-Join </li></ul></ul><ul><ul><li>Sub query in &quot;From&quot; clause </li></ul></ul>
    10. 10. Multi-table Group-By/Insert <ul><ul><ul><li>FROM user_information </li></ul></ul></ul><ul><ul><ul><li>INSERT OVERWRITE TABLE log_user_gender PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', genderid, COUNT(1) GROUP BY genderid </li></ul></ul></ul><ul><ul><ul><li>INSERT OVERWRITE TABLE log_user_age PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', YEAR(dob), COUNT(1) GROUP BY YEAR(dob) </li></ul></ul></ul><ul><ul><ul><li>INSERT OVERWRITE TABLE log_user_education PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', educationid, COUNT(1) GROUP BY educationid </li></ul></ul></ul><ul><ul><ul><li>INSERT OVERWRITE TABLE log_user_job PARTITION (dt='$YESTERDAY‘) SELECT '$YESTERDAY', jobid, COUNT(1) GROUP BY jobid </li></ul></ul></ul>
    11. 11. File Formats <ul><li>TextFile: </li></ul><ul><ul><li>Easy for other applications to write/read </li></ul></ul><ul><ul><li>Gzip text files are not splittable </li></ul></ul><ul><li>SequenceFile: </li></ul><ul><ul><li>http://wiki.apache.org/hadoop/SequenceFile </li></ul></ul><ul><ul><li>Only hadoop can read it </li></ul></ul><ul><ul><li>Support splittable compression </li></ul></ul><ul><li>RCFile: Block-based columnar storage </li></ul><ul><ul><li>https://issues.apache.org/jira/browse/HIVE-352 </li></ul></ul><ul><ul><li>Use SequenceFile block format </li></ul></ul><ul><ul><li>Columnar storage inside a block </li></ul></ul><ul><ul><li>25% smaller compressed size </li></ul></ul><ul><ul><li>On-par or better query performance depending on the query </li></ul></ul>
    12. 12. SerDe <ul><li>Serialization/Deserialization </li></ul><ul><li>Row Format </li></ul><ul><ul><li>CSV (LazySimpleSerDe) </li></ul></ul><ul><ul><li>Thrift (ThriftSerDe) </li></ul></ul><ul><ul><li>Regex (RegexSerDe) </li></ul></ul><ul><ul><li>Hive Binary Format (LazyBinarySerDe) </li></ul></ul><ul><li>LazySimpleSerDe and LazyBinarySerDe </li></ul><ul><ul><li>Deserialize the field when needed </li></ul></ul><ul><ul><li>Reuse objects across different rows </li></ul></ul><ul><ul><li>Text and Binary format </li></ul></ul>
    13. 13. UDF/UDAF <ul><li>Features: </li></ul><ul><ul><li>Use either Java or Hadoop Objects (int, Integer, IntWritable) </li></ul></ul><ul><ul><li>Overloading </li></ul></ul><ul><ul><li>Variable-length arguments </li></ul></ul><ul><ul><li>Partial aggregation for UDAF </li></ul></ul><ul><li>Example UDF: </li></ul><ul><ul><li>public class UDFExampleAdd extends UDF { public int evaluate(int a, int b) { return a + b; } } </li></ul></ul>
    14. 14. What we use Hadoop for? <ul><li>Storing Zing Me core log data </li></ul><ul><li>Storing Zing Me Game/App log data </li></ul><ul><li>Storing backup data </li></ul><ul><li>Processing/Analyzing data with HIVE </li></ul><ul><li>Storing social data (feed, comment, voting, chat messages, …) with HBase </li></ul>
    15. 15. Data Usage <ul><li>Statistics per day: </li></ul><ul><ul><li>~ 300 GB of new data added per day </li></ul></ul><ul><ul><li>~ 800 GB of data scanned per day </li></ul></ul><ul><ul><li>~ 10,000 Hive j obs per day </li></ul></ul>
    16. 16. Where is the data stored? <ul><li>Hadoop/Hive Warehouse </li></ul><ul><ul><li>90T data </li></ul></ul><ul><ul><li>20 nodes, 16 cores/node </li></ul></ul><ul><ul><li>16 TB per node </li></ul></ul><ul><ul><li>Replication= 2 </li></ul></ul>
    17. 17. Log Collecting, Analyzing & Reporting <ul><li>Need </li></ul><ul><ul><li>Simple & high performance framework for log collection </li></ul></ul><ul><ul><li>Central, high-available & scalable storage </li></ul></ul><ul><ul><li>Ease-of-use tool for data analyzing (schema-based, SQL-like query, …) </li></ul></ul><ul><ul><li>Robust framework to develop report </li></ul></ul>
    18. 18. <ul><li>Version 1 (RDBMS-style) </li></ul><ul><ul><li>Log data go directly into MySQL database (Master) </li></ul></ul><ul><ul><li>Transform data into another MySQL database (off-load) </li></ul></ul><ul><ul><li>Statistics queries running and export data into another MySQL tables </li></ul></ul><ul><li>Performance problem </li></ul><ul><ul><li>Slow log insert, concurrent insert </li></ul></ul><ul><ul><li>Slow query-time on large dataset </li></ul></ul>Log Collecting, Analyzing & Reporting
    19. 19. <ul><li>Version 2 (Scribe, Hadoop & Hive) </li></ul><ul><ul><li>Fast log </li></ul></ul><ul><ul><li>Acceptable query-time on large dataset </li></ul></ul><ul><ul><li>Data replication </li></ul></ul><ul><ul><li>Distributed calculation </li></ul></ul>Log Collecting, Analyzing & Reporting
    20. 20. <ul><li>Components </li></ul><ul><ul><li>Log Collector </li></ul></ul><ul><ul><li>Log/Data Transformer </li></ul></ul><ul><ul><li>Data Analyzer </li></ul></ul><ul><ul><li>Web Reporter </li></ul></ul><ul><li>Process </li></ul><ul><ul><li>Log define </li></ul></ul><ul><ul><li>Log integrate (into application) </li></ul></ul><ul><ul><li>Log/Data analyze </li></ul></ul><ul><ul><li>Report develop </li></ul></ul>Log Collecting, Analyzing & Reporting
    21. 21. <ul><li>Log Collector </li></ul><ul><ul><li>Scribe: </li></ul></ul><ul><ul><ul><li>a server for aggregating streaming log data </li></ul></ul></ul><ul><ul><ul><li>designed to scale to a very large number of nodes and be robust to network and node failures </li></ul></ul></ul><ul><ul><ul><li>hierarchy stores </li></ul></ul></ul><ul><ul><ul><li>Thrift service using the non-blocking C++ server </li></ul></ul></ul><ul><ul><li>Thrift-client in C/C++, Java, PHP, … </li></ul></ul>Log Collecting, Analyzing & Reporting
    22. 22. <ul><li>Log format (common) </li></ul><ul><ul><li>Application-action log </li></ul></ul><ul><ul><ul><li>server_ip server_domain client_ip username actionid createdtime appdata execution_time </li></ul></ul></ul><ul><ul><li>Request log </li></ul></ul><ul><ul><ul><li>server_ip request_domain request_uri request_time execution_time memory client_ip username application </li></ul></ul></ul><ul><ul><li>Game action log </li></ul></ul><ul><ul><ul><li>time  username  actionid  gameid  goldgain  coingain  expgain  itemtype    itemid  userid_affect  appdata </li></ul></ul></ul>Log Collecting, Analyzing & Reporting
    23. 23. <ul><li>Scribe – file store </li></ul><ul><ul><ul><li>port=1463 </li></ul></ul></ul><ul><ul><ul><li>max_msg_per_second=2000000 </li></ul></ul></ul><ul><ul><ul><li>max_queue_size=10000000 </li></ul></ul></ul><ul><ul><ul><li>new_thread_per_category=yes </li></ul></ul></ul><ul><ul><ul><li>num_thrift_server_threads=10 </li></ul></ul></ul><ul><ul><ul><li>check_interval=3 </li></ul></ul></ul><ul><ul><ul><li># DEFAULT - write all other categories to /data/scribe_log </li></ul></ul></ul><ul><ul><ul><li><store> </li></ul></ul></ul><ul><ul><ul><li>category=default </li></ul></ul></ul><ul><ul><ul><li>type=file </li></ul></ul></ul><ul><ul><ul><li>file_path=/data/scribe_log </li></ul></ul></ul><ul><ul><ul><li>base_filename=default_log </li></ul></ul></ul><ul><ul><ul><li>max_size=8000000000 </li></ul></ul></ul><ul><ul><ul><li>add_newlines=1 </li></ul></ul></ul><ul><ul><ul><li>rotate_period=hourly </li></ul></ul></ul><ul><ul><ul><li>#rotate_hour=0 </li></ul></ul></ul><ul><ul><ul><li>rotate_minute=1 </li></ul></ul></ul><ul><ul><ul><li></store> </li></ul></ul></ul>Log Collecting, Analyzing & Reporting
    24. 24. <ul><li>Scribe – buffer store </li></ul><ul><ul><ul><li><store> </li></ul></ul></ul><ul><ul><ul><li>category=default </li></ul></ul></ul><ul><ul><ul><li>type=buffer </li></ul></ul></ul><ul><ul><ul><li>target_write_size=20480 </li></ul></ul></ul><ul><ul><ul><li>max_write_interval=1 </li></ul></ul></ul><ul><ul><ul><li>buffer_send_rate=1 </li></ul></ul></ul><ul><ul><ul><li>retry_interval=30 </li></ul></ul></ul><ul><ul><ul><li>retry_interval_range=10 </li></ul></ul></ul><ul><ul><ul><li><primary> </li></ul></ul></ul><ul><ul><ul><li>type=network </li></ul></ul></ul><ul><ul><ul><li>remote_host=xxx.yyy.zzz.ttt </li></ul></ul></ul><ul><ul><ul><li>remote_port=1463 </li></ul></ul></ul><ul><ul><ul><li></primary> </li></ul></ul></ul><ul><ul><ul><li><secondary> </li></ul></ul></ul><ul><ul><ul><li>type=file </li></ul></ul></ul><ul><ul><ul><li>fs_type=std </li></ul></ul></ul><ul><ul><ul><li>file_path=/tmp </li></ul></ul></ul><ul><ul><ul><li>base_filename=zmlog_backup </li></ul></ul></ul><ul><ul><ul><li>max_size=30000000 </li></ul></ul></ul><ul><ul><ul><li></secondary> </li></ul></ul></ul><ul><ul><ul><li></store> </li></ul></ul></ul>Log Collecting, Analyzing & Reporting
    25. 25. <ul><li>Log/Data Transformer </li></ul><ul><ul><li>Help to import data from multi-type source into Hive </li></ul></ul><ul><ul><li>Semi-automated </li></ul></ul><ul><ul><li>Log files to Hive: </li></ul></ul><ul><ul><ul><li>LOAD DATA LOCAL INPATH … OVERWRITE INTO TABLE… </li></ul></ul></ul><ul><ul><li>MySQL data to Hive: </li></ul></ul><ul><ul><ul><li>Data extract using SELECT … INTO OUTFILE … </li></ul></ul></ul><ul><ul><ul><li>Import using LOAD DATA </li></ul></ul></ul>Log Collecting, Analyzing & Reporting
    26. 26. <ul><li>Data Analyzer </li></ul><ul><ul><li>Calculation using Hive query language (HQL): SQL-like </li></ul></ul><ul><ul><li>Data partitioning, query optimization: </li></ul></ul><ul><ul><ul><li>very important to improve speed </li></ul></ul></ul><ul><ul><ul><li>distributed data reading </li></ul></ul></ul><ul><ul><ul><li>optimize query for one-pass data reading </li></ul></ul></ul><ul><ul><li>Automation </li></ul></ul><ul><ul><ul><li>hive --service cli -f hql_file </li></ul></ul></ul><ul><ul><ul><li>Bash shell, crontab </li></ul></ul></ul><ul><ul><li>Export data and import into MySQL for web report </li></ul></ul><ul><ul><ul><li>Export with Hadoop command-line: hadoop fs -cat </li></ul></ul></ul><ul><ul><ul><li>Import using LOAD DATA LOCAL INFILE … INTO TABLE … </li></ul></ul></ul>Log Collecting, Analyzing & Reporting
    27. 27. <ul><li>Web Reporter </li></ul><ul><ul><li>PHP web application </li></ul></ul><ul><ul><li>Modular </li></ul></ul><ul><ul><li>Standard format and template </li></ul></ul><ul><ul><li>jpgraph </li></ul></ul>Log Collecting, Analyzing & Reporting
    28. 28. <ul><li>Applications </li></ul><ul><ul><li>Summarization </li></ul></ul><ul><ul><ul><li>User/Apps indicators: active, churn-rate, login, return… </li></ul></ul></ul><ul><ul><ul><li>User demographics: age, gender, education, job, location… </li></ul></ul></ul><ul><ul><ul><li>User interactions/Apps actions </li></ul></ul></ul><ul><ul><li>Data mining </li></ul></ul><ul><ul><li>Spam Detection </li></ul></ul><ul><ul><li>Application performance </li></ul></ul><ul><ul><li>Ad-hoc Analysis </li></ul></ul><ul><ul><li>… </li></ul></ul>Log Collecting, Analyzing & Reporting
    29. 29. THANK YOU!
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×