Your SlideShare is downloading. ×
Hadoop & Zing
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop & Zing

2,524

Published on

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,524
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
69
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • SequenceFile  is a flat file consisting of binary key/value pairs. It is extensively used in  MapReduce  as input/output formats. It is also worth noting that, internally, the temporary outputs of maps are stored using SequenceFile.
  • Persistent connection One-way communication
  • Transcript

    • 1. PRESENTER: HUNGVV W: http://me.zing.vn/hung.vo E: [email_address] 2011-08 HADOOP & ZING
    • 2. AGENDA Using Hadoop in Zing Rank Introduction to Hadoop, Hive A case study: Log Collecting, Analyzing & Reporting System ter Estimate Conclusion 1 3 2
    • 3. Hadoop & Zing
      • What
        • It’s a framework for large-scale data processing
        • Inspired by Google’s architecture: Map Reduce and GFS
        • A top-level Apache project – Hadoop is open source
      • Why
        • Fault-tolerant hardware is expensive
        • Hadoop is designed to run on cheap commodity hardware
        • It automatically handles data replication and node failure
        • It does the hard work – you can focus on processing data
    • 4. Data Flow into Hadoop Web Servers Scribe MidTier Network Storage and Servers Hadoop Hive Warehouse MySQL
    • 5. Hive – Data Warehouse
      • A system for managing and querying structured data build on top of Hadoop
        • Map-Reduce for execution
        • HDFS for storage
        • Metadata in an RDBMS
      • Key building Principles:
        • SQL as a familiar data warehousing tool
        • Extensibility - Types, Functions, Formats, Scripts
        • Scalability and Performance
      • Efficient SQL to Map-Reduce Compiler
    • 6. Hive Architecture HDFS Map Reduce Web UI + Hive CLI + JDBC/ODBC Browse, Query, DDL Hive QL Parser Planner Optimizer Execution SerDe CSV Thrift Regex UDF/UDAF substr sum average FileFormats TextFile SequenceFile RCFile User-defined Map-reduce Scripts
    • 7. Hive DDL
      • DDL
        • Complex columns
        • Partitions
        • Buckets
      • Example
        • CREATE TABLE stats_active_daily( username STRING, userid INT, last_login INT, num_login INT, num_longsession INT) PARTITIONED BY(dt STRING, app STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n‘ STORED AS TEXTFILE;
    • 8. Hive DML
      • Data loading
        • LOAD DATA LOCAL INPATH '/data/scribe_log/STATSLOGIN/STATSLOGIN-$YESTERDAY*' OVERWRITE INTO TABLE stats_login PARTITION(dt='$YESTERDAY', app='${APP}');
      • Insert data into Hive tables
        • INSERT OVERWRITE TABLE stats_active_daily PARTITION (dt='$YESTERDAY', app='${APP}') SELECT username, userid, MAX(login_time), COUNT(1), SUM(IF(login_type=3,1,0)) FROM stats_login WHERE dt='$YESTERDAY' and app='${APP}' GROUP BY username, userid;
    • 9. Hive Query Language
      • SQL
        • Where
        • Group By
        • Equi-Join
        • Sub query in "From" clause
    • 10. Multi-table Group-By/Insert
          • FROM user_information
          • INSERT OVERWRITE TABLE log_user_gender PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', genderid, COUNT(1) GROUP BY genderid
          • INSERT OVERWRITE TABLE log_user_age PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', YEAR(dob), COUNT(1) GROUP BY YEAR(dob)
          • INSERT OVERWRITE TABLE log_user_education PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', educationid, COUNT(1) GROUP BY educationid
          • INSERT OVERWRITE TABLE log_user_job PARTITION (dt='$YESTERDAY‘) SELECT '$YESTERDAY', jobid, COUNT(1) GROUP BY jobid
    • 11. File Formats
      • TextFile:
        • Easy for other applications to write/read
        • Gzip text files are not splittable
      • SequenceFile:
        • http://wiki.apache.org/hadoop/SequenceFile
        • Only hadoop can read it
        • Support splittable compression
      • RCFile: Block-based columnar storage
        • https://issues.apache.org/jira/browse/HIVE-352
        • Use SequenceFile block format
        • Columnar storage inside a block
        • 25% smaller compressed size
        • On-par or better query performance depending on the query
    • 12. SerDe
      • Serialization/Deserialization
      • Row Format
        • CSV (LazySimpleSerDe)
        • Thrift (ThriftSerDe)
        • Regex (RegexSerDe)
        • Hive Binary Format (LazyBinarySerDe)
      • LazySimpleSerDe and LazyBinarySerDe
        • Deserialize the field when needed
        • Reuse objects across different rows
        • Text and Binary format
    • 13. UDF/UDAF
      • Features:
        • Use either Java or Hadoop Objects (int, Integer, IntWritable)
        • Overloading
        • Variable-length arguments
        • Partial aggregation for UDAF
      • Example UDF:
        • public class UDFExampleAdd extends UDF { public int evaluate(int a, int b) { return a + b; } }
    • 14. What we use Hadoop for?
      • Storing Zing Me core log data
      • Storing Zing Me Game/App log data
      • Storing backup data
      • Processing/Analyzing data with HIVE
      • Storing social data (feed, comment, voting, chat messages, …) with HBase
    • 15. Data Usage
      • Statistics per day:
        • ~ 300 GB of new data added per day
        • ~ 800 GB of data scanned per day
        • ~ 10,000 Hive j obs per day
    • 16. Where is the data stored?
      • Hadoop/Hive Warehouse
        • 90T data
        • 20 nodes, 16 cores/node
        • 16 TB per node
        • Replication= 2
    • 17. Log Collecting, Analyzing & Reporting
      • Need
        • Simple & high performance framework for log collection
        • Central, high-available & scalable storage
        • Ease-of-use tool for data analyzing (schema-based, SQL-like query, …)
        • Robust framework to develop report
    • 18.
      • Version 1 (RDBMS-style)
        • Log data go directly into MySQL database (Master)
        • Transform data into another MySQL database (off-load)
        • Statistics queries running and export data into another MySQL tables
      • Performance problem
        • Slow log insert, concurrent insert
        • Slow query-time on large dataset
      Log Collecting, Analyzing & Reporting
    • 19.
      • Version 2 (Scribe, Hadoop & Hive)
        • Fast log
        • Acceptable query-time on large dataset
        • Data replication
        • Distributed calculation
      Log Collecting, Analyzing & Reporting
    • 20.
      • Components
        • Log Collector
        • Log/Data Transformer
        • Data Analyzer
        • Web Reporter
      • Process
        • Log define
        • Log integrate (into application)
        • Log/Data analyze
        • Report develop
      Log Collecting, Analyzing & Reporting
    • 21.
      • Log Collector
        • Scribe:
          • a server for aggregating streaming log data
          • designed to scale to a very large number of nodes and be robust to network and node failures
          • hierarchy stores
          • Thrift service using the non-blocking C++ server
        • Thrift-client in C/C++, Java, PHP, …
      Log Collecting, Analyzing & Reporting
    • 22.
      • Log format (common)
        • Application-action log
          • server_ip server_domain client_ip username actionid createdtime appdata execution_time
        • Request log
          • server_ip request_domain request_uri request_time execution_time memory client_ip username application
        • Game action log
          • time  username  actionid  gameid  goldgain  coingain  expgain  itemtype    itemid  userid_affect  appdata
      Log Collecting, Analyzing & Reporting
    • 23.
      • Scribe – file store
          • port=1463
          • max_msg_per_second=2000000
          • max_queue_size=10000000
          • new_thread_per_category=yes
          • num_thrift_server_threads=10
          • check_interval=3
          • # DEFAULT - write all other categories to /data/scribe_log
          • <store>
          • category=default
          • type=file
          • file_path=/data/scribe_log
          • base_filename=default_log
          • max_size=8000000000
          • add_newlines=1
          • rotate_period=hourly
          • #rotate_hour=0
          • rotate_minute=1
          • </store>
      Log Collecting, Analyzing & Reporting
    • 24.
      • Scribe – buffer store
          • <store>
          • category=default
          • type=buffer
          • target_write_size=20480
          • max_write_interval=1
          • buffer_send_rate=1
          • retry_interval=30
          • retry_interval_range=10
          • <primary>
          • type=network
          • remote_host=xxx.yyy.zzz.ttt
          • remote_port=1463
          • </primary>
          • <secondary>
          • type=file
          • fs_type=std
          • file_path=/tmp
          • base_filename=zmlog_backup
          • max_size=30000000
          • </secondary>
          • </store>
      Log Collecting, Analyzing & Reporting
    • 25.
      • Log/Data Transformer
        • Help to import data from multi-type source into Hive
        • Semi-automated
        • Log files to Hive:
          • LOAD DATA LOCAL INPATH … OVERWRITE INTO TABLE…
        • MySQL data to Hive:
          • Data extract using SELECT … INTO OUTFILE …
          • Import using LOAD DATA
      Log Collecting, Analyzing & Reporting
    • 26.
      • Data Analyzer
        • Calculation using Hive query language (HQL): SQL-like
        • Data partitioning, query optimization:
          • very important to improve speed
          • distributed data reading
          • optimize query for one-pass data reading
        • Automation
          • hive --service cli -f hql_file
          • Bash shell, crontab
        • Export data and import into MySQL for web report
          • Export with Hadoop command-line: hadoop fs -cat
          • Import using LOAD DATA LOCAL INFILE … INTO TABLE …
      Log Collecting, Analyzing & Reporting
    • 27.
      • Web Reporter
        • PHP web application
        • Modular
        • Standard format and template
        • jpgraph
      Log Collecting, Analyzing & Reporting
    • 28.
      • Applications
        • Summarization
          • User/Apps indicators: active, churn-rate, login, return…
          • User demographics: age, gender, education, job, location…
          • User interactions/Apps actions
        • Data mining
        • Spam Detection
        • Application performance
        • Ad-hoc Analysis
      Log Collecting, Analyzing & Reporting
    • 29. THANK YOU!

    ×