Hadoop Presentation

Hadoop Presentation
2012

Presenter : Pham Thai Hoa
Email : thaihoabo@gmail.com
Web : http://mobion.com/hoa

4/14/2012 Pham Thai Hoa

Topic
 Introduce to Hadoop
 Introduce to Hive
 Introduce to Logger
 Using Hadoop at Mobion
 Warehouse at Mobion
 Q&A


What is Hadoop
 It’s a framework for the distributed
processing
 Inspired by Google’s architecture: Map
Reduce and GFS
 A top-level Apache project
 Hadoop is the open source
 Hadoop have the two important
elements
+ Map – Reduce core
+ Hadoop Distributed File System

Why use Hadoop
 Fault-tolerant hardware is expensive
 Hadoop is designed to run on cheap
commodity hardware
 It automatically handles data
replication and node failure
 It does the hard work – you can focus
on processing data
 It has the three supported modes :
Local, Pseudo-Distributed, Fully-
Distributed Mode

Data Flow into Hadoop


Who use Hadoop
 Amazon's product search indices
using the streaming API and pre-
existing C++, Perl, and Python tools
 Yahoo : More than 100,000 CPUs in
>40,000 computers running Hadoop
 Facebook use Hadoop to store copies
of internal log and dimension data
sources and use it as a source for
reporting/analytics and machine
learning

What is Hive
 Hive is a data warehouse system for
Hadoop
 Using Map-Reduce for execution
 Using HDFS for storage
 Metadata in an RDBMS
 Scalability and performance
 Interoperability
 Using a SQL-like language called
HiveQL

Data Flow into Hive


Hive Data Model
 Tables
+ Typed columns (int, float, string,…)
+ Also, array/map/struct for JSON-like
data
 Partitions
+ e.g., to range-partition tables by
date
 Buckets
+ Hash partitions within ranges (useful
for sampling, join optimization)

Hive Metastore
 Database: namespace containing a
set of tables
 Holds Table/Partition definitions
(column types,mappings to HDFS
directories)
 Statistics
 Implemented with DataNucleus ORM.
Runs on Derby, MySQL, and many
other relational databases

Introduce to Logger
 A logging system has three broad
components
+ Client Code Interface
+ Distribution System
+ Do Something Usefullizer
 Scribe is a server for aggregating
streaming log data. It is designed to
scale to a very large number of nodes
and be robust to network and node
failures

Why use Scribe
 Scalability and performance
 Event Notification library
 Thrift framework
 Hadoop is optional
 Client using
 Distributed scribe system
 Over 1 million messages per second
for logging
 Hierarchy stores


Warehouse at Mobion
 Log Collector
 Log/Data Transformer
 Data Analyzer
 Web Reporter
 Log define
 Log integrate (into application)
 Log/Data analyze
 Report develop (API, Mobion, Music
…)

Warehouse at Mobion
 Data mining
 Music Recommendation
 Spam Detection
 Application performance
 Export data and import into MySQL for
web report
 Analytic system


Q&A
 Why use hadoop ?
 Why use Hive ?
 Why need a logging system ?
 What is the warehouse system
architecture ?
 Do we use these system for voting,
chat, message and feed ??
 How can we use them for
recommendation, suggestion ?


Following Link
 http://facebook.com
 http://highscalability.com/product-
scribe-facebooks-scalable-logging-
system
 http://hadoop.apache.org/
 http://hive.apache.org/
 http://wiki.apache.org/hadoop/Powere
dBy
 http://www.apache.org/foundation/than
ks.html 4/14/2012 Pham Thai Hoa

THANK YOU

Hadoop Presentation

More Related Content

What's hot

Viewers also liked

Similar to Hadoop Presentation

Recently uploaded

Hadoop Presentation