Hadoop Presentation
       2012

Presenter : Pham Thai Hoa
Email : thaihoabo@gmail.com
Web : http://mobion.com/hoa



                    4/14/2012   Pham Thai Hoa
Topic
 Introduce to Hadoop
 Introduce to Hive
 Introduce to Logger
 Using Hadoop at Mobion
 Warehouse at Mobion
 Q&A




                4/14/2012   Pham Thai Hoa
What is Hadoop
 It’s a framework for the distributed
  processing
 Inspired by Google’s architecture: Map
  Reduce and GFS
 A top-level Apache project
 Hadoop is the open source
 Hadoop have the two important
  elements
  + Map – Reduce core
  + Hadoop Distributed File System
                  4/14/2012   Pham Thai Hoa
Why use Hadoop
 Fault-tolerant hardware is expensive
 Hadoop is designed to run on cheap
  commodity hardware
 It automatically handles data
  replication and node failure
 It does the hard work – you can focus
  on processing data
 It has the three supported modes :
  Local, Pseudo-Distributed, Fully-
  Distributed Mode
                  4/14/2012   Pham Thai Hoa
Data Flow into Hadoop




         4/14/2012   Pham Thai Hoa
Who use Hadoop
 Amazon's product search indices
  using the streaming API and pre-
  existing C++, Perl, and Python tools
 Yahoo : More than 100,000 CPUs in
  >40,000 computers running Hadoop
 Facebook use Hadoop to store copies
  of internal log and dimension data
  sources and use it as a source for
  reporting/analytics and machine
  learning
                 4/14/2012   Pham Thai Hoa
What is Hive
 Hive is a data warehouse system for
  Hadoop
 Using Map-Reduce for execution
 Using HDFS for storage
 Metadata in an RDBMS
 Scalability and performance
 Interoperability
 Using a SQL-like language called
  HiveQL
                  4/14/2012   Pham Thai Hoa
Data Flow into Hive




        4/14/2012   Pham Thai Hoa
Hive Data Model
 Tables
  + Typed columns (int, float, string,…)
  + Also, array/map/struct for JSON-like
  data
 Partitions
  + e.g., to range-partition tables by
  date
 Buckets
  + Hash partitions within ranges (useful
  for sampling, join optimization)
                   4/14/2012   Pham Thai Hoa
Hive Metastore
 Database: namespace containing a
  set of tables
 Holds Table/Partition definitions
  (column types,mappings to HDFS
  directories)
 Statistics
 Implemented with DataNucleus ORM.
  Runs on Derby, MySQL, and many
  other relational databases
                4/14/2012   Pham Thai Hoa
Introduce to Logger
 A logging system has three broad
  components
  + Client Code Interface
  + Distribution System
  + Do Something Usefullizer
 Scribe is a server for aggregating
  streaming log data. It is designed to
  scale to a very large number of nodes
  and be robust to network and node
  failures
                  4/14/2012   Pham Thai Hoa
Why use Scribe
 Scalability and performance
 Event Notification library
 Thrift framework
 Hadoop is optional
 Client using
 Distributed scribe system
 Over 1 million messages per second
  for logging
 Hierarchy stores

                 4/14/2012   Pham Thai Hoa
Warehouse at Mobion
 Log Collector
 Log/Data Transformer
 Data Analyzer
 Web Reporter
 Log define
 Log integrate (into application)
 Log/Data analyze
 Report develop (API, Mobion, Music
  …)
                 4/14/2012   Pham Thai Hoa
Warehouse at Mobion
 Data mining
 Music Recommendation
 Spam Detection
 Application performance
 Export data and import into MySQL for
  web report
 Analytic system



                  4/14/2012   Pham Thai Hoa
Q&A
 Why use hadoop ?
 Why use Hive ?
 Why need a logging system ?
 What is the warehouse system
  architecture ?
 Do we use these system for voting,
  chat, message and feed ??
 How can we use them for
  recommendation, suggestion ?

                  4/14/2012   Pham Thai Hoa
Following Link
 http://facebook.com
 http://highscalability.com/product-
  scribe-facebooks-scalable-logging-
  system
 http://hadoop.apache.org/
 http://hive.apache.org/
 http://wiki.apache.org/hadoop/Powere
  dBy
 http://www.apache.org/foundation/than
  ks.html         4/14/2012   Pham Thai Hoa
THANK YOU
   4/14/2012   Pham Thai Hoa

Hadoop Presentation

  • 1.
    Hadoop Presentation 2012 Presenter : Pham Thai Hoa Email : thaihoabo@gmail.com Web : http://mobion.com/hoa 4/14/2012 Pham Thai Hoa
  • 2.
    Topic  Introduce toHadoop  Introduce to Hive  Introduce to Logger  Using Hadoop at Mobion  Warehouse at Mobion  Q&A 4/14/2012 Pham Thai Hoa
  • 3.
    What is Hadoop It’s a framework for the distributed processing  Inspired by Google’s architecture: Map Reduce and GFS  A top-level Apache project  Hadoop is the open source  Hadoop have the two important elements + Map – Reduce core + Hadoop Distributed File System 4/14/2012 Pham Thai Hoa
  • 4.
    Why use Hadoop Fault-tolerant hardware is expensive  Hadoop is designed to run on cheap commodity hardware  It automatically handles data replication and node failure  It does the hard work – you can focus on processing data  It has the three supported modes : Local, Pseudo-Distributed, Fully- Distributed Mode 4/14/2012 Pham Thai Hoa
  • 5.
    Data Flow intoHadoop 4/14/2012 Pham Thai Hoa
  • 6.
    Who use Hadoop Amazon's product search indices using the streaming API and pre- existing C++, Perl, and Python tools  Yahoo : More than 100,000 CPUs in >40,000 computers running Hadoop  Facebook use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning 4/14/2012 Pham Thai Hoa
  • 7.
    What is Hive Hive is a data warehouse system for Hadoop  Using Map-Reduce for execution  Using HDFS for storage  Metadata in an RDBMS  Scalability and performance  Interoperability  Using a SQL-like language called HiveQL 4/14/2012 Pham Thai Hoa
  • 8.
    Data Flow intoHive 4/14/2012 Pham Thai Hoa
  • 9.
    Hive Data Model Tables + Typed columns (int, float, string,…) + Also, array/map/struct for JSON-like data  Partitions + e.g., to range-partition tables by date  Buckets + Hash partitions within ranges (useful for sampling, join optimization) 4/14/2012 Pham Thai Hoa
  • 10.
    Hive Metastore  Database:namespace containing a set of tables  Holds Table/Partition definitions (column types,mappings to HDFS directories)  Statistics  Implemented with DataNucleus ORM. Runs on Derby, MySQL, and many other relational databases 4/14/2012 Pham Thai Hoa
  • 11.
    Introduce to Logger A logging system has three broad components + Client Code Interface + Distribution System + Do Something Usefullizer  Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures 4/14/2012 Pham Thai Hoa
  • 12.
    Why use Scribe Scalability and performance  Event Notification library  Thrift framework  Hadoop is optional  Client using  Distributed scribe system  Over 1 million messages per second for logging  Hierarchy stores 4/14/2012 Pham Thai Hoa
  • 13.
    Warehouse at Mobion Log Collector  Log/Data Transformer  Data Analyzer  Web Reporter  Log define  Log integrate (into application)  Log/Data analyze  Report develop (API, Mobion, Music …) 4/14/2012 Pham Thai Hoa
  • 14.
    Warehouse at Mobion Data mining  Music Recommendation  Spam Detection  Application performance  Export data and import into MySQL for web report  Analytic system 4/14/2012 Pham Thai Hoa
  • 15.
    Q&A  Why usehadoop ?  Why use Hive ?  Why need a logging system ?  What is the warehouse system architecture ?  Do we use these system for voting, chat, message and feed ??  How can we use them for recommendation, suggestion ? 4/14/2012 Pham Thai Hoa
  • 16.
    Following Link  http://facebook.com http://highscalability.com/product- scribe-facebooks-scalable-logging- system  http://hadoop.apache.org/  http://hive.apache.org/  http://wiki.apache.org/hadoop/Powere dBy  http://www.apache.org/foundation/than ks.html 4/14/2012 Pham Thai Hoa
  • 17.
    THANK YOU 4/14/2012 Pham Thai Hoa