Hadoop Presentation


Published on

Hadoop Presentation

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop Presentation

  1. 1. Hadoop Presentation 2012Presenter : Pham Thai HoaEmail : thaihoabo@gmail.comWeb : http://mobion.com/hoa 4/14/2012 Pham Thai Hoa
  2. 2. Topic Introduce to Hadoop Introduce to Hive Introduce to Logger Using Hadoop at Mobion Warehouse at Mobion Q&A 4/14/2012 Pham Thai Hoa
  3. 3. What is Hadoop It’s a framework for the distributed processing Inspired by Google’s architecture: Map Reduce and GFS A top-level Apache project Hadoop is the open source Hadoop have the two important elements + Map – Reduce core + Hadoop Distributed File System 4/14/2012 Pham Thai Hoa
  4. 4. Why use Hadoop Fault-tolerant hardware is expensive Hadoop is designed to run on cheap commodity hardware It automatically handles data replication and node failure It does the hard work – you can focus on processing data It has the three supported modes : Local, Pseudo-Distributed, Fully- Distributed Mode 4/14/2012 Pham Thai Hoa
  5. 5. Data Flow into Hadoop 4/14/2012 Pham Thai Hoa
  6. 6. Who use Hadoop Amazons product search indices using the streaming API and pre- existing C++, Perl, and Python tools Yahoo : More than 100,000 CPUs in >40,000 computers running Hadoop Facebook use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning 4/14/2012 Pham Thai Hoa
  7. 7. What is Hive Hive is a data warehouse system for Hadoop Using Map-Reduce for execution Using HDFS for storage Metadata in an RDBMS Scalability and performance Interoperability Using a SQL-like language called HiveQL 4/14/2012 Pham Thai Hoa
  8. 8. Data Flow into Hive 4/14/2012 Pham Thai Hoa
  9. 9. Hive Data Model Tables + Typed columns (int, float, string,…) + Also, array/map/struct for JSON-like data Partitions + e.g., to range-partition tables by date Buckets + Hash partitions within ranges (useful for sampling, join optimization) 4/14/2012 Pham Thai Hoa
  10. 10. Hive Metastore Database: namespace containing a set of tables Holds Table/Partition definitions (column types,mappings to HDFS directories) Statistics Implemented with DataNucleus ORM. Runs on Derby, MySQL, and many other relational databases 4/14/2012 Pham Thai Hoa
  11. 11. Introduce to Logger A logging system has three broad components + Client Code Interface + Distribution System + Do Something Usefullizer Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures 4/14/2012 Pham Thai Hoa
  12. 12. Why use Scribe Scalability and performance Event Notification library Thrift framework Hadoop is optional Client using Distributed scribe system Over 1 million messages per second for logging Hierarchy stores 4/14/2012 Pham Thai Hoa
  13. 13. Warehouse at Mobion Log Collector Log/Data Transformer Data Analyzer Web Reporter Log define Log integrate (into application) Log/Data analyze Report develop (API, Mobion, Music …) 4/14/2012 Pham Thai Hoa
  14. 14. Warehouse at Mobion Data mining Music Recommendation Spam Detection Application performance Export data and import into MySQL for web report Analytic system 4/14/2012 Pham Thai Hoa
  15. 15. Q&A Why use hadoop ? Why use Hive ? Why need a logging system ? What is the warehouse system architecture ? Do we use these system for voting, chat, message and feed ?? How can we use them for recommendation, suggestion ? 4/14/2012 Pham Thai Hoa
  16. 16. Following Link http://facebook.com http://highscalability.com/product- scribe-facebooks-scalable-logging- system http://hadoop.apache.org/ http://hive.apache.org/ http://wiki.apache.org/hadoop/Powere dBy http://www.apache.org/foundation/than ks.html 4/14/2012 Pham Thai Hoa
  17. 17. THANK YOU 4/14/2012 Pham Thai Hoa