Hadoop Introduction 24x7 coach.com

862 views

Published on

Hadoop - "An open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware."

"Hadoop Introduction - 24x7Coach.com" is covering following topics:
* What is Big Data
* What is Hadoop?
* Why Hadoop?
* Key Breakthroughs of Hadoop
* Future of Hadoop
* Course Curriculum
* Course offering

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
862
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop Introduction 24x7 coach.com

  1. 1. Becoming Professional by Apache Hadoop, Hadoop, Apache, the Apache feather logo, and the Apache Hadoop project logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and other countries.
  2. 2. Agenda • What is Big Data • What is Hadoop? • Why Hadoop? • Key Breakthroughs of Hadoop • Future of Hadoop • Course Curriculum • Course Deliverables
  3. 3. BIG DATA 3
  4. 4. Big Data The term for a collection of data sets so large and complex that it becomes difficult to process using on- hand database management tools or traditional data processing applications
  5. 5. Big Data - Why • Large volume of data • Existing tools were not designed to handle such a huge data • 10^9 = 1,000,000,000Gigabyte • 10^12 = 1,000,000,000,000Terabyte • 10^15 = 1,000,000,000,000,000Petabyte • 10^18 = 1,000,000,000,000,000,000Exabyte • 10^21 = 1,000,000,000,000,000,000,000Zetabyte
  6. 6. Big Data - Journey1990 - Store 1,400 MB - Transfer speed of 4.5MB/s - Read the entire drive in ~ 5 minutes 2010 - Store 1 TB - Transfer speed of 100MB/s - Read the entire drive in ~ 3 hours Hadoop - 100 drives working at the same time can read 1TB of data in 2 minutes
  7. 7. Big Data - Statistics 2010 • Information Data Corporation (IDC) estimates data 1.2 ZETTABYTES (1.2 Trillion Gigabytes) 2011 • Facebook ~ 6 billion messages per day, ~ 400 Petabytes per month • EBay ~ 2 billion page views a day, ~ 9 Petabytes of storage • Satellite Images by Skybox Imaging ~ 1 Terabyte per day • Google creates 250 – 300 Petabytes of data per month
  8. 8. HADOOP OVERVIEW 8
  9. 9. Hadoop An open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.
  10. 10. History • Hadoop became Apache Top Level Project 2008 • Yahoo! hires Doug Cutting to work on Hadoop with a dedicated team2006 • Doug Cutting and Nutch team implemented Google’s frameworks in Nutch2005 • Google publishes Google File System (GFS) and MapReduce framework papers2004
  11. 11. Hadoop Cluster Source: BRAD HEDLUND.com
  12. 12. Current challenges & How Hadoop addresses Current Challenges (3Vs): • Volume = amount of data • Velocity = speed of data in and out • Variety = range of data types and sources Advantages of Using Hadoop: • Scalable • Available • Reliable • Cost effective • Flexible • Fast • Resilient to failure
  13. 13. Key Breakthroughs Source: mapr.com
  14. 14. Industry Sectors • Search Engines • Social Networking • Finance/Banking • Retail Industries • E-Commerce • Email Services • Security • Government and Public Sectors
  15. 15. Who is using?
  16. 16. The future • 33% of the companies are using it • 20% of the companies have started their development in Hadoop • 58% Compound Annual Growth Rate (CAGR) of Hadoop usage • By 2018, $2.18 billions will be spent on Hadoop Source: indeed.com
  17. 17. Course Offerings by
  18. 18. Course Curriculum Ecosystem Description HDFS A distributed filesystem that runs on large clusters of commodity machines. MapReduce A distributed data processing model and execution environment that runs on large clusters of commodity machines. Pig A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL. Sqoop A tool for efficiently moving data between relational databases and HDFS. HBase A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries.
  19. 19. Course Curriculum Ecosystem Description ZooKeeper A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. Oozie A workflow scheduler system to manage Hadoop jobs. Flume A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Integrations • Hive <-> Hbase • MapReduce -> Hive • MapReduce -> Hbase
  20. 20. Course Deliverables  Course material  Practice exercises for each topic  Quiz for each topic  Workshop style coaching  Interactive approach  Tips and techniques for certification exam and interviews  Group activities for better understanding  Case Studies at the end of course
  21. 21. Workshop Formats Classroom workshops Corporate Workshops Virtual Online Workshops Consulting Services
  22. 22. Array of courses we offer Project Management Training Technical Training Soft skills Training Career Readiness Training
  23. 23. Interested? + 91 95057 44455 info@24x7coach.com Visit www.24x7Coach.com

×