Hadoop Introduction 24x7 coach.com

  • 471 views
Uploaded on

Hadoop - "An open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware." …

Hadoop - "An open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware."

"Hadoop Introduction - 24x7Coach.com" is covering following topics:
* What is Big Data
* What is Hadoop?
* Why Hadoop?
* Key Breakthroughs of Hadoop
* Future of Hadoop
* Course Curriculum
* Course offering

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
471
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Becoming Professional by Apache Hadoop, Hadoop, Apache, the Apache feather logo, and the Apache Hadoop project logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and other countries.
  • 2. Agenda • What is Big Data • What is Hadoop? • Why Hadoop? • Key Breakthroughs of Hadoop • Future of Hadoop • Course Curriculum • Course Deliverables
  • 3. BIG DATA 3
  • 4. Big Data The term for a collection of data sets so large and complex that it becomes difficult to process using on- hand database management tools or traditional data processing applications
  • 5. Big Data - Why • Large volume of data • Existing tools were not designed to handle such a huge data • 10^9 = 1,000,000,000Gigabyte • 10^12 = 1,000,000,000,000Terabyte • 10^15 = 1,000,000,000,000,000Petabyte • 10^18 = 1,000,000,000,000,000,000Exabyte • 10^21 = 1,000,000,000,000,000,000,000Zetabyte
  • 6. Big Data - Journey1990 - Store 1,400 MB - Transfer speed of 4.5MB/s - Read the entire drive in ~ 5 minutes 2010 - Store 1 TB - Transfer speed of 100MB/s - Read the entire drive in ~ 3 hours Hadoop - 100 drives working at the same time can read 1TB of data in 2 minutes
  • 7. Big Data - Statistics 2010 • Information Data Corporation (IDC) estimates data 1.2 ZETTABYTES (1.2 Trillion Gigabytes) 2011 • Facebook ~ 6 billion messages per day, ~ 400 Petabytes per month • EBay ~ 2 billion page views a day, ~ 9 Petabytes of storage • Satellite Images by Skybox Imaging ~ 1 Terabyte per day • Google creates 250 – 300 Petabytes of data per month
  • 8. HADOOP OVERVIEW 8
  • 9. Hadoop An open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.
  • 10. History • Hadoop became Apache Top Level Project 2008 • Yahoo! hires Doug Cutting to work on Hadoop with a dedicated team2006 • Doug Cutting and Nutch team implemented Google’s frameworks in Nutch2005 • Google publishes Google File System (GFS) and MapReduce framework papers2004
  • 11. Hadoop Cluster Source: BRAD HEDLUND.com
  • 12. Current challenges & How Hadoop addresses Current Challenges (3Vs): • Volume = amount of data • Velocity = speed of data in and out • Variety = range of data types and sources Advantages of Using Hadoop: • Scalable • Available • Reliable • Cost effective • Flexible • Fast • Resilient to failure
  • 13. Key Breakthroughs Source: mapr.com
  • 14. Industry Sectors • Search Engines • Social Networking • Finance/Banking • Retail Industries • E-Commerce • Email Services • Security • Government and Public Sectors
  • 15. Who is using?
  • 16. The future • 33% of the companies are using it • 20% of the companies have started their development in Hadoop • 58% Compound Annual Growth Rate (CAGR) of Hadoop usage • By 2018, $2.18 billions will be spent on Hadoop Source: indeed.com
  • 17. Course Offerings by
  • 18. Course Curriculum Ecosystem Description HDFS A distributed filesystem that runs on large clusters of commodity machines. MapReduce A distributed data processing model and execution environment that runs on large clusters of commodity machines. Pig A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL. Sqoop A tool for efficiently moving data between relational databases and HDFS. HBase A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries.
  • 19. Course Curriculum Ecosystem Description ZooKeeper A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. Oozie A workflow scheduler system to manage Hadoop jobs. Flume A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Integrations • Hive <-> Hbase • MapReduce -> Hive • MapReduce -> Hbase
  • 20. Course Deliverables  Course material  Practice exercises for each topic  Quiz for each topic  Workshop style coaching  Interactive approach  Tips and techniques for certification exam and interviews  Group activities for better understanding  Case Studies at the end of course
  • 21. Workshop Formats Classroom workshops Corporate Workshops Virtual Online Workshops Consulting Services
  • 22. Array of courses we offer Project Management Training Technical Training Soft skills Training Career Readiness Training
  • 23. Interested? + 91 95057 44455 info@24x7coach.com Visit www.24x7Coach.com