Shankar Radhakrishnan<br />HCL Technologies<br />Hadoop – An Introduction<br />
State of the Data<br />What is Hadoop<br />Hadoop Ecosystem<br />References<br />Agenda<br />
Data driven businesses<br />Businesses have been collecting information all the time<br />Mine more == Collect more (and v...
Data driven business<br />Businesses have been collecting informationall the time<br />Mine more == Collect more (and vice...
Applications<br />Searches, Message posts, Comments, Emails,Blogs, Photos, Video Clips, Product Listings<br />ERP, CRM, Da...
Data driven businesses<br />Businesses have been collecting informationall the time<br />Mine more == Collect more (and vi...
Drivers<br />ROI<br />Customer Retention<br />Product Affinity<br />Market Trends<br />Research Analysis<br />Customer/Con...
Data driven businesses<br />Businesses have been collecting informationall the time<br />Mine more == Collect more (and vi...
Complex Applications<br />Data integration is a good but complex problem to solve<br />Data Growth<br />Growth is exponent...
System that can handle high volume data<br />System that can perform complex operations<br />Scalable<br />Robust<br />Hig...
Top level Apache project<br />Open source<br />Inspired by Google’s white papers onMap/Reduce (MR), Google File System (GF...
Runs on commodity hardware<br />Shared-nothing architecture<br />Scale hardware when ever you want<br />System compensates...
Hadoop in an enterprise - Example<br />
HDFS 		Hadoop Distributed File System<br />Map/Reduce 	Software framework for Clustered, 			Distributed data processing<br...
Master/Slave Architecture<br />Runs on commodity hardware<br />Fault Tolerant<br />Handle large volumes of data<br />Provi...
HDFS – Example<br />Name node<br /><ul><li>File system operations
Maps data-nodes</li></ul>Data node<br /><ul><li>Process read/write
Handles Data-blocks
Replication</li></li></ul><li>Tagged by a job<br />Splits input data-set into separate chunk’s<br />Processed by map tasks...
Example : Mapper Function<br />
Upcoming SlideShare
Loading in...5
×

Hadoop - An Introduction

1,281

Published on

Introduction to Hadoop

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,281
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
34
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop - An Introduction"

  1. 1. Shankar Radhakrishnan<br />HCL Technologies<br />Hadoop – An Introduction<br />
  2. 2. State of the Data<br />What is Hadoop<br />Hadoop Ecosystem<br />References<br />Agenda<br />
  3. 3. Data driven businesses<br />Businesses have been collecting information all the time<br />Mine more == Collect more (and vice-versa)<br />Challenges<br />Application Complexities<br />Data growth<br />Infrastructure<br />Economics<br />Need of the day<br />State of the data<br />
  4. 4. Data driven business<br />Businesses have been collecting informationall the time<br />Mine more == Collect more (and vice-versa)<br />Challenges<br />Application Complexities<br />Data growth<br />Infrastructure<br />Economics<br />State of the data<br />
  5. 5. Applications<br />Searches, Message posts, Comments, Emails,Blogs, Photos, Video Clips, Product Listings<br />ERP, CRM, Databases, Internal Applications, Customer/Consumer facing products<br />Mobile<br />Context<br />Web, Customers, Products, Business Systems,Processes, Services<br />Support Systems<br />CRM, SOA, Recommendation Systems/processes,Data warehouses, Business Intelligence, BPM<br />Data driven business<br />
  6. 6. Data driven businesses<br />Businesses have been collecting informationall the time<br />Mine more == Collect more (and vice-versa)<br />Challenges<br />Application Complexities<br />Data growth<br />Infrastructure<br />Economics<br />State of the data<br />
  7. 7. Drivers<br />ROI<br />Customer Retention<br />Product Affinity<br />Market Trends<br />Research Analysis<br />Customer/Consumer Analytics<br />Process<br />Clustering<br />Classification<br />Build Relationships<br />Regression<br />Types<br />Structured<br />Semi-structured<br />Unstructured<br />Mine more<br />
  8. 8. Data driven businesses<br />Businesses have been collecting informationall the time<br />Mine more == Collect more (and vice-versa)<br />Challenges<br />Application Complexities<br />Data growth<br />Infrastructure<br />Economics<br />State of the data<br />
  9. 9. Complex Applications<br />Data integration is a good but complex problem to solve<br />Data Growth<br />Growth is exponential<br />Infrastructure<br />Availability<br />Unscalablehardware<br />Economics<br />Managing high data volume comes at a price<br />Failures are very costly<br />Challenges<br />
  10. 10. System that can handle high volume data<br />System that can perform complex operations<br />Scalable<br />Robust<br />Highly Available<br />Fault Tolerant<br />Cheap<br />Need of the day<br />
  11. 11. Top level Apache project<br />Open source<br />Inspired by Google’s white papers onMap/Reduce (MR), Google File System (GFS)<br />Originally developed to support Apache Nutch Search Engine<br />Software Framework - Java<br />Designed<br />For sophisticated analysis<br />To deal with structured and unstructured complex data<br />
  12. 12. Runs on commodity hardware<br />Shared-nothing architecture<br />Scale hardware when ever you want<br />System compensates for hardware scalingand issues (if any)<br />Run large-scale, high volume data processes<br />Scales well with complex analysis jobs<br />Handles failures<br />Ideal to consolidate data from both new and legacy data sources<br />Value to the business<br />Why Hadoop?<br />
  13. 13. Hadoop in an enterprise - Example<br />
  14. 14. HDFS Hadoop Distributed File System<br />Map/Reduce Software framework for Clustered, Distributed data processing<br />ZooKeeper Scheduler<br />Avro Data Serialization<br />Chukwa Data Collection System to monitor Distributed Systems<br />HBase Data storage for distributed large tables<br />Hive Data warehousing infrastructure<br />Pig High-Level Query Language<br />Hadoop Ecosystem<br />
  15. 15. Master/Slave Architecture<br />Runs on commodity hardware<br />Fault Tolerant<br />Handle large volumes of data<br />Provides High Throughput<br />Streaming data-access<br />Simple file coherency model<br />Portable to heterogeneous hardware and software<br />Robust<br />Handles disk failures, replication (& re-replication)<br />Performs cluster rebalancing, data integrity checks<br />HDFS – Hadoop Distributed File System<br />
  16. 16. HDFS – Example<br />Name node<br /><ul><li>File system operations
  17. 17. Maps data-nodes</li></ul>Data node<br /><ul><li>Process read/write
  18. 18. Handles Data-blocks
  19. 19. Replication</li></li></ul><li>Tagged by a job<br />Splits input data-set into separate chunk’s<br />Processed by map tasks, in parallel<br />Sorts the output of the maps<br />Processed by reduce tasks, in parallel<br />Typically stored and processed in a file system<br />Framework takes care of<br />Scheduling tasks<br />Monitoring<br />Re-executing failed tasks<br />Hadoop Map/Reduce<br />
  20. 20. Example : Mapper Function<br />
  21. 21. Example : Reduce Function<br />
  22. 22. Who runs Hadoop?<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×