Your SlideShare is downloading. ×
0
Data Processing in the Cloud Parand Tony Darugar http://parand.com/say/ [email_address]
What is Hadoop <ul><li>Flexible infrastructure for large scale computation and data processing on a network of commodity h...
Why? <ul><li>A common infrastructure pattern extracted from building distributed systems </li></ul><ul><li>Scale </li></ul...
Built-in Resilience to Failure <ul><li>When dealing with large numbers of commodity servers, failure is a fact of life </l...
Current State of Hadoop Project <ul><li>Top level Apache Foundation project </li></ul><ul><li>In production use at Yahoo, ...
Widely Adopted <ul><li>A valuable and reusable skill set </li></ul><ul><ul><li>Taught at major universities </li></ul></ul...
Plethora of Related Projects <ul><li>Pig </li></ul><ul><li>Hive </li></ul><ul><li>Hbase </li></ul><ul><li>Cascading </li><...
What is Hadoop <ul><li>The Linux of distributed processing. </li></ul>
How Does Hadoop Work?
Hadoop File System <ul><li>A distributed file system for large data </li></ul><ul><ul><li>Your data in triplicate </li></u...
Programming Model: Map/Reduce <ul><li>Very simple programming model: </li></ul><ul><ul><li>Map(anything)->key, value </li>...
Processing Model <ul><li>Create or allocate a cluster </li></ul><ul><li>Put data onto the file system: </li></ul><ul><ul><...
Processing Model <ul><ul><ul><li>Monitor workers, automatically restarting failed or slow tasks </li></ul></ul></ul><ul><u...
Hadoop on the Grid <ul><li>Managed Hadoop clusters </li></ul><ul><li>Shared resources </li></ul><ul><ul><li>improved utili...
Usage Patterns
ETL <ul><li>Put large data source (eg. Log files) onto the Hadoop File System </li></ul><ul><li>Perform aggregations, tran...
Reporting and Analytics <ul><li>Run canned and ad-hoc queries over large data </li></ul><ul><li>Run analytics and data min...
Data Processing Pipelines <ul><li>Multi-step pipelines for data processing </li></ul><ul><li>Coordination, scheduling, dat...
Machine Learning & Graph Algorithms <ul><li>Traverse large graphs and data sets, building models and classifiers </li></ul...
General Back-End Processing <ul><li>Implement significant portions of back-end, batch oriented processing on the grid </li...
What Next? <ul><li>Dowload Hadoop: </li></ul><ul><ul><li>http://hadoop.apache.org/ </li></ul></ul><ul><li>Try it on your l...
Upcoming SlideShare
Loading in...5
×

Cloud Computing: Hadoop

14,910

Published on

Data Processing in the Cloud with Hadoop from Data Services World conference.

Published in: Technology, Education
4 Comments
18 Likes
Statistics
Notes
  • thanks
    it helps to understand what is hadoop in cloud domain
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hope fully interesting topic
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • it clarifies many doubts... thank u
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi dude,
    Thank u, It helped a lot.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
14,910
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
882
Comments
4
Likes
18
Embeds 0
No embeds

No notes for slide

Transcript of "Cloud Computing: Hadoop"

  1. 1. Data Processing in the Cloud Parand Tony Darugar http://parand.com/say/ [email_address]
  2. 2. What is Hadoop <ul><li>Flexible infrastructure for large scale computation and data processing on a network of commodity hardware. </li></ul>
  3. 3. Why? <ul><li>A common infrastructure pattern extracted from building distributed systems </li></ul><ul><li>Scale </li></ul><ul><li>Incremental growth </li></ul><ul><li>Cost </li></ul><ul><li>Flexibility </li></ul>
  4. 4. Built-in Resilience to Failure <ul><li>When dealing with large numbers of commodity servers, failure is a fact of life </li></ul><ul><li>Assume failure, build protections and recovery into your architecture </li></ul><ul><ul><li>Data level redundancy </li></ul></ul><ul><ul><li>Job/Task level monitoring and automated restart and re-allocation </li></ul></ul>
  5. 5. Current State of Hadoop Project <ul><li>Top level Apache Foundation project </li></ul><ul><li>In production use at Yahoo, Facebook, Amazon, IBM, Fox, NY Times, Powerset, … </li></ul><ul><li>Large, active user base, mailing lists, user groups </li></ul><ul><li>Very active development, strong development team </li></ul>
  6. 6. Widely Adopted <ul><li>A valuable and reusable skill set </li></ul><ul><ul><li>Taught at major universities </li></ul></ul><ul><ul><li>Easier to hire for </li></ul></ul><ul><ul><li>Easier to train on </li></ul></ul><ul><ul><li>Portable across projects, groups </li></ul></ul>
  7. 7. Plethora of Related Projects <ul><li>Pig </li></ul><ul><li>Hive </li></ul><ul><li>Hbase </li></ul><ul><li>Cascading </li></ul><ul><li>Hadoop on EC2 </li></ul><ul><li>JAQL , X-Trace, Happy, Mahout </li></ul>
  8. 8. What is Hadoop <ul><li>The Linux of distributed processing. </li></ul>
  9. 9. How Does Hadoop Work?
  10. 10. Hadoop File System <ul><li>A distributed file system for large data </li></ul><ul><ul><li>Your data in triplicate </li></ul></ul><ul><ul><li>Built-in redundancy, resiliency to large scale failures </li></ul></ul><ul><ul><li>Intelligent distribution, striping across racks </li></ul></ul><ul><ul><li>Accommodates very large data sizes </li></ul></ul><ul><ul><li>On commodity hardware </li></ul></ul>
  11. 11. Programming Model: Map/Reduce <ul><li>Very simple programming model: </li></ul><ul><ul><li>Map(anything)->key, value </li></ul></ul><ul><ul><li>Sort, partition on key </li></ul></ul><ul><ul><li>Reduce(key,value)->key, value </li></ul></ul><ul><li>No parallel processing / message passing semantics </li></ul><ul><li>Programmable in Java or any other language (streaming) </li></ul>
  12. 12. Processing Model <ul><li>Create or allocate a cluster </li></ul><ul><li>Put data onto the file system: </li></ul><ul><ul><li>Data is split into blocks, stored in triplicate across your cluster </li></ul></ul><ul><li>Run your job: </li></ul><ul><ul><li>Your Map code is copied to the allocated nodes, preferring nodes that contain copies of your data </li></ul></ul><ul><ul><ul><li>Move computation to data, not data to computation </li></ul></ul></ul>
  13. 13. Processing Model <ul><ul><ul><li>Monitor workers, automatically restarting failed or slow tasks </li></ul></ul></ul><ul><ul><li>Gather output of Map, sort and partition on key </li></ul></ul><ul><ul><li>Run Reduce tasks </li></ul></ul><ul><ul><ul><li>Monitor workers, automatically restarting failed or slow tasks </li></ul></ul></ul><ul><li>Results of your job are now available on the Hadoop file system </li></ul>
  14. 14. Hadoop on the Grid <ul><li>Managed Hadoop clusters </li></ul><ul><li>Shared resources </li></ul><ul><ul><li>improved utilization </li></ul></ul><ul><li>Standard data sets, storage </li></ul><ul><li>Shared, standardized operations management </li></ul><ul><li>Hosted internally or externally (eg. on EC2) </li></ul>
  15. 15. Usage Patterns
  16. 16. ETL <ul><li>Put large data source (eg. Log files) onto the Hadoop File System </li></ul><ul><li>Perform aggregations, transformations, normalizations on the data </li></ul><ul><li>Load into RDBMS / data mart </li></ul>
  17. 17. Reporting and Analytics <ul><li>Run canned and ad-hoc queries over large data </li></ul><ul><li>Run analytics and data mining operations on large data </li></ul><ul><li>Produce reports for end-user consumption or loading into data mart </li></ul>
  18. 18. Data Processing Pipelines <ul><li>Multi-step pipelines for data processing </li></ul><ul><li>Coordination, scheduling, data collection and publishing of feeds </li></ul><ul><li>SLA carrying, regularly scheduled jobs </li></ul>
  19. 19. Machine Learning & Graph Algorithms <ul><li>Traverse large graphs and data sets, building models and classifiers </li></ul><ul><li>Implement machine learning algorithms over massive data sets </li></ul>
  20. 20. General Back-End Processing <ul><li>Implement significant portions of back-end, batch oriented processing on the grid </li></ul><ul><li>General computation framework </li></ul><ul><li>Simplify back-end architecture </li></ul>
  21. 21. What Next? <ul><li>Dowload Hadoop: </li></ul><ul><ul><li>http://hadoop.apache.org/ </li></ul></ul><ul><li>Try it on your laptop </li></ul><ul><li>Try Pig </li></ul><ul><ul><li>http://hadoop.apahe.org/pig/ </li></ul></ul><ul><li>Deploy to multiple boxes </li></ul><ul><li>Try it on EC2 </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×