Successfully reported this slideshow.

Big data and Hadoop

74,875 views

Published on

http://www.linkedin.com/in/rahulaga

Published in: Technology

Big data and Hadoop

  1. 1. Big Data and Hadoop<br />Rahul Agarwal<br />irahul.com<br />
  2. 2. <ul><li>AmrAwadallah: http://www.sfbayacm.org/wp/wp-content/uploads/2010/01/amr-hadoop-acm-dm-sig-jan2010.pdf
  3. 3. Hadoop: http://hadoop.apache.org/
  4. 4. Computerworld: http://www.computerworld.com/s/article/350908/5_Indispensable_IT_Skills_of_the_Future
  5. 5. AshishTushoo: http://www.sfbayacm.org/wp/wp-content/uploads/2010/01/sig_2010_v21.pdf
  6. 6. Big data: http://en.wikipedia.org/wiki/Big_data
  7. 7. Chukwa: http://www.cca08.org/papers/Paper-13-Ariel-Rabkin.pdf
  8. 8. Dean, Ghemawat: http://labs.google.com/papers/mapreduce.html</li></ul>Attributions<br />
  9. 9. <ul><li>Big Data Problem
  10. 10. What is Hadoop
  11. 11. HDFS
  12. 12. MapReduce
  13. 13. HBase
  14. 14. PIG
  15. 15. HIVE
  16. 16. Chukwa
  17. 17. ZooKeeper
  18. 18. Q&A</li></ul>Agenda<br />
  19. 19. Why?<br />
  20. 20. Extremely large datasets that are hard to deal with using Relational Databases<br />Storage/Cost<br />Search/Performance<br />Analytics and Visualization<br />Need for parallel processing on hundreds of machines<br />ETL cannot complete within a reasonable time<br />Beyond 24hrs – never catch up<br />Big Data<br />
  21. 21. System shall manage and heal itself<br />Automatically and transparently route around failure<br />Speculatively execute redundant tasks if certain nodes are detected to be slow<br />Performance shall scale linearly<br />Proportional change in capacity with resource change<br />Compute should move to data<br />Lower latency, lower bandwidth<br />Simple core, modular and extensible<br />Hadoop design principles<br />
  22. 22. A scalablefault-tolerantgrid operating system for data storage and processing<br />Commodity hardware<br />HDFS: Fault-tolerant high-bandwidth clustered storage<br />MapReduce: Distributed data processing<br />Works with structured and unstructured data<br />Open source, Apache license<br />Master (named-node) – Slave architecture<br />What is Hadoop<br />
  23. 23. Hadoop Projects<br />BI Reporting<br />ETL Tools<br />Hive (SQL)<br />Pig (Data Flow)<br />MapReduce (Job Scheduling/Execution System)<br />ZooKeeper (Coordination)<br />(Streaming/Pipes APIs)<br />HBase (key-value store)<br />Chukwa (Monitoring)<br />HDFS(Hadoop Distributed File System)<br />
  24. 24. HDFS: Hadoop Distributed FS<br />Block Size = 64MB<br />Replication Factor = 3<br />
  25. 25. Patented Google framework<br />Distributed processing of large datasets<br />map (in_key, in_value) -> list(out_key, intermediate_value)<br />reduce (out_key, list(intermediate_value)) -> list(out_value)<br />MapReduce<br />
  26. 26. Example: count word occurences<br />
  27. 27. “Project's goal is the hosting of very large tables - billions of rows X millions of columns - atop clusters of commodity hardware”<br />Hadoop database, open-source version of Google BigTable<br />Column-oriented<br />Random access, realtime read/write<br />“Random access performance on par with open source relational databases such as MySQL” <br />HBase<br />
  28. 28. High level language (Pig Latin) for expressing data analysis programs<br />Compiled into a series of MapReduce jobs<br />Easier to program<br />Optimization opportunities<br />grunt> A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);grunt> B = FOREACH A GENERATE name;<br />PIG<br />
  29. 29. Managing and querying structured data<br />MapReduce for execution<br />SQL like syntax<br />Extensible with types, functions, scripts<br />Metadata stored in a RDBMS (MySQL)<br />Joins, Group By, Nesting<br />Optimizer for number of MapReduce required<br />hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>';<br />HIVE<br />
  30. 30. A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination service<br />Cluster Management<br />Load balancing<br />JMX monitoring<br />ZooKeeper<br />
  31. 31. <ul><li>Data collection system for monitoring distributed systems
  32. 32. Agents to collect and process logs
  33. 33. Monitoring and analysis
  34. 34. Hadoop Infrastructure Care Center</li></ul>Chukwa<br />
  35. 35. Data Flow at Facebook<br />
  36. 36. Choose the right tool<br /><ul><li>Hadoop
  37. 37. Affordable Storage/Compute
  38. 38. Structured or Unstructured
  39. 39. Resilient Auto Scalability
  40. 40. Relational Databases
  41. 41. Interactive response times
  42. 42. ACID
  43. 43. Structured data
  44. 44. Cost/Scale prohibitive</li>

×