Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data and Hadoop


Published on

Published in: Technology
  • Don't forget another good way of simplifying your writing is using external resources (such as ⇒ ⇐ ). This will definitely make your life more easier
    Are you sure you want to  Yes  No
    Your message goes here
  • I'd advise you to use this service: ⇒ ⇐ The price of your order will depend on the deadline and type of paper (e.g. bachelor, undergraduate etc). The more time you have before the deadline - the less price of the order you will have. Thus, this service offers high-quality essays at the optimal price.
    Are you sure you want to  Yes  No
    Your message goes here
  • God bless you Ted. You saved me tons of money. I almost went to bought an overpriced side table until I saw your plans. Thanks for all the great ideas. It's gonna keep me occupied for a long time :) ➜➜➜
    Are you sure you want to  Yes  No
    Your message goes here
  • I know another site where professionals write a great essay, too!
    Are you sure you want to  Yes  No
    Your message goes here
  • Get Instant Access to 12000 SHED PLANS, Download plans now. ▶▶▶
    Are you sure you want to  Yes  No
    Your message goes here

Big data and Hadoop

  1. 1. Big Data and Hadoop<br />Rahul Agarwal<br /><br />
  2. 2. <ul><li>AmrAwadallah:
  3. 3. Hadoop:
  4. 4. Computerworld:
  5. 5. AshishTushoo:
  6. 6. Big data:
  7. 7. Chukwa:
  8. 8. Dean, Ghemawat:</li></ul>Attributions<br />
  9. 9. <ul><li>Big Data Problem
  10. 10. What is Hadoop
  11. 11. HDFS
  12. 12. MapReduce
  13. 13. HBase
  14. 14. PIG
  15. 15. HIVE
  16. 16. Chukwa
  17. 17. ZooKeeper
  18. 18. Q&A</li></ul>Agenda<br />
  19. 19. Why?<br />
  20. 20. Extremely large datasets that are hard to deal with using Relational Databases<br />Storage/Cost<br />Search/Performance<br />Analytics and Visualization<br />Need for parallel processing on hundreds of machines<br />ETL cannot complete within a reasonable time<br />Beyond 24hrs – never catch up<br />Big Data<br />
  21. 21. System shall manage and heal itself<br />Automatically and transparently route around failure<br />Speculatively execute redundant tasks if certain nodes are detected to be slow<br />Performance shall scale linearly<br />Proportional change in capacity with resource change<br />Compute should move to data<br />Lower latency, lower bandwidth<br />Simple core, modular and extensible<br />Hadoop design principles<br />
  22. 22. A scalablefault-tolerantgrid operating system for data storage and processing<br />Commodity hardware<br />HDFS: Fault-tolerant high-bandwidth clustered storage<br />MapReduce: Distributed data processing<br />Works with structured and unstructured data<br />Open source, Apache license<br />Master (named-node) – Slave architecture<br />What is Hadoop<br />
  23. 23. Hadoop Projects<br />BI Reporting<br />ETL Tools<br />Hive (SQL)<br />Pig (Data Flow)<br />MapReduce (Job Scheduling/Execution System)<br />ZooKeeper (Coordination)<br />(Streaming/Pipes APIs)<br />HBase (key-value store)<br />Chukwa (Monitoring)<br />HDFS(Hadoop Distributed File System)<br />
  24. 24. HDFS: Hadoop Distributed FS<br />Block Size = 64MB<br />Replication Factor = 3<br />
  25. 25. Patented Google framework<br />Distributed processing of large datasets<br />map (in_key, in_value) -> list(out_key, intermediate_value)<br />reduce (out_key, list(intermediate_value)) -> list(out_value)<br />MapReduce<br />
  26. 26. Example: count word occurences<br />
  27. 27. “Project's goal is the hosting of very large tables - billions of rows X millions of columns - atop clusters of commodity hardware”<br />Hadoop database, open-source version of Google BigTable<br />Column-oriented<br />Random access, realtime read/write<br />“Random access performance on par with open source relational databases such as MySQL” <br />HBase<br />
  28. 28. High level language (Pig Latin) for expressing data analysis programs<br />Compiled into a series of MapReduce jobs<br />Easier to program<br />Optimization opportunities<br />grunt> A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);grunt> B = FOREACH A GENERATE name;<br />PIG<br />
  29. 29. Managing and querying structured data<br />MapReduce for execution<br />SQL like syntax<br />Extensible with types, functions, scripts<br />Metadata stored in a RDBMS (MySQL)<br />Joins, Group By, Nesting<br />Optimizer for number of MapReduce required<br />hive> SELECT FROM invites a WHERE a.ds='<DATE>';<br />HIVE<br />
  30. 30. A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination service<br />Cluster Management<br />Load balancing<br />JMX monitoring<br />ZooKeeper<br />
  31. 31. <ul><li>Data collection system for monitoring distributed systems
  32. 32. Agents to collect and process logs
  33. 33. Monitoring and analysis
  34. 34. Hadoop Infrastructure Care Center</li></ul>Chukwa<br />
  35. 35. Data Flow at Facebook<br />
  36. 36. Choose the right tool<br /><ul><li>Hadoop
  37. 37. Affordable Storage/Compute
  38. 38. Structured or Unstructured
  39. 39. Resilient Auto Scalability
  40. 40. Relational Databases
  41. 41. Interactive response times
  42. 42. ACID
  43. 43. Structured data
  44. 44. Cost/Scale prohibitive</li>