Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop_EcoSystem_Pradeep_MG

199 views

Published on

  • Be the first to comment

  • Be the first to like this

Hadoop_EcoSystem_Pradeep_MG

  1. 1. 1 Hadoop Eco System
  2. 2. • Why Big Data? • Ingredients of Big Data Eco System • Working with Map Reduce • Phases of MR • HDFS • Hive • Use case • Conclusion Agenda 2
  3. 3. • Big Data is NOT JUST ABOUT SIZE its ABOUT HOW IMPORTANT THE DATA IS in a large chunk • Data is CHANGING and getting MESSY • Prior Structured but now Unstructured. • Non Uniform • Many distributed contributors to the data • Mobile, PDA, Tablet, sensors. • Domains: Financial, Healthcare, Social Media Why Big Data!! 3
  4. 4. Glimpse 4
  5. 5. • Map reduce – Technique of solving big data by map – reduce technique on clusters. • HDFS- Distributed file system used by hadoop. • HIVE- SQL based query engine for non java programmers • PIG- A data flow language and execution environment for exploring very large datasets Ingredients of Eco System 5
  6. 6. • HBASE - A distributed, column-oriented database. • Zookeeper - A distributed, highly available coordination service. • Sqoop - A tool for efficiently moving data between relational databases and HDFS. Ingredients cont. 6
  7. 7. • Protocols used- RPC/ HTTP for inter communication of commodity hardware. • Run on Pseudo Node or Clusters • Components- Daemons • NameNode • DataNode • JobTracker • TaskTracker Hadoop Internals 7
  8. 8. • Map  Function which maps for each of the data available • Reduce  Function which is used for aggregation or reduction Working with Map Reduce 8
  9. 9. • f(n) = Σ {n=0.. n=10} (n(n-1)/2) • map = ∀ n from 0 to n • map(n(n-1)/2) • Reduce = Σ ([values]) is the aggregation/reduction function Hence can achieve parallelism MR as a function 9
  10. 10. MR as representation 10 • Map <K1, V1>  Map <K2, V2> • V2 – list of values for Key K2 • Reduce <K2, V2>  ~ <K3, V3> • ~ Reduction operation • Reduced output with specific Keys and Values
  11. 11. • Data on HDFS • Input partition – FileSplit , Inputsplit • Map • Shuffle • Sort • Partition • Reducer • Aggregated Data on HDFS Phases of MR 11
  12. 12. Phases of MR depicted 12
  13. 13. Data flow in MR 13 MapReduce data flow with multiple reduce tasks
  14. 14. Shuffle and Sort phase 14
  15. 15. • Architecture HDFS Hadoop Distributed File System 15
  16. 16. HDFS- Client Read 16
  17. 17. HDFS- Client Write 17
  18. 18. • List all the files and directories in the HDFS • $hadoop fs –lsr • Put file to HDFS • $hadoop fs –put <from path> <to path> • Get files from HDFS • $hadoop fs –get <from path> • To run jar file • $hadoop jar <jarfile> <className> <input path> <output path> HDFS - cli 18
  19. 19. • Job Configuration • Key files core-site.xml, mapred- site.xml • Specific job configuration can be provided in the code Map Reduce cont. 19
  20. 20. MR job in action 20
  21. 21. • Job Scheduling • Fair scheduler • Capacity scheduler Job Scheduling 21
  22. 22. • Job is planned and placed in the job pool • Supports preemption • If no pools created and only one job available, the job runs as is Fair Scheduler 22
  23. 23. • Supports Multi user scheduling • Depends on the clusters, number of queues and hierarchical way jobs are scheduled • One queue may be child of another queue • Enforces fair scheduling within each job pool Capacity scheduler 23
  24. 24. Map reduce Input Formats 24
  25. 25. • Map Side Join • large inputs works by performing the join before the data reaches the map function • Reducer Side Join • input datasets don’t have to be structured in any particular way, but it is less efficient as both datasets have to go through the Map Reduce shuffle. MR Joins 25
  26. 26. • Hive was created to make it possible for analysts with strong SQL skills (but meager Java programming skills) • From Developers of Facebook and later associated it part of apache open source projects. • Hive runs on your workstation and converts your SQL query into a series of Map Reduce jobs for execution on a Hadoop cluster HIVE 26
  27. 27. • Unzip the gz file • % tar xzf hive-x.y.z-dev.tar.gz • Be handy • % export HIVE_INSTALL=/home/tom/hive-x.y.z- dev • % export PATH=$PATH:$HIVE_INSTALL/bin • Hive shell launched • hive> Show tables; Hive Infrastructure 27
  28. 28. Hive Modules 28
  29. 29. Hive Data Types 29
  30. 30. • Creating table • CREATE TABLE rank_customer(custid STRING, socre STRING, location STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; • Load Data • LOAD DATA LOCAL INPATH 'input/dir/customerrank.dat‘ OVERWRITE INTO TABLE rank_customer; • Check data in warehouse • $ls /user/hive/warehouse/records/ Commands 30
  31. 31. • SELECT QUERY • SELECT c.custid, c.score, c.location FROM rank_customer c ORDER BY c.custid ASC, c.location ASC, c.score DESC; Commands cont. 31
  32. 32. • hive> CREATE DATABASE financials WITH DBPROPERTIES ('creator' = MGP', 'date' = '2014-10-03'); • hive> DROP DATABASE IF EXISTS financials; • hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba'); • hive> DROP TABLE IF EXISTS employees; • hive> ALTER TABLE log_messages RENAME TO logmsgs; Hive- DDL Commands 32
  33. 33. • Determine the rank of the customer based on his id and the locality he belongs. Highest scorer gains the higher rank. • Input Output Use case 33
  34. 34. • Custom Writable Using Map Reduce 34
  35. 35. • CustomWritable methods overridden CustomWritable cont. 35
  36. 36. Driver code 36
  37. 37. Mapper Code 37
  38. 38. Partitioner Code 38
  39. 39. Sort Comparator class 39
  40. 40. Reducer Code 40
  41. 41. • ## FOR OBTINING THE RANKING ON THE BASIS OF LOCATION AND CUSTOMER ID AS PER THE REQUIREMENT • hive>SELECT custid, score, location, rank() over(PARTITION BY custid, location ORDER BY score DESC ) AS myrank FROM rank_customer; Hive Query 41
  42. 42. Hive results 42
  43. 43. • Hadoop eco system is majorly designed for large number of files of large size of data • Not so suitable for small sized large number of files. • Achieving the parallelism on the huge data • Mapping and Reducing are the key and core functions to achieve parallelism. Conclusion 43
  44. 44. • Hadoop eco system works efficiently with commodity hardware. • Distributed hardware can be efficiently utilized. • Hadoop map reduce codes are written using Java. • Hive gives feasibility for SQL programmers though internally Java MR jobs run. Conclusion cont. 44
  45. 45. • Hadoop: The Definitive Guide, Third Edition by Tom White • Programming Hive by Edward Capriolo, Dean Wampler, and Jason Rutherglen • http://hadoop.apache.org/ • http://hive.apache.org/ References 45
  46. 46. 46 THANK YOU Q&A PRADEEP M G

×