Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoop Tutorial | Edureka

4,354 views

Published on

This Hadoop Tutorial on Hadoop Interview Questions and Answers ( Hadoop Interview Blog series: https://goo.gl/ndqlss ) will help you to prepare yourself for Big Data and Hadoop interviews. Learn about the most important Hadoop interview questions and answers and know what will set you apart in the interview process. Below are the topics covered in this Hadoop Interview Questions and Answers Tutorial:

Hadoop Interview Questions on:

1) Big Data & Hadoop
2) HDFS
3) MapReduce
4) Apache Hive
5) Apache Pig
6) Apache HBase and Sqoop

Check our complete Hadoop playlist here: https://goo.gl/4OyoTW

#HadoopInterviewQuestions #BigDataInterviewQuestions #HadoopInterview

Published in: Technology
  • DOWNLOAD FULL BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • @Download eBooks Hey thanks for checking out our Tutorial. Glad you loved it. Cheers!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Njce! Thanks for sharing.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoop Tutorial | Edureka

  1. 1. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
  2. 2. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Market  According to Forrester: growth rate of 13% for the next 5 years, which is more than twice w.r.t. predicted general IT growth  U.S. and International Operations (29%) and Enterprises (27%) lead the adoption of Big Data globally  Asia Pacific to be fastest growing Hadoop market with a CAGR of 59.2 %  Companies focusing on improving customer relationships (55%) and making the business more data-focused (53%) 2013 2014 2015 2016 Hadoop Market CAGR of 58.2 %
  3. 3. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Job Trends
  4. 4. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Agenda for Today Hadoop Interview Questions  Big Data & Hadoop  HDFS  MapReduce  Apache Hive  Apache Pig  Apache HBase and Sqoop
  5. 5. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Interview Questions “The harder I practice, the luckier I get.” Gary Player
  6. 6. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. What are the five V’s associated with Big Data?
  7. 7. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. What are the five V’s associated with Big Data? Big Data
  8. 8. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. Differentiate between structured, semi-structured and unstructured data?
  9. 9. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop  Structured  Semi - Structured  Unstructured  Organized data format  Data schema is fixed  Example: RDBMS data, etc.  Partial organized data  Lacks formal structure of a data model  Example: XML & JSON files, etc.  Un-organized data  Unknown schema  Example: multi - media files, etc. Q. Differentiate between structured, semi-structured and unstructured data?
  10. 10. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. How Hadoop differs from Traditional Processing System using RDBMS?
  11. 11. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. How Hadoop differs from Traditional Processing System using RDBMS? RDBMS Hadoop RDBMS relies on the structured data and the schema of the data is always known. Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured. RDBMS provides limited or no processing capabilities. Hadoop allows us to process the data in distributed parallel fashion. RDBMS is based on ‘schema on write’ where schema validation is done before loading the data. On the contrary, Hadoop follows the schema on read policy. In RDBMS, reads are fast because the schema of the data is already known. The writes are fast in HDFS because no schema validation happens during HDFS write. Suitable for OLTP (Online Transaction Processing) Suitable for OLAP (Online Analytical Processing) Licensed software Hadoop is an open source framework.
  12. 12. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. Explain the components of Hadoop and their services.
  13. 13. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. Explain the components of Hadoop and their services.
  14. 14. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. What are the main Hadoop configuration files?
  15. 15. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. What are the main Hadoop configuration files? hadoop-env.sh core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml masters slaves
  16. 16. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Interview Questions “A person who never made a mistake never tried anything new.” Albert Einstein
  17. 17. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS ensures the fault tolerance capability of the system?
  18. 18. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS ensures the fault tolerance capability of the system?  HDFS replicates the blocks and stores on different DataNodes  Default Replication Factor is set to 3
  19. 19. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. What is the problem in having lots of small files in HDFS? Provide one method to overcome this problem.
  20. 20. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. What is the problem in having lots of small files in HDFS? Provide one method to overcome this problem. > hadoop archive –archiveName edureka_archive.har /input/location /output/location Problem:  Too Many Small Files = Too Many Blocks  Too Many Blocks == Too Many Metadata  Managing this huge number of metadata is difficult  Increase in cost of seek Solution:  Hadoop Archive  It clubs small HDFS files into a single archive HDFS Files (small) .HAR file
  21. 21. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size configuration and default replication factor. Then, how many blocks will be created in total and what will be the size of each block?
  22. 22. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size configuration and default replication factor. Then, how many blocks will be created in total and what will be the size of each block?  Default Block Size = 128 MB  514 MB / 128 MB = 4.05 == 5 Blocks  Replication Factor = 3  Total Blocks = 5 * 3 = 15  Total size = 514 * 3 = 1542 MB
  23. 23. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. How to copy a file into HDFS with a different block size to that of existing block size configuration?
  24. 24. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. How to copy a file into HDFS with a different block size to that of existing block size configuration?  Block size: 32 MB = 33554432 Bytes ( Default block size: 128 MB)  Command: hadoop fs -Ddfs.blocksize=33554432 -copyFromLocal /local/test.txt /sample_hdfs  Check the block size of test.txt hadoop fs -stat %o /sample_hdfs/test.txt HDFS Files (existing) 128 MB 128 MB test.txt (local) -Ddfs.blocksize=33554432 test.txt (HDFS) 32 MB 32 MB move to HDFS: /sample_hdfs HDFS HDFS
  25. 25. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. What is a block scanner in HDFS?
  26. 26. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. What is a block scanner in HDFS?  Block scanner maintains integrity of the data blocks  It runs periodically on every DataNode to verify whether the data blocks stored are correct or not Steps: 1. DataNode reports to NameNode 2. NameNode schedules the creation of new replicas using the good replicas 3. Once replication factor (uncorrupted replicas) reaches to the required level, deletion of corrupted blocks takes place Note: This question is generally asked for the position Hadoop Admin
  27. 27. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. Can multiple clients write into an HDFS file concurrently?
  28. 28. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. Can multiple clients write into an HDFS file concurrently?  HDFS follows Single Writer Multiple Reader Model  The client which opens a file for writing is granted a lease by the NameNode  NameNode rejects write request of other clients for the file which is currently being written by someone else HDFS ReadWrite
  29. 29. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. What do you mean by the High Availability of a NameNode? How is it achieved?
  30. 30. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. What do you mean by the High Availability of a NameNode? How is it achieved?  NameNode used to be Single Point of Failure in Hadoop 1.x  High Availability refers to the condition where a NameNode must remain active throughout the cluster  HDFS HA Architecture in Hadoop 2.x allows us to have two NameNode in an Active/Passive configuration.
  31. 31. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Interview Questions “Never tell me the sky’s the limit when there are footprints on the moon.” –Author Unknown
  32. 32. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Explain the process of spilling in MapReduce?
  33. 33. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Explain the process of spilling in MapReduce? Local Disc  The output of a map task is written into a circular memory buffer (RAM).  Default Buffer size is set to 100 MB as specified in mapreduce.task.io.sort.mb  Spilling is a process of copying the data from memory buffer to disc after a certain threshold is reached  Default spilling threshold is 0.8 as specified in mapreduce.map.sort.spill.percent 20 % 50 %80%80% Spill data Node Manager RAM
  34. 34. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the difference between blocks, input splits and records?
  35. 35. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the difference between blocks, input splits and records? Blocks Input Splits Records Physical Division Logical Division  Blocks: Data in HDFS is physically stored as blocks  Input Splits: Logical chunks of data to be processed by an individual mapper  Records: Each input split is comprised of records e.g. in a text file each line is a record
  36. 36. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the role of RecordReader in Hadoop MapReduce?
  37. 37. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the role of RecordReader in Hadoop MapReduce?  RecordReader converts the data present in a file into (key, value) pairs suitable for reading by the Mapper task  The RecordReader instance is defined by the Input Format 1 David 2 Cassie 3 Remo 4 Ramesh … RecordReader Key Value 0 1 David 57 2 Cassie 122 3 Remo 171 4 Ramesh … Mapper
  38. 38. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the significance of counters in MapReduce?
  39. 39. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING 1 David 2%^&%d 3 Jeff 4 Shawn 5$*&!#$ MapReduce Q. What is the significance of counters in MapReduce?  Used for gathering statistics about the job:  for quality control  for application-level statistics  Easier to retrieve counters as compared to log messages for large distributed job  For example: Counting the number of invalid records, etc. MapReduce Output Counter: 02 +1 1 invalid records
  40. 40. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Why the output of map tasks are stored ( spilled ) into local disc and not in HDFS?
  41. 41. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Why the output of map tasks are stored ( spilled ) into local disc and not in HDFS?  The outputs of map task are the intermediate key-value pairs which is then processed by reducer  Intermediate output is not required after completion of job  Storing these intermediate output into HDFS and replicating it will create unnecessary overhead. Local Disc Mapper Reducer NodeManager HDFS output
  42. 42. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Define Speculative Execution
  43. 43. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Define Speculative Execution  If a task is detected to be running slower, an equivalent task is launched so as to maintain the critical path of the job  Scheduler tracks the progress of all the tasks (map and reduce) and launches speculative duplicates for slower tasks  After completion of a task, all running duplicates task are killed MRTask (slow) Node Manager MRTask (duplicate) Node Manager Scheduler slow task progress launch speculative
  44. 44. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. How will you prevent a file from splitting in case you want the whole file to be processed by the same mapper?
  45. 45. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. How will you prevent a file from splitting in case you want the whole file to be processed by the same mapper? Method 1: Increase the minimum split size to be larger than the largest file inside the driver section i. conf.set ("mapred.min.split.size", “size_larger_than_file_size"); ii. Input Split Computation Formula - max ( minimumSize, min ( maximumSize, blockSize ) ) public class NonSplittableTextInputFormat extends TextInputFormat { @Override protected boolean isSplitable (JobContext context, Path file) { return false; } } Method 2: Modify the InputFormat class that you want to use: i. Subclass the concrete subclass of FileInputFormat and override the isSplitable() method to return false as shown below:
  46. 46. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Is it legal to set the number of reducer task to zero? Where the output will be stored in this case?
  47. 47. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Is it legal to set the number of reducer task to zero? Where the output will be stored in this case?  Legal to set the number of reducer task to zero  It is done when there is no need for a reducer like in the cases where inputs needs to be transformed into a particular format, map side join etc.  Map outputs is directly stored into the HDFS as specified by the client HDFS (Input) Map Reduce HDFS (Output) HDFS (Input) Map Reduce HDFS (Output) Reducer set to zero
  48. 48. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the role of Application Master in a MapReduce Job?
  49. 49. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the role of Application Master in a MapReduce Job?  Acts as a helper process for ResourceManager  Initializes the job and track of the job’s progress  Retrieves the input splits computed by the client  Negotiates the resources needed for running a job with the ResourceManager  Creates a map task object for each split Client RM NM AM submit job launch AM ask for resources run task status unregister
  50. 50. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What do you mean by MapReduce task running in uber mode?
  51. 51. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What do you mean by MapReduce task running in uber mode?  If a job is small, ApplicationMaster chooses to run the tasks in its own JVM and are called uber task  It reduces the overhead of allocating new containers for running the tasks  A MapReduce job is decided as uber task if:  It requires less than 10 mappers  It requires only one reducer  The input size is less than the HDFS block size  Parameters to be set for deciding uber task:  mapreduce.job.ubertask.maxmaps  mapreduce.job.ubertask.maxreduces  mapreduce.job.ubertask.maxbytes  To enable uber task: mapreduce.job.ubertask.enable to true.
  52. 52. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Client Node JVM MR Code MR Job run job Node Manager RM Node Node Manager MR Task (uber) AppMaster JVM ResourceManager JVM HDFS 1. Submit Job 2. Launch AppMaster 3. output Copy job resources Criteria:  It requires less than 10 mappers  It requires only one reducer  The input size is less than the HDFS block size
  53. 53. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. How will you enhance the performance of MapReduce job when dealing with too many small files?
  54. 54. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. How will you enhance the performance of MapReduce job when dealing with too many small files?  CombineFileInputFormat can be used to solve this problem  CombineFileInputFormat packs all the small files into input splits where each split is processed by a single mapper  Takes node and rack locality into account when deciding which blocks to place in the same split  Can process the input files efficiently in a typical MapReduce job
  55. 55. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Interview Questions “Generally, the question that seems to be complicated have simple answers.” – Anonymous
  56. 56. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. Where does the data of a Hive table gets stored? Q. Why HDFS is not used by the Hive metastore for storage?
  57. 57. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. Where does the data of a Hive table gets stored?  By default, the Hive table is stored in an HDFS directory: /user/hive/warehouse  It is specified in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml Q. Why HDFS is not used by the Hive metastore for storage?
  58. 58. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. Where does the data of a Hive table gets stored?  By default, the Hive table is stored in an HDFS directory: /user/hive/warehouse  It is specified in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml Q. Why HDFS is not used by the Hive metastore for storage?  Editing files or data present in HDFS is not allowed.  Metastore stores metadata using RDBMS to provide low query latency  HDFS read/write operations are time consuming processes
  59. 59. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Scenario: Suppose, I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration. Then, what will happen if we have multiple clients trying to access Hive at the same time?
  60. 60. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Scenario: Suppose, I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration. Then, what will happen if we have multiple clients trying to access Hive at the same time?  Multiple client access is not allowed in default metastore configuration or embedded mode  One may use following two metastore configurations: 1. Local Metastore Configuration 2. Remote Metastore Configuration
  61. 61. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. What is the difference between external table and managed table?
  62. 62. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. What is the difference between external table and managed table? Managed Table:  Hive responsible for managing the table data  While dropping the table, Metadata information along with the table data is deleted from the Hive warehouse External Table:  Hive is responsible for managing only table metadata not the table data  While dropping the table, Hive just deletes the metadata information leaving the table data untouched
  63. 63. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. When should we use SORT BY instead of ORDER BY ?
  64. 64. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. When should we use SORT BY instead of ORDER BY ?  SORT BY clause sorts the data using multiple reducers Reducer OutputDataset Reducer 1 Reducer 2 Reducer n Output  ORDER BY sorts all of the data together using a single reducer SORT BY should be used to sort huge datasets Dataset
  65. 65. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. What is the difference between partition and bucket in Hive?
  66. 66. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. What is the difference between partition and bucket in Hive?
  67. 67. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Scenario: CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ; Now, after inserting 50,000 tuples in this table, I want to know the total revenue generated for the month - January. But, Hive is taking too much time in processing this query. How will you solve this problem?
  68. 68. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING  Create a partitioned table:  CREATE TABLE partitioned_transaction (cust_id INT, amount FLOAT, country STRING) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;  Enable dynamic partitioning in Hive:  SET hive.exec.dynamic.partition = true;  SET hive.exec.dynamic.partition.mode = nonstrict;  Transfer the data :  INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) SELECT cust_id, amount, country, month FROM transaction_details;  Run the query :  SELECT SUM(amount) FROM partitioned_transaction WHERE month= ‘January’; Apache Hive
  69. 69. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. What is dynamic partitioning and when is it used? Apache Hive
  70. 70. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. What is dynamic partitioning and when is it used?  Values for partition columns are known during runtime  One may use dynamic partition in following cases:  Loading data from an existing non-partitioned table to improve the sampling (query latency)  Values of the partitions are not known before hand and therefore, finding these unknown partition values manually from huge data sets is a tedious task Apache Hive
  71. 71. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. How Hive distributes the rows into buckets? Apache Hive
  72. 72. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. How Hive distributes the rows into buckets?  Bucket number is determined for a row by using the formula: hash_function (bucketing_column) modulo (num_of_buckets)  hash_function depends on the column data type i.e. for int type it is equal to value of column  hash_function for other data types is complex to calculate Id Name 1 John 2 Mike 3 Shawn 2, Mike 1, John 3, Shawn Bucket 1 Bucket 2  hash_function (1) = 1  hash_function (2) = 2  hash_function (3) = 3 hash_function (id) = id  1 mod 2 = 1  2 mod 2 = 0  3 mod 2 = 1 id mod 2 = bucket num Apache Hive
  73. 73. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Scenario: Suppose, I have a CSV file – ‘sample.csv’ present in ‘/temp’ directory with the following entries: id first_name last_name e-mail gender ip 1 Hugh Jackman hugh32@sun.co Male 136.90.241.52 2 David Lawrence dlawrence@gmail.co Male 101.177.15.130 3 Andy Hall anyhall@yahoo.co Female 114.123.153.64 4 Samuel Jackson samjackson@rediff.co Male 91.121.145.67 5 Emily Rose rosemily@edureka.co Female 117.123.108.98 How will you consume this CSV file into the Hive warehouse using built-in SerDe? Apache Hive
  74. 74. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING  A SerDe allows us to convert the unstructured bytes into a record that we can process using Hive.  CREATE EXTERNAL TABLE sample (id INT, first_name STRING, last_name STRING, email STRING, gender STRING, ip_address STRING) ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’ STORED AS TEXTFILE LOCATION ‘/temp’;  SELECT first_name FROM sample WHERE gender = ‘male’; Note:  Hive provides several built – in SerDe like for JSON, TSV etc.  Useful in cases where you have embedded commas in delimited fields Apache Hive
  75. 75. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Scenario:  I have a lot of small CSV files present in /input directory in HDFS and I want to create a single Hive table corresponding to these files.  The data in these files are in the format: {id, name, e-mail, country} Now, as we know, Hadoop performance degrades when we use lots of small files. So, how will you solve this problem? Apache Hive
  76. 76. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING  Create a temporary table: CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING) ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS TEXTFILE;  Load the data from the input directory into temp_table: LOAD DATA INPATH ‘/input’ INTO TABLE temp_table;  Create a table that will store data in SequenceFile format: CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING) ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS SEQUENCEFILE;  Transfer the data from the temporary table into the sample_seqfile table: INSERT OVERWRITE TABLE sample_seqfile SELECT * FROM temp_table; Apache Hive  When Hive converts queries to MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record  Sequence files are flat files consisting of binary key-value pairs  Using sequence file, one can club two or more smaller files to make them one single sequence file
  77. 77. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Interview Questions “Whenever you are asked if you can do a job, tell them, 'Certainly I can!' , Then get busy and find out how to do it.” –Theodore Roosevelt
  78. 78. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What is the difference between logical and physical plans?
  79. 79. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What is the difference between logical and physical plans? Logical Plan:  Created for each line in pig script if no syntax error is found by interpreter  No data processing happens during creation of logical plan Physical Plan:  Physical plan is basically a series of map reduce jobs  Describes the physical operators to execute the script, without reference to how they will be executed in MapReduce
  80. 80. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What is a bag in Pig Latin?
  81. 81. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What is a bag in Pig Latin?  Unordered collection of tuples  Duplicate tuples are allowed  Tuples with differing numbers of fields is allowed  For example: { (Linkin Park, 7, California), (Metallica, 8), (Mega Death, Los Angeles) }
  82. 82. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. How Apache Pig handles unstructured data which is difficult in case of Apache Hive?
  83. 83. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. How Apache Pig handles unstructured data which is difficult in case of Apache Hive? No Datatype {a, b, c} $2 (positional notation) c (3rd field) missing schema JOIN, COGROUP, etc NULL (schema) schema is NULL Byte Array (default) Data type definition (runtime) Byte Array (default)
  84. 84. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What are the different execution modes available in Pig?
  85. 85. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What are the different execution modes available in Pig? MapReduce Mode:  Default mode  Requires access to a Hadoop cluster  Input and output data are present on HDFS Local Mode:  Requires access to a single machine  ‘-x ’ flag is used to specify the local mode environment (pig -x local)  Input and output data are present on local file system
  86. 86. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What does Flatten do in Pig?
  87. 87. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What does Flatten do in Pig?  Flatten un-nests bags and tuples.  For tuples, the Flatten operator will substitute the fields of a tuple in place of the tuple  For example:  Un-nesting bags is a little complex as it requires creating new tuples (a, (b, c)) GENERATE $0, flatten($1) (a, b, c)
  88. 88. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase & Sqoop Interview Questions “Take risks: if you win, you will be happy; if you lose, you will be wise.” –Anonymous
  89. 89. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. What are the key components of HBase?
  90. 90. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. What are the key components of HBase?  HMaster manages the Region Servers  Region Server manages a group of regions  Zooeeper acts as a coordinator inside HBase environment
  91. 91. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. How do we back up a HBase cluster?
  92. 92. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. How do we back up a HBase cluster? 1. Full Shutdown Backup  Useful for cases where HBase cluster shutdown is possible  Steps: • Stop HBase: Stop the HBase services first • Distcp: Copy the contents of the HBase directory into another HDFS directory in different or same cluster 2. Live Cluster Backup  Useful for live cluster that cannot afford downtime  Steps: • CopyTable: Copy data from one table to another on the same or different cluster • Export: Dumps the content of a table into HDFS on the same cluster
  93. 93. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. What is a Bloom filter and how does it help in searching rows?
  94. 94. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. What is a Bloom filter and how does it help in searching rows?  Used to improve the overall throughput of the cluster  Space efficient mechanism to test whether a HFile contains a specific row or row-col cell  Saves the time in scanning non - relevant blocks for a given row key
  95. 95. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. What is the role of JDBC driver in a Sqoop set up?
  96. 96. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Sqoop Q. What is the role of JDBC driver in a Sqoop set up?  To connect to different relational databases Sqoop needs a connector  Almost every DB vendor makes this connecter available as a JDBC driver which is specific to that DB  Sqoop needs the JDBC driver of each of the database that it needs to interact with
  97. 97. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. When to use --target-dir and when to use --warehouse-dir while importing data? Sqoop
  98. 98. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. When to use --target-dir and when to use --warehouse-dir while importing data?  --target-dir is used for specifying a particular directory in HDFS  --warehouse-dir is used for specifying the parent directory of all the Sqoop jobs  In the later case, Sqoop will create directory with the same name as that of table under the parent directory Sqoop
  99. 99. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. What does the following query do: $ sqoop import --connect jdbc:mysql://host/dbname --table EMPLOYEES --where "start_date” > '2012-11-09‘ Sqoop
  100. 100. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING It imports the employees who have joined after 9-Nov-2012 Sqoop Q. What does the following query do: $ sqoop import --connect jdbc:mysql://host/dbname --table EMPLOYEES --where "start_date” > '2012-11-09‘
  101. 101. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Scenario: In a Sqoop import command you have mentioned to run 8 parallel MapReduce tasks but Sqoop runs only 4 What can be the reason? Sqoop
  102. 102. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Scenario: In a Sqoop import command you have mentioned to run 8 parallel MapReduce tasks but Sqoop runs only 4 What can be the reason? In this case, the MapReduce cluster is configured to run 4 parallel tasks. Therefore, the Sqoop command must have the number of parallel tasks less or equal to that of the MapReduce cluster Sqoop
  103. 103. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. Give a Sqoop command to show all the databases in a MySQL server. Sqoop
  104. 104. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. Give a Sqoop command to show all the databases in a MySQL server.  Issue the command given below: $ sqoop list-databases --connect jdbc:mysql://database.example.com/ Sqoop
  105. 105. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Learning Resources  Top 50 Hadoop Interview Questions: www.edureka.co/blog/interview-questions/top-50-hadoop-interview-questions-2016  HDFS Interview Questions: www.edureka.co/blog/interview-questions/hadoop-interview-questions-hdfs-2  MapReduce Interview Questions: www.edureka.co/blog/interview-questions/hadoop-interview-questions-mapreduce  Apache Hive Interview Questions: www.edureka.co/blog/interview-questions/hive-interview-questions  Apache Pig Interview Questions: www.edureka.co/blog/interview-questions/hadoop-interview-questions-pig  Apache HBase Interview Questions: www.edureka.co/blog/interview-questions/hbase-interview-questions
  106. 106. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Thank You… Questions/Queries/Feedback

×