2. Hadoop Concept
■ HDFS: Hadoop Distributed File System
■ Clusters-Name Node, Data Node
■ Data Node: Contains the blocks for HDFS and has the actual data
■ Name Node: Knows which Data Node each block exist on
■ Three-Block redundancy ensures HDFS is fault tolerant
6. Hive
■ SQL-like query language that generates MapReduce code
■ Hive –f select_aggregate.hql
■ Sort-limit.hql
Select userid, sum(score) from user_db.comments group by userid sort by sum(score)
desc limit 10;
■ Filter only in the partition in which you know your data exists
Show partitions table_name
Show partitions table_name partition(ds=‘2013-01-01’)
■ Left Join, Right Join, Inner Join, Left outer join, Right outer join
Tab1 left join tab2 on (tab1.id=tab2.id)
Tab1 right join tab2 on (tab1.id=tab2.id)
Tab1 inner join tab2 on (tab1.id=tab2.id)
7. Hive Metadata
■ Use database;
■ Show databases;
■ Show tables;
■ Describe (formatted|extended) table;
■ Create Database db_name;
■ Drop Database db_name(cascade);
8. UDF- User Defined Function
■ Show Functions
■ Describe Function <function_name>
■ Describe Function Extended <function_name>
■ Create Function
– CREATE FUNCTION [db_name.]function_nameAS class_name
– [USINGJAR|FILE|ARCHIVE 'file_uri' [,JAR|FILE|ARCHIVE 'file_uri'] ];
9. Others
■ Pipeline- Multiple Jobs
■ Load data into HDFS
■ Compressed data types
– Columnar data type - Parquet
– Compress Algorithm – Snappy
■ DistributedComputing
■ Impala, Spark-sql etc
10. Linux Shell Commands
Command Description
Ls List folder contents
Cat Reads(displays) a file
Mkdir Makes a directory
Cd Change to a directory
Sudo command Run command as administrator
Chmod file Show/change permissions of file
12. Sqoop-1.4.5
■ Command-line utility for transferring data between RDBMS systems and Hadoop
Sqoop import –connect <JDBC connection string> --table <tablename> --username
<un> --password <pw> --hive-import
Sqoop export–connect <JDBC connection string> --table <tablename> --username <un>
--password <pw> --export-dir <path>
13. Apache MahoutVS R Single Machine
■ Library with common machine learning algorithms/ Data Mining / Analysis
■ Mahout is designed for Hadoop Scale
■ Support for Multiple Distributed Backends (includingApache Spark)
■ R is totally opposite
■ Many data-mining algorithms
– Recommendation (likelihood: Pandora)
– Classification(Known & new data:Spam ID)
– Clustering(new groups of similar data:Google News)
14. Apache Spark
■ In-memory distributed data analysis
– Goal is to speed up job
– Batches
– Machine Learning
■ Interactive queries
■ Yarn-ready
■ Hive in memory
15. Beyond Hadoop
■ Hadoop is over ten years old based on google white paper in 2006
– https://static.googleusercontent.com/media/research.google.com/zh-
CN//archive/bigtable-osdi06.pdf
■ Google has released a next-generation product
– Big Query: Query-as-a-service
– Petabytes in seconds via SQL
■ Watch for other implementations
– Open Source version of Dremel: Apache Drill
– ScalableStream and Batch Data Processing: Apache Flink