Data analysis on hadoop

DATA ANALYSIS ON
HADOOP
Presented byWenkeYang
2018/02/12

Hadoop Concept
■ HDFS: Hadoop Distributed File System
■ Clusters-Name Node, Data Node
■ Data Node: Contains the blocks for HDFS and has the actual data
■ Name Node: Knows which Data Node each block exist on
■ Three-Block redundancy ensures HDFS is fault tolerant

MapReduce
■ Function Programming: Map, Reduce, Filter

MapReduce Process
■ Move Processing to where the data is

Resource Negotiator-Yarn
■ Request resources through resource negotiator
You get this much memory and you get this many CPU cores.

Hive
■ SQL-like query language that generates MapReduce code
■ Hive –f select_aggregate.hql
■ Sort-limit.hql
Select userid, sum(score) from user_db.comments group by userid sort by sum(score)
desc limit 10;
■ Filter only in the partition in which you know your data exists
Show partitions table_name
Show partitions table_name partition(ds=‘2013-01-01’)
■ Left Join, Right Join, Inner Join, Left outer join, Right outer join
Tab1 left join tab2 on (tab1.id=tab2.id)
Tab1 right join tab2 on (tab1.id=tab2.id)
Tab1 inner join tab2 on (tab1.id=tab2.id)

Hive Metadata
■ Use database;
■ Show databases;
■ Show tables;
■ Describe (formatted|extended) table;
■ Create Database db_name;
■ Drop Database db_name(cascade);

UDF- User Defined Function
■ Show Functions
■ Describe Function <function_name>
■ Describe Function Extended <function_name>
■ Create Function
– CREATE FUNCTION [db_name.]function_nameAS class_name
– [USINGJAR|FILE|ARCHIVE 'file_uri' [,JAR|FILE|ARCHIVE 'file_uri'] ];

Others
■ Pipeline- Multiple Jobs
■ Load data into HDFS
■ Compressed data types
– Columnar data type - Parquet
– Compress Algorithm – Snappy
■ DistributedComputing
■ Impala, Spark-sql etc

Linux Shell Commands
Command Description
Ls List folder contents
Cat Reads(displays) a file
Mkdir Makes a directory
Cd Change to a directory
Sudo command Run command as administrator
Chmod file Show/change permissions of file

Haddop shell commands
Hadoop file system
Hadoop fs –cat file:///file2
hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2
Hadoop fs –copyFromLocal <from> <to>
Hadoop fs –put <files> hdfs://nn.example.com/hadoop/hadoopfile
Sudo hadoop jar <jarFileName> <method> <from> <to>
Hadoop fs –ls /user/hadoop/dir1
Hadoop fs –cat hdfs://nn.example.com/file1
Hadoop fs –get /user/haoop/file <localfile>

Sqoop-1.4.5
■ Command-line utility for transferring data between RDBMS systems and Hadoop
Sqoop import –connect <JDBC connection string> --table <tablename> --username
<un> --password <pw> --hive-import
Sqoop export–connect <JDBC connection string> --table <tablename> --username <un>
--password <pw> --export-dir <path>

Apache MahoutVS R Single Machine
■ Library with common machine learning algorithms/ Data Mining / Analysis
■ Mahout is designed for Hadoop Scale
■ Support for Multiple Distributed Backends (includingApache Spark)
■ R is totally opposite
■ Many data-mining algorithms
– Recommendation (likelihood: Pandora)
– Classification(Known & new data:Spam ID)
– Clustering(new groups of similar data:Google News)

Apache Spark
■ In-memory distributed data analysis
– Goal is to speed up job
– Batches
– Machine Learning
■ Interactive queries
■ Yarn-ready
■ Hive in memory

Beyond Hadoop
■ Hadoop is over ten years old based on google white paper in 2006
– https://static.googleusercontent.com/media/research.google.com/zh-
CN//archive/bigtable-osdi06.pdf
■ Google has released a next-generation product
– Big Query: Query-as-a-service
– Petabytes in seconds via SQL
■ Watch for other implementations
– Open Source version of Dremel: Apache Drill
– ScalableStream and Batch Data Processing: Apache Flink

Google big query performance snapshot

Data analysis on hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data analysis on hadoop

Similar to Data analysis on hadoop (20)

Recently uploaded

Recently uploaded (20)

Data analysis on hadoop