MapReduce 실행 샘플 (K-mer Counting, K-means Clustering)

  • 841 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
841
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
23
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. MapReduce 샘플 코드 및 실행 방법 정리 (Bioinformatics lab 대상 2차 세미나) 작성자 : 송주영(bt22dr@gmail.com) 1. K-mer Counting- K-mer길이가 k인 염기서열 내의 연속된 염기염기서열 ACGTACGTACGTAK-mer (length = 6) ACGTAC CGTACG GTACGT TACGTA ACGTAC CGTACG GTACGT TACGTA- 실행 방법Usage: KmerCounting <in><out><k-mer length>- 상세 실행 과정[hadoop@cudatest hadoop]$ cd $HADOOP_HOME[hadoop@cudatest hadoop]$ bin/hadoop dfs -ls Bio/inputFound 1 items-rw-r--r-- 1 hadoop supergroup 33 2012-07-10 11:27 /user/hadoop/Bio/input/kmer_counting.txt[hadoop@cudatest hadoop]$[hadoop@cudatest hadoop]$ bin/hadoop dfs -cat Bio/input/kmer_counting.txtATGAACCTTAGAACAACTTATTTAGGCAAC[hadoop@cudatest hadoop]$ bin/hadoop jar example.jar KmerCounting Bio/input Bio/output 3****hdfs://cudatest:9000/user/hadoop/Bio/input12/07/10 18:08:55 INFO input.FileInputFormat: Total input paths to process : 112/07/10 18:08:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library12/07/10 18:08:55 WARN snappy.LoadSnappy: Snappy native library not loaded12/07/10 18:08:55 INFO mapred.JobClient: Running job: job_201207051452_002112/07/10 18:08:56 INFO mapred.JobClient: map 0% reduce 0%12/07/10 18:09:09 INFO mapred.JobClient: map 100% reduce 0%12/07/10 18:09:21 INFO mapred.JobClient: map 100% reduce 100%12/07/10 18:09:27 INFO mapred.JobClient: Job complete: job_201207051452_002112/07/10 18:09:27 INFO mapred.JobClient: Counters: 29… 중략 …12/07/10 18:09:27 INFO mapred.JobClient: Combine output records=012/07/10 18:09:27 INFO mapred.JobClient: Physical memory (bytes) snapshot=275869696
  • 2. 12/07/10 18:09:27 INFO mapred.JobClient: Reduce output records=1612/07/10 18:09:27 INFO mapred.JobClient: Virtual memory (bytes) snapshot=118691430412/07/10 18:09:27 INFO mapred.JobClient: Map output records=24[hadoop@cudatest hadoop]$ bin/hadoop dfs -cat Bio/output/part-r-00000AAC 4ACA 1ACC 1ACT 1AGG 1ATG 1CAA 2CCT 1CTT 2GAA 2GCA 1GGC 1TAG 1TGA 1TTA 3TTT 1- 비교 및 확인<출처 : http://schatzlab.cshl.edu/presentations/2010-03-15.XGen-Scalable%20Solutions.pdf>※ 위 원문 자료 오류: [CTT : 1]과 [GAA : 1]을 [CTT : 2]과 [GAA : 2]로 수정해야 함.
  • 3. K-means Clustering주어진 데이터를 k개의 클러스터로 묶는 알고리즘. 주어진 데이터를 가장 거리가 가까운 것들끼리 k개의 클러스터로 군집하여 모든 데이터와 해당 클러스터의 centroid와의 거리합이 최소가 되도록 반복연산을 수행한다.- 실행 방법Usage: KmeansClustering <in><out><k: # of cluster><x: maximum iteration>- 상세 실행 과정1. 입력 데이터 생성[hadoop@cudatest hadoop]$ R>library(rhdfs)>library(rmr)>x <- rbind(matrix(rnorm(10, sd = 0.3), ncol = 2),+ matrix(rnorm(10, mean = 1, sd = 0.3), ncol = 2))>to.dfs(x, "/user/hadoop/Kmeans/input/kmeans_data_10.txt", format="text")[1] "/user/hadoop/Kmeans/input/kmeans_test.txt">plot(x, col = cl$cluster)2. Clustering 수행[hadoop@cudatest hadoop]$ bin/hadoop dfs -ls KmeansFound 1 itemsdrwxr-xr-x - hadoop supergroup 0 2012-07-13 17:42 /user/hadoop/Kmeans/input[hadoop@cudatest hadoop]$ bin/hadoop dfs -ls Kmeans/inputFound 1 items-rw-r--r-- 1…2012-07-13 17:41 /user/hadoop/Kmeans/input/kmeans_data_10.txt[hadoop@cudatest hadoop]$ bin/hadoop jar BQR.jar com.skcc.services.BQR.KmeansClusteringKmeans/input Kmeans/output 2 512/07/13 17:55:29 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applicationsshould implement Tool for the same.****hdfs://cudatest:9000/user/hadoop/Kmeans/input12/07/13 17:55:29 INFO input.FileInputFormat: Total input paths to process : 112/07/13 17:55:29 INFO util.NativeCodeLoader: Loaded the native-hadoop library12/07/13 17:55:29 WARN snappy.LoadSnappy: Snappy native library not loaded12/07/13 17:55:29 INFO mapred.JobClient: Running job: job_201207131700_001912/07/13 17:55:30 INFO mapred.JobClient: map 0% reduce 0%…생략…[hadoop@cudatest hadoop]$ bin/hadoop dfs -cat Kmeans/output/5/part-r-000000 -0.220571493559915 0.0174640983069702 0.191775203854135 -0.03244784560555290.30730078757325 -0.446549547688165 0.12718989634021 0.1332291050222271 0.635665082648627 0.466306965120643 1.24647562794383 1.08065598756178 0.8413775421196991.04601249480335 1.28393866595644 1.6448881770345 0.987943522251124 0.6195786002580740.57360383756511 1.38138384087598
  • 4. - 소스코드 설명코드는 총 4개의 Mapper, Reducer 클래스로 구성되어 있으며,Clustering 과정은 크게 3가지 MapReduce 작업으로 이루어진다. generateSeed( ) : 최초 seed로 사용할 centroid를 생성 kmeansIter( ) : centroid를 갱신하며 clustering. 반복 수행함 resultKmeans( ) : 최종 결과 생성전체 구조도Key-Value 설계 map : (k1, v1) -> list(k2, v2) reduce : (k2, list(v2)) -> list(v3)GenSeedMapper - 입력 : [ 키: 라인 offset, 값: <x y> ] - 출력 : [ 키: center ID, 값: <x y>]KmeansReducer - 입력 : [ 키: center ID, 값: list(<x y>) ] - 출력 : [ 키: center ID, 값: 해당 클러스터의 mean값 <x y>]KmeansMapper - Setup: [ 이전 단계 Reducer의 결과 읽기 ] - 입력 : [ 키: 라인 offset, 값: <x y> ] - 출력 : [ 키: <x y>에 제일 가까운 center ID, 값: x y ]ResultReducer
  • 5. - 입력 : [ 키: center ID, 값: list(<x y>) ] - 출력 : [ 키: center ID, 값: 해당 ID에 포함된 <x y> list ]- 입/출력 입력 출력-0.220571493559915 0.0174640983069702 0 -0.220571493559915 0.01746409830697020.191775203854135 -0.0324478456055529 0.191775203854135 -0.0324478456055529 0.307300787573250.30730078757325 -0.446549547688165 -0.446549547688165 0.12718989634021 0.1332291050222270.12718989634021 0.133229105022227 1 0.635665082648627 0.4663069651206430.635665082648627 0.466306965120643 1.24647562794383 1.08065598756178 0.8413775421196991.24647562794383 1.08065598756178 1.04601249480335 1.28393866595644 1.64488817703450.841377542119699 1.04601249480335 0.987943522251124 0.619578600258074 0.573603837565111.28393866595644 1.6448881770345 1.381383840875980.987943522251124 0.6195786002580740.57360383756511 1.38138384087598- 비교 및 확인