• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
MapReduce 실행 샘플 (K-mer Counting, K-means Clustering)
 

MapReduce 실행 샘플 (K-mer Counting, K-means Clustering)

on

  • 1,229 views

 

Statistics

Views

Total Views
1,229
Views on SlideShare
1,229
Embed Views
0

Actions

Likes
0
Downloads
19
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    MapReduce 실행 샘플 (K-mer Counting, K-means Clustering) MapReduce 실행 샘플 (K-mer Counting, K-means Clustering) Document Transcript

    • MapReduce 샘플 코드 및 실행 방법 정리 (Bioinformatics lab 대상 2차 세미나) 작성자 : 송주영(bt22dr@gmail.com) 1. K-mer Counting- K-mer길이가 k인 염기서열 내의 연속된 염기염기서열 ACGTACGTACGTAK-mer (length = 6) ACGTAC CGTACG GTACGT TACGTA ACGTAC CGTACG GTACGT TACGTA- 실행 방법Usage: KmerCounting <in><out><k-mer length>- 상세 실행 과정[hadoop@cudatest hadoop]$ cd $HADOOP_HOME[hadoop@cudatest hadoop]$ bin/hadoop dfs -ls Bio/inputFound 1 items-rw-r--r-- 1 hadoop supergroup 33 2012-07-10 11:27 /user/hadoop/Bio/input/kmer_counting.txt[hadoop@cudatest hadoop]$[hadoop@cudatest hadoop]$ bin/hadoop dfs -cat Bio/input/kmer_counting.txtATGAACCTTAGAACAACTTATTTAGGCAAC[hadoop@cudatest hadoop]$ bin/hadoop jar example.jar KmerCounting Bio/input Bio/output 3****hdfs://cudatest:9000/user/hadoop/Bio/input12/07/10 18:08:55 INFO input.FileInputFormat: Total input paths to process : 112/07/10 18:08:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library12/07/10 18:08:55 WARN snappy.LoadSnappy: Snappy native library not loaded12/07/10 18:08:55 INFO mapred.JobClient: Running job: job_201207051452_002112/07/10 18:08:56 INFO mapred.JobClient: map 0% reduce 0%12/07/10 18:09:09 INFO mapred.JobClient: map 100% reduce 0%12/07/10 18:09:21 INFO mapred.JobClient: map 100% reduce 100%12/07/10 18:09:27 INFO mapred.JobClient: Job complete: job_201207051452_002112/07/10 18:09:27 INFO mapred.JobClient: Counters: 29… 중략 …12/07/10 18:09:27 INFO mapred.JobClient: Combine output records=012/07/10 18:09:27 INFO mapred.JobClient: Physical memory (bytes) snapshot=275869696
    • 12/07/10 18:09:27 INFO mapred.JobClient: Reduce output records=1612/07/10 18:09:27 INFO mapred.JobClient: Virtual memory (bytes) snapshot=118691430412/07/10 18:09:27 INFO mapred.JobClient: Map output records=24[hadoop@cudatest hadoop]$ bin/hadoop dfs -cat Bio/output/part-r-00000AAC 4ACA 1ACC 1ACT 1AGG 1ATG 1CAA 2CCT 1CTT 2GAA 2GCA 1GGC 1TAG 1TGA 1TTA 3TTT 1- 비교 및 확인<출처 : http://schatzlab.cshl.edu/presentations/2010-03-15.XGen-Scalable%20Solutions.pdf>※ 위 원문 자료 오류: [CTT : 1]과 [GAA : 1]을 [CTT : 2]과 [GAA : 2]로 수정해야 함.
    • K-means Clustering주어진 데이터를 k개의 클러스터로 묶는 알고리즘. 주어진 데이터를 가장 거리가 가까운 것들끼리 k개의 클러스터로 군집하여 모든 데이터와 해당 클러스터의 centroid와의 거리합이 최소가 되도록 반복연산을 수행한다.- 실행 방법Usage: KmeansClustering <in><out><k: # of cluster><x: maximum iteration>- 상세 실행 과정1. 입력 데이터 생성[hadoop@cudatest hadoop]$ R>library(rhdfs)>library(rmr)>x <- rbind(matrix(rnorm(10, sd = 0.3), ncol = 2),+ matrix(rnorm(10, mean = 1, sd = 0.3), ncol = 2))>to.dfs(x, "/user/hadoop/Kmeans/input/kmeans_data_10.txt", format="text")[1] "/user/hadoop/Kmeans/input/kmeans_test.txt">plot(x, col = cl$cluster)2. Clustering 수행[hadoop@cudatest hadoop]$ bin/hadoop dfs -ls KmeansFound 1 itemsdrwxr-xr-x - hadoop supergroup 0 2012-07-13 17:42 /user/hadoop/Kmeans/input[hadoop@cudatest hadoop]$ bin/hadoop dfs -ls Kmeans/inputFound 1 items-rw-r--r-- 1…2012-07-13 17:41 /user/hadoop/Kmeans/input/kmeans_data_10.txt[hadoop@cudatest hadoop]$ bin/hadoop jar BQR.jar com.skcc.services.BQR.KmeansClusteringKmeans/input Kmeans/output 2 512/07/13 17:55:29 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applicationsshould implement Tool for the same.****hdfs://cudatest:9000/user/hadoop/Kmeans/input12/07/13 17:55:29 INFO input.FileInputFormat: Total input paths to process : 112/07/13 17:55:29 INFO util.NativeCodeLoader: Loaded the native-hadoop library12/07/13 17:55:29 WARN snappy.LoadSnappy: Snappy native library not loaded12/07/13 17:55:29 INFO mapred.JobClient: Running job: job_201207131700_001912/07/13 17:55:30 INFO mapred.JobClient: map 0% reduce 0%…생략…[hadoop@cudatest hadoop]$ bin/hadoop dfs -cat Kmeans/output/5/part-r-000000 -0.220571493559915 0.0174640983069702 0.191775203854135 -0.03244784560555290.30730078757325 -0.446549547688165 0.12718989634021 0.1332291050222271 0.635665082648627 0.466306965120643 1.24647562794383 1.08065598756178 0.8413775421196991.04601249480335 1.28393866595644 1.6448881770345 0.987943522251124 0.6195786002580740.57360383756511 1.38138384087598
    • - 소스코드 설명코드는 총 4개의 Mapper, Reducer 클래스로 구성되어 있으며,Clustering 과정은 크게 3가지 MapReduce 작업으로 이루어진다. generateSeed( ) : 최초 seed로 사용할 centroid를 생성 kmeansIter( ) : centroid를 갱신하며 clustering. 반복 수행함 resultKmeans( ) : 최종 결과 생성전체 구조도Key-Value 설계 map : (k1, v1) -> list(k2, v2) reduce : (k2, list(v2)) -> list(v3)GenSeedMapper - 입력 : [ 키: 라인 offset, 값: <x y> ] - 출력 : [ 키: center ID, 값: <x y>]KmeansReducer - 입력 : [ 키: center ID, 값: list(<x y>) ] - 출력 : [ 키: center ID, 값: 해당 클러스터의 mean값 <x y>]KmeansMapper - Setup: [ 이전 단계 Reducer의 결과 읽기 ] - 입력 : [ 키: 라인 offset, 값: <x y> ] - 출력 : [ 키: <x y>에 제일 가까운 center ID, 값: x y ]ResultReducer
    • - 입력 : [ 키: center ID, 값: list(<x y>) ] - 출력 : [ 키: center ID, 값: 해당 ID에 포함된 <x y> list ]- 입/출력 입력 출력-0.220571493559915 0.0174640983069702 0 -0.220571493559915 0.01746409830697020.191775203854135 -0.0324478456055529 0.191775203854135 -0.0324478456055529 0.307300787573250.30730078757325 -0.446549547688165 -0.446549547688165 0.12718989634021 0.1332291050222270.12718989634021 0.133229105022227 1 0.635665082648627 0.4663069651206430.635665082648627 0.466306965120643 1.24647562794383 1.08065598756178 0.8413775421196991.24647562794383 1.08065598756178 1.04601249480335 1.28393866595644 1.64488817703450.841377542119699 1.04601249480335 0.987943522251124 0.619578600258074 0.573603837565111.28393866595644 1.6448881770345 1.381383840875980.987943522251124 0.6195786002580740.57360383756511 1.38138384087598- 비교 및 확인