20121215 DevLOVE2012 Mahout on AWS

黄色いゾウ使いの
パレード
∼Mahout on AWS∼
都元ダイスケ
2012-12-15 @DevLOVE2012

自己紹介
• 都元ダイスケ (@daisuke_m)
• Java屋です
• java-jaから来ま(ry
Java
オブジェクト指向
Eclipse
恭ライセンス
薬
Mahout
Spring
XML Jiemamy
DDD
HadoopOSGi
Haskell
Scala
Maven
Wicket
AWS
酒

works
• 日経ソフトウエア
• Java入門記事
• Eclipse記事

Mahoutとは
• Javaで実装された
• スケーラブルな
• オープンソースの
• 機械学習ライブラリ

代表的な機械学習
• レコメンド（推薦）
• クラスタリング
• クラシファイイング（分類）
• その他色々ある

アプリと機械学習
• CRUD (create, read, update, delete)
• FILTER (where)
• AGGREGATE (count, sum, ave, max, min...)
• SORT (order by)
• INTELLIGENCE (machine learning)

スケーラブル
• 機械学習の精度は、データ量依存
• データ量に応じ、計算量が指数的に増加
• 大規模な計算リソースが必要
• Hadoop (MapReduce)
• AWS Elastic MapReduce

レコメンド
1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
...
1128 [
1179:5.0,
3160:4.6582785, ...,
797:4.0637455
]
1136[
33493:4.8670673,
6934:4.86497, ...,
230:4.335819
]
...
recommendation
【input】【output】

入力データ (intro.csv)
1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0

簡単なレコメンド
import java.io.File;
import java.util.List;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.*;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
DataModel model = new FileDataModel(new File("intro.csv"));
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood =
new NearestNUserNeighborhood(2, similarity, model);
Recommender recommender =
new GenericUserBasedRecommender(model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(1, 2);
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);
}

結果！
RecommendedItem[item:104, value:4.257081]

レコメンドの理屈
• 1∼5の「ユーザ」
• 101∼107の「アイテム」
• そしてスコア

• 1さんと5さん似てる
• 1さんと4さんも 
何と無く似てる
• 2さんとは逆の好み？
• 3さんとの関連は 
見えない

• 1 vs 5 = 0.94
• 1 vs 4 = 0.99
• 1 vs 2 = -0.76
• 1 vs 3 = NaN
• 1 vs 1 = 1.0

相関係数
• 1 vs 1 = 1.0
• 1 vs 2 = -0.7642652566278799
• 1 vs 3 = NaN
• 1 vs 4 = 0.9999999999999998
• 1 vs 5 = 0.944911182523068
それぞれの人が1さんの予想評点に与える影響度

http://ja.wikipedia.org/wiki/相関係数

加重平均
0.94 ×0.99 ×
0.94 ×
0.94 ×0.99 ×
）/ 1.93
）/ 0.94
）/ 1.93
4.25 =（
3.50 =（
4.00 =（
この情報は
相関係数が低い
またはNaNなので
もうアテにしない

結果！
（再掲）

分散レコメンド
1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
...
1128 [
1179:5.0,
3160:4.6582785, ...,
797:4.0637455
]
1136[
33493:4.8670673,
6934:4.86497, ...,
230:4.335819
]
...
recommendation
【input】【output】
S3 S3EMR

http://www.grouplens.org/node/73

• 1万アイテム
• 7万2千ユーザ
• 1千万評価
MovieLens 10M
実は
これでも
まだ小規模
だと思う

S3入力の準備
•バケットを作る mahoutinaction-jp
•ファイルを2つアップロード
•mahout/mahout-core-0.7-job.jar
•input10m/mahout-10m-ratings.dat

upload by code
import java.io.File;
import com.amazonaws.auth.*;
import com.amazonaws.services.s3.*;
import com.amazonaws.services.s3.model.Region;
AWSCredentials cred = new BasicAWSCredentials(
"AccessKeyID",
"SecretAccessKey");
AmazonS3 s3 = new AmazonS3Client(cred);
s3.createBucket("mahoutinaction-jp", Region.AP_Tokyo);
s3.putObject(
"mahoutinaction-jp",
"mahout/mahout-core-0.7-job.jar",
new File("mahout-core-0.7-job.jar"));
s3.putObject(
"mahoutinaction-jp",
"input10m/mahout-10m-ratings.dat",
new File("mahout-10m-ratings.dat"));

EMRの起動
• JAR Location
mahoutinaction-jp/mahout/
mahout-core-0.7-job.jar
• JAR Arguments
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.map.tasks=40
-Dmapred.reduce.tasks=19
-Dmapred.input.dir=s3n://mahoutinaction-jp/input10m
-Dmapred.output.dir=s3n://mahoutinaction-jp/output10m
--numRecommendations 100
--similarityClassname SIMILARITY_PEARSON_CORRELATION

compute by code
import com.amazonaws.services.elasticmapreduce.*;
import com.amazonaws.services.elasticmapreduce.model.*;
import com.amazonaws.services.elasticmapreduce.util.*;
"AccessKeyID", "SecretAccessKey");
AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(cred);
emr.setEndpoint("elasticmapreduce.ap-northeast-1.amazonaws.com");
RunJobFlowRequest runRequest = new RunJobFlowRequest()
.withName("mahout-10m")
.withSteps( ... ) // detailed on next page
.withInstances( ... ) // detailed on next page
.withAmiVersion("2.1.4")
.withLogUri("s3n://mahoutinaction-jp/log");
RunJobFlowResult runResult = emr.runJobFlow(runRequest);

RunJobFlowRequest runRequest = new RunJobFlowRequest()
.withName("mahout-10m")
.withSteps(
new StepConfig()
.withName("Setup Hadoop Debugging")
.withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)
.withHadoopJarStep(
new StepFactory("ap-northeast-1.elasticmapreduce")
.newEnableDebuggingStep()),
new StepConfig()
.withName("Custom Jar")
.withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)
.withHadoopJarStep(new HadoopJarStepConfig()
.withJar("s3n://mahoutinaction-jp/mahout/mahout-core-0.7-job.jar")
.withMainClass("org.apache.mahout.cf.taste.hadoop.item.RecommenderJob")
.withArgs(Arrays.asList(
"-Dmapred.map.tasks=40",
"-Dmapred.reduce.tasks=19",
"-Dmapred.input.dir=s3n://mahoutinaction-jp/input10m",
"-Dmapred.output.dir=s3n://mahoutinaction-jp/output10m",
"--numRecommendations", "100",
"--similarityClassname", "SIMILARITY_PEARSON_CORRELATION"))))
.withInstances(new JobFlowInstancesConfig()
.withPlacement(new PlacementType("ap-northeast-1a"))
.withInstanceCount(20)
.withMasterInstanceType("m1.small")
.withSlaveInstanceType("m1.small")
.withKeepJobFlowAliveWhenNoSteps(false)
.withHadoopVersion("0.20.205"))
.withAmiVersion("2.1.4")
.withLogUri("s3n://mahoutinaction-jp/logs");
後でごゆっくりどうぞ

watch by code
AmazonElasticMapReduce emr = ...;
RunJobFlowResult runResult = ...;
String jobFlowId = runResult.getJobFlowId();
DescribeJobFlowsRequest describeRequest =
new DescribeJobFlowsRequest().withJobFlowIds(jobFlowId);
DescribeJobFlowsResult describeResult =
emr.describeJobFlows(describeRequest);
JobFlowDetail detail = describeResult.getJobFlows().get(0);
JobFlowExecutionStatusDetail statusDetail =
detail.getExecutionStatusDetail();
JobFlowExecutionState state =
JobFlowExecutionState.fromValue(statusDetail.getState());
// COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN,
// STARTING, WAITING, BOOTSTRAPPING

結果を取り出す
指定したロケーションにファイルが
いくつか生成されている。

download by code
import java.io.InputStream;
import java.util.List;
import com.amazonaws.services.s3.*;
import com.amazonaws.services.s3.model.*;
"AccessKeyID",
"SecretAccessKey");
AmazonS3 s3 = new AmazonS3Client(cred);
ObjectListing listing = s3.listObjects(
"mahoutinaction-jp", "output10m");
List<S3ObjectSummary> summaries = listing.getObjectSummaries();
for (S3ObjectSummary summary : summaries) {
System.out.println(summary.getKey());
if (summary.getKey().endsWith("/_SUCCESS")) {
continue;
}
S3Object obj = s3.getObject("mahoutinaction-jp", summary.getKey());
InputStream in = obj.getObjectContent();
// ...
}

Summary
• 機械学習は、ちょっとインテリな機能
• 分散・非分散アルゴリズム
• 非分散ならオンラインで
• 分散ならAWSのEMRで
• 本スライドはこの後すぐにUP予定。
Twitterで @daisuke_m をチェック！

20121215 DevLOVE2012 Mahout on AWS

More Related Content

What's hot

Similar to 20121215 DevLOVE2012 Mahout on AWS

More from 都元ダイスケ Miyamoto

20121215 DevLOVE2012 Mahout on AWS