Lightning-fast cluster computing
잠시 복습
Problem
Solution
MapReduce?
모든 일을 MapReduce화 하라!
근데 이런 SQL을 어떻
게 MapReduce로 만들
지?
SELECT LAT_N, CITY,
TEMP_F
FROM STATS, STATION
WHERE MONTH = 7
AND STATS.ID =
STATION.ID
ORDER BY TEMP_F;
모든 일을 MapReduce화 하라!
이런 Machine
learning/Data 분석 업
무는?
“지난 2007년부터 매월 나오
는 전국 부동산 실거래가 정
보에서 영향을 미칠 수 있는
변수 140개중에 의미있는 변
수 5개만 뽑아.”
“아, 마감은 내일이다.”
코드도 이정도면 뭐? (단순히 단어세는 코드가…)
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
원래 세월이 가면 연장은 좋아지는 법
Generality
High-level tool들 아래
에서 모든 일들을 있는
그대로 하게 해줍니다.
쓰기 쉽습니다.
Java, Scala, Python을
지원합니다.
text_file = spark.textFile
("hdfs://...")
text_file.flatMap(lambda line:
line.split())
.map(lambda word: (word,
1))
.reduceByKey(lambda a, b:
a+b)
Word count in Spark's Python API
온갖 분산처리 환경에서 다 돌아갑니다.
● Hadoop, Mesos, 혼
자서도, Cloud에서
도 돌아요.
● HDFS, Cassandra,
HBase, S3등에서
데이타도 가져올 수
있어요.
속도도 빠릅니다.
Hadoop MapReduce
를 Memory에서 올렸
을 때보다 100배, Disk
에서 돌렸을 때의 10배
빠릅니다.
Logistic regression in
Hadoop and Spark
자체 Web UI까지 있어요….
Spark은 말이죠
● Tool이에요, Library 아닙니다.
○ 이 Tool위에 하고 싶은 일들을 정의하고
○ 실행시키는 겁니다.
Standalone으로 부터
제 2부: 한번 해보자!
vagrant up / vagrant ssh
spark-shell
pyspark- python spark shell
Wordcount : Scala
val f = sc.textFile("README.md")
val wc = f.flatMap(l => l.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
wc.saveAsTextFile("wc_out.txt")
Wordcount : Scala
val f = sc.textFile("README.md")
===================
def textFile(path: String, minPartitions:
Int = defaultMinPartitions):RDD[String]
===================
Read a directory of text files from HDFS, a local file system
(available on all nodes), or any Hadoop-supported file
system URI.
Wordcount : Scala
val wc = f.flatMap(l => l.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
Wordcount : Scala
val wc = f.flatMap(l => l.split(" "))
: 한 단어씩 끊어서
Wordcount : Scala
val wc = f.flatMap(l => l.split(" ")).map(word => (word,
1))
:(각 단어들, 1)이라는 (Key, Value)들을 만들고
Wordcount : Scala
val wc = f.flatMap(l => l.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
: 그 집합들을 다 Key별로 합해보아요.
Wordcount : Scala
scala>wc.take(20)
…….
finished: take at <console>:26, took 0.081425 s
res6: Array[(String, Int)] = Array((package,1), (For,2),
(processing.,1), (Programs,1), (Because,1), (The,1),
(cluster.,1), (its,1), ([run,1), (APIs,1), (computation,
1), (Try,1), (have,1), (through,1), (several,1), (This,2),
("yarn-cluster",1), (graph,1), (Hive,2), (storage,1))
Wordcount : Scala
wc.saveAsTextFile("wc_out.txt")
==========================
파일로 저장
앞서 짠 코드를 이렇게 돌린다면?
Simplifying Big Data Analysis
with Apache Spark
Matei Zaharia
April 27, 2015
Disk-> Memory로 옮겨봅시다.
Simplifying Big Data Analysis
with Apache Spark
Matei Zaharia
April 27, 2015
즉 이렇게 각 Cluster별로
일거리와 명령을 전달해 주
면 되요.
Spark Model
● 데이타를 변환해가는 프로그램을 작성하는
것
● Resilient Distributed Dataset(RDDs)
○ Cluster로 전달할 memory나 disk에 저장될 object들
의 집합
○ 병렬 변환 ( map, filter…)등등으로 구성
○ 오류가 생기면 자동으로 재구성
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Making interactive Big Data
Applications Fast AND Easy
Holden Karau
Simplifying Big Data Analysis
with Apache Spark
Matei Zaharia
April 27, 2015
각 code 한 줄이 RDD!
val f = sc.textFile("README.md")
val wc = f.flatMap(l => l.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
wc.saveAsTextFile("wc_out.txt")
지원하는 명령들
Build-in libraries
● 다양한 기능들을 RDD로
쓸 수 있게 만들어놓음
● Caching + DAG model은
이런거 돌리는데 충분히
효율적임.
● 모든 라이브러리를 하나
프로그램에 다 묶어 놓는
게 더 빠르다.
Simplifying Big Data Analysis
with Apache Spark
Matei Zaharia
April 27, 2015
Simplifying Big Data Analysis
with Apache Spark
Matei Zaharia
April 27, 2015
Simplifying Big Data Analysis
with Apache Spark
Matei Zaharia
April 27, 2015
MLib
Vectors, Matrices = RDD[Vector]
Iterative computation
points = sc.textFile(“data.txt”).map
(parsePoint)
model = KMeans.train(points, 10)
model.predict(newPoint)
GraphX
Represents
graphs as RDDs
of vertices and
edges.
Simplifying Big Data Analysis
with Apache Spark
Matei Zaharia
April 27, 2015
결론
여러분의 data source, 작업, 환경들을 다 통합
하고 싶어요.
Q&A

Spark 소개 2부

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
    모든 일을 MapReduce화하라! 근데 이런 SQL을 어떻 게 MapReduce로 만들 지? SELECT LAT_N, CITY, TEMP_F FROM STATS, STATION WHERE MONTH = 7 AND STATS.ID = STATION.ID ORDER BY TEMP_F;
  • 6.
    모든 일을 MapReduce화하라! 이런 Machine learning/Data 분석 업 무는? “지난 2007년부터 매월 나오 는 전국 부동산 실거래가 정 보에서 영향을 미칠 수 있는 변수 140개중에 의미있는 변 수 5개만 뽑아.” “아, 마감은 내일이다.”
  • 7.
    코드도 이정도면 뭐?(단순히 단어세는 코드가…) package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 8.
    원래 세월이 가면연장은 좋아지는 법
  • 9.
    Generality High-level tool들 아래 에서모든 일들을 있는 그대로 하게 해줍니다.
  • 10.
    쓰기 쉽습니다. Java, Scala,Python을 지원합니다. text_file = spark.textFile ("hdfs://...") text_file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) Word count in Spark's Python API
  • 11.
    온갖 분산처리 환경에서다 돌아갑니다. ● Hadoop, Mesos, 혼 자서도, Cloud에서 도 돌아요. ● HDFS, Cassandra, HBase, S3등에서 데이타도 가져올 수 있어요.
  • 12.
    속도도 빠릅니다. Hadoop MapReduce 를Memory에서 올렸 을 때보다 100배, Disk 에서 돌렸을 때의 10배 빠릅니다. Logistic regression in Hadoop and Spark
  • 13.
    자체 Web UI까지있어요….
  • 14.
    Spark은 말이죠 ● Tool이에요,Library 아닙니다. ○ 이 Tool위에 하고 싶은 일들을 정의하고 ○ 실행시키는 겁니다.
  • 15.
  • 16.
    vagrant up /vagrant ssh
  • 17.
  • 18.
  • 19.
    Wordcount : Scala valf = sc.textFile("README.md") val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) wc.saveAsTextFile("wc_out.txt")
  • 20.
    Wordcount : Scala valf = sc.textFile("README.md") =================== def textFile(path: String, minPartitions: Int = defaultMinPartitions):RDD[String] =================== Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
  • 21.
    Wordcount : Scala valwc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
  • 22.
    Wordcount : Scala valwc = f.flatMap(l => l.split(" ")) : 한 단어씩 끊어서
  • 23.
    Wordcount : Scala valwc = f.flatMap(l => l.split(" ")).map(word => (word, 1)) :(각 단어들, 1)이라는 (Key, Value)들을 만들고
  • 24.
    Wordcount : Scala valwc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) : 그 집합들을 다 Key별로 합해보아요.
  • 26.
    Wordcount : Scala scala>wc.take(20) ……. finished:take at <console>:26, took 0.081425 s res6: Array[(String, Int)] = Array((package,1), (For,2), (processing.,1), (Programs,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation, 1), (Try,1), (have,1), (through,1), (several,1), (This,2), ("yarn-cluster",1), (graph,1), (Hive,2), (storage,1))
  • 27.
  • 29.
    앞서 짠 코드를이렇게 돌린다면? Simplifying Big Data Analysis with Apache Spark Matei Zaharia April 27, 2015
  • 30.
    Disk-> Memory로 옮겨봅시다. SimplifyingBig Data Analysis with Apache Spark Matei Zaharia April 27, 2015
  • 31.
    즉 이렇게 각Cluster별로 일거리와 명령을 전달해 주 면 되요.
  • 32.
    Spark Model ● 데이타를변환해가는 프로그램을 작성하는 것 ● Resilient Distributed Dataset(RDDs) ○ Cluster로 전달할 memory나 disk에 저장될 object들 의 집합 ○ 병렬 변환 ( map, filter…)등등으로 구성 ○ 오류가 생기면 자동으로 재구성
  • 33.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 34.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 35.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 36.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 37.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 38.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 39.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 40.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 41.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 42.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 43.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 44.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 45.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 46.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 47.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 48.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 49.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 50.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 51.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 52.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 53.
    Making interactive BigData Applications Fast AND Easy Holden Karau
  • 54.
    Simplifying Big DataAnalysis with Apache Spark Matei Zaharia April 27, 2015
  • 55.
    각 code 한줄이 RDD! val f = sc.textFile("README.md") val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) wc.saveAsTextFile("wc_out.txt")
  • 56.
  • 57.
    Build-in libraries ● 다양한기능들을 RDD로 쓸 수 있게 만들어놓음 ● Caching + DAG model은 이런거 돌리는데 충분히 효율적임. ● 모든 라이브러리를 하나 프로그램에 다 묶어 놓는 게 더 빠르다.
  • 58.
    Simplifying Big DataAnalysis with Apache Spark Matei Zaharia April 27, 2015
  • 59.
    Simplifying Big DataAnalysis with Apache Spark Matei Zaharia April 27, 2015
  • 60.
    Simplifying Big DataAnalysis with Apache Spark Matei Zaharia April 27, 2015
  • 61.
    MLib Vectors, Matrices =RDD[Vector] Iterative computation points = sc.textFile(“data.txt”).map (parsePoint) model = KMeans.train(points, 10) model.predict(newPoint)
  • 62.
  • 63.
    Simplifying Big DataAnalysis with Apache Spark Matei Zaharia April 27, 2015
  • 64.
    결론 여러분의 data source,작업, 환경들을 다 통합 하고 싶어요.
  • 65.