7가지 동시성 모델 람다아키텍처

7가지 동시성 모델
Lambda Architecture
아꿈사 송성곤

Lambda Architecture
Lambda Architecture는 Nathan Marz 에 의해 대중화된용어로 scalable and fault-tolerant data processing
architecture.
Backtype과 Twitter에서 분산 데이터 처리 시스템에서일한 경험이 바탕이 됨.

Lambda Architecture 1. All data entering the system is dispatched to both the
batch layer and the speed layer for processing.
2. The batch layer has two functions: (i) managing the
master dataset (an immutable, append-only set of
raw data), and (ii) to pre-compute the batch views.
3. The serving layer indexes the batch views so that
they can be queried in low-latency, ad-hoc way.
4. The speed layer compensates for the high latency of
updates to the serving layer and deals with recent
data only.
5. Any incoming query can be answered by merging
results from batch views and real-time views.
출처 : http://lambda-architecture.net/

Batch layer and Speed layer
ALL Data
New data
stream
(Kafka)
Storm
Hadoop MR
Cassandra
Hbase
Riak
ElephantDB
Voldemort
Query

Hadoop MapReduce
● Mapper :
○ Processes input key/value pair,
○ Produces set of intermediate
pairs.
● Reducer :
○ Combines all intermediate
values for a particular key,
○ Produces a set of merged
output values

Mapper
public static class Map extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString(); // <label id="code.map.tostring"/>
Iterable<String> words = new Words(line); // <label id="code.map.words"/>
for (String word: words)
context.write(new Text(word), one); // <label id="code.map.write"/>
}
}

Reducer
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
int sum = 0;
for (IntWritable val: values)
sum += val.get();
context.write(key, new IntWritable(sum));
}
}

Driver
public class WordCount extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = Job.getInstance(conf, "wordcount");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class); // <label id="code.mapperclass"/>
job.setReducerClass(Reduce.class); // <label id="code.reducerclass"/>
job.setOutputKeyClass(Text.class); // <label id="code.keyclass"/>
job.setOutputValueClass(IntWritable.class); // <label id="code.valueclass"/>
FileInputFormat.addInputPath(job, new Path(args[0])); // <label id="code.inputpath"/>
FileOutputFormat.setOutputPath(job, new Path(args[1])); // <label id="code.outputpath"/>
boolean success = job.waitForCompletion(true); // <label id="code.waitforcompletion"/>
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new WordCount(), args);
System.exit(res);

지역적으로 실행하기
$ hadoop jar target/wordcount-1.0-jar-with-dependencies. jar input output

Amazon EMR에서 실행하기
● EMR : Elastic MapReduce
● Input/Ouput : Amazon S3
● 커다란 파일 올리기 : Amazon EC2 활용 → Amazon S3와 EC2 대역폭이
높음

클러스터 만들기 - Amazon EMR
● $ elastic-mapreduce --create --name wordcount -num-instances 11
--master-instance-type m1.large --slave-instance-type m1.large V
--ami-version 3.0.2 --jar s3://pb7con-lambda/wordcount. jar V --arg
s3://pb7con-wikipedia/text --arg S3://pb7con-wikipedia/Counts Created
job flow j-2LSRGPBSR79ZV

진행상황 모니터 - Amazon EMR
● $ elastic-mapreduce --jobflow j -2LSRGPBSR79ZW -ssh
● $ tail -f /mnt/var/log/hadoop/steps/1/syslog

결과 검사하기 - Amazon EMR
● 작업이 완료되면 S3 버켓에 여러 개의 파일을 발견할 수 있을 것이다.
part-r-00000
part-r-00001
part-r-00002
….
part-r-00028

XML 처리하기-Driver
public int run(String[] args) throws Exception {
Configuration conf = getConf();
conf.set("xmlinput.start", "<text"); // <label id="code.xmlinputstart"/>
conf.set("xmlinput.end", "</text>"); // <label id="code.xmlinputend"/>
Job job = Job.getInstance(conf, "wordcount");
job.setJarByClass(WordCount.class);
job.setInputFormatClass(XmlInputFormat.class); // <label id="code.setinputformat"/>
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class); // <label id="code.setcombiner"/>
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;

XML 처리하기-Mapper
public static class Map extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private final static Pattern textPattern =
Pattern.compile("^<text.*>(.*)</text>$", Pattern.DOTALL);
String text = value.toString();
Matcher matcher = textPattern.matcher(text);
if (matcher.find()) {
Iterable<String> words = new Words(matcher.group(1));
for (String word: words)
context.write(new Text(word), one);
}
}
}

하둡을 사용하는 것이 전적으로 속도
때문인가?
대용량 분산 처리
장애를 처리하고 회복할수 있음
● 노드 장애 : 장애가 발생한 노드에서재시도(map, reduce 재 실행)
● 디스크 장애 : 데이터를여러 노드에 중복하여저장하는장애 허용 분산 파일 시스템
● 유한한 메모리 : 메모리에전체 데이터를모두 올리는 대신, 처리하는동안 HDFS에 키/값쌍 들을 저장

1일차 마무리
● Hadoop streaming API : https://hadoop.apache.org/docs/r1.2.1/streaming.html
● Hadoop pipe API : https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/mapred/pipes/package-summary.html
● Hadoop java/scala api
○ Scalding : http://www.cascading.org/projects/scalding/
○ Cascading : http://www.cascading.org/projects/cascading/
○ Cascalog : http://cascalog.org/

전통적인 데이터 시스템의 문제
● 규모 - 복제, 샤딩으로 확장... 장비가 늘어날수록 질의가 많이 질수록 쉽지 않다!
● 유지보수 오버 헤드 - 여러대의 컴퓨터에 퍼져있는 데이터 베이스관리 쉽지 않다!
● 복잡성 - 복제와 샤딩은 어플리케이션의 지원이 필요
● 사람의 실수 - 사람이 저지르는 잘못을 처리하는 것
● 과거의 정보에 대한 접근이 필요한 분석과 보고서

Batch layer
● 불멸의 진리 데이터 : 원천 데이터 + 도출된 데이터
● 원천데이터가 불변이 될려면 ?
○ 타임스탬프 만 추가하면 된다.
● 원천데이터가 불변이면 ...
○ 대량의 데이터를 처리할 수 있는 고도의 병렬성
○ 만기도 쉽고 장애도 잘 일어나지 않는 단순함
○ 기술적 고장이나 사람의 실수를 허용하는 능력
○ 매일의 업무는 물론 과거의 데이터를 대상으로 하는 보고서나 분석도 가능하게 만드는
능력
● "지연"이라는 단점이 있음

WikiContributorsBatch
● 타임스탬프
● 기여 자체를 나타내는 식별자
● 기여를 한 사용자를 나타내는 식별자
● 사용자의 이름

Mapper - WikiContributorsBatch
public static class Map extends Mapper<Object, Text, IntWritable, LongWritable> {
Contribution contribution = new Contribution(value.toString());
context.write(new IntWritable(contribution.contributorId),
new LongWritable(contribution.timestamp));
}
}

Mapper - WikiContributorsBatch
class Contribution {
static final Pattern pattern = Pattern.compile("^([^s]*) (d*) (d*) (.*)$"); // <label id="code.contributorpattern"/>
static final DateTimeFormatter isoFormat = ISODateTimeFormat.dateTimeNoMillis(); // <label id="code.isoformat"/>
public long timestamp;
public int id;
public int contributorId;
public String username;
public Contribution(String line) {
Matcher matcher = pattern.matcher(line);
if(matcher.find()) {
timestamp = isoFormat.parseDateTime(matcher.group(1)).getMillis(); // <label id="code.getmillis"/>
id = Integer.parseInt(matcher.group(2));
contributorId = Integer.parseInt(matcher.group(3));
username = matcher.group(4);
}
}

Reducer - WikiContributorsBatch
public static class Reduce
extends Reducer<IntWritable, LongWritable, IntWritable, Text> {
static DateTimeFormatter dayFormat = ISODateTimeFormat.yearMonthDay();
static DateTimeFormatter monthFormat = ISODateTimeFormat.yearMonth();
public void reduce(IntWritable key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
HashMap<DateTime, Integer> days = new HashMap<DateTime, Integer>(); // <label id="code.days"/>
HashMap<DateTime, Integer> months = new HashMap<DateTime, Integer>(); // <label id="code.months"/>
for (LongWritable value: values) {
DateTime timestamp = new DateTime(value.get());
DateTime day = timestamp.withTimeAtStartOfDay(); // <label id="code.day"/>
DateTime month = day.withDayOfMonth(1); // <label id="code.month"/>
incrementCount(days, day);
incrementCount(months, month);
}
for (Entry<DateTime, Integer> entry: days.entrySet()) // <label id="code.outputcountsstart"/>
context.write(key, formatEntry(entry, dayFormat));
for (Entry<DateTime, Integer> entry: months.entrySet())

Reducer - WikiContributorsBatch
private void incrementCount(HashMap<DateTime, Integer> counts, DateTime key) {
Integer currentCount = counts.get(key);
if (currentCount == null)
counts.put(key, 1);
else
counts.put(key, currentCount + 1);
}
private Text formatEntry(Entry<DateTime, Integer> entry,
DateTimeFormatter formatter) {
return new Text(formatter.print(entry.getKey()) + "t" + entry.getValue())
}
}

Service Layer 배치뷰 업데이트에적합한 서비스 계층 DB ?
ElephantDB (Clojure)► This is system created exactly for
this case. It is very simple.
참조: http://www.slideshare.net/nathanmarz/elephantdb
Voldemort(java) ► NoSQL, Pluggable Storage engines,
Consistent hashing, Eventual consistency, Support for
batch-computed read-only stores
An open source clone of Amazon's Dynamo.
출처 :
http://www.slideshare.net/DavidGroozman/challenge-26788828

Speed Layer ● Batch Layer 속도 지연의 극복
● 점진적인 접근 방식을 사용
● 무작위 쓰기를 허용하는
전통적인 데이터베이스를
사용

Speed Layer
동기적 접근 방법
비동기적 접근 방법

STORM
● 스톰 시스템은 튜플의 스트림을 처리한다.
● 튜플은 스파우트에 의해서 생성되고 볼트에 의해서 처리된다.

STORM 작업자들
● 스파우트/볼트를 여러 노드에 분산시키는 이유는 장애 허용 때문
● 클러스터에 있는 노드 중 하나에 장애 : 스톰 토폴로지는 튜플을 정상적으로
동작하는 노드에 보냄 → 튜플 “최소한 한 번" 처리 보장

STORM을 이용해 기여 세기
● 로그 읽기 스파우트
● 로그 항목 해석 볼트
● 데이터베이스 업데이트 볼트

Spout
public class RandomContributorSpout extends BaseRichSpout { // <label id="code.spout"/>
private static final Random rand = new Random();
private static final DateTimeFormatter isoFormat =
ISODateTimeFormat.dateTimeNoMillis();
private SpoutOutputCollector collector;
private int contributionId = 10000;
public void open(Map conf, TopologyContext context, // <label id="code.spoutopen"/>
SpoutOutputCollector collector) {
this.collector = collector;
}
public void declareOutputFields(OutputFieldsDeclarer declarer) { // <label id="code.spoutdeclare"/>
declarer.declare(new Fields("line"));
}
public void nextTuple() { // <label id="code.spoutnexttuple"/>
Utils.sleep(rand.nextInt(100));
++contributionId;
String line = isoFormat.print(DateTime.now()) + " " + contributionId + " " +
rand.nextInt(10000) + " " + "dummyusername";

Parser Bolt
class ContributionParser extends BaseBasicBolt { // <label id="code.bolt"/>
public void declareOutputFields(OutputFieldsDeclarer declarer) { // <label id="code.parserdeclare"/>
declarer.declare(new Fields("timestamp", "id", "contributorId", "username"));
}
public void execute(Tuple tuple, BasicOutputCollector collector) { // <label id="code.parserexecute"/>
Contribution contribution = new Contribution(tuple.getString(0));
collector.emit(new Values(contribution.timestamp, contribution.id,
contribution.contributorId, contribution.username));
}
}

Record Bolt
class ContributionRecord extends BaseBasicBolt {
private static final HashMap<Integer, HashSet<Long>> timestamps = // <label id="code.recordertimestamps"/>
new HashMap<Integer, HashSet<Long>>();
public void declareOutputFields(OutputFieldsDeclarer declarer) { // <label id="code.recorderdeclare"/>
}
public void execute(Tuple tuple, BasicOutputCollector collector) { // <label id="code.recorderexecute"/>
addTimestamp(tuple.getInteger(2), tuple.getLong(0));
}
private void addTimestamp(int contributorId, long timestamp) { // <label id="code.addtimestamp"/>
HashSet<Long> contributorTimestamps = timestamps.get(contributorId);
if (contributorTimestamps == null) {
contributorTimestamps = new HashSet<Long>();
timestamps.put(contributorId, contributorTimestamps);
}
contributorTimestamps.add(timestamp);
}
}

Topology
public class WikiContributorsTopology {
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder(); // <label id="code.topologybuilder"/>
builder.setSpout("contribution_spout", new RandomContributorSpout(), 4); // <label id="code.setspout"/>
builder.setBolt("contribution_parser", new ContributionParser(), 4). // <label id="code.shufflegrouping"/>
shuffleGrouping("contribution_spout");
builder.setBolt("contribution_recorder", new ContributionRecord(), 4). // <label id="code.fieldsgrouping"/>
fieldsGrouping("contribution_parser", new Fields("contributorId"));
LocalCluster cluster = new LocalCluster();
Config conf = new Config();
cluster.submitTopology("wiki-contributors", conf, builder.createTopology()); // <label id="code.submittopology"/>
Thread.sleep(10000);
cluster.shutdown(); // <label id="code.clustershutdown"/>
}
}

Stream Group
스톰의 스트림 그룹은 어느 작업자(Task)가 어느 튜플을 받는가라는 질문에 대한 대답을 제공한다.
(http://storm.apache.org/releases/current/Concepts.html)
Shuffle grouping
튜플이 볼트의 작업자(Task)에 랜덤하게 분산됨.
Fields grouping
튜플이 가지는 특정 데이터 필드에 따라 작업자(Task)가 선택됨. 예를들어 만약 튜플이
contributorId로 그룹되어 진다면, 동일한 contributorId를 가지는 튜플들은 항상 동일한 작업자(Task)에
가게 됨.
Global grouping
전체 튜플은 볼트의 작업자(Task)중 하나로만 가게됨. (가장 낮은 아이디의 작업자로 가게됨)

Stream Group
All grouping
하나의 튜플이 모든 볼트 작업자 들에 전달됨. (모든 볼트 작업자 들에 브로드캐스트 됨)
Direct grouping
튜플의 생산자는 어떤 작업자가 이 튜플을 받을 것인지를 결정함.(1:1)
Local or shuffle grouping
Target 볼트가 동일한 작업자(Task) 프로세스에 하나 이상의 작업을 수행할 경우, 튜플 작업을
in-process 작업으로 처리함. 그렇지 않으면 일반적인 Shuffle grouping과 같이 동작
Partial Key grouping
Fields grouping과 같이 튜플이 가지는 특정 데이터 필드에 따라 작업자가 선택되지만, 두개의
downstream 볼트 사이에 부하가 배분됨.

3일차 마무리
속도 계층은 최근 데이터에 대한 실시간 뷰를 제공
속도 계층의 스톰은 비동기적인 속도 계층을 만들 때 사용

마치며
● 람다 아키텍처 - 여러 개념을 하나로 통합
○ 원천 데이터 불변 : 클로저가 아이덴티티와 상태를 분리하는것을 연상
○ Mapper/Reducer 병렬처리 : 병렬적 함수 프로그래밍과 유사
○ 작업과정을 클러스터에 분산 : 액터
○ 스톰 튜플의 스트림 : 액터와 CSP 메시지 전달
● 장점/단점
○ 람다 아키텍처는 거대한 양의 데이터를 다루기 위한 것/거대한 양의 데이터 - > 오버헤드
● 대안
○ 스파크는 DAG 실행 엔진을 구현하는 수많은 알고리즘(특히 그래프 알고리즘)이
맵리듀스와 비교해서 더 자연스럽게 표현되도록 만들어주는 클러스터 컴퓨팅 프레임워크
■ 내부에 스트리밍 API : 배치 계층과 속도 계층이 모두 구현 가능

7가지 동시성 모델 람다아키텍처

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 7가지 동시성 모델 람다아키텍처

Similar to 7가지 동시성 모델 람다아키텍처 (20)

More from Sunggon Song

More from Sunggon Song (13)

7가지 동시성 모델 람다아키텍처