우리가 이름만 들어도 아는 유명 IT 서비스들의 화려한 웹페이지도, 예쁜 모바일 앱도 그 뒤에는 탄탄하고 강력한 분산 시스템을 기반으로 합니다. 이러한 백엔드 시스템이 부실할 경우 서비스나 앱은 그야말로 사상누각입니다. 본 세미나에서는 이러한 시스템들을 만들때 풀어야 할, 가장 기본이 되는 문제와 이슈들 12가지에 도전해봅니다.
오픈 소스 Actor Framework 인 Akka.NET 을 통해 온라인 게임 서버를 어떻게 구현할 수 있는지를 설명합니다. Actor Model 에 대한 기본 이해부터 Scale-out 가능한 게임 서버 구축까지 전반적인 내용에 대해 알 수 있습니다. 설명을 위해 클라이언트는 Unity3D 를 사용할 예정입니다.
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Code examples available at https://github.com/br4nd0n/spark-streaming and https://github.com/br4nd0n/spark-viz
A demo and explanation of building a streaming application using Spark Streaming, Node.js and Redis with a real time visualization. Includes discussion of internals of Spark and Spark streaming including RDD partitioning and code and data distribution and cluster resource allocation.
우리가 이름만 들어도 아는 유명 IT 서비스들의 화려한 웹페이지도, 예쁜 모바일 앱도 그 뒤에는 탄탄하고 강력한 분산 시스템을 기반으로 합니다. 이러한 백엔드 시스템이 부실할 경우 서비스나 앱은 그야말로 사상누각입니다. 본 세미나에서는 이러한 시스템들을 만들때 풀어야 할, 가장 기본이 되는 문제와 이슈들 12가지에 도전해봅니다.
오픈 소스 Actor Framework 인 Akka.NET 을 통해 온라인 게임 서버를 어떻게 구현할 수 있는지를 설명합니다. Actor Model 에 대한 기본 이해부터 Scale-out 가능한 게임 서버 구축까지 전반적인 내용에 대해 알 수 있습니다. 설명을 위해 클라이언트는 Unity3D 를 사용할 예정입니다.
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Code examples available at https://github.com/br4nd0n/spark-streaming and https://github.com/br4nd0n/spark-viz
A demo and explanation of building a streaming application using Spark Streaming, Node.js and Redis with a real time visualization. Includes discussion of internals of Spark and Spark streaming including RDD partitioning and code and data distribution and cluster resource allocation.
Test strategies for data processing pipelinesLars Albertsson
This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing.
Presented at highloadstrategy.com 2016 by Lars Albertsson (independent, www.mapflat.com), joint work with Øyvind Løkling (Schibsted Products & Technology).
AWS provides a broad platform of managed services to help you build, secure, and seamlessly scale end-to-end Big Data applications quickly and with ease. Want to get ramped up on how to use Amazon's big data web services? Learn when to use which service? Want to write your first big data application on AWS? Join us in this session as we discuss reference architecture, design patterns, and best practices for pulling together various AWS services to meet your big data challenges.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
Time series data is everywhere: IoT, sensor data, financial transactions. The industry has moved to databases like Cassandra to handle the high velocity and high volume of data that is now common place. However data is pointless without being able to process it in near real time. That's where Spark combined with Cassandra comes in! What was one just your storage system (Cassandra) can be transformed into an analytics system and it's really surprising how easy it is!
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
100% Serverless big data scale production Deep Learning Systemhoondong kim
- BigData Sale Deep Learning Training System (with GPU Docker PaaS on Azure Batch AI)
- Deep Learning Serving Layer (with Auto Scale Out Mode on Web App for Linux Docker)
- BigDL, Keras, Tensorlfow, Horovod, TensorflowOnAzure
Auto Scalable 한 Deep Learning Production 을 위한 AI Serving Infra 구성 및 AI DevOps...hoondong kim
[Tensorflow-KR Offline 세미나 발표자료]
Auto Scalable 한 Deep Learning Production 을 위한 AI Serving Infra 구성 및 AI DevOps Cycle 구성 방법론. (Azure Docker PaaS 위에서 1만 TPS Tensorflow Inference Serving 방법론 공유)
Spark machine learning & deep learninghoondong kim
Spark Machine Learning and Deep Learning Deep Dive.
Scenarios that use Spark hybrid with other data analytics tools (MS R on Spark, Tensorflow(keras) with Spark, Scikit-learn with Spark, etc)
데이터시각화를 바라보는 데이터 사이언티스트, 엔지니어, 마케터간의 관점들이 서로 다릅니다.
이 슬라이드에서는 엔지니어 관점에서 중요시 하는 키워드들, 설계 관점에서의 데이터시각화,
그리고 비즈니스인텔리전스(Business Intelligence)에 대해서 소개드리고 있습니다.
이 발표자료는 데이터 야놀자에서 소개되었습니다.
25. Aggregation - Mongo
Old method
● Take LVM snapshot
● Upload snapshot to HDFS
○ Tar the data files, upload to HDFS.
● MongoDump to sequence files
○ downloads untar, start a mongod process
○ Scan all records, write out to bson sequence files in HDFS
○ One time conversion of bson → thrift sequence files
39. Luigi is a Python package that helps building complex
pipelines and handling all the plumbing typically
associated with long-running batch jobs.
It handles:
● dependency resolution
● workflow management
● visualization
● failures handling
● command line integration
● and much more...
Luigi
43. • MapReduce
• Map / Reduce 모델이 모든 데이터 처리에 좋지는 않음
• Join 구현이 매우 복잡함
• Cascading
• MapReduce 대신 data flow를 구현하게 해주는 Java wrapper
• Data flow를 구현하면 계산 엔진이 작업을 MapReduce로 변환
• Java 특유의 verbosity 문제
• Scalding
• Scala로 구현한 Cascading
• 함수형 프로그래밍으로 데이터 처리를 구현
• 코드가 간결하고 유지보수가 쉬움
Scalding
45. • Data flow frameworks allow data
processing jobs to be expressed as a
series of operations on streams of data.
• Pros
• Composable - Share series of
operations between jobs.
• Simplifies - Complex joins are
much easier to write.
• Brevity - Faster iteration
• Functional programming style
is great for writing data flows.
• Cons
○ Adds complexity.
○ Debugging may require looking
behind the framework's
"magic."
○ Impacts performance
■ Time
■ Memory pressure
Scalding
46. Example: Word Count
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Scalding
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
47. Example: Word Count
Scheme sourceScheme = new TextLine( new Fields( "line" ) );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
Pipe assembly = new Pipe( "wordcount" );
String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
assembly = new GroupBy( assembly, new Fields( "word" ) );
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "word-count", source, sink, assembly );
flow.complete();
Scalding
56. HFile
● 불변 K/V 저장 방식
● Thrift를 이용해 typing
● 파일 작성시 정렬이 되어있기 때문에 빠른 Seek
● Sharding으로 한 dataset에서 높은 QPS 지원
● Foursquare에서 자체제작한 파일서버로 손쉽게
관리
● MapReduce나 Scalding Job의 output으로 생성
68. Redshift
● Relational DB from AWS
● Hadoop을 거치지 않고 질의 가능
● Columnar DB라는 구조로 최적화에 신경쓰면 꽤
좋은 성능을 얻을 수 있음
● 주로 실험 결과 분석과 대시보드 계산에 사용
● HDFS <-> Redshift 데이터 이동 필요
69.
70. Presto
● Open source SQL engine from Facebook
● Hadoop을 거치지 않고 질의 가능
● In-memory computation
● Really, Really, Really fast
● Hive connector를 사용해 Foursquare HDFS에 있는
thrift를 그대로 사용 가능
71.
72. Presto
● Dedicated presto boxes
○ $$$, 질의가 없을 시 장비의 낭비
● Co-location on Hadoop boxes
○ 배포가 까다롭고 올바른 배포 과정을 찾을 때
까지의 iteration이 지나치게 힘듦
○ Netflix, Facebook에서 사용하는 방식
● Yarn
○ Hadoop이 해야 하는 작업과 리소스 경쟁
73. Presto
● Presto-Yarn
○ Apache Slider를 통해 Yarn이 presto를 배포하고
관리하게 하는 OSS
○ 올바른 설정을 찾기 까지 힘들지만 (직접
배포하는것보다는 훨씬 쉬움) 설정 이후
배포/관리가 굉장히 쉬움
78. Backup
● HDFS space isn’t free
○ 매일 HDFS <-> S3 백업
○ 일정 기간이 지나면 HDFS에서 삭제
● S3 also isn’t free
○ 일정 기간이 지나면 Glacier로 변환
○ Glacier Pricing: $0.007 per GB / month
79. Retention
● HFile, Hive tables have retention policy
○ Collection 자체가 늘어나지 않으면 HDFS 용량
역시 일정 한도 내에서 머물 수 있음
● Retention때문에 필요한 데이터가 지워졌다?
○ Job을 다시 돌리면 다시 얻을 수 있음
80. Compression
● 기본 압축 방식: Snappy
○ Fast read, low compression
● HDFS에는 있어야 하지만 시간이 지나 많이
사용되지 않는 데이터: Gzip
○ Slow read, high compression
● 로그 백업
○ Snappy -> Gzip, Gzip to S3, replace Snappy
with Gzip after n days
86. Hardware Stats
Useful stats (Hadoop):
● Hadoop
○ CPU usage / role, rack
○ Network stats (HDFS <-> AWS)
● Kafka
○ Bytes In/Bytes Out
○ Producer requests/s, Consumer fetch/s
○ GC time
○ SSD read/write time
87. Hadoop Stats
Cloudera Manager
● HDFS alerts
○ HDFS Bytes/Blocks read/written
○ RPC Connections
● YARN alerts
○ RM health
○ Jobs that run too long
○ Failing tasks
89. Timberlake
● Problem: Jobtracker is slow.
● 생산성의 가장 큰 걸림돌: 오래 걸리는 프로세스
● Job 실행 -> Jobtracker이 느려서 모니터링하기
귀찮음 -> 모니터링하지 않고 자원 소모 -> 한
Job때문에 모두가 피해
96. Inviso
● 기본적으로 Inviso는 ES 1.0+을 지원
● ES 2+로 포팅할 경우 Kibana 사용가능
○ . 를 모두 _로 변환
○ Timestamp handling
○ Inviso-imported stats != all available stats
○ 원하는 stat은 추가, 필요 없는 stat은 제거
■ CPU time
■ Pool-based resource usage
101. Wrapping Up
● Engineering based on philosophy
● Solve problems
○ It would be better if we solved problems before
they became problems
● Always be monitoring
○ Monitoring isn’t really fun
○ So make it easier/more fun to monitor!