9. 그럼 설명하기 전에 잠깐만...
보통 빅데이터 설명은 왜 힘든가?
사실 개발은 CRUD 아닌가?
왜 이해가 안되지?
-> 사실 ‘무엇’때문에 ‘뭘 해야’하는지
모르는거 아냐?
10. 그래서 뭘 하고 싶었는지
적어봤습니다…
많은 파일로 쪼개져 있던 대용량의 데이터
읽기
-> Spark
대용량의 데이터를 맵리듀스 하여 원하는
아웃풋으로 만들기
-> Spark
대용량의 데이터를 주기적으로 배치,
관리하기
-> Oozie
11. Spark
빅데이터 처리를 위한 분산 플랫폼
- RDD (Resilient Distribute DataSet)
: Transformation, Action
- DataFrame
: tabular data(테이블형 데이터) 처리를 위한
분산 컬렉션
12. 실행하기 위해 겪어야 했던 것들..
Getting started with Spark
Run Spark in Stand Alone Mode
Run Spark in Cluster Mode
Run Spark in Real Distribute Environment
Run Spark by Scheduler
13.
14. Oozie
Server based Workflow Engine specialized in
running workflow jobs with actions that run
Hadoop Map/Reduce and Pig jobs
Oozie는 하둡의 Workflow 스케쥴러
15. Oozie workflow
<workflow-app xmlns='uri:oozie:workflow:0.1' name='processDir'>
<start to='getDirInfo' />
<!-- STEP ONE -->
<action name='getDirInfo'>
<!--writes 2 properties: dir.num-files: returns -1 if dir doesn't exist,
otherwise returns # of files in dir dir.age: returns -1 if dir doesn't exist,
otherwise returns age of dir in days -->
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<main-class>com.navteq.oozie.GetDirInfo</main-class>
<arg>${inputDir}</arg>
<capture-output />
</java>
<ok to="makeIngestDecision" />
<error to="fail" />
</action>
<kill name="fail">
<message>Java failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow>
16. Oozie command
$ oozie job -oozie http://localhost:8080/oozie -config examples/apps/map-reduce/job.properties -run
.
job: 14-20090525161321-oozie-tucu
Check the workflow job status:
$ oozie job -oozie http://localhost:8080/oozie -info 14-20090525161321-oozie-tucu
.
.----------------------------------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : map-reduce-wf
App Path : hdfs://localhost:9000/user/tucu/examples/apps/map-reduce
Status : SUCCEEDED
Run : 0
User : tucu
Group : users
Created : 2009-05-26 05:01 +0000
Started : 2009-05-26 05:01 +0000
Ended : 2009-05-26 05:01 +0000
Actions
.----------------------------------------------------------------------------------------------------------------------------------------------------------------
Action Name Type Status Transition External Id External Status Error Code Start Time End Time
.----------------------------------------------------------------------------------------------------------------------------------------------------------------
mr-node map-reduce OK end job_200904281535_0254 SUCCEEDED - 2009-05-26 05:01 +0000 2009-05-26
05:01 +0000
.----------------------------------------------------------------------------------------------------------------------------------------------------------------
17. Java Client
...
// start local Oozie
LocalOozie.start();
.
// get a OozieClient for local Oozie
OozieClient wc = LocalOozie.getClient();
.
// create a workflow job configuration and set the workflow application path
Properties conf = wc.createConfiguration();
conf.setProperty(OozieClient.APP_PATH, "hdfs://foo:9000/usr/tucu/my-wf-app");
.
// setting workflow parameters
conf.setProperty("jobTracker", "foo:9001");
conf.setProperty("inputDir", "/usr/tucu/inputdir");
conf.setProperty("outputDir", "/usr/tucu/outputdir");
...
.
// submit and start the workflow job
String jobId = wc.run(conf);
System.out.println("Workflow job submitted");
.
// wait until the workflow job finishes printing the status every 10 seconds
while (wc.getJobInfo(jobId).getStatus() == Workflow.Status.RUNNING) {
System.out.println("Workflow job running ...");
Thread.sleep(10 * 1000);
}
.
// print the final status o the workflow job
System.out.println("Workflow job completed ...");
System.out.println(wf.getJobInfo(jobId));
.
// stop local Oozie
LocalOozie.stop();
…
19. Oozie : Curse or Blessing?
Unmanaged xml hell
Unmanaged dependencies with different big data
applications
${oozie.wf.application.path}/lib
oozie.libpath=${oozie.wf.application.path}/lib
ShareLib : /user/${user.name}/share/lib
<property>
<name>oozie.hive.defaults</name>
<value>${jobDir}/hive-conf.xml</value>
</property>
20. HiveMain
public class HiveMain extends LauncherMain {
public static final String HIVE_SITE_CONF = "hive-site.xml";
public static Configuration setUpHiveSite() throws Exception {
Configuration hiveConf = initActionConf();
// Write the action configuration out to hive-site.xml
OutputStream os = new FileOutputStream(HIVE_SITE_CONF);
hiveConf.writeXml(os);
os.close();
System.out.println();
System.out.println("Hive Configuration Properties:");
System.out.println("------------------------");
for (Entry<String, String> entry : hiveConf) {
System.out.println(entry.getKey() + "=" + entry.getValue());
}
System.out.flush();
System.out.println("------------------------");
System.out.println();
23. Spark with DataFrame
- DataFrame Features
: pass data between nodes, in a much more
efficient way than using Java serialization.
(Because Spark understands the schema)
: transformations directly data on off-heap
memory, avoiding the garbage-collection costs
: API for building a relational query plan that
Spark’s Catalyst optimizer can then execute
24. Spark Too Slow…
DataFrame을 통해 여차저차 구현 후…
30(80G*30)일치 데이터 100개의 세그먼트
처리
-> 처리 시간 8시간
-> 스펙을 맞출 수 없음
-> 느리지만 돌아가해도 괜찮겠다 바람…
25. Spark Failed
- DataFrame Performance
Heavy overhead
Not optimized in distribute environment :
Memory Risky
26. SqlContext Problems
core 2 executor 256
hive metastore db connection full 유발 (mysql
connnection limit : 2,000)
27. Fail reason
내가 소홀히 했던 것들…
라이브러리 의존성 체크에 방만
로컬 테스트가 불가능할 때를 겪어보지 못함
분산 환경 프로그래밍에 대한 지식 부재
Editor's Notes
프로젝트 세팅은 어떻게 해요?
클러스터 어떻게 구축해요?
애플리케이션 가동은 어떻게 해요?
to do list를 잘게 쪼갠다
왜 실행 자체가 안되죠?
왜 돌다가 돌연사하죠?
왜 이렇게 느리죠?
어떻게 빠르게 하죠?
원인이 추정되는 부분을 조금씩 소거한다