Dmp hadoop getting_start

대용량 플랫폼 구축을 위한
Hadoop 따라가기
skplanet
김경진

Data Service 개발팀
- Recopick
- SyrupAd
- DMP 디엠ㅍ …가 아니고 MRSS (Marketing &
Recommendation Suport System)

Q. 대동법이 시행된 연도?
-> 모른다
Q. 여자친구는 어떻게 만드나요?
-> ???????

단순히 아는 것과/모르는 것으로 분리되는 일
VS
다양한 양상과 컨텍스트가 분석이 필요한 일

Task 초입
: 대부분 모르는 것의 문제

Task 진행 중기/말기
: 대부분 다양한 문제

그럼 설명하기 전에 잠깐만...
보통 빅데이터 설명은 왜 힘든가?
사실 개발은 CRUD 아닌가?
왜 이해가 안되지?
-> 사실 ‘무엇’때문에 ‘뭘 해야’하는지
모르는거 아냐?

그래서 뭘 하고 싶었는지
적어봤습니다…
많은 파일로 쪼개져 있던 대용량의 데이터
읽기
-> Spark
대용량의 데이터를 맵리듀스 하여 원하는
아웃풋으로 만들기
-> Spark
대용량의 데이터를 주기적으로 배치,
관리하기
-> Oozie

Spark
빅데이터 처리를 위한 분산 플랫폼
- RDD (Resilient Distribute DataSet)
: Transformation, Action
- DataFrame
: tabular data(테이블형 데이터) 처리를 위한
분산 컬렉션

실행하기 위해 겪어야 했던 것들..
Getting started with Spark
Run Spark in Stand Alone Mode
Run Spark in Cluster Mode
Run Spark in Real Distribute Environment
Run Spark by Scheduler

Oozie
Server based Workflow Engine specialized in
running workflow jobs with actions that run
Hadoop Map/Reduce and Pig jobs
Oozie는 하둡의 Workflow 스케쥴러

Oozie workflow
<workflow-app xmlns='uri:oozie:workflow:0.1' name='processDir'>
<start to='getDirInfo' />

<action name='getDirInfo'>

<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<main-class>com.navteq.oozie.GetDirInfo</main-class>
<arg>${inputDir}</arg>
<capture-output />
</java>
<ok to="makeIngestDecision" />
<error to="fail" />
</action>
<kill name="fail">
<message>Java failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow>

Oozie command
$ oozie job -oozie http://localhost:8080/oozie -config examples/apps/map-reduce/job.properties -run
.
job: 14-20090525161321-oozie-tucu
Check the workflow job status:
$ oozie job -oozie http://localhost:8080/oozie -info 14-20090525161321-oozie-tucu
.
.----------------------------------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : map-reduce-wf
App Path : hdfs://localhost:9000/user/tucu/examples/apps/map-reduce
Status : SUCCEEDED
Run : 0
User : tucu
Group : users
Created : 2009-05-26 05:01 +0000
Started : 2009-05-26 05:01 +0000
Ended : 2009-05-26 05:01 +0000
Actions
.----------------------------------------------------------------------------------------------------------------------------------------------------------------
Action Name Type Status Transition External Id External Status Error Code Start Time End Time
.----------------------------------------------------------------------------------------------------------------------------------------------------------------
mr-node map-reduce OK end job_200904281535_0254 SUCCEEDED - 2009-05-26 05:01 +0000 2009-05-26
05:01 +0000
.----------------------------------------------------------------------------------------------------------------------------------------------------------------

Java Client
...
// start local Oozie
LocalOozie.start();
.
// get a OozieClient for local Oozie
OozieClient wc = LocalOozie.getClient();
.
// create a workflow job configuration and set the workflow application path
Properties conf = wc.createConfiguration();
conf.setProperty(OozieClient.APP_PATH, "hdfs://foo:9000/usr/tucu/my-wf-app");
.
// setting workflow parameters
conf.setProperty("jobTracker", "foo:9001");
conf.setProperty("inputDir", "/usr/tucu/inputdir");
conf.setProperty("outputDir", "/usr/tucu/outputdir");
...
.
// submit and start the workflow job
String jobId = wc.run(conf);
System.out.println("Workflow job submitted");
.
// wait until the workflow job finishes printing the status every 10 seconds
while (wc.getJobInfo(jobId).getStatus() == Workflow.Status.RUNNING) {
System.out.println("Workflow job running ...");
Thread.sleep(10 * 1000);
}
.
// print the final status o the workflow job
System.out.println("Workflow job completed ...");
System.out.println(wf.getJobInfo(jobId));
.
// stop local Oozie
LocalOozie.stop();
…

Oozie : Curse or Blessing?
Unmanaged xml hell
Unmanaged dependencies with different big data
applications
${oozie.wf.application.path}/lib
oozie.libpath=${oozie.wf.application.path}/lib
ShareLib : /user/${user.name}/share/lib
<property>
<name>oozie.hive.defaults</name>
<value>${jobDir}/hive-conf.xml</value>
</property>

HiveMain
public class HiveMain extends LauncherMain {
public static final String HIVE_SITE_CONF = "hive-site.xml";
public static Configuration setUpHiveSite() throws Exception {
Configuration hiveConf = initActionConf();
// Write the action configuration out to hive-site.xml
OutputStream os = new FileOutputStream(HIVE_SITE_CONF);
hiveConf.writeXml(os);
os.close();
System.out.println();
System.out.println("Hive Configuration Properties:");
System.out.println("------------------------");
for (Entry<String, String> entry : hiveConf) {
System.out.println(entry.getKey() + "=" + entry.getValue());
}
System.out.flush();
System.out.println("------------------------");
System.out.println();

HDFS
Processing
Hive Hive
Hadoop
Spark
Segmentation
1. prdname
contains(‘나이키’)
AND adtitle
contains(‘Aution’)
prdname adtitle
나이키 Aution
리복 Gmarket
id count
1 2300
2 156000

Spark with DataFrame
- DataFrame Features
: pass data between nodes, in a much more
efficient way than using Java serialization.
(Because Spark understands the schema)
: transformations directly data on off-heap
memory, avoiding the garbage-collection costs
: API for building a relational query plan that
Spark’s Catalyst optimizer can then execute

Spark Too Slow…
DataFrame을 통해 여차저차 구현 후…
30(80G*30)일치 데이터 100개의 세그먼트
처리
-> 처리 시간 8시간
-> 스펙을 맞출 수 없음
-> 느리지만 돌아가해도 괜찮겠다 바람…

Spark Failed
- DataFrame Performance
Heavy overhead
Not optimized in distribute environment :
Memory Risky

SqlContext Problems
core 2 executor 256
hive metastore db connection full 유발 (mysql
connnection limit : 2,000)

Fail reason
내가 소홀히 했던 것들…
라이브러리 의존성 체크에 방만
로컬 테스트가 불가능할 때를 겪어보지 못함
분산 환경 프로그래밍에 대한 지식 부재

Dmp hadoop getting_start

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Dmp hadoop getting_start

Similar to Dmp hadoop getting_start (20)

Recently uploaded

Recently uploaded (20)

Dmp hadoop getting_start

Editor's Notes