데이터 레이크 알아보기(Learn about Data Lake)

데이터 레이크 알아보기
Hadoop, Spark, ZooKeeper, Kafka, HBase, Solr,
Drill, Atlas, JupyterLab
© 2020 NetApp, Inc. All rights reserved.
SeungYong Baek
Senior Solutions Engineer / NetApp Korea
January 2022

알아 볼 내용 들… ▪ 데이터 레이크 개요
▪ 데이터 레이크 흉내내기
▪ Hadoop, Spark, ZooKeeper, Kafka,
HBase, Solr, Drill, Atlas
▪ 간단 사용 데모
▪ Drill, JupyterLab
▪ NetApp for 데이터 레이크
2

3
데이터 레이크?
데이터 레이크 개요
https://kr.teradata.com/Glossary/What-is-a-Data-Lake
https://aws.amazon.com/ko/big-data/datalakes-and-analytics/what-is-a-data-lake/?nc=sn&loc=2
https://en.wikipedia.org/wiki/Data_lake
https://docs.microsoft.com/ko-kr/azure/architecture/data-guide/scenarios/data-lake
데이터 레이크란?
데이터 레이크와 데이터 웨어하우스는 모두 디자인 패턴이지만 이 둘은
정반대입니다. 데이터 웨어하우스는 높은 동시성으로 품질, 일관성, 재사용 및
성능 향상을 위해 데이터를 구조화하고 패키징합니다. 데이터 레이크는 새로운
형태의 분석 민첩성을 제공하면서 원래의 원시 데이터 정확도 및 장기 저장에
중점을 둔 디자인 패턴으로 웨어하우스를 보완합니다.
- Teradata -
What is a data lake?
A data lake is a centralized repository that allows you to store all your
structured and unstructured data at any scale. You can store your data
as-is, without having to first structure the data, and run different types
of analytics—from dashboards and visualizations to big data
processing, real-time analytics, and machine learning to guide better
decisions.
- AWS -
데이터 레이크
데이터 레이크는 대량의 데이터를 네이티브, 원시 형식으로 보관하는
스토리지 리포지토리입니다. 데이터 레이크 저장소는 테라바이트 및
페타바이트 규모의 데이터에 맞게 크기를 조정할 수 있도록
최적화되었습니다. 데이터는 일반적으로 여러 소스에서 오며 구조화, 반
조화 또는 구조화되지 않을 수 있습니다. 모든 것을 변형되지 않은 원래
상태로 저장하는 것이 데이터 레이크의 개념입니다. 이 접근 방식은
데이터를 수집할 때 데이터를 변환하고 처리하는 기존의 데이터
웨어하우스와는 다릅니다.
- Microsoft -
Data lake
A data lake is a system or repository of data stored in its natural/raw
format,[1] usually object blobs or files. A data lake is usually a single store
of data including raw copies of source system data, sensor data, social
data etc.,[2] and transformed data used for tasks such
as reporting, visualization, advanced analytics and machine learning. A
data lake can include structured data from relational databases (rows and
columns), semi-structured data (CSV, logs, XML, JSON), unstructured
data (emails, documents, PDFs) and binary data (images, audio,
video).[3] A data lake can be established "on premises" (within an
organization's data centers) or "in the cloud" (using cloud services from
vendors such as Amazon, Microsoft, or Google).
Poorly managed data lakes have been facetiously called data swamps.[4]
- WIKIPEDIA -

4
데이터 레이크와 데이터 웨어하우스
데이터 레이크(Data Lake) 데이터 웨어하우스(Data Warehouse) 비고
데이터 형태 정형, 반정형, 비정형 정형
Schema Schema on Read Schema on Write
▪ DL: 읽기 시에 데이터의 구조 정의
▪ DW: 쓰기 시에 데이터의 구조 정의
목표 사용자 모든 사용자 지향 일부 데이터 사용자 및 분석가 ▪ DL: 셀프 서비스 지향
데이터 규모 PB 규모 TB 규모 ▪ DL: 많이 저장해 두고 가치를 발견하자.
데이터 보존 주기 중기 ~ 장기 단기 ~ 중기 ▪ DL: 오래 저장해 두고 가치를 발견하자.
확장 형태 스케일 아웃 스케일 업 또는 아웃
기반 기술 오픈소스 빅 데이터 기술 기반 상용 DBMS 또는 어플라이언스 기반 ▪ DL: 주로 Hadoop Ecosystem

5
데이터 기반의 혁신을 만들기 위한, 대용량 데이터의 원천 저장소
정형
- 관계형 데이터
- CSV등
반정형
- JSON, AVRO
- Parquet, ORC
- XML, HTML 등
비정형
- 이미지
- 오디오
- 비디오
- 문서등
데이터 사용자
- 분석가, 일반사용자
데이터 카탈로그
- 메타 데이터, 샘플 데이터등
진입 영역
- RAW 데이터
민감
데이터 레이크
골드 영역
- 정제 데이터
작업 영역
- 작업 데이터
1. 데이터 적재
3. 데이터 검색
2. 데이터 카탈로깅
4. 데이터 사용 및 분석

6
데이터 영역별 기대 관리 수준
https://www.oreilly.com/library/view/the-enterprise-big/9781491931547/ch01.html

7
데이터 레이크 성숙도
용어 설명
데이터 웅덩이
(Data Puddle)
빅 데이터 기술을 활용해서 구축한 단일 목적이나 단일
프로젝트용 데이터 마트
데이터 연못
(Data Pond)
데이터 웅덩이 여러 개를 모아 놓은 것
데이터 레이크
(Data Lake)
Enterprise Data
Lake 플랫폼
비즈니스 사용자가 IT 부서의 도움 없이 필요한 데이터 세트를
찾아서 사용할 수 있는 셀프서비스를 지원하고, 당장 어떤
데이터를 요구하는 프로젝트가 없더라도 차후에 비즈니스
사용자가 필요로 할 수 있는 데이터를 저장하는 것을 목표로
한다. – The Enterprise Big Data Lake –
전사의 다양한 유형의 대용량 데이터를 Low Latency로(즉
빅데이터) 수집하여, 사내의 모든 구성원들이 직접 필요한
데이터를 찾고, 이해하고, 확보하고, 분석할 수 있도록 해주는
전사 데이터 플랫폼 – 차세대 빅데이터 플랫폼 DATA LAKE –
데이터 오션
(Data Ocean)
데이터가 데이터 레이크에 저장이 됐는지 여부와 상관없이
데이터가 어디에 있든 기업의 모든 데이터가 셀프서비스와
데이터 주도 결정 과정에 활용될 수 있다.
데이터 늪
(Data Swamp)
데이터 레이크만큼 커진 데이터 연못이지만 사용자에게 제대로
사용되지 않는 데이터 저장소

8
데이터 레이크 성숙도

9
데이터 활용 사례 – 공공 데이터 포털

10
데이터 레이크를 위한 Hadoop Ecosystem
데이터 레이크 흉내내기
https://kr.machbase.com/hadoop-ecosystem-%EC%97%B0%EB%8F%99%EC%9D%84-%EC%9C%84%ED%95%9C-kafka-%EC%9D%B4%EC%9A%A9/

11
데모 환경 – 3 노드 Hadoop 클러스터, 1 노드 클라이언트
노드 클라이언트 노드 1 노드 2 노드 3 비고
hostname client node-01 node-02 node-03 ▪ 클라이언트 노드: optional
OS Debian 11 Debian 11 Debian 11 Debian 11
IP 192.168.15.190 192.168.15.191 192.168.15.192 192.168.15.193

12
데모 환경 – 노드별 상세 구성
서비스 / 노드 버전 node-01 node-02 node-03 client 서비스 의존성 웹 UI
Hadoop 3.3.1
Primary Namenode
DataNode
Secondary Namenode
DataNode
DataNode Client N/A
http://192.168.15.191:9870
http://192.168.15.192:9868
http://192.168.15.191:8088
http://192.168.15.191:19888
Spark 3.2.0 Master, Worker Worker Worker N/A Spark Standalone: N/A http://192.168.15.191:8080
Spark on YARN 3.2.0 Member Member Member N/A Spark on YARN: YARN N/A
Zookeeper 3.7.0 Member Member Member N/A N/A N/A
Kafka 2.13-3 Member Member Member N/A Zookeeper N/A
HBase 2.4.8 Master
Backup Master
RegionServer
RegionServer N/A Hadoop, Zookeeper http://192.168.15.191:16010
Solr 8.11.1 Member Member Member N/A Zookeeper http://192.168.15.191:8983
Drill 1.19.0 Member Member Member Client or Single Cluster Zookeeper http://192.168.15.191:8047
Atlas 3.0.0 Server N/A N/A Web Client Hadoop, Zookeeper, HBase, Solr http://192.168.15.191:21000
Maven 3.8.4 N/A N/A N/A Atlas Build N/A N/A
JupyterLab 3.2.5 N/A N/A N/A JupyterLab N/A http://192.168.15.190:8888

13
데모 환경 – 서비스 확인, 실행, 중지
서비스 / 노드 node-01 node-02 node-03 client 서비스 실행 서비스 시작 및 중지
Hadoop
$ jps
Atlas
QuorumPeerMain
DataNode
Master
Jps
HMaster
Kafka
NameNode
ResourceManager
NodeManager
Worker
Drillbit
$ jps
DataNode
NodeManager
Drillbit
QuorumPeerMain
Kafka
Worker
HMaster
Jps
HRegionServer
SecondaryNameNode
$ jps
DataNode
Kafka
QuorumPeerMain
NodeManager
HRegionServer
Worker
Jps
Drillbit
$ jps
Jps
Drillbit
NameNode
$HADOOP_HOME/sbin/start-dfs.sh &&
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/stop-yarn.sh &&
$HADOOP_HOME/sbin/stop-dfs.sh
Spark Master
$SPARK_HOME/sbin/start-all.sh
$SPARK_HOME/sbin/stop-all.sh
Spark on YARN N/A YARN을 통하므로 별도 시작 중지 없음
ZooKeeper 모든 노드
$ZK_HOME/bin/zkServer.sh start
$ZK_HOME/bin/zkServer.sh status
$ZK_HOME/bin/zkServer.sh stop
Kafka 모든 노드
$KAFKA_HOME/bin/kafka-server-start.sh -daemon
$KAFKA_HOME/config/server.properties
$KAFKA_HOME/bin/kafka-server-stop.sh
HBase Master
$HBASE_HOME/bin/start-hbase.sh
$HBASE_HOME/bin/stop-hbase.sh
Solr 모든 노드
$SOLR_BIN/solr start -cloud
$SOLR_BIN/solr status
$SOLR_BIN/solr stop -all
Drill 모든 노드
$DRILL_HOME/bin/drillbit.sh start
$DRILL_HOME/bin/drillbit.sh status
$DRILL_HOME/bin/drillbit.sh stop
Atlas Server
$ATLAS_HOME/bin/atlas_start.py
$ATLAS_HOME/bin/atlas_stop.py
Jupyter LAB Client $ jupyter-lab > .jupyter/jupyter.log 2>&1 &

14
Apache Hadoop – 개요, 분산 컴퓨팅 프레임워크와 분산 파일 시스템
Apache Hadoop
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across
clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library
itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a
cluster of computers, each of which may be prone to failures.
Modules
• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application
data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Related projects
Ambari™, Avro™, Cassandra™, Chukwa™, HBase™, Hive™, Mahout™, Ozone™, Pig™, Spark™, Submarine, Tez™,
ZooKeeper™
https://hadoop.apache.org/

15
Apache Hadoop – 구성 #1
1. Hadoop 설치 및 환경 설정 – NameNode
1) Hadoop 설치
seungyong@node-01:~$ sudo apt update
seungyong@node-01:~$ sudo apt upgrade
seungyong@node-01:~$ sudo apt install default-jdk git
seungyong@node-01:~$ git
seungyong@node-01:~$ java -version
seungyong@node-01:~$ wget https://dlcdn.apache.org/hadoop/common/hadoop-
3.3.1/hadoop-3.3.1.tar.gz
seungyong@node-01:~$ tar zvxf hadoop-3.3.1.tar.gz
seungyong@node-01:~$ ln -s hadoop-3.3.1 hadoop
2) $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/default-java
export HADOOP_HOME=/home/seungyong/hadoop
https://hadoop.apache.org/docs/current/
3) $HOME/.bashrc
# JAVA
# Hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export PDSH_RCMD_TYPE=ssh
4) /etc/hosts
192.168.15.190 client
192.168.15.191 node-01
192.168.15.192 node-02
192.168.15.193 node-03

16
2. Hadoop 설치 – DataNode
1) NameNode 가상 머신 복제
2) hostname 및 IP 변경
$ sudo hostnamectl set-hostname node-02
$ sudo hostnamectl set-hostname node-03
3) ssh 키 배포
seungyong@node-01:~$ ssh-keygen -t rsa
seungyong@node-01:~$ ssh-copy-id seungyong@node-01
seungyong@node-01:~$ ssh seungyong@node-01
3. Hadoop 환경 설정 – 모든 노드
** 구성 파일에 명시한 .dir 속성의 디렉토리는 수동으로 미리 생성 해줄 것
** /home/seungyong/namenode, namesecondary, datanode
** 구성 파일은 별첨 #1 참조
1) $HADOOP_HOME/etc/hadoop/core-site.xml
2) $HADOOP_HOME/etc/hadoop/hdfs-site.xml
3) $HADOOP_HOME/etc/hadoop/yarn-site.xml
4) $HADOOP_HOME/etc/hadoop/mapred-site.xml
5) $HADOOP_HOME/etc/hadoop/workers

17
4. Hadoop 구성 및 서비스 확인
1) HDFS 포맷 및 서비스 시작
seungyong@node-01:~/hadoop$ hdfs namenode -format cluster-01
seungyong@node-01:~/hadoop$ sbin/start-dfs.sh && sbin/start-yarn.sh
Starting namenodes on [node-01]
Starting datanodes
Starting secondary namenodes [node-02]
Starting resourcemanager
Starting nodemanagers
seungyong@node-01:~/hadoop$
seungyong@node-01:~$ jps
4755 ResourceManager
4292 NameNode
4858 NodeManager
4396 DataNode
5262 Jps
** node-02, node-03도 확인
seungyong@node-01:~/hadoop$ hdfs dfsadmin -report
Configured Capacity: 310706749440 (289.37 GB)
Present Capacity: 262988857344 (244.93 GB)
DFS Remaining: 262988783616 (244.93 GB)
DFS Used: 73728 (72 KB)
DFS Used%: 0.00%
……………………중략…………………….
seungyong@node-01:~/hadoop$ hadoop fs -df -h
Filesystem Size Used Available Use%
hdfs://node-01:9000 289.4 G 72 K 244.9 G 0%
(Optional) JOB History 서버
seungyong@node-01:~$ $HADOOP_HOME/bin/mapred --daemon start
historyserver
seungyong@node-01:~$ $HADOOP_HOME/bin/mapred --daemon stop
historyserver
2) 웹 UI 확인
NameNode http://nn_host:port/ Default HTTP port is 9870.
ResourceManager http://rm_host:port/ Default HTTP port is 8088.
MapReduce JobHistory Server http://jhs_host:port/ Default HTTP port is 19888.

18
5. MapReduce 테스트
seungyong@node-01:~$ hdfs dfs -mkdir /user
seungyong@node-01:~$ hdfs dfs -mkdir /user/seungyong/
seungyong@node-01:~$ hdfs dfs -mkdir input
seungyong@node-01:~$ hdfs dfs -put hadoop/etc/hadoop/*.xml input
seungyong@node-01:~$ hdfs dfs -ls /user/seungyong/input
Found 10 items
-rw-r--r-- 3 seungyong supergroup 9213 2021-12-14 10:16
/user/seungyong/input/capacity-scheduler.xml
……………………중략…………………….
seungyong@node-01:~$ hadoop jar
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-
3.3.1.jar grep input output 'dfs[a-z.]+’
2021-12-19 14:53:00,031 INFO client.DefaultNoHARMFailoverProxyProvider:
Connecting to ResourceManager at node-01/192.168.15.191:8032
……………………중략…………………….
File Output Format Counters
Bytes Written=119
seungyong@node-01:~$
seungyong@node-01:~$ hdfs dfs -cat output/*
1 dfsadmin
1 dfs.namenode.secondary.http
1 dfs.namenode.name.dir
1 dfs.namenode.checkpoint.dir
1 dfs.datanode.data.dir
6. 서비스 중지
seungyong@node-01:~/hadoop$ sbin/stop-yarn.sh && sbin/stop-dfs.sh
7. (Optional) 클라이언트 구성
** 별도 클라이언트 노드 복제
** Hadoop 설치, bashrc 설정, core-site.xml 만 설정
** hadoop, hdfs 명령 사용하여 HDFS 사용 가능

19
Apache Hadoop – Resource Manager, NameNode UI

20
Apache Hadoop – Hadoop Distributed File System

21
Apache Spark – 개요, 메모리 기반 분산 컴퓨팅 프레임워크
https://spark.apache.org/, https://docs.microsoft.com/ko-kr/dotnet/spark/what-is-spark, https://databricks.com/blog/2013/11/21/putting-spark-to-use.html
What is Apache Spark™?
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-
node machines or clusters.
Apache Spark는 메모리 내 처리를
지원하여 빅 데이터를 분석하는
애플리케이션의 성능을 향상시키는
오픈 소스 병렬 처리 프레임워크입니다.
빅 데이터 솔루션은 기존
데이터베이스에 비해 너무 크거나
복잡한 데이터를 처리하도록
설계되었습니다. Spark는 메모리에서
대량의 데이터를 처리하므로 디스크
기반 대체 방법보다 훨씬 빠릅니다.
- Microsoft -

22
Apache Spark – Standalone 구성 #1
https://spark.apache.org/docs/latest/quick-start.html, https://spark.apache.org/docs/latest/
1. Spark 설치 및 환경 설정
1) Spark 설치 – 모든 노드
$ sudo apt install scala
$ wget https://downloads.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-
hadoop3.2.tgz
$ tar zvxf spark-3.2.0-bin-hadoop3.2.tgz
$ ln -s spark-3.2.0-bin-hadoop3.2 spark
2) $HOME/.bashrc – 모든 노드
# JAVA
# Spark
export SPARK_HOME=/home/seungyong/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
3) $SPARK_HOME/conf/workers – 마스터 노드
seungyong@node-01:~$ cp workers.template workers
seungyong@node-01:~$ vi workers
node-01
node-02
node-03
2. 서비스 실행 및 확인
1) 서비스 실행
seungyong@node-01:~$ cd spark
seungyong@node-01:~$ ./sbin/start-all.sh
1669 Worker
1589 Master
1765 Jps
1717 Jps
1641 Worker
1757 Jps
1567 Worker
2) 웹 UI 확인
http://192.168.15.191:8080/

23
Apache Spark – Standalone 구성 #2
https://spark.apache.org/docs/latest/quick-start.html, https://spark.apache.org/docs/latest/
3. 테스트 및 확인
1) Spark Shell
seungyong@node-01:~/spark$ bin/spark-shell --master spark://node-01:7077
……………………중략…………………….
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 3.2.0
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.13)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val textFile = spark.read.textFile("README.md")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
scala> textFile.count()
res0: Long = 109
scala> textFile.first()
res1: String = # Apache Spark
2) spark-sumit
seungyong@node-01:~/spark$ spark-submit
--master spark://node-01:7077
--class org.apache.spark.examples.SparkPi
~/spark/examples/jars/spark-examples*.jar
10
21/12/19 14:44:25 INFO SparkContext: Running Spark version 3.2.0
21/12/19 14:44:25 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
……………………중략…………………….
21/12/19 14:44:33 INFO ShutdownHookManager: Deleting directory /tmp/spark-
efff028e-a377-455f-bca3-ae037da1ec74
seungyong@node-01:~/spark$
4. 서비스 중지
seungyong@node-01:~$ ./sbin/stop-all.sh

24
Apache Spark – Standalone Web UI

25
Apache Spark on YARN – 구성 #1
https://spark.apache.org/docs/latest/running-on-yarn.html
1. Spark 환경 설정 - 모든 노드
1) $SPARK_HOME/conf/spark-env.sh
export SPARK_HOME=/home/seungyong/spark
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/Hadoop
2. 테스트 및 확인
1) spark-sumit
seungyong@node-01:~$ spark-submit
--master yarn
--deploy-mode cluster
--class org.apache.spark.examples.SparkPi
~/spark/examples/jars/spark-examples*.jar
10
2021-12-19 14:57:20,965 INFO yarn.Client: Requesting a new application from
cluster with 3 NodeManagers
……………………중략…………………….
2021-12-19 14:58:03,395 INFO util.ShutdownHookManager: Deleting directory
/tmp/spark-60360f96-47e4-4ca9-ae36-37779a812269
2) Spark Shell
seungyong@node-01:~/spark$ bin/spark-shell --master yarn --deploy-mode client
……………………중략…………………….
scala> val textFile = spark.read.textFile("/user/seungyong/input/core-site.xml")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
scala> textFile.show(30)
+--------------------+
| value|
+--------------------+
|<?xml version="1....|
……………………중략…………………….
scala> val myfile = sc.textFile("hdfs://node-01:9000/user/seungyong/input")
myfile: org.apache.spark.rdd.RDD[String] = hdfs://node-
01:9000/user/seungyong/input MapPartitionsRDD[4] at textFile at <console>:23
scala> val counts = myfile.flatMap(line => line.split(" ")).map(word =>
(word,1)).reduceByKey(_ + _)
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey
at <console>:23
scala> counts.saveAsTextFile("hdfs://node-01:9000/user/seungyong/output")
scala> :quit
seungyong@node-01:~/spark$ hdfs dfs -cat output/*

26
Apache ZooKeeper – 개요, 분산 시스템 코디네이션
https://zookeeper.apache.org/ , https://kr.cloudera.com/products/open-source/apache-hadoop/apache-zookeeper.html, https://developer-woong.tistory.com/11
What is ZooKeeper?
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization,
and providing group services. All of these kinds of services are used in some form or another by distributed applications.
Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable.
Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them, which make
them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these
services lead to management complexity when the applications are deployed.
Apache ZooKeeper
분산 프로세스를 안정적으로 조정하는
오픈소스 서버
Apache ZooKeeper는 Hadoop 클러스터에
운영 서비스를 제공합니다. 분산 구성 서비스,
동기화 서비스, 분산 시스템의 명명
레지스트리 등이 여기에 포함됩니다. 분산
애플리케이션은 Zookeeper를 사용하여
중요한 구성 정보에 대한 업데이트를
저장하고 조정합니다.
- Cloudera -
▪ 서비스를 여러 노드로 분산 처리
▪ 노드들의 처리 결과를 공유, 동기화해 데이터 안정성 보장
▪ 주 노드에 문제 발생 시에, 다른 노드로 서비스 전달
▪ 분산 노드들의 환경 설정을 통합 관리

27
Apache ZooKeeper – 구성 #1
https://zookeeper.apache.org/doc/r3.7.0/zookeeperAdmin.html, https://zookeeper.apache.org/doc/r3.7.0/zookeeperStarted.html
1. ZooKeeper 설치 및 환경 설정 – 모든 노드
1) ZooKeeper 설치
$ wget https://dlcdn.apache.org/zookeeper/zookeeper-3.7.0/apache-zookeeper-
3.7.0-bin.tar.gz
$ tar zvxf apache-zookeeper-3.7.0-bin.tar.gz
$ ln -s apache-zookeeper-3.7.0-bin zookeeper
2) $HOME/.bashrc
# ZooKeeper
export ZK_HOME=/home/seungyong/zookeeper
export PATH=$PATH:$ZK_HOME/bin
3) $ZK_HOME/conf/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/seungyong/zookeeper/data
clientPort=2181
server.1=node-01:2888:3888
server.2=node-02:2888:3888
server.3=node-03:2888:3888
4) ID 파일 생성
$ cd zookeeper; mkdir data; cd data
seungyong@node-01:~/zookeeper/data$ echo 1 > myid
2. 서비스 실행 및 확인 – 모든 노드
$ $ZK_HOME/bin/zkServer.sh start
seungyong@node-01:~/zookeeper/bin$ ./zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /home/seungyong/zookeeper/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost. Client SSL: false.
Mode: leader
seungyong@node-03:~/zookeeper/bin$
3. CLI 접속 테스트
seungyong@node-01:~/zookeeper/bin$ ./zkCli.sh
........................... 중간 생략 ..............................
[zk: localhost:2181(CONNECTED) 1] ls /
[zookeeper]
[zk: localhost:2181(CONNECTED) 2]
4. 서비스 중지 - 모든 노드
$ $ZK_HOME/bin/zkServer.sh stop

28
Apache Kafka – 개요, 분산 메시징 시스템
https://kafka.apache.org/, https://www.redhat.com/ko/topics/integration/what-is-apache-kafka
APACHE KAFKA
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance
data pipelines, streaming analytics, data integration, and mission-critical applications.
Apache Kafka란 무엇일까요?
Apache Kafka는 실시간으로 기록 스트림을 게시, 구독, 저장
및 처리할 수 있는 분산 데이터 스트리밍 플랫폼입니다. 이는
여러 소스에서 데이터 스트림을 처리하고 여러 사용자에게
전달하도록 설계되었습니다. 간단히 말해 A지점에서
B지점까지 이동하는 것뿐만 아니라 A지점에서 Z지점을
비롯해 필요한 모든 곳에서 대규모 데이터를 동시에 이동할
수 있습니다.
Apache Kafka는 전통적인 엔터프라이즈 메시징 시스템의
대안입니다. 하루에 1조4천억 건의 메시지를 처리하기 위해
LinkedIn이 개발한 내부 시스템으로 시작했으나, 현재 이는
다양한 기업의 요구사항을 지원하는 애플리케이션을 갖춘
오픈소스 데이터 스트리밍 솔루션이 되었습니다.
- RedHat -

29
Apache Kafka – 구성 #1
https://kafka.apache.org/documentation/#quickstart
1. Kafka 설치 및 환경 설정 – 모든 노드
1) Kafka 설치
$ wget https://dlcdn.apache.org/kafka/3.0.0/kafka_2.13-3.0.0.tgz
$ tar zvxf kafka_2.13-3.0.0.tgz
$ ln -s kafka_2.13-3.0.0 kafka
2) $HOME/.bashrc
# Kafka
export KAFKA_HOME=/home/seungyong/kafka
export PATH=$PATH:$KAFKA_HOME/bin
3) $KAFKA_HOME/config/server.properties
$ cd $KAFKA_HOME
$ mkdir data
$ cd kafka/config
seungyong@node-01:~/kafka/config$ vi server.properties
broker.id=1
log.dirs=/home/seungyong/kafka/data
zookeeper.connect=node-01:2181,node-02:2181,node-03:2181/kafka
broker.id=2
broker.id=3
$ cd $KAFKA_HOME
$ bin/kafka-server-start.sh -daemon config/server.properties
seungyong@node-01:~/kafka/bin$ jps
$ $ZK_HOME/bin/zkCli.sh
........................... 중간 생략 ..............................
[kafka, zookeeper]
[zk: localhost:2181(CONNECTED) 2] ls /kafka/brokers/ids
[1, 2, 3]

30
Apache Kafka – 구성 #2
https://kafka.apache.org/documentation/#quickstart
3. 메시지 생성 및 확인
seungyong@node-01:~/kafka$ bin/kafka-topics.sh --create --partitions 1 --
replication-factor 1 --topic quickstart-events --bootstrap-server node-01:9092
Created topic quickstart-events.
seungyong@node-01:~/kafka$ bin/kafka-topics.sh --describe --topic quickstart-
events --bootstrap-server node-01:9092
Topic: quickstart-events TopicId: uFw7D0r5Sy6FzLoAlyjlTA PartitionCount:
1 ReplicationFactor: 1Configs: segment.bytes=1073741824
Topic: quickstart-events Partition: 0 Leader: 1 Replicas: 1 Isr: 1
seungyong@node-01:~/kafka$ bin/kafka-console-producer.sh --topic quickstart-
events --bootstrap-server node-01:9092
>1
>2
>3
>^Cseungyong@node-01:~/kafka$
seungyong@node-01:~/kafka$ bin/kafka-console-consumer.sh --topic quickstart-
events --from-beginning --bootstrap-server node-01:9092
1
2
3
^CProcessed a total of 3 messages
seungyong@node-01:~/kafka$ bin/kafka-topics.sh --list --bootstrap-server node-
01:9092
__consumer_offsets
quickstart-events
seungyong@node-01:~/kafka$ bin/kafka-topics.sh --delete --bootstrap-server
node-01:9092 --topic quickstart-events
4. 서비스 중지 – 모든 노드
$ $KAFKA_HOME/bin/kafka-server-stop.sh

31
Apache HBase – 개요, 컬럼 기반 분산 NoSQL 데이터베이스
https://hbase.apache.org/, https://aws.amazon.com/ko/elasticmapreduce/details/hbase/, https://www.usenix.org/system/files/login/articles/login1210_khurana.pdf
Welcome to Apache HBase™
Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting
of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an
open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System
for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File
System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Apache HBase는 Apache Hadoop 에코시스템에 있는
확장성이 뛰어난 분산 빅 데이터 스토어입니다. 하둡
분산 파일 시스템(HDFS) 위에서 실행되는 버전이
지정된 비관계형 오픈 소스 데이터베이스이며, 수십억
개의 행과 수백만 개의 열로 구성된 테이블에 엄격하게
일관된 실시간 임의 액세스를 제공하도록
구축되었습니다.
- AWS -

32
Apache HBase – 구성 #1
https://hbase.apache.org/book.html#quickstart, https://hbase.apache.org/book.html#zookeeper
1. HBase 설치 및 환경 설정 – 모든 노드
1) HBase 설치
$ wget https://dlcdn.apache.org/hbase/stable/hbase-2.4.8-bin.tar.gz
$ tar zvxf hbase-2.4.8-bin.tar.gz
$ ln -s hbase-2.4.8 hbase
2) $HOME/.bashrc
# HBase
export HBASE_HOME=/home/seungyong/hbase
export PATH=$PATH:$HBASE_HOME/bin
3) $HBASE_HOME/conf/hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/default-java/
export HBASE_MANAGES_ZK=false → 주키퍼를 별도로 구성 했을 때 적용
export HBASE_DISABLE_HADOOP_CLASSPATH_LOOKUP=true → 하둡이 같은
노드에 설치되어 있을 때 적용
4) $HBASE_HOME/conf/hbase-site.xml
<configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://node-01:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>node-01,node-02,node-03</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
</configuration>
5) $HBASE_HOME/conf/regionservers
node-02
node-03
6) $HBASE_HOME/conf/backup-masters
node-02

33
1) 서비스 실행
seungyong@node-01:~/hbase$ bin/start-hbase.sh
running master, logging to /home/seungyong/hbase/bin/../logs/hbase-seungyong-
master-node-01.out
node-03: running regionserver, logging to
/home/seungyong/hbase/bin/../logs/hbase-seungyong-regionserver-node-03.out
node-02: running regionserver, logging to
/home/seungyong/hbase/bin/../logs/hbase-seungyong-regionserver-node-02.out
node-02: running master, logging to /home/seungyong/hbase/bin/../logs/hbase-
seungyong-master-node-02.out
seungyong@node-01:~/hbase$ jps
2) 웹 UI 확인
http://192.168.15.191:16010/
3. HDFS 및 ZooKeeper 확인
seungyong@node-02:~$ hadoop fs -ls /
Found 3 items
drwxr-xr-x - seungyong supergroup 0 2021-12-19 19:55 /hbase
drwx------ - seungyong supergroup 0 2021-12-19 14:53 /tmp
drwxr-xr-x - seungyong supergroup 0 2021-12-19 14:52 /user
seungyong@node-02:~$ hadoop fs -ls /hbase/
Found 12 items
drwxr-xr-x - seungyong supergroup 0 2021-12-19 15:35 /hbase/.hbck
drwxr-xr-x - seungyong supergroup 0 2021-12-19 15:35
/hbase/MasterData
........................... 중간 생략 ..............................
drwxr-xr-x - seungyong supergroup 0 2021-12-19 15:40 /hbase/oldWALs
drwx--x--x - seungyong supergroup 0 2021-12-19 15:35 /hbase/staging
seungyong@node-03:~/zookeeper/bin$ ./zkCli.sh
........................... 중간 생략 ..............................
[hbase, kafka, zookeeper]
[zk: localhost:2181(CONNECTED) 1] ls /hbase
[backup-masters, draining, flush-table-proc, hbaseid, master, master-maintenance,
meta-region-server, namespace, online-snapshot, rs, running, splitWAL, switch,
table]

34
4. 테이블 생성 및 확인
seungyong@node-01:~/hbase$ bin/hbase shell
........................... 중간 생략 ..............................
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.4.8, rf844d09157d9dce6c54fcd53975b7a45865ee9ac, Wed Oct 27
08:48:57 PDT 2021
Took 0.0038 seconds
hbase:001:0> create 'test', 'cf'
Created table test
Took 2.8753 seconds
=> Hbase::Table - test
hbase:002:0> list 'test'
TABLE
test
1 row(s)
Took 0.0336 seconds
=> ["test"]
hbase:003:0> describe 'test'
Table test is ENABLED
test
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS =>
'1', KEEP_DELETED_CELLS => 'FAL
SE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL
=> 'FOREVER', MIN_VERSIONS => '0', BLOC
KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s)
Quota is disabled
Took 0.4140 seconds
hbase:004:0> put 'test', 'row1', 'cf:a', 'value1'
Took 0.3752 seconds
hbase:005:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=2021-12-19T15:36:25.910,
value=value1
1 row(s)
Took 0.0924 seconds

35
hbase:006:0> get 'test', 'row1'
COLUMN CELL
cf:a timestamp=2021-12-19T15:36:25.910, value=value1
1 row(s)
Took 0.0190 seconds
hbase:007:0> disable 'test'
Took 1.2562 seconds
hbase:008:0> drop 'test'
Took 0.7229 seconds
hbase:009:0>
seungyong@node-01:~/hbase$
5. 서비스 중지
seungyong@node-01:~/hbase$ bin/stop-hbase.sh
stopping hbase...............
seungyong@node-01:~/hbase$

36
Apache HBase – Web UI

37
Apache Solr – 개요, 검색 엔진
https://solr.apache.org/, https://solr.apache.org/guide/8_11/a-quick-overview.html
Apache Solr
Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying,
automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of
many of the world's largest internet sites.
A Quick Overview
Solr is a search server built on top of Apache Lucene, an open
source, Java-based, information retrieval library. It is designed to
drive powerful document retrieval applications - wherever you
need to serve data to users based on their queries, Solr can work
for you.
Here is a example of how Solr could integrate with an application:

38
Apache Solr – 구성 #1
https://solr.apache.org/guide/8_11/solr-tutorial.html, https://solr.apache.org/guide/8_2/setting-up-an-external-zookeeper-ensemble.html
1. Solr 설치 및 환경 설정 – 모든 노드
1) Solr 설치
$ wget https://dlcdn.apache.org/lucene/solr/8.11.1/solr-8.11.1.tgz
$ tar zvxf solr-8.11.1.tgz
$ ln -s solr-8.11.1.tgz solr
2) $HOME/.bashrc
# Solr
export SOLR_BIN=/home/seungyong/solr/bin
export PATH=$PATH:$SOLR_BIN
3) $SOLR_BIN/solr.in.sh
ZK_HOST=node-01:2181,node-02:2181,node-03:2181/solr
ZK_CREATE_CHROOT=true
4) $HOME/zookeeper/conf/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/seungyong/zookeeper/data
clientPort=2181
4lw.commands.whitelist=mntr,conf,ruok → 기존 ZooKeeper 설정에 추가
server.1=node-01:2888:3888
server.2=node-02:2888:3888
server.3=node-03:2888:3888
1) ZooKeeper 재 실행 – 모든 노드
$ $ZK_HOME/bin/zkServer.sh stop
$ $ZK_HOME/bin/bin/zkServer.sh start
$ $ZK_HOME/bin/zkServer.sh status
........................... 중간 생략 ..............................
[hbase, kafka, solr, zookeeper]
[zk: localhost:2181(CONNECTED) 1] ls /solr
[aliases.json, autoscaling, autoscaling.json, clusterstate.json, collections, configs,
live_nodes, overseer, overseer_elect, security.json]
[zk: localhost:2181(CONNECTED) 2] ls /solr/live_nodes
[192.168.15.191:8983_solr, 192.168.15.192:8983_solr, 192.168.15.193:8983_solr]

39
2) Solr 실행 및 확인 - 모든 노드
$ $SOLR_BIN/solr start -cloud
$ $SOLR_BIN/solr status
seungyong@node-01:~/solr$ bin/solr status
Found 1 Solr nodes:
Solr process 15865 running on port 8983
{
"solr_home":"/home/seungyong/solr/server/solr",
"version":"8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy -
2021-12-14 13:50:55",
........................... 중간 생략 ..............................
"memory":"90 MB (%17.6) of 512 MB",
"cloud":{
"ZooKeeper":"node-01:2181,node-02:2181,node-03:2181/solr",
"liveNodes":"3",
"collections":"0"}}
3) 웹 UI 확인
http://192.168.15.191:8983
3. 컬렉션 생성 및 인덱스 생성
1) 컬렉션 생성
seungyong@node-01:~/solr$ bin/solr create -c collection-01 -shards 3 -
replicationFactor 3
........................... 중간 생략 ..............................
Created collection 'collection-01' with 3 shard(s), 3 replica(s) with config-set
'collection-01'
Found 1 Solr nodes:
Solr process 15865 running on port 8983
{
"solr_home":"/home/seungyong/solr/server/solr",
"version":"8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy -
2021-12-14 13:50:55",
........................... 중간 생략 ..............................
"collections":"1"}}
2) 인덱스 생성
seungyong@node-01:~/solr$ bin/post -c collection-01 example/exampledocs/*
/usr/lib/jvm/default-java/bin/java -classpath /home/seungyong/solr/dist/solr-core-
8.11.1.jar -Dauto=yes -Dc=collection-01 -Ddata=files
example/exampledocs/books.json example/exampledocs/gb18030-example.xml
........................... 중간 생략 ..............................
POSTing file vidcard.xml (application/xml) to [base]
21 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/collection-
01/update...
Time spent: 0:00:26.616

40
4. 쿼리 테스트
1) 웹 UI → collection-01 → Query → Execute Query
2) 웹 브라우저 → http://192.168.15.192:8983/solr/collection-01/select?q=id:SP2514N&wt=xml
3) curl
seungyong@node-01:~/solr$ curl http://192.168.15.192:8983/solr/collection-01/select?q=id:SP2514N
{
"responseHeader":{
"zkConnected":true,
"params":{
"q":"id:SP2514N"}},
........................... 중간 생략 ..............................
"store":["35.0752,-97.032"],
"_version_":1719672535376723968}]
}}
seungyong@node-01:~/solr$
$ $SOLR_BIN/solr stop -all

41
Apache Solr – Web UI

42
Apache Drill – 개요, SQL 쿼리 엔진
https://drill.apache.org/
Apache Drill
Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Kiss the overhead goodbye and enjoy data agility
Query any non-relational datastore (well, almost...)
Treat your data like a table even when it's not
Keep using the BI tools you love

43
Apache Drill – 구성 #1
https://drill.apache.org/docs/install-drill-introduction/, https://drill.apache.org/docs/drill-in-10-minutes/
1. Drill 설치 및 환경 설정 – 모든 노드
1) Drill 설치
$ wget https://dlcdn.apache.org/drill/drill-1.19.0/apache-drill-1.19.0.tar.gz
$ tar zvxf apache-drill-1.19.0.tar.gz
$ ln -s apache-drill-1.19.0 drill
2) $HOME/.bashrc
# Drill
export DRILL_HOME=/home/seungyong/drill
export PATH=$PATH:$DRILL_HOME/bin
3) $DRILL_HOME/conf/drill-override.conf
drill.exec: {
cluster-id: "drill-cluster-01",
zk.root: "drill",
zk.connect: "node-01:2181,node-02:2181,node-03:2181"
}
$ $DRILL_HOME/bin/drillbit.sh start
$ $DRILL_HOME/bin/drillbit.sh status
drillbit is running.
........................... 중간 생략 ..............................
[drill, hbase, kafka, solr, zookeeper]
[zk: localhost:2181(CONNECTED) 1] ls /drill
[drill-cluster-01, running, sys.options, sys.storage_plugins, udf]
[zk: localhost:2181(CONNECTED) 2] ls /drill/drill-cluster-01
[8f4c9658-8cdc-486b-9f2c-903c27b3f0cd, 94c802f8-aab3-4266-b813-
9e37dad38022, e627dd09-e837-4af4-a7ea-b7374e6477e3]
3) 웹 UI 확인
http://192.168.15.191:8047/

45
$ $DRILL_HOME/bin/drillbit.sh stop
5. (Optional) 클라이언트 구성
** 별도 클라이언트 노드 복제, Drill 설치, 스토리지 플러그인 설정 변경 하여
** drill-embedded 로 HDFS, HBASE 사용 가능하지만, 세션 끝나면 초기화 됨
** 그래서, 아래와 같이 클라이언트용 싱글 노드 클러스터로 기동이 가능하며,
스토리지 플러그인 변경하여 사용하고 유지 가능
seungyong@client:~/drill$ cd drill/conf
seungyong@client:~/drill$ vi drill-override.conf
drill.exec: {
cluster-id: "drill-client",
zk.root: "drill-client",
zk.connect: "node-01:2181,node-02:2181,node-03:2181"
}
seungyong@client:~/drill$ drillbit.sh start

46
Apache Drill – Web UI

47
Apache Drill – Storage Plugins

48
Apache Atlas – 개요, 데이터 카탈로그 및 거버넌스
https://atlas.apache.org/ , https://www.oreilly.com/library/view/the-enterprise-big/9781491931547/ch01.html , https://kr.cloudera.com/products/open-source/apache-hadoop/apache-atlas.html
Overview
Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their
compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.
Apache Atlas provides open metadata management and governance capabilities for organizations to build a catalog of their data assets,
classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts and the data
governance team.
Features
Metadata types & instances, Classification, Lineage, Search/Discovery, Security & Data Masking
Apache Atlas
Atlas는 기본적으로 Hadoop 스택
내부 및 외부에서 다른 툴 및
프로세스와 메타데이터를 교환할
수 있어 플랫폼에 상관없이
거버넌수 제어가 가능하므로 규제
준수 요구 사항에 효과적으로
대응할 수 있습니다.

49
Apache Atlas – 구성 #1
https://maven.apache.org/install.html, https://atlas.apache.org/#/BuildInstallation, https://github.com/apache/atlas.git
** Atlas는 바이너리 버전을 제공하지 않아서, 직접 빌드 해야 함
** HA 구성 없이 node-01에서만 실행
1. Maven 설치
** 1, 2번은 client 노드에서 진행했으나, 어느 노드에서 진행해도 상관 없음
seungyong@client:~$ wget https://downloads.apache.org/maven/maven-
3/3.8.4/binaries/apache-maven-3.8.4-bin.tar.gz
seungyong@client:~$ tar zvxf apache-maven-3.8.4-bin.tar.gz
seungyong@client:~$ ln -s apache-maven-3.8.4 maven
seungyong@client:~$ export PATH=$PATH:$HOME/maven/bin
seungyong@client:~$ mvn -version
Apache Maven 3.8.4 (9b656c72d54e5bacbed989b64718c159fe39b537)
Maven home: /home/seungyong/maven
Java version: 11.0.13, vendor: Debian, runtime: /usr/lib/jvm/java-11-openjdk-
amd64
Default locale: ko_KR, platform encoding: UTF-8
OS name: "linux", version: "5.10.0-9-amd64", arch: "amd64", family: "unix"
seungyong@client:~$
2. Atlas 빌드
seungyong@client:~$ git clone https://github.com/apache/atlas.git
seungyong@client:~$ cd atlas
seungyong@client:~$ export MAVEN_OPTS="-Xms2g -Xmx2g“
seungyong@client:~$ mvn clean install –DskipTests
........................... 중간 생략 ..............................
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 19:13 min
[INFO] Finished at: 2021-12-24T14:55:59+09:00
[INFO] ------------------------------------------------------------------------
seungyong@client:~/atlas$
seungyong@client:~$ mvn clean -DskipTests package –Pdist
........................... 중간 생략 ..............................
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 08:02 min
[INFO] Finished at: 2021-12-24T15:16:05+09:00
[INFO] ------------------------------------------------------------------------
seungyong@client:~/atlas$
seungyong@client:~/atlas/$ cd distro/target
seungyong@client:~/atlas/distro/target$ ls *.tar.gz

50
https://atlas.apache.org/#/Installation
3. Atlas 설치 및 환경 설정
1) 패키지 복사
seungyong@client:~/atlas/distro/target$ scp apache-atlas-3.0.0-SNAPSHOT-
server.tar.gz seungyong@node-01:/home/seungyong
2) Atlas 설치
seungyong@node-01:~$ tar zvxf apache-atlas-3.0.0-SNAPSHOT-server.tar.gz
seungyong@node-01:~$ ln -s apache-atlas-3.0.0-SNAPSHOT atlas
3) $HOME/.bashrc
# Atlas
export ATLAS_HOME=/home/seungyong/atlas
export HBASE_CONF_DIR=$HBASE_HOME/conf
export PATH=$PATH:$ATLAS_HOME/bin
4) $ATLAS_HOME/conf/atlas-application.properties
atlas.graph.storage.hostname=node-01:2181,node-02:2181,node-03:2181
atlas.graph.index.search.solr.zookeeper-url=node-01:2181,node-02:2181,node-
03:2181/solr
5) Solr 컬렉션 생성
seungyong@node-01:~$ solr create -c vertex_index -shards 3 -replicationFactor 3
seungyong@node-01:~$ solr create -c edge_index -shards 3 -replicationFactor 3
seungyong@node-01:~$ solr create -c fulltext_index -shards 3 -replicationFactor
3
1) Hadoop, ZooKeeper, HBase, Solr 서비스가 실행 되어 있는지 확인 필요
2) 서비스 실행
seungyong@node-01:~$ sudo update-alternatives --install /usr/bin/python python
/usr/bin/python3.9 1
seungyong@node-01:~/atlas$ bin/atlas_start.py
Starting Atlas server on host: localhost
Starting Atlas server on port: 21000
....................................................................................................
Apache Atlas Server started!!!
seungyong@node-01:~/atlas$
3) 서비스 확인

51
https://atlas.apache.org/#/Installation
4) 웹 UI 및 curl 확인
** admin/admin
** http://192.168.15.191:21000/
seungyong@node-01:~$ curl -u admin:admin
http://localhost:21000/api/atlas/admin/version
{"Description":"Metadata Management and Data Governance Platform over
Hadoop","Revision":"8e5dbabd1b06e0d8e1594e456fe112e7514ee1d1","Version":"
3.0.0-SNAPSHOT","Name":"apache-atlas"}
5) Quick Start 데이터 로드 및 웹 UI로 확인
seungyong@node-01:~/atlas$ bin/quick_start.py
ERROR StatusLogger Reconfiguration failed: No configuration found for '5ffd2b27'
at 'null' in 'null'
Enter username for atlas :- admin
Enter password for atlas :-
Creating sample types:
Created type [DB]
........................... 중간 생략 ..............................
time_dim(Table) -> loadSalesDaily(LoadProcess)
Sample data added to Apache Atlas Server.
6) (Optional) 테스트 엔티티 생성 및 확인
** 다음 장의 “간단 사용 데모 Drill 테스트”에서 데이터를 업로드하고, Atlas에서
엔티티를 생성해야만 조회 가능
seungyong@node-01:~$ curl -u admin:admin
http://localhost:21000/api/atlas/v2/search/basic?typeName=hdfs_path
{"queryType":"BASIC","searchParameters":{"typeName":"hdfs_path","excludeDele
tedEntities":false,"includeClassificationAttributes":false,"includeSubTypes":true,"in
cludeSubClassifications":true,"limit":100,"offset":0},"entities":[{"typeName":"hdfs_p
ath","attributes":{"createTime":1640876400000,"qualifiedName":"Seoul_2020_Rea
l_Price","name":"Seoul_2020_Real_Price"},"guid":"9c2f11cc-7fd7-46c5-a337-
ce4af1c1283e","status":"ACTIVE","displayText":"Seoul_2020_Real_Price","classifi
cationNames":[],"meaningNames":[],"meanings":[],"isIncomplete":false,"labels":[]}],
"approximateCount":1}
5. 서비스 중지
seungyong@node-01:~/atlas$ bin/atlas_stop.py
stopping atlas................................
did not stop gracefully after 30 seconds: killing process using SIGKILL

52
Apache Atlas – Web UI #1

53
Apache Atlas – Web UI #2

54
Apache Drill, JupyterLab 데모
간단 사용 데모
https://www.data.go.kr/data/15052419/fileData.do, http://data.seoul.go.kr/dataList/OA-15548/S/1/datasetView.do

55
Apache Drill – 데모 #1
http://guruble.com/apache-drill-%EA%B7%B8%EB%A6%AC%EA%B3%A0-sql-on-hadoop/, https://dataonair.or.kr/db-tech-reference/d-guide/data-practical/?mod=document&uid=403
1. Drill 환경 설정
1) DFS 스토리지 플러그인 업데이트
http://192.168.15.191:8047/storage → dfs → Update
================================================
"type": "file",
"connection": "hdfs://node-01:9000/",
"workspaces": {
........................... 중간 생략 ..............................
"root": {
"location": "/",
........................... 중간 생략 ..............................
"allowAccessOutsideWorkspace": false
}
},
================================================
2) HBase 스토리지 플러그인 업데이트
http://192.168.15.191:8047/storage → hbase Enable → Update
================================================
{
"type": "hbase",
"config": {
"hbase.zookeeper.quorum": "node-01,node-02,node-03",
"hbase.zookeeper.property.clientPort": "2181"
},
"enabled": true
}
================================================
3) 테스트 데이터 다운로드 및 HDFS 업로드
** 공공데이터 포털 – 서울 특별시 부동산 실거래가 정보
** https://www.data.go.kr/data/15052419/fileData.do
** http://data.seoul.go.kr/dataList/OA-15548/S/1/datasetView.do
** 한글 캐릭터셋 변환 필요: EUC-KR → UTF-8
seungyong@node-01:~$ iconv -c -f euc-kr -t utf-8 Seoul_2020_Real_Price.csv >
Seoul_2020_Real_Price_UTF8.csv
seungyong@node-01:~$ head Seoul_2020_Real_Price_UTF8.csv
"실거래가아이디","지번코드","시군구코드","자치구명","법정동코드","법정동명","신고년도",
"11290-2020-4-0000066-1","1129013800101440024","11290","성북구","1129
........................... 중간 생략 ..............................
"11200-2020-4-0002987-1","1120010400100530000","11200","성동구","1120
seungyong@node-01:~$ hadoop fs -ls /user/seungyong
seungyong@node-01:~$ hadoop fs -mkdir realestate
seungyong@node-01:~$ hadoop fs -put Seoul_2020_Real_Price_UTF8.csv
seungyong@node-01:~$ hadoop fs -ls /user/seungyong/realestate

56
https://drill.apache.org/docs/text-files-csv-tsv-psv/
2. CSV 파일 활용
1) CSV 옵션 변경
http://192.168.15.191:8047/storage → dfs → Update
================================================
"csv": {
"type": "text",
"extensions": [
"csv"
],
"extractHeader": true → 추가
================================================
2) Drill Shell
sungyong@node-01:~$ drill-conf
........................... 중간 생략 ..............................
Apache Drill 1.19.0
"The only truly happy people are children, the creative minority and Drill users."
apache drill> use dfs;
+------+---------------------------------+
| ok | summary |
+------+---------------------------------+
| true | Default schema changed to [dfs] |
+------+---------------------------------+
apache drill (dfs)> select * from sys.drillbits;
+----------+-----------+--------------+-----------+-----------+---------+---------+--------+
| hostname | user_port | control_port | data_port | http_port | current | version | state |
+----------+-----------+--------------+-----------+-----------+---------+---------+--------+
| node-02 | 31010 | 31011 | 31012 | 8047 | true | 1.19.0 | ONLINE |
+----------+-----------+--------------+-----------+-----------+---------+---------+--------+
apache drill (dfs)>
apache drill (dfs)> show files in `/user/seungyong/realestate`;
+-------------------------------------------------+-------------+--------+----------+-----------+------------+-------------
+------------------------+-------------------------+
| name | isDirectory | isFile | length | owner | group | permissions
| accessTime | modificationTime |
+-------------------------------------------------+-------------+--------+----------+-----------+------------+-------------
+------------------------+-------------------------+
| Active_Real_Estate_Salespersons_and_Brokers.csv | false | true | 21674958 | seungyong |
supergroup | rw-r--r-- | 2021-12-31 09:33:00.17 | 2021-12-24 11:07:47.915 |
| Seoul_2020_Real_Price_UTF8.csv | false | true | 35736694 | seungyong |
supergroup | rw-r--r-- | 2021-12-31 11:10:13.02 | 2021-12-31 11:10:13.88 |
+-------------------------------------------------+-------------+--------+----------+-----------+------------+-------------
+------------------------+-------------------------+

57
apache drill (dfs)> select * from `/user/seungyong/realestate/Seoul_2020_Real_Price_UTF8.csv` limit 3;
+------------------------+---------------------+-------+------+------------+------+------+--------+------+------+----------+-----------+--------+-----+---------+-------+------------+------+-------------+
| 실거래가아이디 | 지번코드 | 시군구코드 | 자치구명 | 법정동코드 | 법정동명 | 신고년도 | 업무구분코드 | 업무구분 | 물건번호 | 대지권면적 | 건물면적 | 관리구분코드 | 층정보 |
건물주용도코드 | 건물주용도 | 물건금액 | 건축년도 | 건물명 |
+------------------------+---------------------+-------+------+------------+------+------+--------+------+------+----------+-----------+--------+-----+---------+-------+------------+------+-------------+
| 11290-2020-4-0000066-1 | 1129013800101440024 | 11290 | 성북구 | 1129013800 | 장위동 | 2020 | 4 | 신고 | 1 | 0.000000 | 59.920000 | 2 | 5 | 02001 | 아파트 | 749000000 | 0 | 래미안
장위포레카운티 |
| 11290-2020-4-0000628-1 | 1129013400105080016 | 11290 | 성북구 | 1129013400 | 길음동 | 2020 | 4 | 신고 | 1 | 0.000000 | 84.770000 | 2 | 18 | 02001 | 아파트 | 1200000000 | 0 |
롯데캐슬 클라시아 |
| 11530-2020-4-0014284-1 | 1153010200107400029 | 11530 | 구로구 | 1153010200 | 구로동 | 2020 | 4 | 신고 | 1 | | 36.900000 | 0 | 1 | 02001 | 아파트 | 105000000 | 1994 |
궁전아트빌라 |
+------------------------+---------------------+-------+------+------------+------+------+--------+------+------+----------+-----------+--------+-----+---------+-------+------------+------+-------------+
apache drill (dfs)> select `실거래가아이디`, `물건금액` from `/user/seungyong/realestate/Seoul_2020_Real_Price_UTF8.csv` limit 5;
+------------------------+------------+
| 실거래가아이디 | 물건금액 |
+------------------------+------------+
| 11290-2020-4-0000066-1 | 749000000 |
| 11290-2020-4-0000628-1 | 1200000000 |
| 11530-2020-4-0014284-1 | 105000000 |
| 11170-2020-4-0005040-1 | 830000000 |
| 11170-2020-4-0001553-1 | 735000000 |
+------------------------+------------+
apache drill (dfs)> select avg(cast(`물건금액` as float)) as `실거래가평균` from `/user/seungyong/realestate/Seoul_2020_Real_Price_UTF8.csv`;
+---------------------+
| 실거래가평균 |
+---------------------+
| 6.283287635788434E8 |
+---------------------+

58
3. HBase 테이블 조회
1) 테스트 데이터를 생성하여 조회
https://drill.apache.org/docs/querying-hbase/
2) 또는 기존 테이블 조회 – 기존 테이블은 Atlas용 테이블임
apache drill (dfs)> use hbase;
+------+-----------------------------------+
| ok | summary |
+------+-----------------------------------+
| true | Default schema changed to [hbase] |
+------+-----------------------------------+
apache drill (hbase)> show tables;
+--------------+---------------------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+---------------------------+
| hbase | apache_atlas_janus |
| hbase | apache_atlas_entity_audit |
+--------------+---------------------------+
apache drill (hbase)> select * from apache_atlas_janus limit 1;
+----------------------------------+----+----+----+----+-----------------------------------------------------------------------+----+----+----+----+
| row_key | e | f | g | h | i | l | m | s | t |
+----------------------------------+----+----+----+----+-----------------------------------------------------------------------+----+----+----+----+
| x00x00x00x00x00x00x00x00 | {} | {} | {} | {} | {"��u0000u0005�u0007u000E��u0003c0a80fbf9083-node-012":""} | {} | {} | {} | {} |
+----------------------------------+----+----+----+----+-----------------------------------------------------------------------+----+----+----+----+
apache drill (hbase)>

59
JypyterLab with HDFS – 데모 #1
https://jx2lee.github.io/python-connection_test/
1. JupyterLab 설치 및 환경 설정
1) 패키지 설치
seungyong@client:~$ pip3 install jupyterlab pandas hdfs
2) 패스워드 생성
seungyong@client:~$ ipython3
ipython> from jupyter_server.auth import passwd; passwd()
3) 환경 설정
seungyong@client:~$ jupyter-lab --generate-config
seungyong@client:~$ vi .jupyter/jupyter_lab_config.py
================================================
c.ServerApp.allow_origin = '*'
c.ServerApp.ip = '0.0.0.0'
c.ServerApp.root_dir = '/home/seungyong'
c.ServerApp.open_browser = False
c.ServerApp.password = ‘encrypted password'
================================================
2. 서비스 실행 및 테스트
seungyong@client:~$ jupyter-lab > .jupyter/jupyter.log 2>&1 &
http://192.168.15.190:8888/lab?
1) HDFS에서 csv 쓰기, 읽기
================================================
import pandas as pd
from hdfs import InsecureClient
client_hdfs = InsecureClient('http://192.168.15.191:9870', user='seungyong')
create_df = pd.DataFrame([1000, 2000, 3000, 4000])
with client_hdfs.write('jupyter-test.csv', encoding = 'utf-8') as writer:
create_df.to_csv(writer)
client_hdfs.list('')
with client_hdfs.read('jupyter-test.csv', encoding = 'utf-8') as reader:
df = pd.read_csv(reader)
print(df)
================================================
seungyong@client:~$ hadoop fs -ls
Found 3 items
drwxr-xr-x - seungyong supergroup 0 2021-12-25 09:49 .sparkStaging
-rw-r--r-- 3 seungyong supergroup 31 2022-01-03 21:05 jupyter-test.csv
drwxr-xr-x - seungyong supergroup 0 2021-12-31 11:10 realestate
seungyong@client:~$
seungyong@client:~$ hadoop fs -cat jupyter-test.csv
,0
0,1000
1,2000
2,3000
3,4000
seungyong@client:~$

60
https://jx2lee.github.io/python-connection_test/
2) 서울 특별시 부동산 실거래가 조회
================================================
import pandas as pd
from hdfs import InsecureClient
client_hdfs = InsecureClient('http://192.168.15.191:9870', user='seungyong')
client_hdfs.list('realestate')
with client_hdfs.read('realestate/Seoul_2020_Real_Price_UTF8.csv', encoding = 'utf-8') as reader:
df = pd.read_csv(reader)
================================================
jupyter> df.head()
실거래가아이디 지번코드 시군구코드 자치구명 법정동코드 법정동명 신고년도 업무구분코드 업무구분 물건번호 대지권면적 건물면적
관리구분코드 층정보 건물주용도코드 건물주용도 물건금액 건축년도 건물명
0 11290-2020-4-0000066-1 1129013800101440024 11290 성북구 1129013800 장위동 2020 4 신고 1 0.0 59.92 2 5.0 2001 아파트 749000000
0.0 래미안 장위포레카운티
1 11290-2020-4-0000628-1 1129013400105080016 11290 성북구 1129013400 길음동 2020 4 신고 1 0.0 84.77 2 18.0 2001 아파트 1200000000
0.0 롯데캐슬 클라시아
2 11530-2020-4-0014284-1 1153010200107400029 11530 구로구 1153010200 구로동 2020 4 신고 1 NaN 36.90 0 1.0 2001 아파트 105000000
1994.0 궁전아트빌라
3 11170-2020-4-0005040-1 1117011500101930000 11170 용산구 1117011500 산천동 2020 4 신고 1 NaN 59.55 0 2.0 2001 아파트 830000000
2001.0 리버힐삼성
4 11170-2020-4-0001553-1 1117012900101930003 11170 용산구 1117012900 이촌동 2020 4 신고 1 NaN 64.43 0 5.0 2001 아파트 735000000
1971.0 강변

61
jupyter> df.describe()
시군구코드 법정동코드 신고년도 업무구분코드 물건번호 대지권면적 건물면적 층정보 건물주용도코드 물건금액 건축년도
count 176001.000000 1.760010e+05 176001.0 176001.0 176001.000000 89147.000000 176001.000000 163838.000000 176001.000000 1.760010e+05 175436.000000
mean 11448.537253 1.144865e+09 2020.0 4.0 1.868825 48.176158 71.489719 6.725540 2967.512037 6.283288e+08 1988.342336
std 169.978944 1.699767e+07 0.0 0.0 7.057148 58.385424 66.030144 5.754692 3430.224515 6.245526e+08 162.107028
min 11110.000000 1.111010e+09 2020.0 4.0 1.000000 0.000000 5.070000 -3.000000 1001.000000 1.700000e+07 0.000000
25% 11305.000000 1.130510e+09 2020.0 4.0 1.000000 23.020000 41.260000 3.000000 2001.000000 2.500000e+08 1993.000000
50% 11440.000000 1.144012e+09 2020.0 4.0 1.000000 30.570000 59.760000 5.000000 2001.000000 4.480000e+08 2002.000000
75% 11590.000000 1.159010e+09 2020.0 4.0 1.000000 45.500000 84.820000 10.000000 2002.000000 8.000000e+08 2011.000000
max 11740.000000 1.174011e+09 2020.0 4.0 175.000000 6086.000000 2804.970000 67.000000 14202.000000 2.900000e+10 2021.000000
jupyter> df.max()
실거래가아이디 11740-2020-4-9000004-1
시군구코드 11740
자치구명 중랑구
법정동코드 1174011000
법정동명 흥인동
신고년도 2020
업무구분코드 4
업무구분 신고
물건번호 175
대지권면적 6086.0
건물면적 2804.97
관리구분코드 B
층정보 67.0
건물주용도코드 14202
건물주용도 오피스텔
물건금액 29000000000
건축년도 2021.0

62

63
데이터 레이크를 위해서는 모든 구성 요소가 중요하다.
NetApp for 데이터 레이크
https://cloud.netapp.com/blog/cvo-blg-cloud-data-lake-in-5-steps, https://www.oreilly.com/library/view/the-enterprise-big/9781491931547/ch01.html
데이터 늪이 되지 않으려면??
▪ 당연히 데이터 레이크 전체의 아키텍처가 매우 중요
▪ 데이터 소스도 다양하고 데이터 레이크를 위한 저장 레이어도
HDFS 만으로 모든 요구 조건을 만족할 수는 없음
▪ 필요한 요구 조건에 따라서, 성능, 데이터 보호, 비용등도 고려가
되어져야 하며 NFS, S3, 클라우드 스토리지등과 같은 다양한
데이터 저장 기술을 적용할 수 있어야 함
▪ NameNode HA With NFS
▪ NFS Gateway
▪ Amazon S3
▪ Azure Blob Storage
▪ Azure Data Lake Storage
▪ OpenStack Swift

64
Apache Ozone
Ozone is a scalable, redundant, and distributed object store for Hadoop.
Apart from scaling to billions of objects of varying sizes, Ozone can function effectively in containerized environments like
Kubernetes.
Applications like Apache Spark, Hive and YARN, work without any modifications when using Ozone. Ozone comes with
a Java client library, S3 protocol support, and a command line interface which makes it easy to use Ozone.
Ozone consists of volumes, buckets, and keys:
• Volumes are similar to user accounts. Only administrators can create or delete volumes.
• Buckets are similar to directories. A bucket can contain any number of keys, but buckets cannot contain other buckets.
• Keys are similar to files.
https://ozone.apache.org/, https://ci-hadoop.apache.org/view/Hadoop%20Ozone/job/ozone-doc-master/lastSuccessfulBuild/artifact/hadoop-hdds/docs/public/index.html
최근의 데이터 스토어 프로젝트

65
NetApp Cloud Central – https://cloud.netapp.com/
Data Lake in Cloud with NetApp
▪ Cloud Data Lake in 5 Steps
▪ Azure Data Lake: 4 Building Blocks and Best Practices
▪ AWS Data Lake: End-to-End Workflow in the Cloud
▪ Google Cloud Data Lake: 4 Phases of the Data Lake Lifecycle

66
NetApp ONTAP – Unified Data Layer
Hybrid Flash FAS All Flash FAS Hybrid Flash FAS
ONTAP Select ONTAP Cloud
NVMe-oF
Fibre
Channel
iSCSI NFS SMB/CIFS S3
ONTAP Unified Data Layer

67
NetApp ONTAP – Unified Data Layer
https://www.netapp.com/pdf.html?item=/media/26877-nva-1157-deploy.pdf, https://www.netapp.com/pdf.html?item=/media/17082-tr4732pdf.pdf, https://www.netapp.com/blog/data-migration-xcp/

68
NetApp StorageGRID – 업계 최고의 오브젝트 스토리지 솔루션
▪ 강력한 정책 기반 데이터 관리
▪ 미디어, 데이터 센터, 보호 방식, 보존 주기, 메타데이터 등에 따라서 자동으로 데이터 배치
▪ 유연하고 단순한 구축 방안
▪ 어플라이언스, 가상화 또는 소프트웨어 기반 구성 및 혼용 지원
▪ 무결점의 데이터 내구성
▪ 2 단계의 Erasure Coding을 통한 15 9’s의 내구성 제공
▪ 단일 네임 스페이스 스케일 아웃 아키텍처
▪ 최대 16 사이트, 640PB 물리 용량과 260B의 오브젝트 지원
▪ 하이브리드 클라우드
▪ AWS의 워크플로우와 서비스의 통합 지원: 미러링, AWS SNS등
▪ Microsoft Azure와 AWS Glacier/S3로 데이터 티어링
▪ Analyst reviews:
▪ IDC: Leader in the 2019 IDC Marketscape Worldwide Object-Based (OBS) Vendor Assessment:
https://blog.netapp.com/netapp-named-object-storage-leader
User ingest
data in San
Francisco
User reads
data in
Seoul
StorageGRID
Site 1
San Francisco
StorageGRID®
Site 2
New York
StorageGRID
Site 3
Munich
StorageGRID
Site 4
Tokyo
16
sites
640
PB
Capacity
260B
Objects
Seoul

NetApp unlocks
the best of cloud

70
별첨 #1 – Hadoop 구성 파일
1) $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node-01:9000</value>
</property>
</configuration>
2) $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/seungyong/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/seungyong/datanode</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file:///home/seungyong/namesecondary</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node-02:9868</value>
</property>
</configuration>
3)$HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node-01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_
DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,L
ANG,TZ,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
4) $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/s
hare/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
5)$HADOOP_HOME/etc/hadoop/workers
node-01
node-02
node-03

71
별첨 #2 – PDSH로 서비스 병렬 시작 및 중지
< 모든 서비스 시작 및 확인 >
seungyong@node-01:~$ $HADOOP_HOME/sbin/start-dfs.sh && $HADOOP_HOME/sbin/start-
yarn.sh
seungyong@node-01:~$ pdsh -w node-[01-03] $ZK_HOME/bin/zkServer.sh start
seungyong@node-01:~$ $HBASE_HOME/bin/start-hbase.sh
seungyong@node-01:~$ pdsh -w node-[01-03] $SOLR_BIN/solr start -cloud
seungyong@node-01:~$ pdsh -w node-[01-03] $DRILL_HOME/bin/drillbit.sh start
seungyong@node-01:~$ pdsh -w node-[01-03] $KAFKA_HOME/bin/kafka-server-start.sh -daemon
$KAFKA_HOME/config/server.properties
seungyong@node-01:~$ $ATLAS_HOME/bin/atlas_start.py
seungyong@node-01:~$ pdsh -w node-[01-03] jps
< 모든 서비스 중지 및 확인 >
seungyong@node-01:~$ $ATLAS_HOME/bin/atlas_stop.py
seungyong@node-01:~$ pdsh -w node-[01-03] $DRILL_HOME/bin/drillbit.sh stop
seungyong@node-01:~$ pdsh -w node-[01-03] $SOLR_BIN/solr stop -all
seungyong@node-01:~$ $HBASE_HOME/bin/stop-hbase.sh
seungyong@node-01:~$ pdsh -w node-[01-03] $KAFKA_HOME/bin/kafka-server-stop.sh
seungyong@node-01:~$ pdsh -w node-[01-03] $ZK_HOME/bin/zkServer.sh stop
seungyong@node-01:~$ $HADOOP_HOME/sbin/stop-yarn.sh && $HADOOP_HOME/sbin/stop-
dfs.sh
seungyong@node-01:~$ pdsh -w node-[01-03] jps

데이터 레이크 알아보기(Learn about Data Lake)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 데이터 레이크 알아보기(Learn about Data Lake)

Similar to 데이터 레이크 알아보기(Learn about Data Lake) (20)

More from SeungYong Baek

More from SeungYong Baek (6)

데이터 레이크 알아보기(Learn about Data Lake)