Bigdata with Google Cloud

JeongChul Kim
Bigdata LAB, Kookmin University
@kimjc
kjc5443@gmail.com
http://jeongchul.tistory.com
Bigdata with
Google Cloud

- Introduction
- Spark Cluster
- GCP
- Cloud DataProc
- GCP Machine Learning
Content

Introduction
WSDM / ACM International Conference on Web Search and Data Mining
http://aminer.org/ranks/conf/
@kimjc

Introduction
WSDM - KKBox’s Music recommendation challenge

Bigdata dataset
largest file size of dataset : 29G

Memory Error
메모리 문제 발생 out of memory(OOM), Java heap space, GC overhead limit
exceeded -> 처리 Scale out, Scale up

Spark
Lighting-fast unified analytics engine
Speed
Run workloads 100x faster.
Easy of Use
Write applications quickly in Java, Scala, Python.
Generality
Combine SQL, streaming, and complex analytics.

Spark AI
TensorFlow On Spark: Scalable TensorFlow Learning on Spark Clusters

Spark Cluster
Cluster Manager Types
- Standalone : a simple cluster manager included with Spark
- Apache Mesos : a genearls cluster manager that can also run Hadoop MR
- Hadoop YARN : the resource manager in Hadoop 2.
- Kubernetes an open-source system for automating deployment, scaling, and management
of containerized applications

Spark ClusterSetup
Installation Spark Cluster with 4 server.
1. Install Hadoop(ssh config, network, HDFS)
2. Spark(Master,Worker)
* 설치에 많은 시간 소요, 어려움(trouble shooting) 존재
Worker 1
Yarn NodeManager
Master
Spark Application Master
Spark
Executor
Worker 1
Yarn NodeManager
Spark
Executor
Worker 1
Yarn NodeManager
Spark
Executor
Worker 1
Yarn NodeManager
Spark
Executor
bd-1 bd-2 bd-3 gpu
Spark
Driver
bd-2

Spark ClusterSetup
Docker
컨테이너를 사용해 애플리케이션을 신속하게 구축, 테스트 및 배포할 수 있는
소프트웨어 플랫폼

Google Cloud Platform
Why Google Cloud Platform?
Future-Proof Infrastructure
보안, 고성능, 비용효율성
Seriously Powerful
Data & Analytics
빅데이터를 활용해 더 나은
제품을 개발 가능.
Serverless
용량, 안정성, 성능을 신경 쓸
필요없는 Serverless 환경

Cloud Dataproc
Cloud Dataproc은 빠르고 사용하기 쉬운 관리형 클라우드 서비스로서 Apache Spark 클러스터를 효율적인
방식으로 실행합니다. 몇 시간이나 며칠씩 걸리던 작업이 몇 분이나 몇 초 만에 끝나게 되고, 사용한
리소스에 대해서만 요금을 지불하면 됩니다(초당 청구). Cloud Dataproc은 다른 Google Cloud
Platform(GCP) 서비스와도 쉽게 통합되어 데이터 처리, 분석, 머신러닝을 위한 강력하고도 완전한
플랫폼을 제공합니다.

Cloud Dataproc
GCP(Google Cloud Platform) 서비스와 쉽게 통합되어 데이터 처리, 분석, 머신러닝을 위한 플랫폼을
제공합니다.
Fast & Scalable Data Processing
노드 수를 3개에서 수백 개까지 조절할 수 있어 데이터 파이프라인이 클러스터보다 커질 일이 없습니다.
Affordable Pricing
실제 사용에 따른 초단위 가격 구조 저렴한 인스턴스를 포함할 수 있어 낮은 비용으로 강력한 클러스터를 얻을 수
있습니다.

GCP Cloud Dataproc
Google login -> GCP Console

GCP Cloud Dataproc
Google Dataproc -> 클러스터

GCP Cloud Dataproc
”API 사용 설정” 버튼 클릭

GCP Cloud Dataproc
“클러스터 만들기” 버튼 클릭

GCP Cloud Dataproc
클러스터 이름과 마스터, 작업자 노드 설정

GCP Cloud Dataproc
생성된 클러스터를 확인

GCP Cloud Dataproc
Cluster Web UI 설정
네트워킹(VPC 네트워크) -> 방화벽 규칙 -> 방화벽 규칙 만들기

GCP Cloud Dataproc
방화벽 규칙 이름
지정된 프로토콜 및 포트
tcp:8088;tcp:9870;tcp8080;tcp4040;tcp:18080;
tcp:19888

GCP Cloud Dataproc
VM 인스턴스 – “마스터” 클릭
@kimjc

GCP Cloud Dataproc
VM 인스턴스 세부정보 외부IP확인

GCP Cloud Dataproc
HDFS Cluster의 정보 확인 가능
http://cluster-master-ip:9870/

GCP Cloud Dataproc
Hadoop Cluster의 정보 확인 가능

GCP Cloud Dataproc
Hadoop Cluster의 Nodes 확인

GCP Cloud Dataproc
Spark History Server

GCP Cloud Dataproc
클러스터 마스터 SSH 접속하기
Cloud Shell 버튼 클릭

GCP Cloud Dataproc
Google Cloud Shell
$ gcloud compute ssh kmubigdata-cluster-m –zone=asia-east1-a

GCP Cloud Dataproc
spark-shell 실행하기
$ spark-shell

GCP Cloud Dataproc
Scala 코드 작성

GCP Cloud Dataproc
SBT 설치하기
$ sudo apt-get install apt-transport-https
$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv
2EE0EA64E40A89B84B2DF73499E82A75642AC823
$ sudo apt-get update
$ sudo apt-get install sbt

GCP Cloud Dataproc
Scala compile
$ sbt package

GCP Cloud Dataproc
컴파일 후 생성된 Jar 파일 확인
$ cd target/scala-2.11
$ ls

GCP Cloud Dataproc
HDFS 업로드
$ hdfs dfs –ls /
$ hdfs dfs –ls /user/
$ hdfs dfs –put matrixmultplicaiton_xxxx.jar /user/kjc5443

GCP Cloud Dataproc
Spark Submit 작업 제출
“작업 제출 버튼” 클릭

GCP Cloud Dataproc
작업 제출 정보 입력
- 작업 ID
- 클러스터 설정
- 작업 유형: Spark
- 기본 클래스 / Jar 파일 입력
hdfs:///user/kjc5443/matrix_xx.jar
- 제출

GCP Cloud Dataproc
작업 제출 시작 -> 성공

GCP Cloud Dataproc
성공 작업 완료된 결과

GCP Cloud Dataproc
History Server Application 확인

GCP Cloud Dataproc
실행된 Jobs 확인

GCP Cloud Storage
Google Cloud Storage
개발자와 기업을 위한 통합 객체 저장소
실시간 데이터를 위한 최적의 가용성과 저장소 및 보관처리와 수명 주기 제공

GCP Cloud Storage
Jar 파일을 Cloud Storage 업로드
$ gsutil cp matrixmultiplication_xxxx.jsar gs://dataproc-e3d4872e-99c3-4dba-
a533-8a5c6d4a9e4a-asia

GCP Cloud Storage
버킷(bucket)에 업로드된 jar 파일

GCP Cloud Dataproc
작업 제출 정보
- 작업 ID
- 클러스터
- 작업 유형
- 기본 클래스
- Jar파일 gs://jar파일

GCP Cloud Dataproc
작업 성공 결과 확인

GCP Cloud Dataproc
Scaling clusters
작업자 노드 개수 조정
Cluster 삭제 간단!

GCP Cloud Vision API
Cloud Vision API
강력한 이미지 분석
개발자는 강력한 기계학습 모델을 사용하기 쉬운 RERST API로 통합한 Vision API를 이용해
이미지의 내용을 파악할 수 있습니다.
- Image Classification
- Object Detection
- OCR(광학 문자 인식)

Cloud Vision API사용 설정
https://console.developers.google.com/apis/api/vision.googleapis.com/

GCP Cloud Storage
클라우드 저장소(Cloud Storage) 버킷(bucket) 만들기
이미지 검색을 위해 Google Cloud Storage에 저장된 파일의 URL 전달

GCP Cloud Storage
버킷(bucket) Image 업로드하기
파일 업로드 버튼을 클릭

GCP Cloud Storage
버킷(bucket) 파일 공개 링크 설정
공개적으로 공유하기에 ”공개 링크” 버튼 클릭 => 접근 가능한 주소 생성

Vision API 요청(request) 생성
Cloud 콘솔에서 vison-request.json 생성하고 Cloud Shell의 Code Editor 수정

Google Cloud Shell API 요청
$ curl -s -X POST -H "Content-Type: application/json" --data-binary @ocr-
request.json https://vision.googleapis.com/v1/images:annotate?key=${API_KEY}

결과 확인
WELCOMEnTOnNevadanTHE SILVER STATEnPacificnTime Zonen

GCP Cloud Translation API
Translation API
빠르고 동적인 번역
최첨단 기술인 인공신경망 기계 번역을 이용하여 임의의 문자열을 지원되는 언어로 번역
응답성이 높으므로 웹사이트의 애플리케이션으로 통합 가능
총 100가지의 언어를 제공

Translation API 사용 설정
https://console.developers.google.com/apis/api/translate.googleapis.com/

이미지에서 추출한 문장(영어)을 한글로 번역
translation-request.json 파일 생성 target에 대한 언어 설정

Vision API 요청 결과 문장 가져오기
jq 명령어를 이용해 json 파일에 원하는 텍스트를 가져옵니다.
$ jq .responses[0].textAnnotations[0].description vision-response.json

translation-request.json 쓰기 작업
$ STR=$(jq .responses[0].textAnnotations[0].description vision-response.json) &&
STR="${STR//"}" && sed -i "s|your_text_here|$STR|g" translation-request.json

Translation API 실행하기
$ curl -s -X POST -H "Content-Type: application/json" --data-binary @translation-
request.json https://translation.googleapis.com/language/translate/v2?key=${API_KEY} -o
translation-response.json

GCP Cloud Natural Language API
Natural Language API
강력한 텍스트 분석(텍스트 문서, 뉴스 기사, 블로그 게시물에 언급한 인물, 장소, 이베튼)
정보 추출하고, 감정 파악, 고객 대화를 통해 의도 분석
Google Cloud Speech API 결합!

Natural Language API 사용 설정
https://console.developers.google.com/apis/library/language.googleapis.com/

Vision API로 나온 텍스트를 NL API 넣어 분석
nl-request.json 생성

Translation API를 통해 번역된 NL API에 복사
$ STR=$(jq .data.translations[0].translatedText translation-response.json) &&
STR="${STR//"}" && sed -i "s|your_text_here|$STR|g" nl-request.json

Entity 분석 요청
$ curl "https://language.googleapis.com/v1/documents:analyzeEntities?key=${API_KEY}" -s -X POST -
H "Content-Type: application/json" --data-binary @nl-request.json

fin.
JeongChul Kim
Bigdata LAB, Kookmin University
@kimjc
kjc5443@gmail.com
http://jeongchul.tistory.com

Bigdata with Google Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bigdata with Google Cloud

Similar to Bigdata with Google Cloud (20)

Bigdata with Google Cloud