JCConf 2016 - Dataflow Workshop Labs

JCConf Dataflow Workshop Labs
{Simon Su / 20161015}
Index
Index 1
Lab 1: 準備Dataflow環境，並建置第一個專案 1
建立GCP專案，並安裝Eclipse開發環境 1
安裝Google Cloud SDK 1
啟用Dataflow API 2
建立第一個Dataflow專案 3
執行您的專案 6
Lab 2: 佈署您的第一個專案到Google Cloud Platform 9
準備工作 9
執行佈署 9
檢測執行結果 10
實作Input/Output/Transform等功能 12
Lab 3: 建立Streaming Dataflow 16
建立PubSub topic / subscription 16
佈署Dataflow streaming sample 16
Streaming範例1 16
Streaming範例2 17
從Dashboard監控Dataflow Streaming Task 19
Lab結束後 20
Lab 1: 準備Dataflow環境，並建置第一個專案
建立GCP專案，並安裝Eclipse開發環境
請參考：JCConf 2016 - Dataflow Workshop行前說明
安裝Google Cloud SDK
● 請參考此URL安裝Cloud SDK：https://cloud.google.com/sdk/?hl=en_US#download

● 認證Cloud SDK:
> gcloud auth login
> gcloud auth application-default login
● 設定預設專案
> gcloud config set project <your-project-id>
● 確認安裝
> gcloud config list
啟用Dataflow API
至所屬Project的API Manager項目：
在API Manager Dashboard中點選Enable API：
搜尋Dataflow項目：
將該項目做Enable：

建立第一個Dataflow專案
透過Eclipse Dataflow Wizard可以協助您建立Dataflow的相關專案，步驟如下：
Step1: 選擇New > Other...
Step2: 選擇Google Cloud Platform > Cloud Dataflow Java Project

Step3: 輸入您的專案資訊
Step4: 輸入Google Cloud Platform上的專案ID與Cloud Storage資訊

Step4: 專案建立好後，可以檢視專案狀態
範例程式如下：

點選右上角按鈕，建立新的Dataflow Run Configuration...

設定Run Configuration名稱
設定Runner形式：
檢視佈署Log狀態...

Lab 2: 佈署您的第一個專案到Google Cloud Platform
準備工作
在進行Lab2前的前置工作部分，需要先確認您在Lab1的專案可以正常執行，然後您可以依照您的需
求稍加改動您的專案，測試一下變化...
執行佈署
透過”Run As > Run Configurations...”之項目進入到Run Configurations設定視窗
設定視窗如下：
您可以點選視窗中的”New Launch Configuration”按鈕(下圖紅色標記處)來建立新的Configuration…

本Lab中，新的Configuration有兩個地方需要設定：
1. 設定Main method
2. 設定Pipeline Arguments
檢測執行結果
在執行視窗中，Console會顯示執行的過程，大致結果如下：

執行當下，可以到依照IDE Console的指示，連線到Web Console檢視該Dataflow Task狀態：
該執行項目的詳細畫面如下：

可以透過”LOGS”鏈結檢視執行狀況...
實作Input/Output/Transform等功能
修改您的專案，讓他從Google Cloud Storage抓取檔案...
@SuppressWarnings("serial")
public class TestMain {
private static final Logger LOG = LoggerFactory.getLogger(TestMain.class);

public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(TextIO.Read.named("sample-book").from("gs://jcconf2016-dataflow-workshop/sample/book-sample.txt"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
LOG.info(c.element());
}
}));
p.run();
}
}
進一步修改程式，讓資料輸出到Google Cloud Storage…
@Override
}
}))
.apply(TextIO.Write.named("output-book").to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt"));
p.run();
}
}
增加Transform Function，將段落字元切割

private final Aggregator<Long, Long> emptyLines =
createAggregator("emptyLines", new Sum.SumLongFn());
@Override
if (c.element().trim().isEmpty()) {
emptyLines.addValue(1L);
}
// Split the line into words.
String[] words = c.element().split("[^a-zA-Z']+");
// Output each word encountered into the output PCollection.
for (String word : words) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}))
.apply(TextIO.Write.named("output-book").to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt"));
p.run();
}
}
Word Count Sample - 計算每個文件中單字出現的數量

static class MyExtractWordsFn extends DoFn<String, String> {
private final Aggregator<Long, Long> emptyLines = createAggregator(
"emptyLines", new Sum.SumLongFn());
@Override
if (c.element().trim().isEmpty()) {
emptyLines.addValue(1L);
}
// Split the line into words.
String[] words = c.element().split("[^a-zA-Z']+");
// Output each word encountered into the output PCollection.
for (String word : words) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}
public static class MyCountWords extends
PTransform<PCollection<String>, PCollection<KV<String, Long>>> {
@Override
public PCollection<KV<String, Long>> apply(PCollection<String> lines) {
// Convert lines of text into individual words.
PCollection<String> words = lines.apply(ParDo.of(new MyExtractWordsFn()));
// Count the number of times each word occurs.
PCollection<KV<String, Long>> wordCounts = words.apply(Count.<String> perElement());
return wordCounts;
}
}
public static class MyFormatAsTextFn extends DoFn<KV<String, Long>, String> {
@Override
c.output(c.element().getKey() + ": " + c.element().getValue());
}
}
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args)
.withValidation().create());
p.apply(TextIO.Read.named("sample-book").from(
"gs://jcconf2016-dataflow-workshop/sample/book-sample.txt"))
.apply(new MyCountWords())
.apply(ParDo.of(new MyFormatAsTextFn()))
.apply(TextIO.Write.named("output-book")
.to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt"));
p.run();
}
}

Lab 3: 建立Streaming Dataflow
建立PubSub topic / subscription
建立topic
gcloud beta pubsub topics create jcconf2016
建立該topic的subscription
gcloud beta pubsub subscriptions create --topic jcconf2016 jcconf2016-sub001
佈署Dataflow streaming sample
Streaming範例1
聆聽subscription作為資料輸入，並將資料輸出在LOG中...

Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
p.apply(PubsubIO.Read.named("my-pubsub-input")
.subscription("projects/sunny-573/subscriptions/jcconf2016-sub001"))
@Override
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
LOG.info(c.element());
}
}));
p.run();
}
Streaming範例2
整合Work Count範例，將資料寫入BigQuery的dataset中...
/*
* Copyright (C) 2015 Google Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*/
package com.jcconf2016.demo;
import java.util.ArrayList;
import java.util.List;
import org.joda.time.Duration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.api.services.bigquery.model.TableReference;
import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableSchema;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.io.BigQueryIO;
import com.google.cloud.dataflow.sdk.io.PubsubIO;
import com.google.cloud.dataflow.sdk.options.Default;
import com.google.cloud.dataflow.sdk.options.Description;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.options.StreamingOptions;

import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import com.google.cloud.dataflow.sdk.transforms.windowing.FixedWindows;
import com.google.cloud.dataflow.sdk.transforms.windowing.Window;
import com.google.cloud.dataflow.sdk.values.KV;
import com.google.cloud.dataflow.sdk.values.PCollection;
/**
* A starter example for writing Google Cloud Dataflow programs.
*
* <p>
* The example takes two strings, converts them to their upper-case
* representation and logs them.
*
* <p>
* To run this starter example locally using DirectPipelineRunner, just execute
* it without any additional parameters from your favorite development
* environment. In Eclipse, this corresponds to the existing 'LOCAL' run
* configuration.
*
* <p>
* To run this starter example using managed resource in Google Cloud Platform,
* you should specify the following command-line options:
* --project=<YOUR_PROJECT_ID>
* --stagingLocation=<STAGING_LOCATION_IN_CLOUD_STORAGE>
* --runner=BlockingDataflowPipelineRunner In Eclipse, you can just modify the
* existing 'SERVICE' run configuration.
*/
public class StreamingPipeline {
static final int WINDOW_SIZE = 1; // Default window duration in minutes
public static interface Options extends StreamingOptions {
@Description("Fixed window duration, in minutes")
@Default.Integer(WINDOW_SIZE)
Integer getWindowSize();
void setWindowSize(Integer value);
@Description("Whether to run the pipeline with unbounded input")
boolean isUnbounded();
void setUnbounded(boolean value);
}
private static TableReference getTableReference(Options options) {
TableReference tableRef = new TableReference();
tableRef.setProjectId("sunny-573");
tableRef.setDatasetId("jcconf2016");
tableRef.setTableId("pubsub");
return tableRef;
}
private static TableSchema getSchema() {
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("word").setType("STRING"));
fields.add(new TableFieldSchema().setName("count").setType("INTEGER"));
fields.add(new TableFieldSchema().setName("window_timestamp").setType(
"TIMESTAMP"));
TableSchema schema = new TableSchema().setFields(fields);
return schema;
}
static class FormatAsTableRowFn extends DoFn<KV<String, Long>, TableRow> {
@Override
TableRow row = new TableRow().set("word", c.element().getKey())
.set("count", c.element().getValue())

// include a field for the window timestamp
.set("window_timestamp", c.timestamp().toString());
c.output(row);
}
}
private static final Logger LOG = LoggerFactory
.getLogger(StreamingPipeline.class);
Options options = PipelineOptionsFactory.fromArgs(args)
.withValidation().as(Options.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
PCollection<String> input = p.apply(PubsubIO.Read.topic("projects/sunny-573/topics/jcconf2016"));
PCollection<String> windowedWords =
input.apply(Window.<String>
into(FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))));
PCollection<KV<String, Long>> wordCounts = windowedWords.apply(new
TestMain.MyCountWords());
wordCounts.apply(ParDo.of(new FormatAsTableRowFn())).apply(
BigQueryIO.Write.to(getTableReference(options)).withSchema(getSchema()));
p.run();
}
}
從Dashboard監控Dataflow Streaming Task
打開GCP Web Console，使用Dataflow Dashboard來檢視每個流程的執行狀況。
並透過Cloud Logging來檢視執行Log…

Lab結束後
在Lab結束後，記得參考IDE輸出的Log，將Dataflow job做cancel動作，避免Streaming Dataflow仍
在運行中，主機無法關閉...
gcloud alpha dataflow jobs --project=sunny-573 cancel
2016-10-14_08_38_48-17987270960467929246

JCConf 2016 - Dataflow Workshop Labs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to JCConf 2016 - Dataflow Workshop Labs

Similar to JCConf 2016 - Dataflow Workshop Labs (20)

More from Simon Su

More from Simon Su (16)

Recently uploaded

Recently uploaded (20)

JCConf 2016 - Dataflow Workshop Labs