This document outlines labs for Google Cloud Dataflow workshops. Lab 1 covers setting up the Dataflow environment and building a first project. Lab 2 focuses on deploying the first project to Google Cloud Platform. Lab 3 builds streaming Dataflow by creating PubSub topics/subscriptions and deploying streaming samples that read from PubSub and write to BigQuery.
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
使用 Raspberry pi + fluentd + gcp cloud logging, big query 做iot 資料搜集與分析Simon Su
This is a short training for introduce Pi to use fluentd to collect data and use Google Cloud Logging and BigQuery as backend and then use Apps Script and Google Sheet as presentation layer.
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
使用 Raspberry pi + fluentd + gcp cloud logging, big query 做iot 資料搜集與分析Simon Su
This is a short training for introduce Pi to use fluentd to collect data and use Google Cloud Logging and BigQuery as backend and then use Apps Script and Google Sheet as presentation layer.
A 20 minute talk about how WePay runs airflow. Discusses usage and operations. Also covers running Airflow in Google cloud.
Video of the talk is available here:
https://wepayinc.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k
Introduction to Apache Airflow, it's main concepts and features and an example of a DAG. Afterwards some lessons and best practices learned by from the 3 years I have been using Airflow to power workflows in production.
Upgrading to Apache Airflow 2 | Airflow Summit 2021Kaxil Naik
Airflow 2.0 was a big milestone for the Airflow community. However, companies and enterprises are still facing difficulties in upgrading to 2.0.
In this talk, I would like to focus and highlight the ideal upgrade path and talk about
- upgrade_check CLI tool
- separation of providers
- registering connections types
- DB Migration
- deprecated feature around Airflow Plugins
https://airflowsummit.org/sessions/2021/upgrading-to-apache-airflow-2/
Say you have an existing app that uses Firebase. But now you want to add payment processing, image processing, send push notifications, or other functionality that really can't be done in the app itself. How can you do these things without spinning up your own servers? Firebase has you covered. In this codelab you learn how to write JavaScript functions that run in response to events that happen in Firebase. You then deploy these functions to Cloud Functions for Firebase, where they run auto-scaled on Google's infrastructure. To get the most value out of attending, be sure to have Node.js and npm installed on your machine along with your favorite text editor.
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)DataWorks Summit
Data Science, Machine Learning, and Artificial Intelligence has exploded in popularity in the last five years, but the nagging question remains, “How to put models into production?” Engineers are typically tasked to build one-off systems to serve predictions which must be maintained amid a quickly evolving back-end serving space which has evolved from single-machine, to custom clusters, to “serverless”, to Docker, to Kubernetes. In this talk, we present KubeFlow- an open source project which makes it easy for users to move models from laptop to ML Rig to training cluster to deployment. In this talk we will discuss, “What is KubeFlow?”, “why scalability is so critical for training and model deployment?”, and other topics.
Users can deploy models written in Python’s skearn, R, Tensorflow, Spark, and many more. The magic of Kubernetes allows data scientists to write models on their laptop, deploy to an ML-Rig, and then devOps can move that model into production with all of the bells and whistles such as monitoring, A/B tests, multi-arm bandits, and security.
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)Jarek Potiuk
This talk is about tools and mechanism we developed and used to improve productivity and teamwork in our team (of 6 currently) while developing 70+ operators for Airflow over more than 6 months.
We developed an "Airflow Breeze" simplified development environment which cuts down the time to become productive Apache Airflow developer from days to minutes.
It is part of Airflow Improvement Proposals:
AIP-10 Multi-layered and multi-stage official Airflow image
AIP-7 Simplified development workflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
****UPDATE: Project is now open sourced at https://www.github.com/industrydive/fileflow****
From Pydata DC 2016
Description
Data warehousing and analytics projects can, like ours, start out small - and fragile. With an organically growing mess of scripts glued together and triggered by cron jobs hiding on different servers, we needed better plumbing. After perusing the data pipelining landscape, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool from Airbnb.
Abstract
The power of any reporting tool breaks based on the data behind it, so when our data warehousing process got too big for its humble origins, we searched for something better. After testing out several options such as Drake, Pydoit, Luigi, AWS Data Pipeline, and Pinball, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool originating from Airbnb, that provides the benefits of pipeline construction as directed acyclic graphs (DAGs), along with a scheduler that can handle alerting, retries, callbacks and more to make your pipeline robust. This talk will discuss the value of DAG based pipelines for data processing workflows, highlight useful features in all of the pipelining projects we tested, and dive into some of the specific challenges (like time travel) and successes (like time travel!) we’ve experienced using Airflow to productionize our data engineering tasks. By the end of this talk, you will learn
- pros and cons of several Python-based/Python-supporting data pipelining libraries
- the design paradigm behind Airflow, an Apache incubating data pipelining and scheduling service, and what it is good for
- some epic fails to avoid and some epic wins to emulate from our experience porting our data engineering tasks to a more robust system
- some quick-start tips for implementing Airflow at your organization.
Orchestrating workflows Apache Airflow on GCP & AWSDerrick Qin
Working in a cloud or on-premises environment, we all somehow move data from A to B on-demand or on schedule. It is essential to have a tool that can automate recurring workflows. This can be anything from an ETL(Extract, Transform, and Load) job for a regular analytics report all the way to automatically re-training a machine learning model.
In this talk, we will introduce Apache Airflow and how it can help orchestrate your workflows. We will cover key concepts, features, and use cases of Apache Airflow, as well as how you can enjoy Apache Airflow on GCP and AWS by demo-ing a few practical workflows.
A 20 minute talk about how WePay runs airflow. Discusses usage and operations. Also covers running Airflow in Google cloud.
Video of the talk is available here:
https://wepayinc.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k
Introduction to Apache Airflow, it's main concepts and features and an example of a DAG. Afterwards some lessons and best practices learned by from the 3 years I have been using Airflow to power workflows in production.
Upgrading to Apache Airflow 2 | Airflow Summit 2021Kaxil Naik
Airflow 2.0 was a big milestone for the Airflow community. However, companies and enterprises are still facing difficulties in upgrading to 2.0.
In this talk, I would like to focus and highlight the ideal upgrade path and talk about
- upgrade_check CLI tool
- separation of providers
- registering connections types
- DB Migration
- deprecated feature around Airflow Plugins
https://airflowsummit.org/sessions/2021/upgrading-to-apache-airflow-2/
Say you have an existing app that uses Firebase. But now you want to add payment processing, image processing, send push notifications, or other functionality that really can't be done in the app itself. How can you do these things without spinning up your own servers? Firebase has you covered. In this codelab you learn how to write JavaScript functions that run in response to events that happen in Firebase. You then deploy these functions to Cloud Functions for Firebase, where they run auto-scaled on Google's infrastructure. To get the most value out of attending, be sure to have Node.js and npm installed on your machine along with your favorite text editor.
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)DataWorks Summit
Data Science, Machine Learning, and Artificial Intelligence has exploded in popularity in the last five years, but the nagging question remains, “How to put models into production?” Engineers are typically tasked to build one-off systems to serve predictions which must be maintained amid a quickly evolving back-end serving space which has evolved from single-machine, to custom clusters, to “serverless”, to Docker, to Kubernetes. In this talk, we present KubeFlow- an open source project which makes it easy for users to move models from laptop to ML Rig to training cluster to deployment. In this talk we will discuss, “What is KubeFlow?”, “why scalability is so critical for training and model deployment?”, and other topics.
Users can deploy models written in Python’s skearn, R, Tensorflow, Spark, and many more. The magic of Kubernetes allows data scientists to write models on their laptop, deploy to an ML-Rig, and then devOps can move that model into production with all of the bells and whistles such as monitoring, A/B tests, multi-arm bandits, and security.
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)Jarek Potiuk
This talk is about tools and mechanism we developed and used to improve productivity and teamwork in our team (of 6 currently) while developing 70+ operators for Airflow over more than 6 months.
We developed an "Airflow Breeze" simplified development environment which cuts down the time to become productive Apache Airflow developer from days to minutes.
It is part of Airflow Improvement Proposals:
AIP-10 Multi-layered and multi-stage official Airflow image
AIP-7 Simplified development workflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
****UPDATE: Project is now open sourced at https://www.github.com/industrydive/fileflow****
From Pydata DC 2016
Description
Data warehousing and analytics projects can, like ours, start out small - and fragile. With an organically growing mess of scripts glued together and triggered by cron jobs hiding on different servers, we needed better plumbing. After perusing the data pipelining landscape, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool from Airbnb.
Abstract
The power of any reporting tool breaks based on the data behind it, so when our data warehousing process got too big for its humble origins, we searched for something better. After testing out several options such as Drake, Pydoit, Luigi, AWS Data Pipeline, and Pinball, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool originating from Airbnb, that provides the benefits of pipeline construction as directed acyclic graphs (DAGs), along with a scheduler that can handle alerting, retries, callbacks and more to make your pipeline robust. This talk will discuss the value of DAG based pipelines for data processing workflows, highlight useful features in all of the pipelining projects we tested, and dive into some of the specific challenges (like time travel) and successes (like time travel!) we’ve experienced using Airflow to productionize our data engineering tasks. By the end of this talk, you will learn
- pros and cons of several Python-based/Python-supporting data pipelining libraries
- the design paradigm behind Airflow, an Apache incubating data pipelining and scheduling service, and what it is good for
- some epic fails to avoid and some epic wins to emulate from our experience porting our data engineering tasks to a more robust system
- some quick-start tips for implementing Airflow at your organization.
Orchestrating workflows Apache Airflow on GCP & AWSDerrick Qin
Working in a cloud or on-premises environment, we all somehow move data from A to B on-demand or on schedule. It is essential to have a tool that can automate recurring workflows. This can be anything from an ETL(Extract, Transform, and Load) job for a regular analytics report all the way to automatically re-training a machine learning model.
In this talk, we will introduce Apache Airflow and how it can help orchestrate your workflows. We will cover key concepts, features, and use cases of Apache Airflow, as well as how you can enjoy Apache Airflow on GCP and AWS by demo-ing a few practical workflows.
In a recent Apache Flex project, we needed to implement automated user tests. Selenium is an (open source) tool for automating your browser, but there was no modern (aka working) extension for Flex applications. We've created the open source project Flexium, both a JAVA and ActionScript extension which enables you to communicate between Selenium and Flex.
In this talk, I'll show you how you can build Alfresco ADF applications using the new version 3.0.0. The new ADF versions include a number of new features and some breaking changes with the past that will make your life easier, I'll show you how to take advantage of it and embrace the change.
A presentation for the Vancouver Island Java User's Group showcasing how Groovy and the Griffon application framework can ease the pain of coding Swing applications.
Integration testing is hard, and often teams are tempted to do it in production. Testcontainers allows writing meaningful integration tests spawning Docker containers for databases, queue systems, kv-store, other services. The talk, a blend of slides and live code, will show how we are able to deploy without fear while integrating with a dozen of different datastores. Don't mock your database with fake data anymore, work with real data
Talk at RubyKaigi 2015.
Plugin architecture is known as a technique that brings extensibility to a program. Ruby has good language features for plugins. RubyGems.org is an excellent platform for plugin distribution. However, creating plugin architecture is not as easy as writing code without it: plugin loader, packaging, loosely-coupled API, and performance. Loading two versions of a gem is a unsolved challenge that is solved in Java on the other hand.
I have designed some open-source software such as Fluentd and Embulk. They provide most of functions by plugins. I will talk about their plugin-based architecture.
Step by step introduction to get unit testing, UI testing, mocking and continuous integration up and running for your Swift projects. These slides are from Agile Swift Meetup in Montreal.
Introduction to Cocoapods as Dependency Management for iOS project using Swift. This slide will also introduce some of common library used in iOS Development: Alamofire, SwiftyJSON, MBProgressHUD.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
13. public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(TextIO.Read.named("sample-book").from("gs://jcconf2016-dataflow-workshop/sample/book-sample.txt"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
}
}
進一步修改程式,讓資料輸出到Google Cloud Storage…
@SuppressWarnings("serial")
public class TestMain {
private static final Logger LOG = LoggerFactory.getLogger(TestMain.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(TextIO.Read.named("sample-book").from("gs://jcconf2016-dataflow-workshop/sample/book-sample.txt"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(TextIO.Write.named("output-book").to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt"));
p.run();
}
}
增加Transform Function,將段落字元切割
14. @SuppressWarnings("serial")
public class TestMain {
private static final Logger LOG = LoggerFactory.getLogger(TestMain.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(TextIO.Read.named("sample-book").from("gs://jcconf2016-dataflow-workshop/sample/book-sample.txt"))
.apply(ParDo.of(new DoFn<String, String>() {
private final Aggregator<Long, Long> emptyLines =
createAggregator("emptyLines", new Sum.SumLongFn());
@Override
public void processElement(ProcessContext c) {
if (c.element().trim().isEmpty()) {
emptyLines.addValue(1L);
}
// Split the line into words.
String[] words = c.element().split("[^a-zA-Z']+");
// Output each word encountered into the output PCollection.
for (String word : words) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}))
.apply(TextIO.Write.named("output-book").to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt"));
p.run();
}
}
Word Count Sample - 計算每個文件中單字出現的數量
@SuppressWarnings("serial")
public class TestMain {
15. static class MyExtractWordsFn extends DoFn<String, String> {
private final Aggregator<Long, Long> emptyLines = createAggregator(
"emptyLines", new Sum.SumLongFn());
@Override
public void processElement(ProcessContext c) {
if (c.element().trim().isEmpty()) {
emptyLines.addValue(1L);
}
// Split the line into words.
String[] words = c.element().split("[^a-zA-Z']+");
// Output each word encountered into the output PCollection.
for (String word : words) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}
public static class MyCountWords extends
PTransform<PCollection<String>, PCollection<KV<String, Long>>> {
@Override
public PCollection<KV<String, Long>> apply(PCollection<String> lines) {
// Convert lines of text into individual words.
PCollection<String> words = lines.apply(ParDo.of(new MyExtractWordsFn()));
// Count the number of times each word occurs.
PCollection<KV<String, Long>> wordCounts = words.apply(Count.<String> perElement());
return wordCounts;
}
}
public static class MyFormatAsTextFn extends DoFn<KV<String, Long>, String> {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().getKey() + ": " + c.element().getValue());
}
}
private static final Logger LOG = LoggerFactory.getLogger(TestMain.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args)
.withValidation().create());
p.apply(TextIO.Read.named("sample-book").from(
"gs://jcconf2016-dataflow-workshop/sample/book-sample.txt"))
.apply(new MyCountWords())
.apply(ParDo.of(new MyFormatAsTextFn()))
.apply(TextIO.Write.named("output-book")
.to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt"));
p.run();
}
}
17. public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
p.apply(PubsubIO.Read.named("my-pubsub-input")
.subscription("projects/sunny-573/subscriptions/jcconf2016-sub001"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
}
Streaming範例2
整合Work Count範例,將資料寫入BigQuery的dataset中...
/*
* Copyright (C) 2015 Google Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*/
package com.jcconf2016.demo;
import java.util.ArrayList;
import java.util.List;
import org.joda.time.Duration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.api.services.bigquery.model.TableReference;
import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableSchema;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.io.BigQueryIO;
import com.google.cloud.dataflow.sdk.io.PubsubIO;
import com.google.cloud.dataflow.sdk.options.Default;
import com.google.cloud.dataflow.sdk.options.Description;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.options.StreamingOptions;
18. import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import com.google.cloud.dataflow.sdk.transforms.windowing.FixedWindows;
import com.google.cloud.dataflow.sdk.transforms.windowing.Window;
import com.google.cloud.dataflow.sdk.values.KV;
import com.google.cloud.dataflow.sdk.values.PCollection;
/**
* A starter example for writing Google Cloud Dataflow programs.
*
* <p>
* The example takes two strings, converts them to their upper-case
* representation and logs them.
*
* <p>
* To run this starter example locally using DirectPipelineRunner, just execute
* it without any additional parameters from your favorite development
* environment. In Eclipse, this corresponds to the existing 'LOCAL' run
* configuration.
*
* <p>
* To run this starter example using managed resource in Google Cloud Platform,
* you should specify the following command-line options:
* --project=<YOUR_PROJECT_ID>
* --stagingLocation=<STAGING_LOCATION_IN_CLOUD_STORAGE>
* --runner=BlockingDataflowPipelineRunner In Eclipse, you can just modify the
* existing 'SERVICE' run configuration.
*/
@SuppressWarnings("serial")
public class StreamingPipeline {
static final int WINDOW_SIZE = 1; // Default window duration in minutes
public static interface Options extends StreamingOptions {
@Description("Fixed window duration, in minutes")
@Default.Integer(WINDOW_SIZE)
Integer getWindowSize();
void setWindowSize(Integer value);
@Description("Whether to run the pipeline with unbounded input")
boolean isUnbounded();
void setUnbounded(boolean value);
}
private static TableReference getTableReference(Options options) {
TableReference tableRef = new TableReference();
tableRef.setProjectId("sunny-573");
tableRef.setDatasetId("jcconf2016");
tableRef.setTableId("pubsub");
return tableRef;
}
private static TableSchema getSchema() {
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("word").setType("STRING"));
fields.add(new TableFieldSchema().setName("count").setType("INTEGER"));
fields.add(new TableFieldSchema().setName("window_timestamp").setType(
"TIMESTAMP"));
TableSchema schema = new TableSchema().setFields(fields);
return schema;
}
static class FormatAsTableRowFn extends DoFn<KV<String, Long>, TableRow> {
@Override
public void processElement(ProcessContext c) {
TableRow row = new TableRow().set("word", c.element().getKey())
.set("count", c.element().getValue())
19. // include a field for the window timestamp
.set("window_timestamp", c.timestamp().toString());
c.output(row);
}
}
private static final Logger LOG = LoggerFactory
.getLogger(StreamingPipeline.class);
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args)
.withValidation().as(Options.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
PCollection<String> input = p.apply(PubsubIO.Read.topic("projects/sunny-573/topics/jcconf2016"));
PCollection<String> windowedWords =
input.apply(Window.<String>
into(FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))));
PCollection<KV<String, Long>> wordCounts = windowedWords.apply(new
TestMain.MyCountWords());
wordCounts.apply(ParDo.of(new FormatAsTableRowFn())).apply(
BigQueryIO.Write.to(getTableReference(options)).withSchema(getSchema()));
p.run();
}
}
從Dashboard監控Dataflow Streaming Task
打開GCP Web Console,使用Dataflow Dashboard來檢視每個流程的執行狀況。
並透過Cloud Logging來檢視執行Log…