BigQuery case study
in Groovenauts
Dive into the
DataflowJavaSDK
BigQuery case study
in Groovenauts
Tomoyuki Chikanaga
2015.04.24
#bq_sushi tokyo #1
Groovenauts, Inc.
HQ:Fukuoka
Tokyo branch
Our Business
• MAGELLAN (new)
• Consulting
• Game Server
BigQuery anywhere
• MAGELLAN (new)
• Consulting
• Game Server
BigQuery anywhere
• MAGELLAN (new)
• Consulting
• Game Server
• Container Hosting Service
• Support HTTP/MQTT
• Built on Google Cloud Platform
BigQuery in MAGELLAN
• Resource Monitoring (VM/container etc..)
• Developer s Activity Logs
• Application Logs
• End-user s Access Logs
Schematic View
End-user
Developer
Containers
router
developers console
API request
Deploy
Deploy
Resource Monitoring
End-user
Developer
Containers
router
developers console
API request
Monitoring
System
usage logs
Watch System Usage
Extract user s usage
billing
(not yet implemented)
Developer s Activity Logs
End-user
Developer
Containers
router
developers console
Deploy
Deploy
developer s activity logs
Analyze/Trace
developer s action
Application logs
End-user
Developer
Containers
router
developers console
API request
application logs
View logs
query
End user s access logs
End-user
Developer
Containers
router
developers console
API request
access logs
View logs
query
metrics
BigQuery Quota
• Concurrent rate limit:

up to 20 concurrent queries.
• Daily limit: 20,000 queries / project
BigQuery Quota
End-user
Developer
Containers
router
developers console
View logs
query
may reach quota limit
by developer increase.
BigQuery Quota
End-user
Developer
Containers
router
developers console
View logs
we plan to migrate to
other storages.
??
BigQuery in Business
• CPG(Maker/Distribution/Retail)
• Automotive after-market
BigQuery in Business
• POS Data Analysis
• Excel + BigQuery
• GPS Telemetric Analysis
• company vehicle utilization/travel distance
etc..
POS Data Analysis
• Replace existing system
• RDB → BigQuery
• Excel: SQL Generation,

Visualization(Table, Graph)
Excel: SQL Generation
• Generate SQL using Excel functions
parameters
Templates for SQL
POS Data Analysis
• Result
• Analysis Time
• 12x faster
• Running Cost
• 95% cut
GPS Telemetric Analysis
Vehicle device
Customer
GPS Location Data
GPS Telemetric Analysis
BigQuery Pros. & Cons.
• Pros.
• Running Cost
• Scalability
• Cons.
• Stability
• Query Latency / Quota
Dive into the
DataflowJavaSDK
@nagachika
2015.04.24
#bq_sushi tokyo #1
• @nagachika (twitter/github)
• ruby-trunk-changes (d.hatena.ne.jp/nagachika)
• Ruby committer

2.2 stable branch maintainer
• Fukuoka.rb (Regional Ruby Community)
Who are you?
One Day…
Boss
I ve heard about Google Cloud Dataflow!
It may unify Batch & Streaming Distributed Processing.
Wow, That sounds awesome.
I d like to integrate it with our service.
Eh!? I have to investigate the details...
I ll leave it to you.
Two Missions
• Port SDK to other Language (Ruby
etc..)
• Implement Custom Stream Input (AMQP)
from: https://cloud.google.com/dataflow/what-is-google-cloud-dataflow
Dataflow SDK for Java
Open Source
Open Source
• Apache License Version 2.0
• You can read it
• You can modify it
• You can run it
• locally (PubsubIO is not supported)
• on the Cloud Dataflow service(beta)
http://dataflow-java-sdk-weekly.hatenablog.com/
Read every commit
• catch-up recent hot topics
• related components are modified
concurrently
• know developers and their territory
Disclaimer
• I’m not good at Java.
• I’m a newbie of Distributed
Computing.
Directory Tree
• sdk/src/
• main/java/com/google/cloud/dataflow/sdk (SDK Source Code)
• test/java/com/google/cloud/dataflow/sdk (Test Code for SDK)
• examples/src/
• main/java/com/google/cloud/dataflow/examples (Example Pipeline
Source Code)
• test/java/com/google/cloud/dataflow/examples (Test for Examples)
• contrib/

Community Contributed Library (join-library)
sdk/src/main/java/com/google/cloud/dataflow/sdk/
• coders/
• PCoder classes
• io/
• Input/Output (Source/Sink)
• optsions/
• Command Line Options Utilities
• runners/
• Pipeline runners. driver for run pipeline locally or on the service
• transforms/
• PTransform classes
• values/
• PCollection classes
Pipeline Components
CollectionCollectionPCollection
PTransform
PCollection
PTransformPTransformPTransformPTransform
Source
Sink
Pipeline as a Code
Pipeline p = Pipeline.create(options);
p.apply( TextIO.Read.named(“Read”).from(input) )
.apply( new MyTransform() )
.apply( TextIO.Write.named(“Write").to(output) );
PCollection
PTransform
public <Output extends POutput>

Output
apply(PTransform<? super PCollection<T>, Output> t)
• Pipeline.apply()/PCollection.apply() Signature
PCollection
• Container of data in Dataflow Pipeline
• Bounded (fixed size) or

Unbounded (variable size ≒ streaming)
• Handler for the real data (element)

cf. file descriptor, pipe etc..
PCollection
Bounded PCollection
…
Unbounded PCollection
Elements
Coder
• Data in PCollection = Byte Stream
• Decode/Encode at PTransform’s In/Out
Coder
PTransform PTransform
PCollection
elemPValue PValue
Coder.encode() Coder.decode()
Coder
• Integer
• Double
• String
• List<T>
• Map<K,V>
• KV<K,V> (Key Value pair)
• TableRow (← BigQuery Table’s row)
PTransform
• Each step in pipeline
• Core Transforms
• ParDo/GroupByKey/Combine/Flatten/Join
• Composite Transforms
• Root Transforms (read, write, create)
• Predefined Transforms (SDK Builtin)
PTransform
• Each step in pipeline
• Core Transform
• ParDo/GroupByKey/Combine/Flatten/Join
• Composite Transforms
• Root Transforms (read, write, create)
• Predefined Transforms (SDK Builtin)
User Defined Code
Composite Transform
• Construct a Transform from Transforms
• ex) Sum, Count.Globally<T> etc..
Composite Transform
Count
• Override apply() method
public class Count {
public static class Globally<T>
extends PTransform<PCollection<T>, PCollection<Long>> {
@Override
public PCollection<Long> apply(PCollection<T> input) {
Combine.Globally<Long, Long> sumGlobally;
…
sumGlobally = Sum.longsGlobally().withFanout(fanout);
…
return input.apply(ParDo.named(“Init")
.of(new DoFn<T, Long>() {
@Override
public void processElement(ProcessContext c) {
c.output(1L);
}
}))
.apply(sumGlobally);
}
}
}
PTransfer.apply()
public abstract class PTransform<Input extends PInput,
Output extends POutput> {
public Output apply(Input input) {
}
}
apply()
PCollection.apply()
=>Pipeline.applyTransform()
=>Pipeline.applyInternal()
=>PipelineRunner.apply()
=>PTransform.apply()
apply()
• used in a construction phase
• apply() construct a Pipeline from
Transforms
ParDo & DoFn
• User defined Runtime Code = DoFn
return input.apply(ParDo.named(“Init")
.of(new DoFn<T, Long>() {
@Override
public void processElement(ProcessContext c) {
c.output(1L);
}
}))
.apply(sumGlobally);
User Defined Code
processElement
• DoFn<I,O>.processElement()
• Receive an element of input PCollection
• I ProcessContext.element()
• void ProcessContext.output(O output)
void DoFn<I,O>.processElement(ProcessContext context)
Example of DoFn
static class ExtractWordsFn extends DoFn<String, String> {

public void processElement(ProcessContext c) {

String[] words = c.element().split(“[^a-zA-Z']+");
for (String word : words) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}
static class FormatCountsFn extends DoFn<KV<String, Long>, String> {
public void processElement(ProcessContext c) {
c.output(c.element().getKey() + ": " + c.element().getValue());
}
}
from WordCount.java
Staging
• How to load user defined code in
Dataflow managed service?
• DoFn<I,O> implements Serializable
• .jar files in $CLASSPATH are
uploaded to GCS `staging` bucket
Two Missions
• Port SDK to other Language (Ruby
etc..)
• Implement Custom Stream Input (AMQP)Dataflow service depend on JVM runtime.

(Python SDK is planned for future release.)
Source/Sink
• TextIO (GCS)
• DatastoreIO
• BigQueryIO
• PubsubIO (for streaming mode)
PubsubIO impl. in SDK
• PubsubIO.Read.Bound<T> extends
PTransform<PInput, PCollection<T>>
• Bound don’t have any runtime impl.
• runners.worker.ReaderFactory translate
these objects into Source/Sink type and
parameters and transport to Dataflow
service workers
Two Missions
• Port SDK to other Language (Ruby
etc..)
• Implement Custom Stream Input (AMQP)
Dataflow custom input development is not supported yet.
(Is there no future plan?)
OK.
But stay tuned for the activities in Dataflow.
I ve found that there s no way to accomplish
these missions right now...
Roger.
Official Documentation
https://cloud.google.com/dataflow/
Official Documentation
https://cloud.google.com/dataflow/
Let’s dive into
the
DataflowJavaSDK
Let’s dive into
the
DataflowJavaSDK
Dataflow Documentation
Windowing
• for Streaming mode
• for Combine/GroupByKey
Windowing
k1: 1
k1: 2
k1: 3
k2: 2
Group
by
Key
k1: [1,2,3]
k2: [2]
Combine
k1: 3
k2: 1
k1: [1,2,3]
k2: [2]
• These transforms require all elements of input.

" In streaming mode inputs are unbounded.
Windowing
• Fixed Time Windows
• Sliding Time Windows
• Session Windows
• Single Global Window
Group elements into windows by timestamp
Trigger
• Streaming data could be arrived with
some delay
• Dataflow should wait for while
after end of window in wall time.
• Time-Based Triggers
• Data-Driven Triggers

BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK