Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK

2,894 views

Published on

#bq_sushi tokyo #1 の発表資料
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK

Published in: Technology
  • Be the first to comment

BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK

  1. 1. BigQuery case study in Groovenauts Dive into the DataflowJavaSDK
  2. 2. BigQuery case study in Groovenauts Tomoyuki Chikanaga 2015.04.24 #bq_sushi tokyo #1
  3. 3. Groovenauts, Inc. HQ:Fukuoka Tokyo branch
  4. 4. Our Business • MAGELLAN (new) • Consulting • Game Server
  5. 5. BigQuery anywhere • MAGELLAN (new) • Consulting • Game Server
  6. 6. BigQuery anywhere • MAGELLAN (new) • Consulting • Game Server
  7. 7. • Container Hosting Service • Support HTTP/MQTT • Built on Google Cloud Platform
  8. 8. BigQuery in MAGELLAN • Resource Monitoring (VM/container etc..) • Developer s Activity Logs • Application Logs • End-user s Access Logs
  9. 9. Schematic View End-user Developer Containers router developers console API request Deploy Deploy
  10. 10. Resource Monitoring End-user Developer Containers router developers console API request Monitoring System usage logs Watch System Usage Extract user s usage billing (not yet implemented)
  11. 11. Developer s Activity Logs End-user Developer Containers router developers console Deploy Deploy developer s activity logs Analyze/Trace developer s action
  12. 12. Application logs End-user Developer Containers router developers console API request application logs View logs query
  13. 13. End user s access logs End-user Developer Containers router developers console API request access logs View logs query metrics
  14. 14. BigQuery Quota • Concurrent rate limit:
 up to 20 concurrent queries. • Daily limit: 20,000 queries / project
  15. 15. BigQuery Quota End-user Developer Containers router developers console View logs query may reach quota limit by developer increase.
  16. 16. BigQuery Quota End-user Developer Containers router developers console View logs we plan to migrate to other storages. ??
  17. 17. BigQuery in Business • CPG(Maker/Distribution/Retail) • Automotive after-market
  18. 18. BigQuery in Business • POS Data Analysis • Excel + BigQuery • GPS Telemetric Analysis • company vehicle utilization/travel distance etc..
  19. 19. POS Data Analysis • Replace existing system • RDB → BigQuery • Excel: SQL Generation,
 Visualization(Table, Graph)
  20. 20. Excel: SQL Generation • Generate SQL using Excel functions parameters Templates for SQL
  21. 21. POS Data Analysis • Result • Analysis Time • 12x faster • Running Cost • 95% cut
  22. 22. GPS Telemetric Analysis Vehicle device Customer GPS Location Data
  23. 23. GPS Telemetric Analysis
  24. 24. BigQuery Pros. & Cons. • Pros. • Running Cost • Scalability • Cons. • Stability • Query Latency / Quota
  25. 25. Dive into the DataflowJavaSDK @nagachika 2015.04.24 #bq_sushi tokyo #1
  26. 26. • @nagachika (twitter/github) • ruby-trunk-changes (d.hatena.ne.jp/nagachika) • Ruby committer
 2.2 stable branch maintainer • Fukuoka.rb (Regional Ruby Community) Who are you?
  27. 27. One Day… Boss I ve heard about Google Cloud Dataflow! It may unify Batch & Streaming Distributed Processing. Wow, That sounds awesome. I d like to integrate it with our service. Eh!? I have to investigate the details... I ll leave it to you.
  28. 28. Two Missions • Port SDK to other Language (Ruby etc..) • Implement Custom Stream Input (AMQP)
  29. 29. from: https://cloud.google.com/dataflow/what-is-google-cloud-dataflow
  30. 30. Dataflow SDK for Java
  31. 31. Open Source
  32. 32. Open Source • Apache License Version 2.0 • You can read it • You can modify it • You can run it • locally (PubsubIO is not supported) • on the Cloud Dataflow service(beta)
  33. 33. http://dataflow-java-sdk-weekly.hatenablog.com/
  34. 34. Read every commit • catch-up recent hot topics • related components are modified concurrently • know developers and their territory
  35. 35. Disclaimer • I’m not good at Java. • I’m a newbie of Distributed Computing.
  36. 36. Directory Tree • sdk/src/ • main/java/com/google/cloud/dataflow/sdk (SDK Source Code) • test/java/com/google/cloud/dataflow/sdk (Test Code for SDK) • examples/src/ • main/java/com/google/cloud/dataflow/examples (Example Pipeline Source Code) • test/java/com/google/cloud/dataflow/examples (Test for Examples) • contrib/
 Community Contributed Library (join-library)
  37. 37. sdk/src/main/java/com/google/cloud/dataflow/sdk/ • coders/ • PCoder classes • io/ • Input/Output (Source/Sink) • optsions/ • Command Line Options Utilities • runners/ • Pipeline runners. driver for run pipeline locally or on the service • transforms/ • PTransform classes • values/ • PCollection classes
  38. 38. Pipeline Components CollectionCollectionPCollection PTransform PCollection PTransformPTransformPTransformPTransform Source Sink
  39. 39. Pipeline as a Code Pipeline p = Pipeline.create(options); p.apply( TextIO.Read.named(“Read”).from(input) ) .apply( new MyTransform() ) .apply( TextIO.Write.named(“Write").to(output) ); PCollection PTransform public <Output extends POutput>
 Output apply(PTransform<? super PCollection<T>, Output> t) • Pipeline.apply()/PCollection.apply() Signature
  40. 40. PCollection • Container of data in Dataflow Pipeline • Bounded (fixed size) or
 Unbounded (variable size ≒ streaming) • Handler for the real data (element)
 cf. file descriptor, pipe etc..
  41. 41. PCollection Bounded PCollection … Unbounded PCollection Elements
  42. 42. Coder • Data in PCollection = Byte Stream • Decode/Encode at PTransform’s In/Out
  43. 43. Coder PTransform PTransform PCollection elemPValue PValue Coder.encode() Coder.decode()
  44. 44. Coder • Integer • Double • String • List<T> • Map<K,V> • KV<K,V> (Key Value pair) • TableRow (← BigQuery Table’s row)
  45. 45. PTransform • Each step in pipeline • Core Transforms • ParDo/GroupByKey/Combine/Flatten/Join • Composite Transforms • Root Transforms (read, write, create) • Predefined Transforms (SDK Builtin)
  46. 46. PTransform • Each step in pipeline • Core Transform • ParDo/GroupByKey/Combine/Flatten/Join • Composite Transforms • Root Transforms (read, write, create) • Predefined Transforms (SDK Builtin) User Defined Code
  47. 47. Composite Transform • Construct a Transform from Transforms • ex) Sum, Count.Globally<T> etc.. Composite Transform
  48. 48. Count • Override apply() method public class Count { public static class Globally<T> extends PTransform<PCollection<T>, PCollection<Long>> { @Override public PCollection<Long> apply(PCollection<T> input) { Combine.Globally<Long, Long> sumGlobally; … sumGlobally = Sum.longsGlobally().withFanout(fanout); … return input.apply(ParDo.named(“Init") .of(new DoFn<T, Long>() { @Override public void processElement(ProcessContext c) { c.output(1L); } })) .apply(sumGlobally); } } }
  49. 49. PTransfer.apply() public abstract class PTransform<Input extends PInput, Output extends POutput> { public Output apply(Input input) { } }
  50. 50. apply() PCollection.apply() =>Pipeline.applyTransform() =>Pipeline.applyInternal() =>PipelineRunner.apply() =>PTransform.apply()
  51. 51. apply() • used in a construction phase • apply() construct a Pipeline from Transforms
  52. 52. ParDo & DoFn • User defined Runtime Code = DoFn return input.apply(ParDo.named(“Init") .of(new DoFn<T, Long>() { @Override public void processElement(ProcessContext c) { c.output(1L); } })) .apply(sumGlobally); User Defined Code
  53. 53. processElement • DoFn<I,O>.processElement() • Receive an element of input PCollection • I ProcessContext.element() • void ProcessContext.output(O output) void DoFn<I,O>.processElement(ProcessContext context)
  54. 54. Example of DoFn static class ExtractWordsFn extends DoFn<String, String> {
 public void processElement(ProcessContext c) {
 String[] words = c.element().split(“[^a-zA-Z']+"); for (String word : words) { if (!word.isEmpty()) { c.output(word); } } } } static class FormatCountsFn extends DoFn<KV<String, Long>, String> { public void processElement(ProcessContext c) { c.output(c.element().getKey() + ": " + c.element().getValue()); } } from WordCount.java
  55. 55. Staging • How to load user defined code in Dataflow managed service? • DoFn<I,O> implements Serializable • .jar files in $CLASSPATH are uploaded to GCS `staging` bucket
  56. 56. Two Missions • Port SDK to other Language (Ruby etc..) • Implement Custom Stream Input (AMQP)Dataflow service depend on JVM runtime.
 (Python SDK is planned for future release.)
  57. 57. Source/Sink • TextIO (GCS) • DatastoreIO • BigQueryIO • PubsubIO (for streaming mode)
  58. 58. PubsubIO impl. in SDK • PubsubIO.Read.Bound<T> extends PTransform<PInput, PCollection<T>> • Bound don’t have any runtime impl. • runners.worker.ReaderFactory translate these objects into Source/Sink type and parameters and transport to Dataflow service workers
  59. 59. Two Missions • Port SDK to other Language (Ruby etc..) • Implement Custom Stream Input (AMQP) Dataflow custom input development is not supported yet. (Is there no future plan?)
  60. 60. OK. But stay tuned for the activities in Dataflow. I ve found that there s no way to accomplish these missions right now... Roger.
  61. 61. Official Documentation https://cloud.google.com/dataflow/
  62. 62. Official Documentation https://cloud.google.com/dataflow/
  63. 63. Let’s dive into the DataflowJavaSDK
  64. 64. Let’s dive into the DataflowJavaSDK Dataflow Documentation
  65. 65. Windowing • for Streaming mode • for Combine/GroupByKey
  66. 66. Windowing k1: 1 k1: 2 k1: 3 k2: 2 Group by Key k1: [1,2,3] k2: [2] Combine k1: 3 k2: 1 k1: [1,2,3] k2: [2] • These transforms require all elements of input.
 " In streaming mode inputs are unbounded.
  67. 67. Windowing • Fixed Time Windows • Sliding Time Windows • Session Windows • Single Global Window Group elements into windows by timestamp
  68. 68. Trigger • Streaming data could be arrived with some delay • Dataflow should wait for while after end of window in wall time. • Time-Based Triggers • Data-Driven Triggers

×