1© Cloudera, Inc. All rights reserved.
Extreme Sports & Beyond:
Exploring a new frontier in data with GoPro
2© Cloudera, Inc. All rights reserved.
Josh is the Manager of Data Architecture and Operations at
GoPro working within the Data Science & Engineering team.
Prior to GoPro, Josh led global supply chain operations analytics
efforts at Apple. He has 15 years of project, analytics,
operations, and business intelligence experience in a variety of
industries. He holds a BS in Industrial Engineering and an MBA
from the University of Texas at Austin.
David is a Principal Engineer in the Data Science and
Engineering team at GoPro and the designer of their Spark-
Kafka data ingestion pipeline. David has been developing
scalable data processing pipelines and eCommerce systems for
over 20 years in Silicon Valley. David’s current big data interests
include streaming data as fast as possible from devices to near
real-time dashboards and switching his primary programming
language to Scala from Java after 17 years. He holds a BS in
Computer Science from The Ohio State University.
Our Speakers
Josh Byrd
Manager, Data
Architecture &
Operations, GoPro
David Winters
Principal Engineer, Data
Science & Engineering,
GoPro
3© Cloudera, Inc. All rights reserved.
Let’s do this!
4© Cloudera, Inc. All rights reserved.
Our Story
When we got here…
5© Cloudera, Inc. All rights reserved.
Growing data needs
6© Cloudera, Inc. All rights reserved.
Dev Ops
• Infrastructure
• Hadoop Admin
Engineering
• Data pipeline
• ETL processing
Architecture
• Design
• Applications
Project
Management
• Agile
Data Science and Engineering
We build the platform
7© Cloudera, Inc. All rights reserved.
Origin Story
•Make Friends
•Haul Ass
•Maintain Balance
•No Half-Assery
•Integrity. Always
•Be a HERO
Yes, this comes from the top…
8© Cloudera, Inc. All rights reserved.
GoPro Desktop Application
9© Cloudera, Inc. All rights reserved.
GoPro Desktop Application on Tableau
10© Cloudera, Inc. All rights reserved.
How the Magic Happens
The Philosopher’s Stone…
…or TPS for short
11© Cloudera, Inc. All rights reserved.
High Level Architecture
ETL Cluster
•File dumps
•MR
•Hive
Secure Data
Mart
•End User Query
•Impala / Sentry
•Parquet
Analytics Apps
•HUE
•Tableau
•Trifacta
•R
Ingest Cluster
•Log file streaming
•Kafka
•Spark
Induction
Framework
•Batch files
•Pre-processing
•Java
Original Cluster
12© Cloudera, Inc. All rights reserved.
Data Pipeline
Ingest Cluster
ELBHTTP
Pipeline for processing of streaming logs
To ETL Cluster
13© Cloudera, Inc. All rights reserved.
Data Pipeline
/path1/…
/path2/…
/path3/…
ToETLCluster
/path4/…
14© Cloudera, Inc. All rights reserved.
Data Pipeline
ETL Cluster
HDFS
HIVE Metastore
To SDM Cluster
From Ingest Cluster
Induction
framework
15© Cloudera, Inc. All rights reserved.
Data Delivery!
HDFS
HIVE Metastore
Applications
Thrift
ODBC
Server
User
Studio
Studio - Staging
GDA
Report
SDM
From ETL Cluster
16© Cloudera, Inc. All rights reserved.
Trifacta
17© Cloudera, Inc. All rights reserved.
GoPro Desktop Application on Tableau
18© Cloudera, Inc. All rights reserved.
PG #
RC Playbook: Your guide to
success at GoPro
Questions?
19© Cloudera, Inc. All rights reserved.
Learn more about Cloudera, Tableau, & Trifacta
http://www.cloudera.com/partners/solutions/trifacta.html http://www.cloudera.com/partners/solutions/tableau.html
20© Cloudera, Inc. All rights reserved.
Thank You

Extreme Sports & Beyond: Exploring a new frontier in data with GoPro

  • 1.
    1© Cloudera, Inc.All rights reserved. Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
  • 2.
    2© Cloudera, Inc.All rights reserved. Josh is the Manager of Data Architecture and Operations at GoPro working within the Data Science & Engineering team. Prior to GoPro, Josh led global supply chain operations analytics efforts at Apple. He has 15 years of project, analytics, operations, and business intelligence experience in a variety of industries. He holds a BS in Industrial Engineering and an MBA from the University of Texas at Austin. David is a Principal Engineer in the Data Science and Engineering team at GoPro and the designer of their Spark- Kafka data ingestion pipeline. David has been developing scalable data processing pipelines and eCommerce systems for over 20 years in Silicon Valley. David’s current big data interests include streaming data as fast as possible from devices to near real-time dashboards and switching his primary programming language to Scala from Java after 17 years. He holds a BS in Computer Science from The Ohio State University. Our Speakers Josh Byrd Manager, Data Architecture & Operations, GoPro David Winters Principal Engineer, Data Science & Engineering, GoPro
  • 3.
    3© Cloudera, Inc.All rights reserved. Let’s do this!
  • 4.
    4© Cloudera, Inc.All rights reserved. Our Story When we got here…
  • 5.
    5© Cloudera, Inc.All rights reserved. Growing data needs
  • 6.
    6© Cloudera, Inc.All rights reserved. Dev Ops • Infrastructure • Hadoop Admin Engineering • Data pipeline • ETL processing Architecture • Design • Applications Project Management • Agile Data Science and Engineering We build the platform
  • 7.
    7© Cloudera, Inc.All rights reserved. Origin Story •Make Friends •Haul Ass •Maintain Balance •No Half-Assery •Integrity. Always •Be a HERO Yes, this comes from the top…
  • 8.
    8© Cloudera, Inc.All rights reserved. GoPro Desktop Application
  • 9.
    9© Cloudera, Inc.All rights reserved. GoPro Desktop Application on Tableau
  • 10.
    10© Cloudera, Inc.All rights reserved. How the Magic Happens The Philosopher’s Stone… …or TPS for short
  • 11.
    11© Cloudera, Inc.All rights reserved. High Level Architecture ETL Cluster •File dumps •MR •Hive Secure Data Mart •End User Query •Impala / Sentry •Parquet Analytics Apps •HUE •Tableau •Trifacta •R Ingest Cluster •Log file streaming •Kafka •Spark Induction Framework •Batch files •Pre-processing •Java Original Cluster
  • 12.
    12© Cloudera, Inc.All rights reserved. Data Pipeline Ingest Cluster ELBHTTP Pipeline for processing of streaming logs To ETL Cluster
  • 13.
    13© Cloudera, Inc.All rights reserved. Data Pipeline /path1/… /path2/… /path3/… ToETLCluster /path4/…
  • 14.
    14© Cloudera, Inc.All rights reserved. Data Pipeline ETL Cluster HDFS HIVE Metastore To SDM Cluster From Ingest Cluster Induction framework
  • 15.
    15© Cloudera, Inc.All rights reserved. Data Delivery! HDFS HIVE Metastore Applications Thrift ODBC Server User Studio Studio - Staging GDA Report SDM From ETL Cluster
  • 16.
    16© Cloudera, Inc.All rights reserved. Trifacta
  • 17.
    17© Cloudera, Inc.All rights reserved. GoPro Desktop Application on Tableau
  • 18.
    18© Cloudera, Inc.All rights reserved. PG # RC Playbook: Your guide to success at GoPro Questions?
  • 19.
    19© Cloudera, Inc.All rights reserved. Learn more about Cloudera, Tableau, & Trifacta http://www.cloudera.com/partners/solutions/trifacta.html http://www.cloudera.com/partners/solutions/tableau.html
  • 20.
    20© Cloudera, Inc.All rights reserved. Thank You

Editor's Notes

  • #5 When we got here a little over two years ago, all we did was sell cameras. It was our job assess the data landscape, understand the roadmap, and to ultimately plan and implement an Enterprise Data Platform to support the company.
  • #6 Here’s what we saw… - Business was indeed growing, the product line was expanding in number and sophistication, BUT we were becoming more than a camera company. - We had a growing ecosystem of software and services - We had a rich media side of the business that was growing and in social and various media distribution channels - We’re moving now into advanced capture - And with drones, entirely new categories - This all lends and leads to the current Big Data landscape that we have today. So, we brought together the a team of bad assess for companies like LinkedIn, Apple, Oracle, and Splice Machine to tackle the problem Thus formed the Data Science and Engineering team at GoPro
  • #7 What does Data Science and Engineering look like at GoPro? The team is broken into 4 areas: Data Architecture and Data Operations Data Engineering Dev Ops Project Management Analytics is a different organization in GoPro There are a number of teams that are building domain specific data science expertise in addition to our team.
  • #8 To set the tone a bit we have to take a moment and talk about our corporate values We take these values seriously and applied them to what we are doing in Data Science and Engineering. [Read the list, mention the fact that ass is in there twice.] So the stage is set. We have our tasks. Build out a big data platform and haul ass! Well as you’ve seen already, the company has been hauling ass to delivery an amazing ecosystem for our cameras, the latest entrant of which is our GoPro Desktop Application.
  • #9 The GoPro App for desktop is the easiest way to offload and enjoy your GoPro photos and videos. Automatically offload your footage and keep everything organized in one place, so you can find your best shots fast. Make quick edits and share your favorite photos and videos straight to Facebook and YouTubeTM, or use the bundled GoPro Studio app for more advanced editing, including GoPro templates, slow-motion effects and more. Of course with it’s release we were immediately interested in understanding popularity and feature usage patterns. Through our platform, and with the use of Tableau, our partners in the analytics organization were able to put together multiple views that exposed several KPIs as well as began to lay out some preliminary insights into the features that resonated the most with our community
  • #10 Unfortunately we can’t show you what those numbers are, but suffice it to say the reporting for the application really did come together quite quickly and continues to rapidly evolve as we iterate through views into our KPIs that resonate with our decision makers. So the question is then: how did all this come together?
  • #11 Magic. That’s how we did it. So much magic that we call our platform the philosopher’s stone. The benefit of that name is that it abbreviates to “TPS” so that we can write TPS reports. And pester people about cover sheets.
  • #12 Joke about extreme big data engineering at GoPro… A word about Data Sources: IoT play Logs from devices, applications (desktop and mobile), external systems and services, ERP, web/email marketing, etc. Some Raw and Gzip, Some Binary and JSON Some streaming and some batch Today, we have 3 clusters to isolate workloads GREEN ARROW: Point to the clusters We started with one cluster, ETL Everything ran there Ingest (Flume) Batch (Framework) ETL (Hive) Analytical (Impala) Lots of resource contention (I/O, memory, cores) To alleviate the resource contention, we opted for 3 clusters to isolate the workloads. Ingest cluster for near real-time streaming Kafka, Spark Streaming (Cloudera Parcels) Input: Logs, Output: JSON Minutes cadence Moving towards more real-time in seconds Induction framework for scheduled batch ingestion ETL cluster for heavy duty aggregation Input: JSON flat files, Output: Aggregated Parquet files Hive (Map/Reduce) Hourly cadence Secure Data Mart Kerberos, LDAP, Active Directory, Apache Sentry (Cloudera committers) Input: Compressed Parquet files Analytical SQL engine for Tableau, ad-hoc queries (Hue), data wrangling (Trifacta), and data science (Jupyter Notebooks and RStudio) With all that said, we will examine the newer technologies that will enable us to simplify our architecture and merge clusters in the future. Kudu is one possible new technology that could help us to consolidate some of the clusters.
  • #13 Let’s take a deeper dive into our streaming ingestion… Logs are streamed from devices and software applications (desktop and mobile) to web service endpoint Endpoint is an elastic pool of Tomcat servers sitting behind ELB in AWS Custom servlet pushes logs into Kafka topics by environment A series of Spark streaming jobs process the logs from Kafka Landing place in ingestion cluster is HDFS with JSON flat files Rationalization of tech stacks… Why Kafka? Unrivaled write throughput for a queue Traditional queue throughput: 100K writes/sec on the biggest box you can buy Kafka throughput: 1M writes/sec on 3-4 commodity servers Strong ordering policy of messages Distributed Fault-tolerant through replication Support synchronous and asynchronous writes Pairs nicely with Spark Streaming for simpler scaling out (Kafka topic partitions map directly to Spark RDD partitions/tasks) Why Spark Streaming? Strong transactional semantics - "exactly once" processing Leverage Spark technology for both data ingest and analytics Horizontally scalable - High throughput for micro-batching Large open source community
  • #14 As previously stated, logs are streamed from devices and software applications (desktop and mobile) to web service endpoint Logs are diverse: gzipped, raw, binary, JSON, batched events, streamed single events Vary significantly in size from < 1 KB to > 1 MB Logs are redirected based on data category and routed to appropriate Kafka topic and respective Spark Streaming job Logs move from Kafka topic to Kafka topic with each Kafka topic having a Spark Streaming job that consumes the log, processes the log, and writes the log to another topic Tree like structure of jobs with more generic logic towards the root of the tree and more specialized logic moving towards the leaf nodes There are generic jobs/services and specialized jobs/services Generic services include PII removal and hashing, IP to Geo lookups, and batched writing to HDFS We perform batched HDFS writing since Kafka likes small messages (1 KB ideal) and HDFS likes large files (100+ MB) Specialized services contain business logic Finally, the logs are written into HDFS as JSON flat files (which are sometimes compressed depending on the type of data) Scheduled ETL jobs perform a distributed copy (distcp) to move the data to the ETL cluster for further heavier aggregations
  • #15 On the ETL cluster… Here’s where we do our heavy lifting. Almost entirely all Hive Map Reduce jobs Some Impala to make the really big narly aggregations more performant Previously, had a custom Java Map Reduce job for sessionization of events This has been replaced with a Spark Streaming job on the ingestion cluster In the future, want to push as much of the ETL processing back into the ingestion cluster for more real-time processing We also have a custom Java Induction framework which ingests data from external services that only make data available on slower schedules (daily, twice daily, etc.) The output from the ETL cluster is Parquet files that are added to partitioned managed tables in the Hive metastore. The Parquet files are then copied via distcp to the Secure Data Mart.
  • #16 Parquet files are copied from the ETL cluster and added to partitioned managed tables in the Hive Metastore of the Secure Data Mart. The Secure Data Mart is protected with Apache Sentry. Kerberos is used for authentication.  Corporate Standard Active Directory stores the groups.  Conrporate Standard Access control is role based and the roles are assigned with Sentry. Hue has a Sentry UI app to manage authorization. Hand off to Josh… Josh: From our secure data mart we are able to leverage the ODBC connectivity that Tableau has to Cloudera to visualize data in Tableau. Our governance structure in Tableau server allows analysts to iterate quickly through views and test those views in the browser in a staging location before publishing to a larger audience in a “production” folder for that business area. Trifacts is also present in this layer and plays a role in our team’s effort to move quickly
  • #17 Speak to Trifacta usage
  • #18 Pulling it all together our team has been successful in powering day 0 analytics that allow a very broad range of flexibility to the business [riff more on our platform]