SlideShare a Scribd company logo
Be A Hero: Transforming
GoPro Analytics Data Pipeline
Machine Learning Innovation Summit , 2017
Chester Chen
ABOUT SPEAKER
• Head of Data Science &
Engineering
• Prev. Director of Engineering,
Alpine Data Labs
• Start to play with Spark since
Spark 0.6 version.
• Spoke at Hadoop Summit, Big
Data Scala, IEEE Big Data
Conferences
• Organizer of SF Big Analytics
meetup (6800+ members)
AGENDA
• Business Use Cases
• Data Platform Architecture
• Old Data Platforms: Pro & Cons and Challenges
• New Data Platform Architeture and Initiatives
• Adding Data Schema During Ingestion (Dynamic DDL)
Business Use Cases
GROWING DATA NEEDS FROM GOPRO ECOSYSTEM
GROWING DATA NEEDS FROM GOPRO ECOSYSTEM
DATA
Analytics
Platform
Consumer Devices GoPro Apps
E-Commerce
Social Media/OTT
3rd party data
Product Insight
CRM/Marketing/
Personalization
User segmentation
DATA PLATFORM CHALLENGES
Monitoring
Data
Quality
Enable
Predictive
Analytics
Cost
Scalability
Data Platform Architecture
Transformation
OLD DATA PLATFORM ARCHITECTURE
ETL Cluster
•File dumps (Json,
CSV)
• Spark Jobs
•Hive
Secure Data Mart
•End User Query
•Impala / Sentry
•Parquet
Analytics Apps
•HUE
•Tableau
Real Time Cluster
• Log file streaming
• Kafka
• Spark
• HBase
Induction
Framework
• Batch Ingestion
• Pre-processing
• Scheduled download
Rest API,
FTP
S3 sync
Streaming ingestion
Batch Ingestion
STREAMING PIPELINE
Streaming Cluster
ELBHTTP
Pipeline for processing of streaming logs
To Batch ETL Cluster
SPARK STREAMING PIPELINE
/path1/…
/path2/…
/path3/…
ToBatch
ETLCluster
/path4/…
OLD BATCH DATA PIPELINE
ETL Cluster
HDFS
HIVE Metastore
To SDM Cluster
From Streaming Pipeline
Pull
distcp
Hard-code Hive SQL based
predefined schema to load
Json transform parquet
and load to Hive
Map-Reduce Jobs tend to fail
Map-Reduce Jobs tend to fail
HDFS
HIVE Metastore
distcp
Aggregation
Hard-coded SQL
OLD ANALYTICS CLUSTER
HDFS
HIVE
Metastore
BI Reporting
SDM
From Batch Cluster
Exploratory Analytics
with Hue: Impala/Hive: SQL
Kerberos
distcp
3rd Party Service
PROS AND CONS OF OLD ARCHITECTURE
• Isolation of workloads
• Fast ingest
• Loosely coupled clusters
• Secure analytics cluster
• Multiple copies of data
• Tightly coupled storage and
compute
• Lack of elasticity
• Operational overhead of multiple
clusters
• Hard-coded batch Hive SQL not
flexible to change
• Multiple Hive meta stores
• distcp across clusters can take a long
time with increase of data volume
PROS CONS
PROS AND CONS OF OLD ARCHITECTURE
• Not easy to scale
• Storage and compute cost
• Only have SQL interface, no predictive
analytics tool
• Not easy to adapt data schema changes
CONS
New Infrastructure
KEY INITIATIVES: INFRASTRUCTURE
• Separate Compute and Storage
• Move storage to S3
• Centralize Hive Metadata
• Use ephemeral instance as compute cluster
• Simplify the ETL ingestion process and eliminate the distcp
• Elasticity
• auto-scale compute cluster (expand & shrink based on demand)
• Enhance Analytics Capabilities
• introducing Notebook
• Scala, Python, R etc.
• AWS Cost Reduction
• Reduce EBS storage cost
• Dynamic DDL
• add schema on the fly
DATA PLATFORM ARCHITECTURE
Real Time
Cluster
•Log file streaming
•Kafka
•Spark
Batch Ingestion
Framework
•Batch Ingestion
•Pre-processing
Streaming ingestion
Batch Ingestion
S3
CLUSTERS
HIVE
METASTORE
PLOT.LY SERVER
TABLEAU SERVER
EXTERNAL SERFVICE
Notebook
Rest API,
FTP
S3 sync,etc
Parquet
+
DDL
State Sync
OLAP
Aggregation
NEW DYNAMIC DDL ARCHITECTURE
Streaming Pipeline
ELBHTTP
Pipeline for processing of streaming logs
S3
HIVE
METASTORE
transition
Centralized Hive Meta Store
DATA PLATFORM ARCHITECTURE
Batch Pipeline
pull
S3
3rd Party Service
export
Centralized Hive Meta Store
S3
HIVE
METASTORE
Ingestion/Aggregation/Snapshot
with dynamic DDL
State sync
transition
ANALYTICS ARCHITECTURE – IN PROGRESS
BI Reporting/Visualization
Exploratory/Predictive
Analytics
Spark SQL/Scala/python/R
Hive
Metastore
DSE SELF-
SERVICE
PORTAL
OLAP
Aggregation
Dynamic DDL: Adding
Schema to Data on the fly
WHAT IS DYNAMIC DDL?
• Dynamically alter table and add column
{
{ “userId”, “123”}
{“eventId”, “abc”}
}
Flattened Columns
record_userId, record_eventN
Updated Table X
A B C userId erventId
a b c 123 abc
A B C
a b c
Existing Table X
WHY USE DYNAMIC DDL?
• Reduce development time
• Traditionally, adding new Event/Attribute/Column requires of a lot time among
teams
• Many Hive ETL SQL needs to be changed to every column changes.
• One way to solve this problem is to use key-value pair table
• Ingestion is easy, no changes needed for newly added event/attribute/column
• Hard for Analytics, tabulated data are much easier to work with
• Dynamical DDL
• Automatically flatten attributes (for json data)
• Turn data into columns
DYNAMIC DDL – CREATE TABLE
// manually create table due to Spark bug
def createTable(sqlContext: SQLContext, columns: Seq[(String, String)],
destInfo: OutputInfo, partitionColumns: Array[(ColumnDef, Column)]): DataFrame = {
val partitionClause = if (partitionColumns.length == 0) "" else {
s"""PARTITIONED BY (${partitionColumns.map(f => s"${f._1.name} ${f._1.`type`}").mkString(", ")})"""
}
val sqlStmt =
s"""CREATE TABLE IF NOT EXISTS ${destInfo.tableName()} ( columns.map(f => s"${f._1} ${f._2}").mkString(", "))
$partitionClause
STORED AS ${destInfo.destFormat.split('.').last}
""".stripMargin
//spark 2.x doesn't know create if not exists syntax,
// still log AlreadyExistsException message. but no exception
sqlContext.sql(sqlStmt)
}
DYNAMIC DDL – ALTER TABLE ADD COLUMNS
//first find existing fields, then add new fields
val tableDf = sqlContext.table(dbTableName)
val exisingFields : Seq[StructField] = …
val newFields: Seq[StructField] = …
if (newFields.nonEmpty) {
// spark 2.x bug https://issues.apache.org/jira/browse/SPARK-19261
val sqlStmt: String = s"""ALTER TABLE $dbTableName ADD COLUMNS ( ${newFields.map ( f =>
s"${f.name} ${f.dataType.typeName}” ).mkString(", ")}. )"""
}
DYNAMIC DDL – ALTER TABLE ADD COLUMNS (SPARK 2.0)
//Hack for Spark 2.0, Spark 2.1
if (newFields.nonEmpty) {
// spark 2.x bug https://issues.apache.org/jira/browse/SPARK-19261
alterTable(sqlContext, dbTableName, newFields)
}
def alterTable(sqlContext: SQLContext,
tableName: String,
newColumns: Seq[StructField]): Unit = {
alterTable(sqlContext, getTableIdentifier(tableName), newColumns)
}
private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configuration)
extends ExternalCatalog with Logging {
….
}
DYNAMIC DDL – PREPARE DATAFRAME
// Reorder the columns in the incoming data frame to match the order in
the destination table. Project all columns from the table
modifiedDF = modifiedDF.select(tableDf.schema.fieldNames.map(
f => {
if (modifiedDF.columns.contains(f)) col(f) else lit(null).as(f)
}): _*)
// Coalesce the data frame into the desired number of partitions (files).
// avoid too many partitions
modifiedDF.coalesce(ioInfo.outputInfo.numberOfPartition)
DYNAMIC DDL – BATCH SPECIFIC ISSUES
• Issue 1 : Several log files are mapped into same table, and not all
columns are present
CSV file 1
A B
A B C
CSV file 2
A X Y
Destination Table
Table Writer
B C
DYNAMIC DDL – BATCH SPECIFIC ISSUES
• Solution:
• Find DataFrame with max number of columns, use it as base, and reorder
columns against this DataFrame
val newDfs : Option[ParSeq[DataFrame]] = maxLengthDF.map{ baseDf =>
dfs.map { df =>
df.select(baseDf.schema.fieldNames.map(f => if (df.columns.contains(f)) col(f) else
lit(null).as(f)): _*)
}
}
DYNAMIC DDL – BATCH SPECIFIC ISSUES
• Issue2 : Too many log files -- performance
• Solution: We consolidate several data log files Data Frame into chunks, each
chunk with all Data Frames union together.
val ys: Seq[Seq[DataFrame]] = destTableDFs.seq.grouped(mergeChunkSize).toSeq
val dfs: ParSeq[DataFrame] = ys.par.map(p => p.foldLeft(emptyDF) { (z, a) => z.unionAll(a) })
dfs.foreach(saveDataFrame(info, _))
SUMMARY
SUMMARY
•GoPro Data Platform is in transition and we just get started
•Central Hive Meta store + S3  separate storage +
computing, reduce cost
•Introducing cloud computing for elasticity and reduce
operation complexity
•Leverage dynamic DDL for flexible ingestion, aggregation
and snapshot for both batch and streaming
PG #
RC Playbook: Your guide to
success at GoPro
Questions?

More Related Content

What's hot

Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to Hive
Databricks
 
PDI data vault framework #pcmams 2012
PDI data vault framework #pcmams 2012PDI data vault framework #pcmams 2012
PDI data vault framework #pcmams 2012
Jos van Dongen
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Michael Rys
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Michael Rys
 
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
Michael Rys
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With Hadoop
Julian Hyde
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
Apache Hive
Apache HiveApache Hive
Apache Hive
Abhishek Gautam
 
U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Meta Data Catalog (SQLBits 2016)U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Meta Data Catalog (SQLBits 2016)
Michael Rys
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
kristinferrier
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
What's new in Mondrian 4?
What's new in Mondrian 4?What's new in Mondrian 4?
What's new in Mondrian 4?
Julian Hyde
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
Rohit Agrawal
 
U-SQL Reading & Writing Files (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)U-SQL Reading & Writing Files (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)
Michael Rys
 
Using C# with U-SQL (SQLBits 2016)
Using C# with U-SQL (SQLBits 2016)Using C# with U-SQL (SQLBits 2016)
Using C# with U-SQL (SQLBits 2016)
Michael Rys
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLTaming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Michael Rys
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
Will Du
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 

What's hot (20)

Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to Hive
 
PDI data vault framework #pcmams 2012
PDI data vault framework #pcmams 2012PDI data vault framework #pcmams 2012
PDI data vault framework #pcmams 2012
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
 
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With Hadoop
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Meta Data Catalog (SQLBits 2016)U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Meta Data Catalog (SQLBits 2016)
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
What's new in Mondrian 4?
What's new in Mondrian 4?What's new in Mondrian 4?
What's new in Mondrian 4?
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
U-SQL Reading & Writing Files (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)U-SQL Reading & Writing Files (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)
 
Using C# with U-SQL (SQLBits 2016)
Using C# with U-SQL (SQLBits 2016)Using C# with U-SQL (SQLBits 2016)
Using C# with U-SQL (SQLBits 2016)
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLTaming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 

Similar to Be A Hero: Transforming GoPro Analytics Data Pipeline

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Amazon Web Services
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Amazon Web Services
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
Microsoft Tech Community
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
DataWorks Summit
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Amazon Web Services
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
Torsten Steinbach
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Amazon Web Services
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 

Similar to Be A Hero: Transforming GoPro Analytics Data Pipeline (20)

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 

More from Chester Chen

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
Chester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
Chester Chen
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
Chester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
Chester Chen
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
Chester Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
Chester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
Chester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
Chester Chen
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
Chester Chen
 

More from Chester Chen (20)

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 

Be A Hero: Transforming GoPro Analytics Data Pipeline

  • 1. Be A Hero: Transforming GoPro Analytics Data Pipeline Machine Learning Innovation Summit , 2017 Chester Chen
  • 2. ABOUT SPEAKER • Head of Data Science & Engineering • Prev. Director of Engineering, Alpine Data Labs • Start to play with Spark since Spark 0.6 version. • Spoke at Hadoop Summit, Big Data Scala, IEEE Big Data Conferences • Organizer of SF Big Analytics meetup (6800+ members)
  • 3. AGENDA • Business Use Cases • Data Platform Architecture • Old Data Platforms: Pro & Cons and Challenges • New Data Platform Architeture and Initiatives • Adding Data Schema During Ingestion (Dynamic DDL)
  • 5. GROWING DATA NEEDS FROM GOPRO ECOSYSTEM
  • 6. GROWING DATA NEEDS FROM GOPRO ECOSYSTEM DATA Analytics Platform Consumer Devices GoPro Apps E-Commerce Social Media/OTT 3rd party data Product Insight CRM/Marketing/ Personalization User segmentation
  • 9. OLD DATA PLATFORM ARCHITECTURE ETL Cluster •File dumps (Json, CSV) • Spark Jobs •Hive Secure Data Mart •End User Query •Impala / Sentry •Parquet Analytics Apps •HUE •Tableau Real Time Cluster • Log file streaming • Kafka • Spark • HBase Induction Framework • Batch Ingestion • Pre-processing • Scheduled download Rest API, FTP S3 sync Streaming ingestion Batch Ingestion
  • 10. STREAMING PIPELINE Streaming Cluster ELBHTTP Pipeline for processing of streaming logs To Batch ETL Cluster
  • 12. OLD BATCH DATA PIPELINE ETL Cluster HDFS HIVE Metastore To SDM Cluster From Streaming Pipeline Pull distcp Hard-code Hive SQL based predefined schema to load Json transform parquet and load to Hive Map-Reduce Jobs tend to fail Map-Reduce Jobs tend to fail HDFS HIVE Metastore distcp Aggregation Hard-coded SQL
  • 13. OLD ANALYTICS CLUSTER HDFS HIVE Metastore BI Reporting SDM From Batch Cluster Exploratory Analytics with Hue: Impala/Hive: SQL Kerberos distcp 3rd Party Service
  • 14. PROS AND CONS OF OLD ARCHITECTURE • Isolation of workloads • Fast ingest • Loosely coupled clusters • Secure analytics cluster • Multiple copies of data • Tightly coupled storage and compute • Lack of elasticity • Operational overhead of multiple clusters • Hard-coded batch Hive SQL not flexible to change • Multiple Hive meta stores • distcp across clusters can take a long time with increase of data volume PROS CONS
  • 15. PROS AND CONS OF OLD ARCHITECTURE • Not easy to scale • Storage and compute cost • Only have SQL interface, no predictive analytics tool • Not easy to adapt data schema changes CONS
  • 17. KEY INITIATIVES: INFRASTRUCTURE • Separate Compute and Storage • Move storage to S3 • Centralize Hive Metadata • Use ephemeral instance as compute cluster • Simplify the ETL ingestion process and eliminate the distcp • Elasticity • auto-scale compute cluster (expand & shrink based on demand) • Enhance Analytics Capabilities • introducing Notebook • Scala, Python, R etc. • AWS Cost Reduction • Reduce EBS storage cost • Dynamic DDL • add schema on the fly
  • 18. DATA PLATFORM ARCHITECTURE Real Time Cluster •Log file streaming •Kafka •Spark Batch Ingestion Framework •Batch Ingestion •Pre-processing Streaming ingestion Batch Ingestion S3 CLUSTERS HIVE METASTORE PLOT.LY SERVER TABLEAU SERVER EXTERNAL SERFVICE Notebook Rest API, FTP S3 sync,etc Parquet + DDL State Sync OLAP Aggregation
  • 19. NEW DYNAMIC DDL ARCHITECTURE Streaming Pipeline ELBHTTP Pipeline for processing of streaming logs S3 HIVE METASTORE transition Centralized Hive Meta Store
  • 20. DATA PLATFORM ARCHITECTURE Batch Pipeline pull S3 3rd Party Service export Centralized Hive Meta Store S3 HIVE METASTORE Ingestion/Aggregation/Snapshot with dynamic DDL State sync transition
  • 21. ANALYTICS ARCHITECTURE – IN PROGRESS BI Reporting/Visualization Exploratory/Predictive Analytics Spark SQL/Scala/python/R Hive Metastore DSE SELF- SERVICE PORTAL OLAP Aggregation
  • 22. Dynamic DDL: Adding Schema to Data on the fly
  • 23. WHAT IS DYNAMIC DDL? • Dynamically alter table and add column { { “userId”, “123”} {“eventId”, “abc”} } Flattened Columns record_userId, record_eventN Updated Table X A B C userId erventId a b c 123 abc A B C a b c Existing Table X
  • 24. WHY USE DYNAMIC DDL? • Reduce development time • Traditionally, adding new Event/Attribute/Column requires of a lot time among teams • Many Hive ETL SQL needs to be changed to every column changes. • One way to solve this problem is to use key-value pair table • Ingestion is easy, no changes needed for newly added event/attribute/column • Hard for Analytics, tabulated data are much easier to work with • Dynamical DDL • Automatically flatten attributes (for json data) • Turn data into columns
  • 25. DYNAMIC DDL – CREATE TABLE // manually create table due to Spark bug def createTable(sqlContext: SQLContext, columns: Seq[(String, String)], destInfo: OutputInfo, partitionColumns: Array[(ColumnDef, Column)]): DataFrame = { val partitionClause = if (partitionColumns.length == 0) "" else { s"""PARTITIONED BY (${partitionColumns.map(f => s"${f._1.name} ${f._1.`type`}").mkString(", ")})""" } val sqlStmt = s"""CREATE TABLE IF NOT EXISTS ${destInfo.tableName()} ( columns.map(f => s"${f._1} ${f._2}").mkString(", ")) $partitionClause STORED AS ${destInfo.destFormat.split('.').last} """.stripMargin //spark 2.x doesn't know create if not exists syntax, // still log AlreadyExistsException message. but no exception sqlContext.sql(sqlStmt) }
  • 26. DYNAMIC DDL – ALTER TABLE ADD COLUMNS //first find existing fields, then add new fields val tableDf = sqlContext.table(dbTableName) val exisingFields : Seq[StructField] = … val newFields: Seq[StructField] = … if (newFields.nonEmpty) { // spark 2.x bug https://issues.apache.org/jira/browse/SPARK-19261 val sqlStmt: String = s"""ALTER TABLE $dbTableName ADD COLUMNS ( ${newFields.map ( f => s"${f.name} ${f.dataType.typeName}” ).mkString(", ")}. )""" }
  • 27. DYNAMIC DDL – ALTER TABLE ADD COLUMNS (SPARK 2.0) //Hack for Spark 2.0, Spark 2.1 if (newFields.nonEmpty) { // spark 2.x bug https://issues.apache.org/jira/browse/SPARK-19261 alterTable(sqlContext, dbTableName, newFields) } def alterTable(sqlContext: SQLContext, tableName: String, newColumns: Seq[StructField]): Unit = { alterTable(sqlContext, getTableIdentifier(tableName), newColumns) } private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configuration) extends ExternalCatalog with Logging { …. }
  • 28. DYNAMIC DDL – PREPARE DATAFRAME // Reorder the columns in the incoming data frame to match the order in the destination table. Project all columns from the table modifiedDF = modifiedDF.select(tableDf.schema.fieldNames.map( f => { if (modifiedDF.columns.contains(f)) col(f) else lit(null).as(f) }): _*) // Coalesce the data frame into the desired number of partitions (files). // avoid too many partitions modifiedDF.coalesce(ioInfo.outputInfo.numberOfPartition)
  • 29. DYNAMIC DDL – BATCH SPECIFIC ISSUES • Issue 1 : Several log files are mapped into same table, and not all columns are present CSV file 1 A B A B C CSV file 2 A X Y Destination Table Table Writer B C
  • 30. DYNAMIC DDL – BATCH SPECIFIC ISSUES • Solution: • Find DataFrame with max number of columns, use it as base, and reorder columns against this DataFrame val newDfs : Option[ParSeq[DataFrame]] = maxLengthDF.map{ baseDf => dfs.map { df => df.select(baseDf.schema.fieldNames.map(f => if (df.columns.contains(f)) col(f) else lit(null).as(f)): _*) } }
  • 31. DYNAMIC DDL – BATCH SPECIFIC ISSUES • Issue2 : Too many log files -- performance • Solution: We consolidate several data log files Data Frame into chunks, each chunk with all Data Frames union together. val ys: Seq[Seq[DataFrame]] = destTableDFs.seq.grouped(mergeChunkSize).toSeq val dfs: ParSeq[DataFrame] = ys.par.map(p => p.foldLeft(emptyDF) { (z, a) => z.unionAll(a) }) dfs.foreach(saveDataFrame(info, _))
  • 33. SUMMARY •GoPro Data Platform is in transition and we just get started •Central Hive Meta store + S3  separate storage + computing, reduce cost •Introducing cloud computing for elasticity and reduce operation complexity •Leverage dynamic DDL for flexible ingestion, aggregation and snapshot for both batch and streaming
  • 34. PG # RC Playbook: Your guide to success at GoPro Questions?

Editor's Notes

  1. Variety of Data Software – Mobile, Desktop and Cloud Apps Hardware – Camera, Drone, Drone Controller, VR, Accessories, Developer Program 3rd Party data – CRM, Social Media, OTT, E-Commerce etc. Variety of data Ingestion mechanism Real-Time Streaming pipeline Batch pipeline -- pushed or pulled data Complex data transformation Data often stored as binary to conserve space in camera Special logics for pair events and flight time correction Heterogeneous data format (json, csv, binary) Seamless data aggregation Combine data from both hardware and software Building data structures of both event-based and state-based
  2. Scalability Challenges Increase number of data sources and services requests Quick visibility of the data Infrastructure scalability Data Quality Tools and infrastructure for QA process Hadoop DevOp Challenges Manage Hadoop hardware (disk, security, service, dev and staging clusters) Monitoring Data Pipeline Monitoring metrics and infrastructure Enable Predictive Analytics Tools for Machine Learning and exploratory analytics Cost management AWS (storage & computing) as well as License costs
  3. .