Talk 1 : Evolution of the GoPro's data platform
In this talk, we will share GoPro’s experiences in building Data Analytics Cluster in Cloud. We will discuss: evolution of data platform from fixed-size Hadoop clusters to Cloud-based Spark Cluster with Centralized Hive Metastore +S3: Cost Benefits and DevOp Impact; Configurable, spark-based batch Ingestion/ETL framework;
Migration Streaming framework to Cloud + S3;
Analytics metrics delivery with Slack integration;
BedRock: Data Platform Management, Visualization & Self-Service Portal
Visualizing Machine learning Features via Google Facets + Spark
Speakers: Chester Chen
Chester Chen is the Head of Data Science & Engineering, GoPro. Previously, he was the Director of Engineering at Alpine Data Lab.
David Winters
David is an Architect in the Data Science and Engineering team at GoPro and the creator of their Spark-Kafka data ingestion pipeline. Previously He worked at Apple & Splice Machines.
Hao Zou
Hao is a Senior big data engineer at Data Science and Engineering team. Previously He worked as Alpine Data Labs and Pivotal
A missing link in the ML infrastructure stack?Chester Chen
Talk at SF Big Analytics
Machine learning is quickly becoming a product engineering discipline. Although several new categories of infrastructure and tools have emerged to help teams turn their models into production systems, doing so is still extremely challenging for most companies. In this talk, we survey the tooling landscape and point out several parts of the machine learning lifecycle that are still underserved. We propose a new category of tool that could help alleviate these challenges and connect the fragmented production ML tooling ecosystem. We conclude by discussing similarities and differences between our proposed system and those of a few top companies.
Bio: Josh Tobin is the founder and CEO of a stealth machine learning startup. Previously, Josh worked as a deep learning & robotics researcher at OpenAI and as a management consultant at McKinsey. He is also the creator of Full Stack Deep Learning (fullstackdeeplearning.com), the first course focused on the emerging engineering discipline of production machine learning. Josh did his PhD in Computer Science at UC Berkeley advised by Pieter Abbeel.
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
GoPro’s camera, drone, mobile devices as well as web, desktop applications are generating billions of event logs. The analytics metrics and insights that inform product, engineering, and marketing team decisions need to be distributed quickly and efficiently. We need to visualize the metrics to find the trends or anomalies.
While trying to building up the features store for machine learning, we need to visualize the features, Google Facets is an excellent project for visualizing features. But can we visualize larger feature dataset?
These are issues we encounter at GoPro as part of the data platform evolution. In this talk, we will discuss few of the progress we made at GoPro. We will talk about how to use Slack + Plot.ly to delivery analytics metrics and visualization. And we will also discuss our work to visualize large feature set using Google Facets with Apache Spark.
Optimizing the Catalyst Optimizer for Complex PlansDatabricks
For more than 6 years, Workday has been building various analytics products powered by Apache Spark. At the core of each product offering, customers use our UI to create data prep pipelines, which are then compiled to DataFrames and executed by Spark under the hood. As we built out our products, however, we started to notice places where vanilla Spark is not suitable for our workloads. For example, because our Spark plans are programmatically generated, they tend to be very complex, and often result in tens of thousands of operators. Another common issue is having case statements with thousands of branches, or worse, nested expressions containing such case statements.
With the right combination of these traits, the final DataFrame can easily take Catalyst hours to compile and optimize – that is, if it doesn’t first cause the driver JVM to run out of memory.
In this talk, we discuss how we addressed some of our pain points regarding complex pipelines. Topics covered include memory-efficient plan logging, using common subexpression elimination to remove redundant subplans, rewriting Spark’s constraint propagation mechanism to avoid exponential growth of filter constraints, as well as other performance enhancements made to Catalyst rules.
We then apply these changes to several production pipelines, showcasing the reduction of time spent in Catalyst, and list out ideas for further improvements. Finally, we share tips on how you too can better handle complex Spark plans.
Programing for problem solving ( airline reservation system)Home
The software we used for this project is Dev C++.
Dev-C++ is a free IDE for Windows that uses either MinGW or TDM-GCC as the underlying compiler.
Also, this software is quite handy as compared with others as a result used by beginners.
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks
The ‘feature store’ is an emerging concept in data architecture that is motivated by the challenge of productionizing ML applications. The rapid iteration in experimental, data driven research applications creates new challenges for data management and application deployment.
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies’ feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.
In this talk, we describe how we built a general purpose, open-source Feature Store for ML around dataframes and Apache Spark. We will demonstrate how data engineers can transform and engineers features from backend databases and data lakes, while data scientists can use PySpark to select and transform features into train/test data in a file format of choice (.tfrecords, .npy, .petastorm, etc) on a file system of choice (S3, HDFS). Finally, we will show how the Feature Store enables end-to-end ML pipelines to be factored into feature engineering and data science stages that each can run at different cadences.
Bio:
Fabio Buso is the head of engineering at Logical Clocks AB, where he leads the Feature Store development. Fabio holds a master's degree in cloud computing and services with a focus on data intensive applications, awarded by a joint program between KTH Stockholm and TU Berlin.
Topics: feature store, MLOps.
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.
At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains.
In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings.
A missing link in the ML infrastructure stack?Chester Chen
Talk at SF Big Analytics
Machine learning is quickly becoming a product engineering discipline. Although several new categories of infrastructure and tools have emerged to help teams turn their models into production systems, doing so is still extremely challenging for most companies. In this talk, we survey the tooling landscape and point out several parts of the machine learning lifecycle that are still underserved. We propose a new category of tool that could help alleviate these challenges and connect the fragmented production ML tooling ecosystem. We conclude by discussing similarities and differences between our proposed system and those of a few top companies.
Bio: Josh Tobin is the founder and CEO of a stealth machine learning startup. Previously, Josh worked as a deep learning & robotics researcher at OpenAI and as a management consultant at McKinsey. He is also the creator of Full Stack Deep Learning (fullstackdeeplearning.com), the first course focused on the emerging engineering discipline of production machine learning. Josh did his PhD in Computer Science at UC Berkeley advised by Pieter Abbeel.
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
GoPro’s camera, drone, mobile devices as well as web, desktop applications are generating billions of event logs. The analytics metrics and insights that inform product, engineering, and marketing team decisions need to be distributed quickly and efficiently. We need to visualize the metrics to find the trends or anomalies.
While trying to building up the features store for machine learning, we need to visualize the features, Google Facets is an excellent project for visualizing features. But can we visualize larger feature dataset?
These are issues we encounter at GoPro as part of the data platform evolution. In this talk, we will discuss few of the progress we made at GoPro. We will talk about how to use Slack + Plot.ly to delivery analytics metrics and visualization. And we will also discuss our work to visualize large feature set using Google Facets with Apache Spark.
Optimizing the Catalyst Optimizer for Complex PlansDatabricks
For more than 6 years, Workday has been building various analytics products powered by Apache Spark. At the core of each product offering, customers use our UI to create data prep pipelines, which are then compiled to DataFrames and executed by Spark under the hood. As we built out our products, however, we started to notice places where vanilla Spark is not suitable for our workloads. For example, because our Spark plans are programmatically generated, they tend to be very complex, and often result in tens of thousands of operators. Another common issue is having case statements with thousands of branches, or worse, nested expressions containing such case statements.
With the right combination of these traits, the final DataFrame can easily take Catalyst hours to compile and optimize – that is, if it doesn’t first cause the driver JVM to run out of memory.
In this talk, we discuss how we addressed some of our pain points regarding complex pipelines. Topics covered include memory-efficient plan logging, using common subexpression elimination to remove redundant subplans, rewriting Spark’s constraint propagation mechanism to avoid exponential growth of filter constraints, as well as other performance enhancements made to Catalyst rules.
We then apply these changes to several production pipelines, showcasing the reduction of time spent in Catalyst, and list out ideas for further improvements. Finally, we share tips on how you too can better handle complex Spark plans.
Programing for problem solving ( airline reservation system)Home
The software we used for this project is Dev C++.
Dev-C++ is a free IDE for Windows that uses either MinGW or TDM-GCC as the underlying compiler.
Also, this software is quite handy as compared with others as a result used by beginners.
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks
The ‘feature store’ is an emerging concept in data architecture that is motivated by the challenge of productionizing ML applications. The rapid iteration in experimental, data driven research applications creates new challenges for data management and application deployment.
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies’ feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.
In this talk, we describe how we built a general purpose, open-source Feature Store for ML around dataframes and Apache Spark. We will demonstrate how data engineers can transform and engineers features from backend databases and data lakes, while data scientists can use PySpark to select and transform features into train/test data in a file format of choice (.tfrecords, .npy, .petastorm, etc) on a file system of choice (S3, HDFS). Finally, we will show how the Feature Store enables end-to-end ML pipelines to be factored into feature engineering and data science stages that each can run at different cadences.
Bio:
Fabio Buso is the head of engineering at Logical Clocks AB, where he leads the Feature Store development. Fabio holds a master's degree in cloud computing and services with a focus on data intensive applications, awarded by a joint program between KTH Stockholm and TU Berlin.
Topics: feature store, MLOps.
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.
At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains.
In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings.
Flink has been used by many users in their ML use cases, such as real-time feature engineering and near-line inference. For the other ML use cases that are more batch-oriented, such as model training, validation, usually other systems are used. This talk we give in Flink Forward 2019 show the efforts in Flink community to let Flink cover all the ML use cases.
In this talk, we will present the basic features and functionality of Flock, an end-to-end research platform that we are developing at CISL which simplifies and automates the integration of machine learning solutions in data engines. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, model optimizations and support for the ONNX model format.
We will showcase Flock's features through a demo using Microsoft's Azure Data Studio and SQL Server.
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDatabricks
eBay has been using enterprise ADBMS for over a decade, and our team is working on batch workload migration from ADBMS to Spark in 2018. There has been so many experiences and lessons we got during the whole migration journey (85% auto + 15% manual migration) - during which we exposed many unexpected issues and gaps between ADBMS and Spark SQL, we made a lot of decisions to fulfill the gaps in practice and contributed many fixes in Spark core in order to unblock ourselves. It will be a really interesting and should be helpful sharing for many folks especially data/software engineers to plan and execute their migration work. And during this session we will share many very specific issues each individually we encountered and how we resolve & work-around with team in real migration processes.
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as:
* Improving Dataset performance with automated column pruning
* Bringing an efficient 2d join algorithm to Spark SQL
* Fixing join skewness with adaptive execution
* Enhancing the cost-optimizer with a history-based learning approach
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Databricks
Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance.
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Databricks
In this talk, we present a comprehensive framework we developed at Databricks for assessing the correctness, stability, and performance of our Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools.
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on
how our customers productionize machine learning models
do a deep dive with actual customer case studies,
show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
What MLflow is; what problem it solves for machine learning lifecycle; and how it solves; How it will be used with Databricks; and CI/CD pipeline with Databricks.
Using Production Profiles to Guide OptimizationsDatabricks
Identifying the most important areas for optimization can be difficult. One large factor of this difficulty is the fact that different production workloads have different needs and different bottlenecks. Important efficiency concerns for one workload might be negligible for another. As a result, it is very important that engineers working on efficiency know how to profile their workload and use data to prioritize the right things. There are multiple aspects of this which are worth discussing. Firstly, it is important to get the profiling right.
There are different types of profilers that have different pros and cons, and as an engineer looking at profiles, it is important to have a high level understanding of what is happening under the hood to put everything in context. For instance, at Facebook we use both a perf-based profiler called Strobelight that has minimal Java specific context, and the third party async-profiler, which is a Java specific solution that misses native processes such as transforms.
Both of these solutions are important for different use cases. Secondly, being able to interpret the results and view them in meaningful ways is vital to properly identify the biggest potential efficiency wins. There are many different ways of looking at the data which can be useful in different situations. At Facebook, we rely heavily on data driven investigations to determine where our efficiency efforts are best spent. This involves heavy use of profiling.
In this talk, I will go over some of the different profilers and other tools we employ to gather the data. I will also provide some examples of how we’ve uncovered issues in the past, both deliberately and incidentally, and the process involved to come to the conclusions we did.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.
http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/
50k runs, millions of metrics, parameters or tags, some bursts at 20k QPS. That’s the volume of data managed by our MLflow tracking servers this year at Criteo. In this talk, you will learn how we set up a shared instance of MLflow at company scale. We will present our contributions to the SQLAlchemyStore to make it responsive at this scale. We will present you how we turned MLflow to a production-ready system. How we scaled horizontally a shared instance on a mesos cluster ? Our monitoring system based on prometheus. Integration to the company Single Sign-On (SSO) authentication. And how our data scientists register their runs from the largest hadoop cluster in Europe.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
The Hop project entered Apache Software Foundation as an Incubator project in 2020, and Julian Hyde, one of their mentors, gave this presentation to educate the initial committers on the Apache Way and what to expect during the Incubation process.
The talk was given by Julian Hyde on October 1st, 2020, with the original title "Apache Incubation - What's it all about?"
Adding structure to your streaming pipelines: moving from Spark streaming to ...DataWorks Summit
How do you go from a strictly typed object-based streaming pipeline with simple operations to a structured streaming pipeline with higher order complex relational operations? This is what the Data Engineering team did at GoPro to scale up the development of streaming pipelines for the rapidly growing number of devices and applications.
When big data frameworks such as Hadoop first came to exist, developers were happy because we could finally process large amounts of data without writing complex multi-threaded code or worse yet writing complicated distributed code. Unfortunately, only very simple operations were available such as map and reduce. Almost immediately, higher level operations were desired similar to relational operations. And so Hive and dozens (hundreds?) of SQL-based big data tools became available for more developer-efficient batch processing of massive amounts of data.
In recent years, big data has moved from batch processing to stream-based processing since no one wants to wait hours or days to gain insights. Dozens of stream processing frameworks exist today and the same trend that occurred in the batch-based big data processing realm has taken place in the streaming world, so that nearly every streaming framework now supports higher level relational operations.
In this talk, we will discuss in a very hands-on manner how the streaming data pipelines for GoPro devices and apps have moved from the original Spark streaming with its simple RDD-based operations in Spark 1.x to Spark's structured streaming with its higher level relational operations in Spark 2.x. We will talk about the differences, advantages, and necessary pain points that must be addressed in order to scale relational-based streaming pipelines for massive IoT streams. We will also talk about moving from “hand built” Hadoop/Spark clusters running in the cloud to using a Spark-based cloud service. DAVID WINTERS, Big Data Architect, GoPro and HAO ZOU, Senior Software Engineer, GoPro
Flink has been used by many users in their ML use cases, such as real-time feature engineering and near-line inference. For the other ML use cases that are more batch-oriented, such as model training, validation, usually other systems are used. This talk we give in Flink Forward 2019 show the efforts in Flink community to let Flink cover all the ML use cases.
In this talk, we will present the basic features and functionality of Flock, an end-to-end research platform that we are developing at CISL which simplifies and automates the integration of machine learning solutions in data engines. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, model optimizations and support for the ONNX model format.
We will showcase Flock's features through a demo using Microsoft's Azure Data Studio and SQL Server.
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDatabricks
eBay has been using enterprise ADBMS for over a decade, and our team is working on batch workload migration from ADBMS to Spark in 2018. There has been so many experiences and lessons we got during the whole migration journey (85% auto + 15% manual migration) - during which we exposed many unexpected issues and gaps between ADBMS and Spark SQL, we made a lot of decisions to fulfill the gaps in practice and contributed many fixes in Spark core in order to unblock ourselves. It will be a really interesting and should be helpful sharing for many folks especially data/software engineers to plan and execute their migration work. And during this session we will share many very specific issues each individually we encountered and how we resolve & work-around with team in real migration processes.
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as:
* Improving Dataset performance with automated column pruning
* Bringing an efficient 2d join algorithm to Spark SQL
* Fixing join skewness with adaptive execution
* Enhancing the cost-optimizer with a history-based learning approach
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Databricks
Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance.
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Databricks
In this talk, we present a comprehensive framework we developed at Databricks for assessing the correctness, stability, and performance of our Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools.
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on
how our customers productionize machine learning models
do a deep dive with actual customer case studies,
show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
What MLflow is; what problem it solves for machine learning lifecycle; and how it solves; How it will be used with Databricks; and CI/CD pipeline with Databricks.
Using Production Profiles to Guide OptimizationsDatabricks
Identifying the most important areas for optimization can be difficult. One large factor of this difficulty is the fact that different production workloads have different needs and different bottlenecks. Important efficiency concerns for one workload might be negligible for another. As a result, it is very important that engineers working on efficiency know how to profile their workload and use data to prioritize the right things. There are multiple aspects of this which are worth discussing. Firstly, it is important to get the profiling right.
There are different types of profilers that have different pros and cons, and as an engineer looking at profiles, it is important to have a high level understanding of what is happening under the hood to put everything in context. For instance, at Facebook we use both a perf-based profiler called Strobelight that has minimal Java specific context, and the third party async-profiler, which is a Java specific solution that misses native processes such as transforms.
Both of these solutions are important for different use cases. Secondly, being able to interpret the results and view them in meaningful ways is vital to properly identify the biggest potential efficiency wins. There are many different ways of looking at the data which can be useful in different situations. At Facebook, we rely heavily on data driven investigations to determine where our efficiency efforts are best spent. This involves heavy use of profiling.
In this talk, I will go over some of the different profilers and other tools we employ to gather the data. I will also provide some examples of how we’ve uncovered issues in the past, both deliberately and incidentally, and the process involved to come to the conclusions we did.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.
http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/
50k runs, millions of metrics, parameters or tags, some bursts at 20k QPS. That’s the volume of data managed by our MLflow tracking servers this year at Criteo. In this talk, you will learn how we set up a shared instance of MLflow at company scale. We will present our contributions to the SQLAlchemyStore to make it responsive at this scale. We will present you how we turned MLflow to a production-ready system. How we scaled horizontally a shared instance on a mesos cluster ? Our monitoring system based on prometheus. Integration to the company Single Sign-On (SSO) authentication. And how our data scientists register their runs from the largest hadoop cluster in Europe.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
The Hop project entered Apache Software Foundation as an Incubator project in 2020, and Julian Hyde, one of their mentors, gave this presentation to educate the initial committers on the Apache Way and what to expect during the Incubation process.
The talk was given by Julian Hyde on October 1st, 2020, with the original title "Apache Incubation - What's it all about?"
Adding structure to your streaming pipelines: moving from Spark streaming to ...DataWorks Summit
How do you go from a strictly typed object-based streaming pipeline with simple operations to a structured streaming pipeline with higher order complex relational operations? This is what the Data Engineering team did at GoPro to scale up the development of streaming pipelines for the rapidly growing number of devices and applications.
When big data frameworks such as Hadoop first came to exist, developers were happy because we could finally process large amounts of data without writing complex multi-threaded code or worse yet writing complicated distributed code. Unfortunately, only very simple operations were available such as map and reduce. Almost immediately, higher level operations were desired similar to relational operations. And so Hive and dozens (hundreds?) of SQL-based big data tools became available for more developer-efficient batch processing of massive amounts of data.
In recent years, big data has moved from batch processing to stream-based processing since no one wants to wait hours or days to gain insights. Dozens of stream processing frameworks exist today and the same trend that occurred in the batch-based big data processing realm has taken place in the streaming world, so that nearly every streaming framework now supports higher level relational operations.
In this talk, we will discuss in a very hands-on manner how the streaming data pipelines for GoPro devices and apps have moved from the original Spark streaming with its simple RDD-based operations in Spark 1.x to Spark's structured streaming with its higher level relational operations in Spark 2.x. We will talk about the differences, advantages, and necessary pain points that must be addressed in order to scale relational-based streaming pipelines for massive IoT streams. We will also talk about moving from “hand built” Hadoop/Spark clusters running in the cloud to using a Spark-based cloud service. DAVID WINTERS, Big Data Architect, GoPro and HAO ZOU, Senior Software Engineer, GoPro
Hamburg Data Science Meetup - MLOps with a Feature StoreMoritz Meister
MLOps is a trend in machine learning (ML) engineering that unifies ML system development (Dev) and ML system operation (Ops). Some ML lifecycle frameworks, such as TensorFlow Extended, are based around end-to-end pipelines that start with raw data and end in production models. During this talk we will introduce the concept of a feature store as the missing piece of ML infrastructure that enables faster lower cost deployment of models. We will show how the Hopsworks Feature Store - factors monolithic end-to-end ML pipelines into feature and model training pipelines that can each run at different cadences. We will show examples of ingestion and training pipelines including hyperparameter optimization and model deployment.
Enterprise guide to building a Data MeshSion Smith
Making Data Mesh simple, Open Source and available to all; without vendor lock-in, without complex tooling and to use an approach centered around ‘specifications’, existing tools and baking in a ‘domain’ model.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...HostedbyConfluent
Does your organization struggle with updating of its Kafka Streams application? Releasing a new version of a Kafka Streams application can be challenging, especially if its state has to be preserved between releases. Consider these best-practices and architectural ideas to make this process smoother and improve your release process.
Having experienced accidental removal of change-log topics and needing to expand partitions, it is much easier to handle with some planning. With the proper planning, you can achieve easier application upgrades.
Key take-aways from the session include:
* How do minimize the rebuilding of the state-stores.
* How to change stream topologies without affecting the existing state stores.
* What you can do when you absolutely need to increase the number of partitions within your application.
* How to leveraging schemas for application releases.
* Measures to prevent data corruption, especially if Kafka is not only your system of record but also your source of truth.
* Techniques to support rolling back an application.
* The advantages of splitting apart a Kafka Streams application into multiple applications.
Revolutionary container based hybrid cloud solution for MLPlatform
Ness' data science platform, NextGenML, puts the entire machine learning process: modelling, execution and deployment in the hands of data science teams.
The entire paradigm approaches collaboration around AI/ML, being implemented with full respect for best practices and commitment to innovation.
Kubernetes (onPrem) + Docker, Azure Kubernetes Cluster (AKS), Nexus, Azure Container Registry(ACR), GlusterFS
Workflow
Argo->Kubeflow
DevOps
Helm, kSonnet, Kustomize,Azure DevOps
Code Management & CI/CD
Git, TeamCity, SonarQube, Jenkins
Security
MS Active Directory, Azure VPN, Dex (K8s) integrated with GitLab
Machine Learning
TensorFlow (model training, boarding, serving), Keras, Seldon
Storage (Azure)
Storage Gen1 & Gen2, Data Lake, File Storage
ETL (Azure)
Databricks, Spark on K8, Data Factory (ADF), HDInsight (Kafka and Spark), Service Bus (ASB)
Lambda functions & VMs, Cache for Redis
Monitoring and Logging
Graphana, Prometeus, GrayLog
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
At the end of day the only thing that data scientists want is one thing. They want tabular data for their analysis.
They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data
that is being streamed at them from IoT devices and apps and at the same time add structure to it so that data scientists
can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds).
Oh... and there are a bunch more data sources that you need to ingest and the current providers of data are changing their structure.
At GoPro, we have massive amounts of heterogeneous data being streamed at us from our consumer devices
and applications, and we have developed a concept of "dynamic DDL" to structure our streamed data on the fly using
Spark Streaming, Kafka, HBase, Hive, and S3. The idea is simple. Add structure (schema) to the data as soon as possible.
Allow the providers of the data to dictate the structure. And automatically create event-based and state-based tables (DDL)
for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Migrating on premises workload to azure sql databasePARIKSHIT SAVJANI
Azure SQL Database is a fully managed cloud database service with built-in intelligence, elastic scale, performance, reliability, and data protection that enables enterprises and ISVs to reduce their total cost of ownership and operational cost and overheads. In this session, I will share real-world experience of successfully migrated existing SaaS application and on-premises workload for some our tier 1 customers and ISV partners to Azure SQL Database service. The session walks through planning, assessment, migration tools and best practices from the proven experiences and practices of migrating real world applications to Azure SQL Database service.
I did a session at the LeedsSharp about azure integration services and where to use which technology
- Azure Functions
- Logic Apps
- Synapse
- Data Factory
- Event Grid
- Service Bus
- Event Hub
- API Management
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
Similar to Sf big analytics_2018_04_18: Evolution of the GoPro's data platform (20)
GPUs used with Apache Spark are leveraged to speed up machine learning (ML) model training and inference. Data preparation stages are traditionally run on CPUs. The RAPIDS Accelerator for Apache Spark is a plugin jar that takes advantage of Apache Spark 3.x's ability to schedule on GPUs. The RAPIDS Accelerator replaces CPU expressions in a physical plan with GPU equivalents for dataframe operations. Code change is not required, making transition to GPUs seamless.
We'll give an overview of what the RAPIDS Accelerator is, how it works, and benefits from using the accelerator. We will discuss benchmarks showing the performance and cost benefits of leveraging GPUs for Spark ETL processing. We'll showcase a user tool that will help estimate speedups and cost savings.
Talk at SF Big Analytics https://www.meetup.com/sf-big-analytics/events/285731741/
Distributed systems are made up of many components such as authentication, a persistence layer, stateless services, load balancers, and stateful coordination services. These coordination services are central to the operation of the system, performing tasks such as maintaining system configuration state, ensuring service availability, name resolution, and storing other system metadata. Given their central role in the system it is essential that these systems remain available, fault tolerant and consistent. By providing a highly available file system-like abstraction as well as powerful recipes such as leader election, Apache Zookeeper is often used to implement these services. Although powerful, the Zookeeper interface may not be flexible enough or provide sufficient performance for all applications and many systems are replacing Zookeeper based solutions with Raft which provides a more generic interface to high availability and fault tolerance through the use of State Machine replication. This talk will go over a generic example of stateful coordination service moving from Zookeeper to Raft.
Speaker: Tyler Crain ( Alluxio)
Tyler Crain is a software engineer at Alluxio, working on distributed systems within the Alluxio core team. Before this, Tyler held Post-Doc positions at the University of Sydney and Sorbonne Universities where he performed research on topics including distributed key-value stores, distributed consensus and blockchain. Tyler received his PhD from the University of Rennes where he worked on Transactional Memory. He also holds a Masters degree in Computer Science from University of California Santa Barbara.
talk at SF Big Analytics:
Related Blog: https://www.alluxio.io/blog/from-zookeeper-to-raft-how-alluxio-stores-file-system-state-with-high-availability-and-fault-tolerance/
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...Chester Chen
Recent years have witnessed an exponential growth of the model scale in recommendation/Ads/search—from Google’s 2016 model with 1 billion parameters to the latest Facebook’s model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes people believe the era of 100 trillion parameters is around the corner. To prepare the exponential growth of the model size, an efficient distributed training system is in urgent need. However, the training of such huge models is challenging even within industrial scale data centers. In this talk, I will introduce Persia -- an open training system developed by my team -- to resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Persia admits nearly linear speedup properties while scaling the number of workers and the model size. Beside the capability of training 100 trillion parameters, it also shows a clear advantage in efficiency over other open sourced engines.
paper link:
https://arxiv.org/pdf/2111.05897.pdf
Speaker: Ji Liu
Dr. Ji Liu received his Ph.D in computer science and his bachelor degree in automation from University of Wisconsin-Madison and University of Science and Technology of China, respectively. After graduation, he joined the University of Rochester as an assistant professor, conducting research in machine learning, optimization, and reinforcement learning. The developed asynchronous and decentralized algorithms were widely used in industry, such as IBM, Microsoft, etc. He left academia and joined Tencent in 2017, exploring AI’s boundary. The developing AI agent Tstarbot was considered to be a milestone for mastering the most challenging RTS game -- Starcraft II. His second stop in industry is Kwai - the second largest short video company in China. He founded and led multiple international teams with different functionalities: platform team, product team, and research team. His team Contributed to 15+% annual revenue growth in Ads. He published 100+ papers in top-tier CS conferences and journals, and received multiple best paper awards (e.g., SIGKDD 2010 and UAI 2015 Facebook best paper). He was an awardee of MIT TR 35 under 35 in China and IBM faculty award in 2017. He was nominated to be one of China top 5 AI innovators under 35 in 2018
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...Chester Chen
Topic:
NVIDIA FLARE: Federated Learning Application Runtime Environment for Developing Robust AI Models
Summary:
Federated learning (FL) enables building robust and generalizable AI models by leveraging diverse datasets from multiple collaborators without moving data. We created NVIDIA FLARE as an open-source SDK to make it easier for data scientists to use FL in their research. The SDK allows existing machine learning and deep learning workflows adapted for distributed learning across enterprises and enables platform developers to build a secure, privacy-preserving offering for multiparty collaboration utilizing homomorphic encryption or differential privacy. The SDK is a lightweight, flexible, and scalable Python package and allows researchers to bring their data science workflows implemented in any training libraries (PyTorch, TensorFlow, or even NumPy), and apply them in real-world FL settings. This talk will introduce the key design principles of NVIDIA FLARE and illustrate use cases (e.g., COVID analysis) with customizable FL workflows that implement different privacy-preserving algorithms.
Speaker: Dr. Holger Roth ( Nvidia)
Holger Roth is a Sr. Applied Research Scientist at NVIDIA focusing on deep learning for medical imaging. He has been working closely with clinicians and academics over the past several years to develop deep learning based medical image computing and computer-aided detection models for radiological applications. He is an Associate Editor for IEEE Transactions of Medical Imaging and holds a Ph.D. from University College London, UK. In 2018, he was awarded the MICCAI Young Scientist Publication Impact Award.
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
Uncovering performance regressions in the TCP SACKs vulnerability fixes
In early July 2019, Databricks noticed some Apache Spark workloads regressing by as much as 6x. In this talk, we'll discuss how we traced these regressions back to the Linux kernel and the fixes for the TCP SACKs vulnerabilities. We will explain the symptoms we were seeing, walk through how we debugged the TCP connections, and dive into the Linux source to uncover the root cause.
Speaker: Chris Stevens (Databricks)
Chris Stevens is a software engineer at Databricks where he works on the reliability, scalability, and security of Apache Spark clusters. His work focuses on auto-scaling compute, auto-scaling storage, node initialization performance, and node health monitoring. Prior to Databricks, Chris founded the Minoca OS project, where he built a POSIX compliant, general purpose OS - from scratch - to run on resource constrained device. He got his start at Microsoft working on the Windows kernel team, porting the Windows boot environment from BIOS to UEFI.
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
Kafka operators need to provide guarantees to the business that Kafka is working properly and delivering data in real time, and they need to identify and triage problems so they can solve them before end users notice them. This elevates the importance of Kafka monitoring from a nice-to-have to an operational necessity. In this talk, Kafka operations experts Xavier Léauté and Gwen Shapira share their best practices for monitoring Kafka and the streams of events flowing through it. How to detect duplicates, catch buggy clients, and triage performance issues – in short, how to keep the business’s central nervous system healthy and humming along, all like a Kafka pro.
Speakers: Gwen Shapira, Xavier Leaute (Confluence)
Gwen is a software engineer at Confluent working on core Apache Kafka. She has 15 years of experience working with code and customers to build scalable data architectures. She currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an author of “Kafka - the Definitive Guide”, "Hadoop Application Architectures", and a frequent presenter at industry conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects.
Xavier Leaute is One of the first engineers to Confluent team, Xavier is responsible for analytics infrastructure, including real-time analytics in KafkaStreams. He was previously a quantitative researcher at BlackRock. Prior to that, he held various research and analytics roles at Barclays Global Investors and MSCI.
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleChester Chen
Talk 2. Managing Uber’s Data workflow at Scale.
Uber microservices serving millions of rides a day, leading to 100+ PB of data. To democratize data pipelines, Uber needed a central tool that provides a way to author, manage, schedule, and deploy data workflows at scale. This talk details Uber’s journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected several components of the system—such as scheduling and serialization—to make them highly available and more scalable.
Speaker Alex Kira (Uber)
Alex Kira is an engineering tech lead at Uber, where he works on the data workflow management team. His team provides a data infrastructure platform. In 19-year, he’s had experience across several software disciplines, including distributed systems, data infrastructure, and full stack development.
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
Building highly efficient data lakes using Apache Hudi (Incubating)
Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
Talk 1. Scaling Apache Spark on Kubernetes at Lyft
As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, We will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speaker: Li Gao
Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.
SFBigAnalytics- hybrid data management using cdapChester Chen
Cloud has emerged as a critical enabler of digital transformation, with the aim of reducing IT overheads and costs. However, cloud
migration is not instantaneous for a variety of reasons including data sensitivity, compliance and application performance. This results in the creation of diverse hybrid and multi-cloud environments and amplifies data management and integration challenges. This talk demonstrates how CDAP’s flexibility can allow you to utilize your existing on-premises infrastructure, as you evolve to the latest Big Data and Cloud services at your own pace, all while providing you a single, unified view of all your data, wherever it resides.
Speaker: Bhooshan Mogal, Google
Bhooshan Mogal is a Product Manager at Google, where he is focused on delivering best-in-class Data and Analytics services to GCP users. Prior to Google, he worked on data systems at Cask Data Inc, Pivotal and Yahoo.
Bighead: Airbnb's end-to-end machine learning platform
Airbnb has a wide variety of ML problems ranging from models on traditional structured data to models built on unstructured data such as user reviews, messages and listing images. The ability to build, iterate on, and maintain healthy machine learning models is critical to Airbnb’s success. Bighead aims to tie together various open source and in-house projects to remove incidental complexity from ML workflows. Bighead is built on Python, Spark, and Kubernetes. The components include a lifecycle management service, an offline training and inference engine, an online inference service, a prototyping environment, and a Docker image customization tool. Each component can be used individually. In addition, Bighead includes a unified model building API that smoothly integrates popular libraries including TensorFlow, XGBoost, and PyTorch. Each model is reproducible and iterable through standardization of data collection and transformation, model training environments, and production deployment. This talk covers the architecture, the problems that each individual component and the overall system aims to solve, and a vision for the future of machine learning infrastructure. It’s widely adopted in Airbnb and we have variety of models running in production. We plan to open source Bighead to allow the wider community to benefit from our work.
Speaker: Andrew Hoh
Andrew Hoh is the Product Manager for the ML Infrastructure and Applied ML teams at Airbnb. Previously, he has spent time building and growing Microsoft Azure's NoSQL distributed database. He holds a degree in computer science from Dartmouth College.
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018Chester Chen
80% of the time in every project is spent on data integration: Getting the data you want the way you want it. This problem remains challenging despite 40 years of attempts to solve it. We want a reliable, low latency system that can handle varied data from wide range of data management systems. We want a solution that is easy to manage and easy to scale. Is it too much to ask?
In this presentation, we’ll discuss the basic challenges of data integration and introduce design and architecture patterns that are used to tackle these challenges. We will explore how these patterns can be implemented using Apache Kafka and share pragmatic solutions that many engineering organizations used to build fast, scalable and manageable data pipelines.
Speaker: Gwen Shapira
Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementation. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
2. ABOUT SPEAKERS
• Chester Chen
• Head of Data Science & Engineering (DSE) at GoPro
• Previously, Director of Engineering, Alpine Data Labs
• Founder and Organizer of SF Big Analytics meetup
• David Winters
• Data Architect of Data Science & Engineering (DSE) at GoPro
• Previously worked at Splice Machine, Apple
• Hao Zou
• Senior Software Engineer of Data Science & Engineering (DSE) at GoPro
• Previously worked at Alpine Data Labs, Pivotal
3. AGENDA
• Business Use Cases
• Evolution of GoPro Data Platform
• Platform Architecture Transformation & Streaming to S3
• Configurable Spark Batch Framework
• Data Democratization
• Data Management & VIsualization
• Data Metrics Delivery
• Initial exploration in ML feature visualization
6. EXAMPLES OF ANALYTICS USE CASES
• Product Analytics
• Web/E-Commercial Analytics
• Camera Analytics
• Mobile Analytics
• GoPro Plus Analytics
• CRM Analytics
• Digital Marketing Analytics
• Social Media Analytics
• Cloud Media Analysis
15. PROS AND CONS OF OLD SYSTEM
• Isolation of workloads
• Fast ingest
• Secure
• Fast delivery/queries
• Loosely coupled clusters
• Multiple copies of data
• Tightly coupled storage and compute
• Lack of elasticity
• Operational overhead of multiple clusters
16. DYNAMIC ELASTIC ARCHITECTURE
Data Files
Streaming Cluster #1
Metastore
Ephemeral
ETL
Cluster #1
Parquet
+
DDL
Aggregates
Events
+
State
Ephemeral
Analytical
Cluster #1
Streaming
State Messages
Streaming Cluster #2
Streaming Cluster #N
Dynamic
DDL
Ephemeral
ETL
Cluster #2
Ephemeral
ETL
Cluster #N
Ephemeral
Analytical
Cluster #2
Ephemeral
Analytical
Cluster #N
Centralized Data Repository
Batch
Induction
Framework
• Rest API
• FTP downloads
• S3 sync
Batch
Download
Improvements
Single copy of data
Separate storage from compute
Elastic clusters
Reduced long running clusters to maintain
Parquet
+
DDL
Notebooks
18. BATCH JOBS
Job Gateway
Spark ClusterScheduled Jobs
New cluster per Job
Dev
Machines
Spark ClusterDev Jobs
New or existing cluster
Production
Job.conf
Dev
Job.conf
20. TAKEAWAYS
Key Changes
•Centralized Hive meta store
•Leveraged S3 as centralized storage
•Separated compute and storage
•Provided horizontal scalability with
cluster elasticity
•Less time in managing infrastructure
21. TAKEAWAYS
Key Challenges
• Pushing data to S3
• Made use of parallel writes with multipart
uploads
• Moving from Hadoop YARN to Spark Standalone
• Changed from fewer large EC2 instances to many
smaller instances
• Combined Spark Streaming jobs
• Considering a move to containers for further
improved instance utilization.
22. TAKEAWAYS
Key Benefits
• Cost
• Reduce redundant storage, compute cost.
• Use the smaller instance types
• 60% AWS cost saving comparing to 1 year ago
• Operation
• Reduce the complexity of DevOps Support
• Analytics tools
• SQL only => Notebook with (SQL, Python, Scala)
25. BATCH INGESTION
GoPro Product data
3rd Parties Data
3rd Parties Data
3rd Parties Data
Rest APIs
sftp
s3 sync
s3 sync
Batch Data Downloads Input File Formats: CSV, JSON
Spark Cluster
New cluster per Job
26. TABLE WRITER JOBS
SparkJob
HiveTableWriter
JDBCToHiveTableWriter
AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter
CSVTableWriter JSONTableWriter
FileToHiveTableWriter
HBaseToHiveTableWriter TableToHiveTableWriter
HBaseSnapshotJob
TableSnapshotJob
CoreTableWriter
Customized Json JobCustomized CSV Job
mixin
All jobs has the same way of configuration loading,
Job State and error reports
All table writers will have the Dynamic DDL
capabilities, as long as they becomes DataFrames,
they will be behave the same
CSV and JSON have
different loader
Need different Loader to
load HBase Record to
DataFrame
Aggregate Jobs
27. ETL JOB CONFIGURATION
gopro.dse.config.etl {
mobile-job {
conf {}
process {}
input {}
output {}
post.process {}
}
}
include classpath("conf/production/etl_mobile_quik.conf")
include classpath("conf/production/etl_mobile_capture.conf")
include classpath("conf/production/etl_mobile_product_events.conf")
Job-level conf override JobType Conf
Job specifics includes
JobType
JobName
Input & output specification
30. DATA TRANSFORMATION
• HSQL over JDBC via beeline
• Suitable for non-java/scala/python-programmers
• Spark Job
• Requires Spark and Scala knowledge, need to setup job, configurations etc.
• Dynamic Scala Scripts
• Scala as script, compile Scala at Runtime, mixed with Spark SQL
31. SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE
class CameraAggCaptureMainJob extends SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc)
val cameraCleanDataSchema = … //define DataFrame Schema
val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema)
.json("s3a://databucket/camera/work/production/clean-events/final/*")
cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data")
sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict
set hive.enforce.bucketing=false
set hive.auto.convert.join=false
set hive.merge.mapredfiles=true""")
sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on
select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” )
//rest of code
}
new CameraAggCaptureMainJob
40. SLACK METRICS DELIVERY
• Why Slack?
• Push vs. Pull -- Easy Access
• Avoid another Login when view metrics
• When Slack Connected, you are already login
• Put metrics generation into software engineering process
• SQL code is under software control
• publishing job is scheduled and performance is monitored
• Discussion/Question/Comments on the specific metrics can be
done directly at the channel with people involved.
41. SLACK DELIVERY FRAMEWORK
• Slack Metrics Delivery Framework
• Configuration Driven
• Multiple private Channels : Mobile/Cloud/Subscription/Web etc.
• Daily/Weekly/Monthly Delivery and comparison
• New metrics can be added easily with new SQL and configurations
43. BLACK KPI DELIVERY ARCHITECTURE
Slack message json
HTTP POST Rest API Server
Rest API Server
generate graphMetrics Json
Return Image
HTTP POST
Save/Get Image
Plot.ly json
Save Metrics to Hive Table
Slack Spark Job
Get Image URL
Webhooks
44. SLACK DELIVERY BENEFITS
• Pros:
• Quick and easy access via Slack
• Can quickly deliver to engineering manager, executives, business owner and product
manager
• 100+ members subscribed different channels, since we launch the service
• Cons
• Limited by Slack UI Real-States, can only display key metrics in two-column formats,
only suitable for hive-level summary metrics
47. FEATURE VISUALIZATION
• Explore Feature Visualization via Google Facets
• Part 1 : Overview
• Part 2: Dive
• What is Facets Overview ?
48. FACETS OVERVIEW INTRODUCTION
• From Facets Home Page
• https://pair-code.github.io/facets/
• "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by
feature and visualizes the analysis.
• Overview can help uncover issues with datasets, including the following:
• Unexpected feature values
• Missing feature values for a large number of examples
• Training/serving skew
• Training/test/validation set skew
• Key aspects of the visualization are outlier detection and distribution comparison across multiple
datasets.
• Interesting values (such as a high proportion of missing data, or very different distributions of a
feature across multiple datasets) are highlighted in red.
• Features can be sorted by values of interest such as the number of missing values or the skew
between the different datasets.
50. FACETS OVERVIEW IMPLEMENTATIONS
• The Facets-overview implementation is consists of
• Feature Statistics Protocol Buffer definition
• Feature Statistics Generation
• Visualization
• Visualization
• The visualizations are implemented as Polymer web components, backed
by Typescript code
• It can be embedded into Jupyter notebooks or webpages.
• Feature Statistics Generation
• There are two implementations for the stats generation: Python and Javascripts
• Python : using numpy, pandas to generate stats
• JavaScripts: using javascripts to generate stats
• Both implementations are running stats generation in brower
52. FEATURE OVERVIEW SPARK
• Initial exploration attempt
• Is it possible to generate larger datasets with small stats size ?
• can we generate stats leveraging distributed computing capability
of spark instead just using one node ?
• Can we generate the stats in Spark, and then used by Python
and/or Javascripts ?
54. PREPARE SPARK DATA FRAME
case class NamedDataFrame(name:String, data: DataFrame)
val features = Array("Age", "Workclass", ….)
val trainData: DataFrame = loadCSVFile(”./adult.data.csv")
val testData = loadCSVFile("./adult.test.txt")
val train = trainData.toDF(features: _*)
val test = testData.toDF(features: _*)
val dataframes = List(NamedDataFrame(name = "train", train),
NamedDataFrame(name = "test", test))
55. SPARK FACETS STATS GENERATOR
val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList())
val proto = generator.protoFromDataFrames(dataframes)
persistProto(proto)
59. INITIAL FINDINGS
• Implementation
• 1st Pass implementation is not efficient
• We have to go through each feature multiple paths, with increase number of features, the
performance suffers, this limits number of features to be used
• The size of dataset used for generate stats also determines the size of the generated protobuf file
• I haven’t dive deeper into this as to what’s contributing the change of the size
• The combination of data size and feature size can produce a large file, which won’t fit in browser
• With Spark DataFrame, we can’t support Tensorflow Records
• The Base64-encoded protobuf String can be loaded by Python or Javascripts
• Protobuf binary file can also be loaded by Python
• But it somehow not be able to loaded by Javascripts.
60. WHAT’S NEXT?
• Improve implementation performance
• When we have a lot of data and features, what’s the proper size that
generate proper stats size that can be loaded into browser or notebook ?
• For example, One experiments: 300 Features 200MB size
• How do we efficiently partition the features so that can be viewable ?
• Data is changing : how can we incremental update the stats on the regular
basis ?
• How to integrate this into production?
62. FINAL THOUGHTS
• We are still in the early stage of Data Platform Evolution.
• We will continue to share our experiences with you along the way.
• Questions?
Thank You
Data Science & Engineering
GoPro
Editor's Notes
High Level Architecture of Data Platform
Isolation of workloads 3 clusters (ingest, ETL, delivery)
Lamdba architecture
Input and output data formats
Cadence of clusters
A word about Data Sources:
IoT data
Logs from devices, applications (desktop and mobile), external systems and services, ERP, web/email marketing, etc.
Some Raw and Gzip, Some Binary and JSON
Some streaming and some batch
Batch data
Web marketing, campaigns
Social media
ERP
CRM
Lambda architecture
Both batch and stream processing
Basic needs/workloads in a Data Platform
High throughput ingestion
Transformations: joins, aggregations, etc.
Fast queries
Today, we have 3 clusters to isolate these workloads
We started with one cluster, ETL
Everything ran there
Ingest (Flume)
Batch (Framework)
ETL (Hive)
Analytical (Impala)
Lots of resource contention (I/O, memory, cores)
To alleviate the resource contention, we opted for 3 clusters to isolate the workloads.
Ingest cluster for near real-time streaming
Kafka, Spark Streaming (Cloudera Parcels)
Input: Logs, Output: JSON
Minutes cadence
Moving towards more real-time in seconds
Induction framework for scheduled batch ingestion
ETL cluster for heavy duty aggregation
Input: JSON flat files, Output: Aggregated Parquet files
Hive (Map/Reduce)
Hourly cadence
Secure Data Mart
Kerberos, LDAP, Active Directory, Apache Sentry (Cloudera committers)
Input: Compressed Parquet files
Analytical SQL engine for Tableau, ad-hoc queries (Hue), data wrangling (Trifacta), and data science (Jupyter Notebooks and RStudio)
With all that said, we will examine the newer technologies that will enable us to simplify our architecture and merge clusters in the future.
Kudu is one possible new technology that could help us to consolidate some of the clusters.
Let’s take a deeper dive into our streaming ingestion…
Logs are streamed from devices and software applications (desktop and mobile) to web service endpoint
Endpoint is an elastic pool of Tomcat servers sitting behind ELB in AWS
Custom servlet pushes logs into Kafka topics by environment
A series of Spark streaming jobs process the logs from Kafka
Landing place in ingestion cluster is HDFS with JSON flat files
Rationalization of tech stacks…
Why Kafka?
Unrivaled write throughput for a queue
Traditional queue throughput: 100K writes/sec on the biggest box you can buy
Kafka throughput: 1M writes/sec on 3-4 commodity servers
Strong ordering policy of messages
Distributed
Fault-tolerant through replication
Support synchronous and asynchronous writes
Pairs nicely with Spark Streaming for simpler scaling out (Kafka topic partitions map directly to Spark RDD partitions/tasks)
Why Spark Streaming?
Strong transactional semantics - "exactly once" processing
Leverage Spark technology for both data ingest and analytics
Horizontally scalable - High throughput for micro-batching
Large open source community
Keyword: Impedance mismatch
As previously stated, logs are streamed from devices and software applications (desktop and mobile) to web service endpoint
Logs are diverse: gzipped, raw, binary, JSON, batched events, streamed single events
Vary significantly in size from < 1 KB to > 1 MB
Logs are redirected based on data category and routed to appropriate Kafka topic and respective Spark Streaming job
Logs move from Kafka topic to Kafka topic with each Kafka topic having a Spark Streaming job that consumes the log, processes the log, and writes the log to another topic
Tree like structure of jobs with more generic logic towards the root of the tree and more specialized logic moving towards the leaf nodes
There are generic jobs/services and specialized jobs/services
Generic services include PII removal and hashing, IP to Geo lookups, and batched writing to HDFS
We perform batched HDFS writing since Kafka likes small messages (1 KB ideal) and HDFS likes large files (100+ MB)
Specialized services contain business logic
Finally, the logs are written into HDFS as JSON flat files (which are sometimes compressed depending on the type of data)
Scheduled ETL jobs perform a distributed copy (distcp) to move the data to the ETL cluster for further heavier aggregations
A few things…
Two flows of data: streaming and batch
Join data sources
Aggregate data sources
Convert to compressed columnar format (gzipped Parquet fies)
On the ETL cluster…
Here’s where we do our heavy lifting.
Almost entirely all Hive Map Reduce jobs
Some Impala to make the really big narly aggregations more performant
Previously, had a custom Java Map Reduce job for sessionization of events
This has been replaced with a Spark Streaming job on the ingestion cluster
In the future, want to push as much of the ETL processing back into the ingestion cluster for more real-time processing
We also have a custom Java Induction framework which ingests data from external services that only make data available on slower schedules (daily, twice daily, etc.)
The output from the ETL cluster is Parquet files that are added to partitioned managed tables in the Hive metastore.
The Parquet files are then copied via distcp to the Secure Data Mart.
Parquet files are copied from the ETL cluster and added to partitioned managed tables in the Hive Metastore of the Secure Data Mart.
The Secure Data Mart is protected with Apache Sentry.
Kerberos is used for authentication. Corporate Standard
Active Directory stores the groups. Corporate Standard
Access control is role based and the roles are assigned with Sentry.
Hue has a Sentry UI app to manage authorization.
Store data in one place Data (S3) + Structure (Hive Metastore)
Separate compute nodes from storage nodes
Elasticity size of clusters and number of clusters
Lower operational overhead of maintaining HDFS storage nodes
Promote Kafka to be a centralized service (data hub)