gobblin-meetup-yarn

•

2 likes•769 views

The document discusses running Gobblin, an open source data ingestion framework, on YARN. It provides an overview of the motivations and architecture when running Gobblin on YARN, including better resource utilization, support for Gobblin as a continuous long-running service, and better fit for streaming ingestion. Key implementation details covered include the use of Apache Helix for distributed task execution and coordination, log aggregation, and security/token management.

A
Preview
of
Gobblin
on
Yarn

Yinan
Li

Data
Analy,cs
Infrastructure
@
LinkedIn

Agenda

•  Mo,va,ons

•  Architecture
Overview

•  Implementa,on
Notes

– The
Role
of
Apache
Helix

– Log
Compac,on

– Security
and
Token
Management

•  Deployment
@
LinkedIn

•  Future
Work

Why
Gobblin
on
Yarn

•  BeJer
resource
u,liza,on

– Sharing
of
containers

– BeJer
control
over
container
provisioning

– BeJer
container
life
cycle
management

•  Supports
Gobblin
as
a
con,nuous
long-‐
running
service

•  BeJer
ﬁt
for
streaming
inges,on

The
Role
of
Apache
Helix

•  Distributed
task
execu,on
framework

– Automa,c
task
assignment
and
rebalancing

•  Coordina,on
between
the
AM
and
containers

– Through
ZooKeeper

•  Messaging
between
components

Log
Aggregation

•  Containers
are
log
sources

•  Logs
get
streamed
to
HDFS
and
further
to
the
driver

Client/Driver
Applica,onMaster

Container

Container

HDFS

Security
and
Token
Management

Client/Driver

Applica,onMaster

Container

Container

HDFS

token

keytab

Deployment
@
LinkedIn

•  Dark
launch
for
a
few
data
sources

– Running
size
by
size
with
produc,on
instances

running
on
MR

•  Planned
to
migrate
more
data
sources
in
Q1

2016

Future
Work

•  AM
and
container
restart
handling

•  Log
reten,on
management

•  Monitoring
and
repor,ng

•  Run,me
cluster
resizing

Thank
You

•  hJps://github.com/linkedin/gobblin/
wiki/Gobblin-‐on-‐Yarn

•  hJps://groups.google.com/forum/#!
forum/gobblin-‐users

(Celia Kung, LinkedIn) Kafka Summit SF 2018 For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address such issues, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin MirrorMaker aims to provide improved performance and stability, while facilitating better management through finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin MirrorMaker and our plans for iterating further on this new mirroring solution.

Gobblin meetup-whats new in 0.7

Vasanth Rajamani

Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...

Flink Forward

Drivetribe is the world’s digital hub for motoring, as envisioned by Jeremy Clarkson, Richard Hammond, and James May. The Drivetribe platform was designed ground up with high scalability in mind. Built on top of the Event Sourcing/CQRS pattern, the platform uses Apache Kafka as its source of truth and Apache Flink as its processing backbone. This talk aims to introduce the architecture, and elaborate on how common problems in social media, such as counting big numbers and dealing with outliers, can be resolved by a healthy mix of Flink and functional programming.

Big Data Platform at Pinterest

Qubole

Flink SQL & TableAPI in Large Scale Production at Alibaba

DataWorks Summit

Search and recommendation system for Alibaba’s e-commerce platform use batch and streaming processing heavily. Flink SQL and Table API (which is a SQL-like DSL) provide simple, flexible, and powerful language to express the data processing logic. More importantly, it opens the door to unify the semantics of batch and streaming jobs. Blink is a project at Alibaba which improves Apache Flink to make it ready for large scale production use. To support our products, we made lots of improvements to Flink SQL & TableAPI in Alibaba's Blink project. We added the support for User-Defined Table function (UDTF), User-Defined Aggregates (UDAGG), Window Aggregate, and retraction, etc. We are actively working with the Flink community to contribute these improvements back. In this talk, we will present the rationale, semantics, design and implementation of these improvements. We will also share the experience of running large scale Flink SQL and TableAPI jobs at Alibaba.

Aggregation based features account for a quarter of the several 1000s features used by the ML-based decisioning system by the Risk team at Uber. We observed several repetitive, cumbersome steps needed for onboarding a feature, every single time. Therefore, to accelerate developer velocity, and to enable Feature Engineering at scale, we decided to develop a generic spark based infrastructure to simplify the process to no more than a simple spec file, containing a parameterized query, along with some metadata on where the feature should be aggregated and stored. In the presentation, we will describe the architecture of the final solution, highlighting some of the advanced capabilities like backfill support and self-healing for correctness. We will showcase how, using data stored in Hive and using Spark, we developed a highly scalable solution to carry out feature aggregation in an incremental way. By dividing data aggregation responsibility across the realtime access layer, and the batch computation components, we ensured that only entities for which there is actual value changes are dispersed to our real-time access store (Cassandra). We will share how we did data modeling in Cassandra using its native capabilities such as counters, and how we worked around some of the limitations of Cassandra. We will also cover the details about the access service how we do different types of feature stitching together. How, based on our data model we were able to ensure that all the feature for an entity with the same aggregation window, were queried via a single query. Finally, we will cover some of the details on how these incremental aggregated features have enabled shorter turnaround times for the models using such features.

Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...

Databricks

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

Databricks

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.

Modern ETL Pipelines with Change Data Capture

Databricks

In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data. This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium. We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.

Pinot: Near Realtime Analytics @ Uber

Xiang Fu

Big Data Ingestion @ Flipkart Data PlatformNavneet Gupta

Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...

Databricks

Near real-time analytics has become a common requirement for many data teams as the technology has caught up to the demand. One of the hardest aspects of enabling near-realtime analytics is making sure the source data is ingested and deduplicated often enough to be useful to analysts while writing the data in a format that is usable by your analytics query engine. This is usually the domain of many tools since there are three different aspects of the problem: streaming ingestion of data, deduplication using an ETL process, and interactive analytics. With Spark, this can be done with one tool. This talk with walk you through how to use Spark Streaming to ingest change-log data, use Spark batch jobs to perform major and minor compaction, and query the results with Spark.SQL. At the end of this talk you will know what is required to setup near-realtime analytics at your organization, the common gotchas including file formats and distributed file systems, and how to handle data the unique data integrity issues that arise from near-realtime analytics.

Scaling Apache Spark on Kubernetes at Lyft

Databricks

Lyft is on the mission to improve people's lives with the world's best transportation. As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, Li Gao and Rohit Menon will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup. Speakers: Li Gao, Rohit Menon

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...

HostedbyConfluent

Activision Data team has been running a data pipeline for a variety of Activision games for many years. Historically we used a mix of micro-batch microservices coupled with classic Big Data tools like Hadoop and Hive for ETL. As a result, it could take up to 4-6 hours for data to be available to the end customers. In the last few years, the adoption of data in the organization skyrocketed. We needed to de-legacy our data pipeline and provide near-realtime access to data in order to improve reporting, gather insights faster, power web and mobile applications. I want to tell a story about heavily leveraging Kafka Streams and Kafka Connect to reduce the end latency to minutes, at the same time making the pipeline easier and cheaper to run. We were able to successfully validate the new data pipeline by launching two massive games just 4 weeks apart.

Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...

confluent

The Oak Ridge Leadership Facility (OLCF) in the National Center for Computational Sciences (NCCS) division at Oak Ridge National Laboratory (ORNL) houses world-class high-performance computing (HPC) resources and has a history of operating top-ranked supercomputers on the TOP500 list, including the world's current fastest, Summit, an IBM AC922 machine with a peak of 200 petaFLOPS. With the exascale era rapidly approaching, the need for a robust and scalable big data platform for operations data is more important than ever. In the past when a new HPC resource was added to the facility, pipelines from data sources spanned multiple data sinks which oftentimes resulted in data silos, slow operational data onboarding, and non-scalable data pipelines for batch processing. Using Apache Kafka as the message bus of the division's new big data platform has allowed for easier decoupling of scalable data pipelines, faster data onboarding, and stream processing with the goal to continuously improve insight into the HPC resources and their supporting systems. This talk will focus on the NCCS division's transition to Apache Kafka over the past few years to enhance the OLCF's current capabilities and prepare for Frontier, OLCF's future exascale system; including the development and deployment of a full big data platform in a Kubernetes environment from both a technical and cultural shift perspective. This talk will also cover the mission of the OLCF, the operational data insights related to high-performance computing that the organization strives for, and several use-cases that exist in production today.

Observability for Data Pipelines With OpenLineage

Databricks

Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment. Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security. Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

Databricks

Data integration is a really difficult problem. We know this because 80% of the time in every project is spent getting the data you want the way you want it. We know this because this problem remains challenging despite 40 years of attempts to solve it. All we want is a service that will be reliable, handle all kinds of data and integrate with all kinds of systems, be easy to manage and scale as our systems grow. Oh, and it should be super low latency too. Is it too much to ask? In this presentation, we’ll discuss the basic challenges of data integration and introduce few design and architecture patterns that are used to tackle these challenges. We will then explore how these patterns can be implemented using Apache Kafka. Difficult problems are difficult and we offer no silver bullets, but we will share pragmatic solutions that helped many organizations build fast, scalable and manageable data pipelines.

03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving

Databricks

We present Spark Serving, a new spark computing mode that enables users to deploy any Spark computation as a sub-millisecond latency web service backed by any Spark Cluster. Attendees will explore the architecture of Spark Serving and discover how to deploy services on a variety of cluster types like Azure Databricks, Kubernetes, and Spark Standalone. We will also demonstrate its simple yet powerful API for RESTful SparkSQL, SparkML, and Deep Network deployment with the same API as batch and streaming workloads. In addition, we will explore the "dual architecture": HTTP on Spark. This architecture converts any spark cluster into a distributed web client with the familiar and pipelinable SparkML API. These two contributions provide the fundamental spark communication primitives to integrate and deploy any computation framework into the Spark Ecosystem. We will explore how Microsoft has used this work to leverage Spark as a fault-tolerant microservice orchestration engine in addition to an ETL and ML platform. And will walk through two examples drawn from Microsoft's ongoing work on Cognitive Service composition, and unsupervised object detection for Snow Leopard recognition.

Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Flink Forward

Tangram: Distributed Scheduling Framework for Apache Spark at Facebook

Databricks

Tangram is a state-of-art resource allocator and distributed scheduling framework for Spark at Facebook with hierarchical queues and a resource based container abstraction. We support scheduling and resource management for a significant portion of Facebook's data warehouse and machine learning workloads that equates to running millions of jobs across several clusters with tens of thousands of machines. In this talk, we will describe Tangram's architecture, discuss Facebook's need for a custom scheduler, and explain how Tangram schedules Spark workloads at scale. We will specifically focus on several important features around improving Spark's efficiency, usability and reliability: 1. IO-rebalancer (Tetris) Support 2. User-Fairness Queueing 3. Heuristic-Based Backfill Scheduling Optimizations.

Symantec: Cassandra Data Modelling techniques in action

DataStax Academy

Our product presents an aggregated view of metadata collected for billions of objects (files, emails, sharepoint objects etc.). We used Cassandra to store those billions of objects along with aggregated view of that metadata. Customers can analyse the corpus of data in real time by searching in completely flexible way i.e. be able to get summary aggregates for many billions of objects, and then be able to further drill down to items by filtering using various facets of the metadata. We achieve this using a combination of Cassandra and ElasticSearch. This presentation will talk about various data modelling techniques we use to aggregate and then further summarise all that metadata and be able to search the summary in real t

Monitoring of GPU Usage with Tensorflow Models Using Prometheus

Databricks

Understanding the dynamics of GPU utilization and workloads in containerized systems is critical to creating efficient software systems. We create a set of dashboards to monitor and evaluate GPU performance in the context of TensorFlow. We monitor performance in real time to gain insight into GPU load, GPU memory and temperature metrics in a Kubernetes GPU enabled system. Visualizing TensorFlow training job metrics in real time using Prometheus allows us to tune and optimize GPU usage. Also, because Tensor flow jobs can have both GPU and CPU implementations it is useful to view detailed real time performance data from each implementation and choose the best implementation. To illustrate our system, we will show a live demo gathering and visualizing GPU metrics on a GPU enabled Kubernetes cluster with Prometheus and Grafana.

Embracing Database Diversity with Kafka and Debezium

Frank Lyaruu

There was a time not long ago when we used relational databases for everything. Even if the data wasn’t particularly relational, we shoehorned it into relational tables, often because that was the only database we had. Thank god these dark times are over and now we have many different kinds of NoSQL databases: Document, realtime, graph, column, but that does not solve the problem that the same data might be a graph from one perspective, but a collection of documents from another. It would be really nice if we can access that same data in many different ways, depending on the context of what we want to achieve in our current task. As software architects this is not easy to solve but definitely possible: We can design an architecture using Event Sourcing: Capture the data with Debezium, post it to a Kafka queue, use Kafka Streams to model the data the way we like, and store the data in various different data sources, so we can synchronize data between data sources.

Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes

Databricks

Have you ever wondered how to implement your own operator pattern for you service X in Kubernetes? You can learn this in this session and see an example of open-source project that does spawn Apache Spark clusters on Kubernetes and OpenShift following the pattern. You will leave this talk with a better understanding of how spark-on-k8s native scheduling mechanism can be leveraged and how you can wrap your own service into operator pattern not only in Go lang but also in Java. The pod with spark operator and optionally the spark clusters expose the metrics for Prometheus so it makes it eas

Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019

VMware Tanzu

Journey towards serverless infrastructure

Ville Seppänen

Hadoop Ecosystem and Low Latency Streaming Architecture

InSemble

What's hot

Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...

Databricks

Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...

Databricks

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

Databricks

Modern ETL Pipelines with Change Data Capture

Databricks

Pinot: Near Realtime Analytics @ Uber

Xiang Fu

Big Data Ingestion @ Flipkart Data PlatformNavneet Gupta

Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...

Databricks

Scaling Apache Spark on Kubernetes at Lyft

Databricks

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...

HostedbyConfluent

Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...

confluent

Observability for Data Pipelines With OpenLineage

Databricks

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

Databricks

03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving

Databricks

Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Flink Forward

Tangram: Distributed Scheduling Framework for Apache Spark at Facebook

Databricks

Symantec: Cassandra Data Modelling techniques in action

DataStax Academy

Monitoring of GPU Usage with Tensorflow Models Using Prometheus

Databricks

Embracing Database Diversity with Kafka and Debezium

Frank Lyaruu

Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes

Databricks

Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019

VMware Tanzu

What's hot (20)

Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...

Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

Modern ETL Pipelines with Change Data Capture

Pinot: Near Realtime Analytics @ Uber

Big Data Ingestion @ Flipkart Data Platform

Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...

Scaling Apache Spark on Kubernetes at Lyft

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...

Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...

Observability for Data Pipelines With OpenLineage

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving

Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Tangram: Distributed Scheduling Framework for Apache Spark at Facebook

Symantec: Cassandra Data Modelling techniques in action

Monitoring of GPU Usage with Tensorflow Models Using Prometheus

Embracing Database Diversity with Kafka and Debezium

Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes

Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019

Similar to gobblin-meetup-yarn

Journey towards serverless infrastructure

Ville Seppänen

Hadoop Ecosystem and Low Latency Streaming Architecture

InSemble

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...

confluent

(Bob Lehmann, Bayer) Kafka Summit SF 2018 You’ve built your streaming data platform. The early adopters are “all in” and have developed producers, consumers and stream processing apps for a number of use cases. A large percentage of the enterprise, however, has expressed interest but hasn’t made the leap. Why? In 2014, Bayer Crop Science (formerly Monsanto) adopted a cloud first strategy and started a multi-year transition to the cloud. A Kafka-based cross-datacenter DataHub was created to facilitate this migration and to drive the shift to real-time stream processing. The DataHub has seen strong enterprise adoption and supports a myriad of use cases. Data is ingested from a wide variety of sources and the data can move effortlessly between an on premise datacenter, AWS and Google Cloud. The DataHub has evolved continuously over time to meet the current and anticipated needs of our internal customers. The “cost of admission” for the platform has been lowered dramatically over time via our DataHub Portal and technologies such as Kafka Connect, Kubernetes and Presto. Most operations are now self-service, onboarding of new data sources is relatively painless and stream processing via KSQL and other technologies is being incorporated into the core DataHub platform. In this talk, Bob Lehmann will describe the origins and evolution of the Enterprise DataHub with an emphasis on steps that were taken to drive user adoption. Bob will also talk about integrations between the DataHub and other key data platforms at Bayer, lessons learned and the future direction for streaming data and stream processing at Bayer.

Building High-Throughput, Low-Latency Pipelines in Kafka

confluent

William Hill is one of the UK’s largest, most well-established gaming companies with a global presence across 9 countries with over 16,000 employees. In recent years the gaming industry and in particular sports betting, has been revolutionised by technology. Customers now demand a wide range of events and markets to bet on both pre-game and in-play 24/7. This has driven out a business need to process more data, provide more updates and offer more markets and prices in real time. At William Hill, we have invested in a completely new trading platform using Apache Kafka. We process vast quantities of data from a variety of feeds, this data is fed through a variety of odds compilation models, before being piped out to UI apps for use by our trading teams to provide events, markets and pricing data out to various end points across the whole of William Hill. We deal with thousands of sporting events, each with sometimes hundreds of betting markets, each market receiving hundreds of updates. This scales up to vast numbers of messages flowing through our system. We have to process, transform and route that data in real time. Using Apache Kafka, we have built a high throughput, low latency pipeline, based on Cloud hosted Microservices. When we started, we were on a steep learning curve with Kafka, Microservices and associated technologies. This led to fast learnings and fast failings. In this session, we will tell the story of what we built, what went well, what didn’t go so well and what we learnt. This is a story of how a team of developers learnt (and are still learning) how to use Kafka. We hope that you will be able to take away lessons and learnings of how to build a data processing pipeline with Apache Kafka.

Trend Micro Big Data Platform and Apache Bigtop

Evans Ye

Scalable and Reliable Logging at Pinterest

Krishna Gade

At Pinterest, hundreds of services and third-party tools that are implemented in various programming languages generate billions of events every day. To achieve scalable and reliable low latency logging, there are several challenges: (1) uploading logs that are generated in various formats from tens of thousands of hosts to Kafka in a timely manner; (2) running Kafka reliably on Amazon Web Services where the virtual instances are less reliable than on-premises hardware; (3) moving tens of terabytes data per day from Kafka to cloud storage reliably and efficiently, and guaranteeing exact one time persistence per message. In this talk, we will present Pinterest’s logging pipeline, and share our experience addressing these challenges. We will dive deep into the three components we developed: data uploading from service hosts to Kafka, data transportation from Kafka to S3, and data sanitization. We will also share our experience in operating Kafka at scale in the cloud.

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Hakka Labs

Netflix web-adrian-qconYiwei Ma

OpenStack: Toward a More Resilient Cloud

Mark Voelker

Since it's inception over four years ago, OpenStack has become the most popular open source software for building many types of clouds in part due to the flexibility it provides. As more adoption increases, interest has increased in building OpenStack clouds on a highly available control plane infrastructure. In this talk we will provide an introduction to today's OpenStack community and software, then dive deeper into how to build more highly available, scalable OpenStack architectures. - See more at: http://www.percona.com/news-and-events/percona-university-smart-data-raleigh/openstack-toward-more-resilient-cloud#sthash.wicdUMdH.dpuf

Storage Requirements and Options for Running Spark on Kubernetes

DataWorks Summit

In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications. This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence. This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done

Spark volume requirements 2018

Rachit Arora

What's New in IBM Streams V4.1

lisanl

Centralizing Kubernetes and Container Operations

Kublr

While developers see and realize the benefits of Kubernetes, how it improves efficiencies, saves time, and enables focus on the unique business requirements of each project; InfoSec, infrastructure, and software operations teams still face challenges when managing a new set of tools and technologies, and integrating them into an existing enterprise infrastructure. These meetup slides go over what’s needed for a general architecture of a centralized Kubernetes operations layer based on open source components such as Prometheus, Grafana, ELK Stack, Keycloak, etc., and how to set up reliable clusters and multi-master configuration without a load balancer. It also outlines how these components should be combined into an operations-friendly enterprise Kubernetes management platform with centralized monitoring and log collection, identity and access management, backup and disaster recovery, and infrastructure management capabilities.  This presentation will show real-world open source projects use cases to implement an ops-friendly environment. Check out this and more webinars in our BrightTalk channel: https://goo.gl/QPE5rZ

Scientific Computing in the Cloud: Speeding Access for Drug Discovery

Avere Systems

Scientific computing on the cloud lured scientists at H3 Biomedicine in Cambridge, Massachusetts, with the promise of near-limitless compute capacity potential of Amazon EC2. Today, scientists run a wide array of applications in the cloud that contribute to the integration of human cancer genomics with chemistry and biology to discover a library of specialty cancer treatment drugs. In this webinar, you'll hear how this organization has built cloud infrastructure in a way that reduces latency and gives them storage flexibility, and does so in a way that helps them save money and support their business strategy. The H3 Biomedicine story will be supported by a look at the cloud technology and AWS services that have enabled application migration to the cloud in a hybrid IT environment.

Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015

Cloud Native Day Tel Aviv

Microservice message routing on Kubernetes

Frans van Buul

Building real time data-driven products

Lars Albertsson

This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time. Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).

Webinar Alpakka 2018-08-16

Enno Runne

As the number of systems within an IT infrastructure increases, the number of integrations needed by enterprises also multiplies. Recognizing that the old times of overnight file exchanges are no longer meeting real-time demands, a well-organized enterprise integration strategy is a critical success factor when your systems need to be connected all day. In this webinar with Enno Runne, Tech Lead for Alpakka at Lightbend, Inc., we’ll look at why integrations should be viewed as streams of data, and how Alpakka—a Reactive Enterprise Integration library for Java and Scala based on Reactive Streams and Akka—fits perfectly for today’s demands on system integrations. Specifically, we will review: How Alpakka brings streaming data flows directly to the surface, utilizing the features of Akka to tame the complexity of streams. Supported connectors for Amazon Web Services, Microsoft Azure, and Google Cloud, as well as others for event sourcing/persistence/DB technologies and traditional interfaces like FTP, HTTP, etc. A deeper look into the use cases for Alpakka’s most utilized interfaces to popular technologies like Apache Kafka, MQTT, and MongoDB. https://info.lightbend.com/webinar-pakk-your-alpakka-reactive-streams-integrations-for-aws-azure-google-cloud-recording.html

Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud

Lightbend

As the number of systems within an IT infrastructure increases, the number of integrations needed by enterprises also multiplies. Recognizing that the old times of overnight file exchanges are no longer meeting real-time demands, a well-organized enterprise integration strategy is a critical success factor when your systems need to be connected all day. In this webinar with Enno Runne, Tech Lead for Alpakka at Lightbend, Inc., we’ll look at why integrations should be viewed as streams of data, and how Alpakka—a Reactive Enterprise Integration library for Java and Scala based on Reactive Streams and Akka—fits perfectly for today’s demands on system integrations. Specifically, we will review: * How Alpakka brings streaming data flows directly to the surface, utilizing the features of Akka to tame the complexity of streams. * Supported connectors for Amazon Web Services, Microsoft Azure, and Google Cloud, as well as others for event sourcing/persistence/DB technologies and traditional interfaces like FTP, HTTP, etc. * A deeper look into the use cases for Alpakka’s most utilized interfaces to popular technologies like Apache Kafka, MQTT, and MongoDB.

12-Step Program for Scaling Web Applications on PostgreSQL

Konstantin Gredeskoul

Similar to gobblin-meetup-yarn (20)

Journey towards serverless infrastructure

Hadoop Ecosystem and Low Latency Streaming Architecture

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...

Building High-Throughput, Low-Latency Pipelines in Kafka

Trend Micro Big Data Platform and Apache Bigtop

Scalable and Reliable Logging at Pinterest

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Netflix web-adrian-qcon

OpenStack: Toward a More Resilient Cloud

Storage Requirements and Options for Running Spark on Kubernetes

Spark volume requirements 2018

What's New in IBM Streams V4.1

Centralizing Kubernetes and Container Operations

Scientific Computing in the Cloud: Speeding Access for Drug Discovery

Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015

Microservice message routing on Kubernetes

Building real time data-driven products

Webinar Alpakka 2018-08-16

Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud

12-Step Program for Scaling Web Applications on PostgreSQL

gobblin-meetup-yarn

1. A Preview of Gobblin on Yarn Yinan Li Data Analy,cs Infrastructure @ LinkedIn

2. Agenda •  Mo,va,ons •  Architecture Overview •  Implementa,on Notes – The Role of Apache Helix – Log Compac,on – Security and Token Management •  Deployment @ LinkedIn •  Future Work

3. Why Gobblin on Yarn •  BeJer resource u,liza,on – Sharing of containers – BeJer control over container provisioning – BeJer container life cycle management •  Supports Gobblin as a con,nuous long-‐ running service •  BeJer ﬁt for streaming inges,on

4. Architecture Overview

5. The Role of Apache Helix •  Distributed task execu,on framework – Automa,c task assignment and rebalancing •  Coordina,on between the AM and containers – Through ZooKeeper •  Messaging between components

6. Log Aggregation •  Containers are log sources •  Logs get streamed to HDFS and further to the driver Client/Driver Applica,onMaster Container Container HDFS

7. Security and Token Management Client/Driver Applica,onMaster Container Container HDFS token keytab

8. Deployment @ LinkedIn •  Dark launch for a few data sources – Running size by size with produc,on instances running on MR •  Planned to migrate more data sources in Q1 2016

9. Future Work •  AM and container restart handling •  Log reten,on management •  Monitoring and repor,ng •  Run,me cluster resizing

10. Thank You •  hJps://github.com/linkedin/gobblin/ wiki/Gobblin-‐on-‐Yarn •  hJps://groups.google.com/forum/#! forum/gobblin-‐users

gobblin-meetup-yarn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to gobblin-meetup-yarn

Similar to gobblin-meetup-yarn (20)

gobblin-meetup-yarn