Sputnik: Airbnb’s Apache Spark Framework for Data Engineering

•

2 likes•632 views

Apache Spark is a general-purpose big data execution engine. You can work with different data sources with the same set of API in both batch and streaming mode.

Data & Analytics

Sputnik: Airbnb’s Apache Spark
framework for data engineering
Egor Pakhomov
Software Engineer,

Typical Spark job
ds=2010-01-01
ds=2010-01-02
ds=2010-01-03
…
ds=2010-01-01
ds=2010-01-02
…
…Spark Application

https://ogirardot.files.wordpress.com/2015/05/future-of-spark.png
DataFrame API
Spark Core
PackagesGraphXMLlib
Spark
Streaming
Spark
SQL
DataSource API

Job logic vs Run Logic
▪ Job logic
▪ Job does some business logic (for example counting unique visits for every url)
▪ Job specifies
▪ Input tables and output tables
▪ Partitioning schema
▪ Validation for result data
▪ Run logic
▪ Running job for specific date retrieves input only for that date from input table
▪ Job tries to write to table, which does not exists, so we need to create the table
▪ Job runs in testing mode, so all result tables are created with “_testing” suffix

Job logic vs Run Logic
Console parameters
--ds : 2020-01-01
--writeEnv : DEV
Sputnik
TableReader
TableWriter
Hive
Input Table
Result Table
Sputnik job
Get Data
Business logic
Write Result

Writing data
What Sputnik HiveTableWriter does:
▪ creates table with “CREATE TABLE” hive statement, if table does not exist
▪ updates table metainformation
▪ normalize dataframe schema according to output Hive table
▪ repartitions and tries to reduce number of result files on disk
▪ runs the checks on result, before writing it
▪ changes output table name (staging/testing mode)

Reading data
Reading dataframe
Reading dataset

Testing
▪ Singleton Spark Session
▪ DataFrame comparison
▪ Loading data from csv/json
▪ Cleaning hive between runs

Backfilling
tHive Table
2020-01-01 2020-01-02 2020-01-03 2020-01-04 2020-01-05 2020-01-06 2020-01-07 2020-01-08
Daily job
--ds 2020-01-08
Daily job
--ds 2020-01-07

Managing environments
database.input_table
database.output_table
Sputnik
Production environment
database.input_table
database.output_table_dev
Sputnik
Developer environment

Managing environments
--writeEnv PROD
--writeEnv DEV
--writeEnv STAGE
database.input_table
database.input_table_dev
database.input_table_staging

https://github.com/airbnb/sputnik
egor.pakhomov@airbnb.com

The document describes the configuration of a Dynamic Multipoint Virtual Private Network (DMVPN) using three phases. Phase 1 establishes IPsec and IKE tunnels between the hub router and spoke routers using EIGRP routing. Phase 2 optimizes the configuration by removing split horizon and enabling next hop self. Phase 3 enables features like NHRP redirect and shortcut to optimize network traffic flow.

PySpark in practice slides

Dat Tran

Apache Spark in Depth: Core Concepts, Architecture & Internals

Anton Kirillov

Why and How to Interconnect IXP

Internet Society

This document discusses interconnecting internet exchange points (IXPs) through either layer 2 or layer 3 connections. It outlines the benefits of interconnecting IXPs such as creating more attractiveness through additional peering and route options for members. However, it also notes potential disadvantages like difficulty troubleshooting issues and controlling routes. The document provides recommendations on setting policies for costs and usage limits between interconnected IXPs, and describes the technical setups required for either layer 2 or layer 3 interconnections.

Comparison of SRv6 Extensions uSID, SRv6+, C-SRH

Kentaro Ebisawa

Managing Apache Spark Workload and Automatic Optimizing

Databricks

eBay is highly using Spark as one of most significant data engines. In data warehouse domain, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. In machine learning domain, it is playing a more and more significant role. We have introduced our great achievement in migration work from MPP database to Apache Spark last year in Europe Summit. Furthermore, from the vision of the entire infrastructure, it is still a big challenge on managing workload and efficiency for all Spark jobs upon our data center. Our team is leading the whole infrastructure of big data platform and the management tools upon it, helping our customers -- not only DW engineers and data scientists, but also AI engineers -- to leverage on the same page. In this session, we will introduce how to benefit all of them within a self-service workload management portal/system. First, we will share the basic architecture of this system to illustrate how it collects metrics from multiple data centers and how it detects the abnormal workload real-time. We develop a component called Profiler which is to enhance the current Spark core to support customized metric collection. Next, we will demonstrate some real user stories in eBay to show how the self-service system reduces the efforts both in customer side and infra-team side. That's the highlight part about Spark job analysis and diagnosis. Finally, some incoming advanced features will be introduced to describe an automatic optimizing workflow rather than just alerting. Speaker: Lantao Jin

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Databricks

PySpark dataframe

Jaemun Jung

This document discusses PySpark DataFrames. It notes that DataFrames can be constructed from various data sources and are conceptually similar to tables in a relational database. The document explains that DataFrames allow richer optimizations than RDDs due to avoiding context switching between Java and Python. It provides links to resources that demonstrate how to create DataFrames, perform queries using DataFrame APIs and Spark SQL, and use an example flight data DataFrame.

Golang-powered open source IoT project Flogo to build ultra-lightweight integration microservices. The Internet of Things (IoT) brings up 50 billion devices until 2020, which have to be connected somehow. Challenges include low bandwidth, high latency, non-reliable connectivity and the need for low network costs. Therefore, a gateway is needed remotely on site of the devices to filter, aggregate and send just relevant data into the cloud or data center. This session introduces project Flogo: A 100% open source framework, which allows developing ultra lightweight IoT integration applications with a zero-coding web user interface or design chat bot. Coders can also rely just on code, of course. It is written in Google’s Go programming language and 20-50x more lightweight than similar Java or JavaScript frameworks. Therefore building very lightweight microservices independent of IoT is another good use case for this framework, e.g. for serverless architectures using open source frameworks such as OpenWhisk. The session focuses on live demos and shows how to build microservices and integrate IoT devices using standards such as MQTT, WebSockets, CoaP or REST. The last part of the session compares Project Flogo to other open source IoT projects like Node-RED and SaaS offerings such as AWS IoT. Please use the Flogo community to discuss or ask questions: https://community.tibco.com/products/project-flogo Video recording of these slides: https://youtu.be/-ThK6BZdoxw

Deep Dive into the New Features of Apache Spark 3.0

Databricks

Bgp tutorial for ISP

Wahyu Nasution

This document provides an overview of BGP (Border Gateway Protocol) basics and configuration for internet service providers. It discusses BGP attributes, path selection, and applying routing policies. The key points covered include the purpose of BGP in exchanging routing information between autonomous systems, BGP neighbor configuration for internal and external peers, and using attributes like AS path, local preference, communities to influence best path selection.

Dive into PySpark

Mateusz Buśkiewicz

Hadoop vs Apache Spark

ALTEN Calsoft Labs

MPLS + BGP Presentation

Gino McCarty

This document provides an overview of Multiprotocol Label Switching (MPLS), including its history, key concepts, applications, and use by service providers. MPLS was developed in the late 1990s to meet the needs of scalable routing and quality of service on the growing internet. It works by assigning fixed length labels to data packets, allowing routers to forward based on these labels rather than long network addresses. Major applications of MPLS include traffic engineering, virtual private networks, and bandwidth management. The document discusses how service providers like MegaPath use MPLS in their backbones to provide integrated data and voice services, and nationwide networking solutions for corporate customers.

PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...

Edureka!

Improving Apache Spark by Taking Advantage of Disaggregated Architecture

Databricks

Shuffle in Apache Spark is an intermediate phrase redistributing data across computing units, which has one important primitive that the shuffle data is persisted on local disks. This architecture suffers from some scalability and reliability issues. Moreover, the assumptions of collocated storage do not always hold in today's data centers. The hardware trend is moving to disaggregated storage and compute architecture for better cost efficiency and scalability. To address the issues of Spark shuffle and support disaggregated storage and compute architecture, we implemented a new remote Spark shuffle manager. This new architecture writes shuffle data to a remote cluster with different Hadoop-compatible filesystem backends. Firstly, the failure of compute nodes will no longer cause shuffle data recomputation. Spark executors can also be allocated and recycled dynamically which results in better resource utilization. Secondly, for most customers currently running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, they are free to build a separated cluster storing and serving the shuffle data, leveraging the latest hardware to improve the performance and reliability. Thirdly, in HPC world, more customers are trying Spark as their high performance data analytics tools, while storage and compute in HPC clusters are typically disaggregated. This work will make their life easier. In this talk, we will present an overview of the issues of the current Spark shuffle implementation, the design of new remote shuffle manager, and a performance study of the work.

Introduccion al direccionamiento IPv6

ZONA 802

MENOG-Segment Routing Introduction

Rasoul Mesghali, CCIE RS

Presentation on arp protocol

Mohd. Ahmad Siddiqi

The document discusses address resolution protocol (ARP) which maps logical IP addresses to physical MAC addresses on a local area network. It explains that ARP broadcasts a request to find the MAC address associated with a given IP address, and the device with that IP address responds with its MAC. This dynamic address mapping is stored in an ARP cache for future use. It also describes how different network protocols may use ARP or similar methods to perform address mapping between logical and physical addresses.

Presto on Apache Spark: A Tale of Two Computation Engines

Databricks

The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.

Building DataCenter networks with VXLAN BGP-EVPN

Cisco Canada

The session specifically covers the requirements and approaches for deploying the Underlay, Overlay as well as the inter-Fabric connectivity of Data Center Networks or Fabrics. Within the VXLAN BGP-EVPN based Overlay, we focus on the insights like forwarding and control plane functions which are critical to the simplicity operation of the architecture in achieving scale, small failure domains and consistent configuration. To complete the overlay view on VXLAN BGP-EVPN, we are going to the insides of BGP and its EVPN address-familiy and extend to about how multiple DC Fabric can be interconnected within, either as stretched Fabrics or with true DCI. The session concludes with a brief overview of manageability functions, network orchestration capabilities and multi-tenancy details. This Advanced session is intended for network, design and operation engineers from Enterprises to Service Providers.

SRv6 Network Programming: deployment use-cases

APNIC

The document discusses SRv6 (Segment Routing IPv6) network programming and deployment use cases. It provides an overview of SRv6, including how it encodes the network path in packet headers, its scalability benefits, and support across industry. Specific functions of SRv6 LocalSIDs are explained, along with examples of fast reroute protection, traffic engineering, VPN overlays, and centralized vs distributed control planes. SRv6 is proposed as a simplification that can eliminate protocols for various use cases.

Igrp

1 2d

rtpengine - Media Relaying and Beyond

Andreas Granig

rtpengine is a media relay, WebRTC bridge, call recorder, media transcoder, and media player. It can relay and manipulate media in real-time by forwarding packets through a kernel module. It supports features like SDP profile transforming, ICE negotiation, DTLS-SRTP encryption, packet recording, transcoding between codecs, and injecting audio streams into calls from files or databases. rtpengine integrates with Kamailio through modules and configuration to manipulate media on SIP calls.

Segment Routing for Dummies

Gary Jan

Segment Routing is a source routing architecture that embeds instructions, called segments, directly in the packet. This allows packets to be steered through specific paths in the network by prepending or stitching segment IDs. Segment Routing simplifies network operations by removing the need for signaling, label distribution, and per-flow state. Paths can either be computed distributively using IGP flooding of segment IDs, or explicitly programmed by a controller. This provides flexibility to engineers while keeping the forwarding plane stateless and simple.

Big Data Processing With Spark

Edureka!

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Databricks

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements. 1) Generality: support reading/writing most data management/storage systems. 2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities. Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.

Spark Summit 2014: Spark Job Server Talk

Evan Chan

This document discusses Spark Job Server, an open source project that allows Spark jobs to be submitted and run via a REST API. It provides features like job monitoring, context sharing between jobs to reuse cached data, and asynchronous APIs. The document outlines motivations for the project, how to use it including submitting and monitoring jobs, and future plans like high availability and hot failover support.

Faster Data Integration Pipeline Execution using Spark-Jobserver

Databricks

As you may already know, the open-source Spark Job Server offers a powerful platform for managing Spark jobs, jars, and contexts, turning Spark into a much more convenient and easy-to-use service. The Spark-Jobserver can keep Spark context warmed up and readily available for accepting new jobs. At Informatica we are leveraging the Spark-Jobserver offerings to solve the data-visualization use-case.

What's hot

DMVPN configuration - Configuring Cisco dynamic Multipoint VPN - HUB, SPOKES,...

NetProtocol Xpert

Flogo - A Golang-powered Open Source IoT Integration Framework (Gophercon)

Kai Wähner

Deep Dive into the New Features of Apache Spark 3.0

Databricks

Bgp tutorial for ISP

Wahyu Nasution

Dive into PySpark

Mateusz Buśkiewicz

Hadoop vs Apache Spark

ALTEN Calsoft Labs

MPLS + BGP Presentation

Gino McCarty

PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...

Edureka!

Improving Apache Spark by Taking Advantage of Disaggregated Architecture

Databricks

Introduccion al direccionamiento IPv6

ZONA 802

MENOG-Segment Routing Introduction

Rasoul Mesghali, CCIE RS

Presentation on arp protocol

Mohd. Ahmad Siddiqi

Presto on Apache Spark: A Tale of Two Computation Engines

Databricks

Building DataCenter networks with VXLAN BGP-EVPN

Cisco Canada

SRv6 Network Programming: deployment use-cases

APNIC

Igrp

1 2d

rtpengine - Media Relaying and Beyond

Andreas Granig

Segment Routing for Dummies

Gary Jan

Big Data Processing With Spark

Edureka!

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Databricks

What's hot (20)

DMVPN configuration - Configuring Cisco dynamic Multipoint VPN - HUB, SPOKES,...

Flogo - A Golang-powered Open Source IoT Integration Framework (Gophercon)

Deep Dive into the New Features of Apache Spark 3.0

Bgp tutorial for ISP

Dive into PySpark

Hadoop vs Apache Spark

MPLS + BGP Presentation

PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...

Improving Apache Spark by Taking Advantage of Disaggregated Architecture

Introduccion al direccionamiento IPv6

MENOG-Segment Routing Introduction

Presentation on arp protocol

Presto on Apache Spark: A Tale of Two Computation Engines

Building DataCenter networks with VXLAN BGP-EVPN

SRv6 Network Programming: deployment use-cases

Igrp

rtpengine - Media Relaying and Beyond

Segment Routing for Dummies

Big Data Processing With Spark

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Similar to Sputnik: Airbnb’s Apache Spark Framework for Data Engineering

Spark Summit 2014: Spark Job Server Talk

Evan Chan

Faster Data Integration Pipeline Execution using Spark-Jobserver

Databricks

[Hack.Hydrosphere] Project TIDE

Bulat Lutfullin

The document summarizes a visual dataflow constructor tool called Tide that was created by Team Q-shke-Q. It consists of: 1) Four team members with various roles - two front-end developers, a back-end Apache Spark expert, and a team lead. 2) A description of their initial idea called Apache Spark Mist Web App and how it evolved into Tide after realizing a data scientist's perspective. 3) An overview of Tide which provides an abstraction layer over code to simplify creating Apache Spark jobs and allows using Python code snippets while still compiling to PySpark.

Introduction to Apache Spark 2.0

Knoldus Inc.

Sydney Apache Spark Meetup - Spark Natural Language Processing

Andy Huang

Spark sql under the hood - Data KRK meetup

Mikołaj Kromka

In recent years Apache Spark has received a lot of hype in the Big Data community. It is seen as a silver bullet for all problems related to gathering, processing and analysing massive datasets. Due to its rapid evolution (do not forget that Spark is one the most active open source projects), some of the ideas behind it seem to be unclear and require digging into different blog posts and presentations. During this talk we will dive into the internals of Spark SQL, look how our queries are translated to the actual code executed on the nodes and find different ways to debug and optimize them.

SQL Performance Improvements At a Glance in Apache Spark 3.0

Kazuaki Ishizaki

SQL Performance Improvements at a Glance in Apache Spark 3.0

Databricks

Spark SQL - 10 Things You Need to Know

Kristian Alexander

The document discusses loading data into Spark SQL and the differences between DataFrame functions and SQL. It provides examples of loading data from files, cloud storage, and directly into DataFrames from JSON and Parquet files. It also demonstrates using SQL on DataFrames after registering them as temporary views. The document outlines how to load data into RDDs and convert them to DataFrames to enable SQL querying, as well as using SQL-like functions directly in the DataFrame API.

Sharing (or stealing) the jewels of python with big data & the jvm (1)

Holden Karau

With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?” Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so lets learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP.

Holden Karau - Spark ML for Custom Models

sparktc

This document summarizes a presentation on extending Spark ML pipelines. It discusses how pipeline stages can be estimators or transformers, with estimators needing to be trained to produce transformers. Pipeline stages must provide transformSchema and copy methods and can have configuration parameters. The document provides an example of a simple transformer and how to make it configurable. It also briefly discusses how to create an estimator by adding a fit method.

Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...

CloudxLab

This document provides an introduction to Spark Structured Streaming. It discusses that Structured Streaming is a scalable, fault-tolerant stream processing engine built on the Spark SQL engine. It expresses streaming computations similar to batch processing and guarantees end-to-end exactly-once processing. The document also provides a code example of a word count application using Structured Streaming and discusses output modes for writing streaming query results.

Sqoop on Spark for Data Ingestion

DataWorks Summit

Sqoop on Spark provides a way to run Sqoop jobs using Apache Spark for parallel data ingestion. It allows Sqoop jobs to leverage Spark's speed and growing community. The key aspects covered are: - Sqoop jobs can be created and executed on Spark by initializing a Spark context and wrapping Sqoop and Spark initialization. - Data is partitioned and extracted in parallel using Spark RDDs and map transformations calling Sqoop connector APIs. - Loading also uses Spark RDDs and map transformations to parallelly load data calling connector load APIs. - Microbenchmarks show Spark-based ingestion can be significantly faster than traditional MapReduce-based Sqoop for large datasets

Are general purpose big data systems eating the world?

Holden Karau

Every-time there is a new piece of big data technology we often see many different specific implementations of the concepts, which often eventually consolidate down to a few viable options, and then frequently end up getting rolled into part of another larger project. This talk will examine this trend in big data ecosystem, look at the exceptions to the "rule", and look at how better interchange formats like Apache Arrow have the potential to change this going forward. In addition to general vague happy feelings (or sad depending on your ideas about how software should be made), this talk will look at some specific examples with deep learning, so if anyone is looking for a little bit of pixie dust to sprinkle on a failing business plan to take to silicon valley to raise a series A, you'll get something out this as well. Video - https://www.youtube.com/watch?v=P_YKrLFZQJo

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Evan Chan

This was a talk that Kelvin Chu and I just gave at the SF Bay Area Spark Meetup 5/14 at Palantir Technologies. We discussed the Spark Job Server (http://github.com/ooyala/spark-jobserver), its history, example workflows, architecture, and exciting future plans to provide HA spark job contexts. We also discussed the use case of the job server at Ooyala to facilitate fast query jobs using shared RDD and a shared job context, and how we integrate with Apache Cassandra.

CODAIT/Spark-Bench

Emily Curtin

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

Simplilearn

This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark. Below topics are explained in this Spark presentation: 1. History of Spark 2. What is Spark 3. Hadoop vs Spark 4. Components of Apache Spark 5. Spark architecture 6. Applications of Spark 7. Spark usecase What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

Spark ML for custom models - FOSDEM HPC 2017

Holden Karau

- Spark ML pipelines involve estimators that are trained on datasets to produce immutable transformers. - A transformer must define transformSchema() to validate the input schema, transform() to do the work, and copy() for cloning. - Configurable transformers take parameters like inputCol and outputCol to allow configuration for meta algorithms. - Estimators are similar but fit() returns a model instead of directly transforming.

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Spark Summit

This document discusses running Sqoop jobs on Apache Spark for faster data ingestion into Hadoop. The authors describe how Sqoop jobs can be executed as Spark jobs by leveraging Spark's faster execution engine compared to MapReduce. They demonstrate running a Sqoop job to ingest data from MySQL to HDFS using Spark and show it is faster than using MapReduce. Some challenges encountered are managing dependencies and job submission, but overall it allows leveraging Sqoop's connectors within Spark's distributed processing framework. Next steps include exploring alternative job submission methods in Spark and adding transformation capabilities to Sqoop connectors.

An introduction into Spark ML plus how to go beyond when you get stuck

Data Con LA

This document provides instructions for extending Spark ML pipelines by building new pipeline stages. It discusses the key components needed to build estimators and transformers, including implementing transformSchema, fit/transform methods, and parameter configuration. Examples are given of building a simple string indexer estimator and transformer. The document also briefly mentions additional features like persistence and serving that could be added.

Similar to Sputnik: Airbnb’s Apache Spark Framework for Data Engineering (20)

Spark Summit 2014: Spark Job Server Talk

Faster Data Integration Pipeline Execution using Spark-Jobserver

[Hack.Hydrosphere] Project TIDE

Introduction to Apache Spark 2.0

Sydney Apache Spark Meetup - Spark Natural Language Processing

Spark sql under the hood - Data KRK meetup

SQL Performance Improvements At a Glance in Apache Spark 3.0

SQL Performance Improvements at a Glance in Apache Spark 3.0

Spark SQL - 10 Things You Need to Know

Sharing (or stealing) the jewels of python with big data & the jvm (1)

Holden Karau - Spark ML for Custom Models

Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...

Sqoop on Spark for Data Ingestion

Are general purpose big data systems eating the world?

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

CODAIT/Spark-Bench

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

Spark ML for custom models - FOSDEM HPC 2017

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

An introduction into Spark ML plus how to go beyond when you get stuck

More from Databricks

DW Migration Webinar-March 2022.pptx

Databricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized Platform

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data Science

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML Monitoring

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature Aggregations

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Recently uploaded

原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样

ihavuls

学校原件一模一样【微信：741003700 】《(unimelb毕业证书)墨尔本大学毕业证》【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

Predictably Improve Your B2B Tech Company's Performance by Leveraging Data

Kiwi Creative

Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts. Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!). From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing. - - - This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA. Watch the video recording at https://youtu.be/5vjwGfPN9lw Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/

Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...

Kaxil Naik

Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical. In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions. This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next. The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs). This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future. Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理

nuttdpt

毕业原版【微信:176555708】【(UCSB毕业证书)圣芭芭拉分校毕业证】【微信:176555708】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理

z6osjkqvd

原版办【微信号:BYZS866】【英属哥伦比亚大学毕业证(UBC毕业证书)】【微信号:BYZS866】《成绩单、外壳、雅思、offer、真实留信官方学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路）我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信号BYZS866】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信号BYZS866】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

一比一原版(CU毕业证)卡尔顿大学毕业证如何办理

bmucuha

原件一模一样【微信：95270640】【卡尔顿大学毕业证CU学位证成绩单】【微信：95270640】（留信学历认证永久存档查询）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信：95270640】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信：95270640】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信：95270640】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才 → 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！办理卡尔顿大学毕业证本科学位证成绩单CU学位证【微信：95270640 】外观非常精致，由特殊纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理卡尔顿大学毕业证CU学位证本科学位证成绩单【微信：95270640 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理卡尔顿大学毕业证本科学位证成绩单CU学位证【微信：95270640 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理卡尔顿大学毕业证本科学位证成绩单CU学位证【微信：95270640 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

writing report business partner b1+ .pdf

VyNguyen709676

在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样

v7oacc3l

学校原件一模一样【微信：741003700 】《(英国UCA毕业证书)创意艺术大学毕业证》【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

Open Source Contributions to Postgres: The Basics POSETTE 2024

ElizabethGarrettChri

Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.

Challenges of Nation Building-1.pptx with more important

Sm321

Experts live - Improving user adoption with AI

jitskeb

一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理

y3i0qsdzb

原版办理【微信号:BYZS866】【巴斯大学毕业证(Bath毕业证书)】【微信号:BYZS866】《成绩单、外壳、雅思、offer、真实留信官方学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【关于学历材料质量】我们承诺采用的是学校原版纸张（原版纸质、底色、纹路、）我们工厂拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有成品以及工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信号BYZS866】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信号BYZS866】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM

Timothy Spann

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM by Timothy Spann Principal Developer Advocate https://budapestdata.hu/2024/en/ https://budapestml.hu/2024/en/ tim.spann@zilliz.com https://www.linkedin.com/in/timothyspann/ https://x.com/paasdev https://github.com/tspannhw https://www.youtube.com/@flank-stack milvus vector database gen ai generative ai deep learning machine learning apache nifi apache pulsar apache kafka apache flink

Palo Alto Cortex XDR presentation .......

Sachin Paul

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理

bopyb

毕业原版【微信:176555708】【(GWU,GW毕业证书)乔治·华盛顿大学毕业证】【微信:176555708】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...

Aggregage

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样

apvysm8

原版一模一样【微信：741003700 】【(uts毕业证书)悉尼科技大学毕业证学历证书】【微信：741003700 】学位证，留信认证（真实可查，永久存档）offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原海外各大学 Bachelor Diploma degree, Master Degree Diploma 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...

sameer shah

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx

SaffaIbrahim1

End-to-end pipeline agility - Berlin Buzzwords 2024

Lars Albertsson

We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines. A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more. A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream. Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.

Recently uploaded (20)

原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样

Predictably Improve Your B2B Tech Company's Performance by Leveraging Data

Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理

一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理

一比一原版(CU毕业证)卡尔顿大学毕业证如何办理

writing report business partner b1+ .pdf

在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样

Open Source Contributions to Postgres: The Basics POSETTE 2024

Challenges of Nation Building-1.pptx with more important

Experts live - Improving user adoption with AI

一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM

Palo Alto Cortex XDR presentation .......

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx

End-to-end pipeline agility - Berlin Buzzwords 2024

Sputnik: Airbnb’s Apache Spark Framework for Data Engineering

2. Sputnik: Airbnb’s Apache Spark framework for data engineering Egor Pakhomov Software Engineer,

3. Typical Spark job ds=2010-01-01 ds=2010-01-02 ds=2010-01-03 … ds=2010-01-01 ds=2010-01-02 … …Spark Application

4. Typical Spark job

5. https://ogirardot.files.wordpress.com/2015/05/future-of-spark.png DataFrame API Spark Core PackagesGraphXMLlib Spark Streaming Spark SQL DataSource API

6. https://ogirardot.files.wordpress.com/2015/05/future-of-spark.png DataFrame API Spark Core PackagesGraphXMLlib Spark Streaming Spark SQL DataSource API

7. Data engineer should write only

8. Job logic vs Run Logic ▪ Job logic ▪ Job does some business logic (for example counting unique visits for every url) ▪ Job specifies ▪ Input tables and output tables ▪ Partitioning schema ▪ Validation for result data ▪ Run logic ▪ Running job for specific date retrieves input only for that date from input table ▪ Job tries to write to table, which does not exists, so we need to create the table ▪ Job runs in testing mode, so all result tables are created with “_testing” suffix

9. Job logic vs Run Logic Console parameters --ds : 2020-01-01 --writeEnv : DEV Sputnik TableReader TableWriter Hive Input Table Result Table Sputnik job Get Data Business logic Write Result

10. Sputnik job

11. Running the Sputnik job

12. Writing data

13. Writing data What Sputnik HiveTableWriter does: ▪ creates table with “CREATE TABLE” hive statement, if table does not exist ▪ updates table metainformation ▪ normalize dataframe schema according to output Hive table ▪ repartitions and tries to reduce number of result files on disk ▪ runs the checks on result, before writing it ▪ changes output table name (staging/testing mode)

14. Reading data Reading dataframe Reading dataset

15. Testing ▪ Singleton Spark Session ▪ DataFrame comparison ▪ Loading data from csv/json ▪ Cleaning hive between runs

16. Managing configuration Job: Config:

17. Checks on output

18. Backfilling tHive Table 2020-01-01 2020-01-02 2020-01-03 2020-01-04 2020-01-05 2020-01-06 2020-01-07 2020-01-08 Daily job --ds 2020-01-08 Daily job --ds 2020-01-07

19. Backfilling tHive Table 2020-01-01 2020-01-02 2020-01-03 2020-01-04 2020-01-05 2020-01-06 2020-01-07 2020-01-08 Daily job --ds 2020-01-08 Daily job --ds 2020-01-07 Backfill job --startDate 2020-01-01 --endDate 2020-01-06

20. Backfilling tHive Table 2020-01-01 2020-01-02 2020-01-03 2020-01-04 2020-01-05 2020-01-06 2020-01-07 2020-01-08 Daily job --ds 2020-01-08 Daily job --ds 2020-01-07 Backfill job --startDate 2020-01-01 --endDate 2020-01-06 --stepSize 3

21. Managing environments database.input_table database.output_table Sputnik Production environment database.input_table database.output_table_dev Sputnik Developer environment

22. Managing environments --writeEnv PROD --writeEnv DEV --writeEnv STAGE database.input_table database.input_table_dev database.input_table_staging

23. Other parameters

24. https://github.com/airbnb/sputnik egor.pakhomov@airbnb.com

Sputnik: Airbnb’s Apache Spark Framework for Data Engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sputnik: Airbnb’s Apache Spark Framework for Data Engineering

Similar to Sputnik: Airbnb’s Apache Spark Framework for Data Engineering (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Sputnik: Airbnb’s Apache Spark Framework for Data Engineering