Efficiently Triaging CI Pipelines with Apache Spark: Mixing 52 Billion Events/Day of Streaming with 40 TB/Hour of Batch Processing with Ivan Jibaja

•

1 like•532 views

Continuous integration (CI) pipelines generate massive amounts of messy log data. At Pure Storage engineering, we run over 65,000 tests per day creating a large triage problem. Spark’s flexible computing platform allows us to write a single application for both streaming and batch jobs to understand the state of our CI pipeline. Spark indexes log data for real-time reporting (Streaming), uses Machine Learning for performance modeling and prediction (Batch job), and re-indexes old data for newly encoded patters (Batch job). Previous work on a mixed streaming and batch environment describes the options for persisting data and their trade-offs: 1) short interval buckets which hurts batch performance 2) long interval buckets which increases micro batch time windows 3) additional software on the background to compact the short interval buckets which adds complexity. This talk will go over how we use the filesystem metadata of our disaggregated compute and storage layers to write over half a million files per day of varied sizes from 52 Billion events and have efficient batch jobs without compaction that allow us to process over 40TB per hour. We will go over the challenges and best practices to achieve efficiency in this mixed environment scenarios.

Data & Analytics

1 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Efficiently Triaging CI/CD Pipelines
with Apache Spark
Ivan Jibaja
Software Engineer
Streaming 80 Billion Events/Day and
Batching 40 TB/Hour

2 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analytics Pipeline in Numbers
ü900k - 1M events / second
ü5 seconds SLA
ü0.5 - 1 PB of data / day

3 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Continuous Integration &
Continuous Deployment
Source Build
Functional
Test
Stress
Test
Deploy

4 © 2018 PURE STORAGE INC. PURE PROPRIETARY
< 5
1 Test
coordinator
(Jenkins)
< 10
< 10
CI/CD works!
100s
tests / day
< 5
failures
Email
developer

5 © 2018 PURE STORAGE INC. PURE PROPRIETARY
700
failures
x
15 min
70,000+
tests / day
20 Triage Engineers
2x in the next 12 months
1500+
VMs
250+
FBs
20+
Jenkins
700+
clients
100+
Engineers
Scale Problems

6 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis Dream
1. Automate triaging of failures
2. Extract performance metrics
3. Save our logs for future use
4. Do all of this in a scalable system
5. Real-time results!

7 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis
Volume
Value

8 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis v1
Volume
Value
Save
Alert / Take action

9 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis v2
Volume
Value
Save
ETL / Add Structure
Alert / Take action

10 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis v3
Volume
Value
Save
Aggregate / Search
ETL / Add Structure
Alert / Take action

11 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis v10
Volume
Value
Save
Aggregate / Search
ETL / Add Structure
Alert / Take action

12 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis Pipeline
Augment &
Centralize
LogSources
Index
Aggregate
Transform
Logic
Timeseries
DB
AlertStore
Visualize

13 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis Pipeline
Augment &
Centralize
LogSources
Aggregate
Transform
Logic
Timeseries
DB
AlertStore
Visualize
Index

14 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis Pipeline
Augment &
Centralize
LogSources
Streaming
Buffer
Filter
Store
Aggregate
Transform
Logic
Timeseries
DB
Alert
Visualize
Index

15 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis Pipeline
Augment &
Centralize
LogSources
Streaming
Buffer
Filter
Store Re-Filter
Aggregate
Transform
Logic
Timeseries
DB
Alert
Visualize
Index

16 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis Pipeline
Augment &
Centralize
LogSources
Streaming
Buffer
Filter
Store
Aggregate
Transform
Logic
Timeseries
DB
Alert
Visualize
Index
Re-Filter

17 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis Pipeline
rsyslog
LogSources
Streaming
Buffer
Filter
Store Re-Filter
Aggregate
Transform
Logic
Timeseries
DB
Alert
Visualize
Index

18 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis Pipeline
rsyslog
LogSources
Filter
Re-Filter
Timeseries
DB
Alert
Aggregate
Transform
Logic
Visualize
Index

19 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis Pipeline
rsyslog
LogSources
Timeseries
DB
Alert
Aggregate
Transform
Logic
Visualize
Index

20 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis Pipeline
rsyslog
LogSources
Timeseries
DB
Alert
Aggregate
Transform
Logic
Visualize

21 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis Pipeline
rsyslog
LogSources
Timeseries
DB
Alert
Visualize

22 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Log Analysis Pipeline
rsyslog
LogSources

23 © 2018 PURE STORAGE INC. PURE PROPRIETARY
FULL PIPELINE
1,500+
VMs
250+
FBs
20+
Jenkins
700+
clients
27T
12
12
12
12
12
12
12
12
12
12
27T 12
12
12
12
12
12
12
12
12
12
12
12
12
12
70,000+
tests / day
9T
rsyslog
16
16
16
16
16
16
300G 12
12
12
12
12
12
ü Duplicate bug
ü Infrastructure failure
ü Performance regression

24 © 2018 PURE STORAGE INC. PURE PROPRIETARY
FULL PIPELINE
1,500+
VMs
250+
FBs
20+
Jenkins
700+
clients
27T
12
12
12
12
12
12
12
12
12
12
27T 12
12
12
12
12
12
12
12
12
12
12
12
12
12
70,000+
tests / day
9T
rsyslog
16
16
16
16
16
16
300G
12
12
12
12
12
12
ü Duplicate bug
ü Infrastructure failure
ü Performance regression81T
12
12
12
12
12
12
30G

25 © 2018 PURE STORAGE INC. PURE PROPRIETARY
FULL PIPELINE
1,500+
VMs
250+
FBs
20+
Jenkins
700+
clients
27T
12
12
12
12
12
12
12
12
12
12
27T 12
12
12
12
12
12
12
12
12
12
12
12
12
12
70,000+
tests / day
9T
rsyslog
16
16
16
16
16
16
300G
12
12
12
12
12
12
ü Duplicate bug
ü Infrastructure failure
ü Performance regression81T
12
12
12
12
12
12
30G
50G
12
12
12
12189T ü Low level details
ü Easy to read graphs

26 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Takeaways
ü Index only what you need, store the rest
(in a storage layer that scales in throughput and to billions of files/objects)
ü Disaggregation of compute and storage for
scalability of subsystems

27 © 2018 PURE STORAGE INC. PURE PROPRIETARY
QUESTIONS?

This talk presents a use case of dynamic service-level agreement (SLA) control for Spark applications that are implemented through resource allocation priorities; which can be assigned to each Spark application at submission and then dynamically modified during the whole lifecycle of the application. The Spark application priority is translated by the scheduler to resource allocation adjustments for executors; and include either preemptive or non-preemptive release of resources. The priorities translation algorithm is determined by a scheduling policy to either define the order of executor allocations between applications or calculate the priority-weighed allocation shares among all submitted applications from the entire resource pool. Driver’s resource allocation adjustments are always expected to follow the first-in, first-out (FIFO) time sequence. The drivers are expected to be non-preemptive, so that the dynamic SLA for the drivers can control only the driver’s start time, then the application priority is used to adjust the order in the driver’s pending queue. This presentation describes the extensions to Spark RESTful APIs to dynamically modify previously submitted Spark application’s priorities, monitor current priority, and record the updated priorities in the application’s events log. We also discuss user permissions for authorizing Spark application priority assignment and its modification and feature extensions. In conclusion, testing results for SLA control of real word Spark applications by using the applications priorities are provided.

Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...

Databricks

Aggregation based features account for a quarter of the several 1000s features used by the ML-based decisioning system by the Risk team at Uber. We observed several repetitive, cumbersome steps needed for onboarding a feature, every single time. Therefore, to accelerate developer velocity, and to enable Feature Engineering at scale, we decided to develop a generic spark based infrastructure to simplify the process to no more than a simple spec file, containing a parameterized query, along with some metadata on where the feature should be aggregated and stored. In the presentation, we will describe the architecture of the final solution, highlighting some of the advanced capabilities like backfill support and self-healing for correctness. We will showcase how, using data stored in Hive and using Spark, we developed a highly scalable solution to carry out feature aggregation in an incremental way. By dividing data aggregation responsibility across the realtime access layer, and the batch computation components, we ensured that only entities for which there is actual value changes are dispersed to our real-time access store (Cassandra). We will share how we did data modeling in Cassandra using its native capabilities such as counters, and how we worked around some of the limitations of Cassandra. We will also cover the details about the access service how we do different types of feature stitching together. How, based on our data model we were able to ensure that all the feature for an entity with the same aggregation window, were queried via a single query. Finally, we will cover some of the details on how these incremental aggregated features have enabled shorter turnaround times for the models using such features.

Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...

Databricks

InAccel provides high performance accelerators for your application based on novel hardware reconfigurable engines as IP blocks. The hardware accelerators can be deployed in the cloud, like Amazon AWS, using the f1 accelerators and can be integrated to widely used frameworks like Spark and PostgreSQL. The main novelty is that the users do not need to change the original code as the accelerators are deployed as software packages. In this talk we will show how machine learning applications (e.g. logistic regression and K-means), based on Spark, can be accelerated by 3x-10x using hardware accelerators that are deployed in the Amazon AWS using f1 accelerators without any changes on the Spark code. InAccel provides all the required APIs for the integration on Spark using Java, Python or Scala. The utilization of hardware accelerators can also be used to reduce the OpEx as less resources and less time is required for the processing of the data.

Self-Service Apache Spark Structured Streaming Applications and Analytics

Databricks

Organizations are increasingly building more and more Apache Spark Structured Streaming Applications for IoT analytics, real-time fraud detection, anomaly detection, analyzing streaming data from devices, turbines etc. However building the streaming applications and operationalizing them is challenging. There is a need for a self-serve platform on Spark Structured Streaming to enable many users to quickly build, deploy, run and monitor a variety of big data streaming use cases. At Sparkflows we built out a Self-Service Platform for building Structured Streaming Applications in minutes. Variety of users can log in with their Browser and build and deploy these applications seamlessly with drag and drop of 200+ Processors. They can also build charts on the streaming data and perform streaming analytics. In this talk we will dive deeper into our journey. We started with a workflow editor and workflow engine for building and running structured streaming jobs. We will explain how we built out the connectors to streaming sources for running in the designer mode, perform ML model scoring with real-time ingestion, streaming analytics, schema inference and propagation and displaying results in continuously moving charts. We will go over how we built self-serve streaming data preparation, lookup and analytics with SQL, Scala, Python etc. Finally, we will also discuss how we enabled deployment, operationalization and monitoring of the long running Structured Streaming jobs. We want to show how Spark can be used to enable very complex Self-Serve Big Data Streaming Applications and Analytics for Enterprises. Speaker: Jayant Shekhar

Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes

Databricks

Have you ever wondered how to implement your own operator pattern for you service X in Kubernetes? You can learn this in this session and see an example of open-source project that does spawn Apache Spark clusters on Kubernetes and OpenShift following the pattern. You will leave this talk with a better understanding of how spark-on-k8s native scheduling mechanism can be leveraged and how you can wrap your own service into operator pattern not only in Go lang but also in Java. The pod with spark operator and optionally the spark clusters expose the metrics for Prometheus so it makes it eas

A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...

Databricks

We will present the design and evolution of Nvidia's 100% Self-Service Streaming Big-Data Platform (ETL, Analytics, AI Training & Inferencing) powered by Spark and Nvidia GPUs. We will discuss the architecture, major challenges that we faced, and lessons learned along the way. Nvidia's data platform processes 10's of billions of events per day, supporting several Nvidia products like GPU Cloud, GeForce NOW Cloud Gaming, AI Smart Cities, DriveSim for Self Driving cars etc. In this talk, we are going to deep dive on Nvidia's next generation data platform with new custom built frameworks, automation tools, and a monitoring system on top of Spark. Thus empowering our developers to build new Spark-powered applications at the speed of light (SOL) with full self-service unified data flows. We will showcase these new tools : a) Zero-engineering dashboards, b) Out-of-the box Spark Streaming applications with automated schema management, c) Custom Spark Streaming to Elastic search connector with enhanced security, d) GDPR compliant SQL access control and auditing with a new custom token management framework, e) Migration from logstash clusters to Spark Streaming for log parsing, etc. We will discuss how decoupling Data-Platform and Applications helped us achieve the next level of scale, self-service, and, security. Finally, we will demo our Platform's App-Store, where developers can shop for new Apps and deploy them with ease - with automated dashboards, streaming ETL, analytics, monitoring, AI training and inferencing. Extended Description: With structured telemetry events and unstructured logs growing at 1000% rate year-over-year, it is extremely important to handle this scale with strict SLAs and high reliability while maintaining extremely low latency. We will discuss how we handled these scaling & security concerns to solve business requirements. Additionally, we will be open-sourcing some of our custom spark frameworks during the talk. Speakers: Satish Dandu, Rohit Kulkarni

Streaming Analytics @ Uber

Xiang Fu

Modern ETL Pipelines with Change Data Capture

Databricks

In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data. This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium. We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.

Tuning Apache Spark can be complex and difficult, since there are many different configuration parameters and metrics. As the Spark applications running on LinkedIn’s clusters become more diverse and numerous, it is no longer feasible for a small team of Spark experts to help individual users debug and tune their Spark applications. Users need to be able to get advice quickly and iterate on their development, and any problems need to be caught promptly to keep the cluster healthy. In order to achieve this, we automated the process of identifying performance issues and providing custom tuning advice to users, and made improvements for scaling to handle thousands of Spark applications per day. We leverage Spark History Server (SHS) to gather application metrics, but as the number of Spark applications and size of individual applications have increased, the SHS has not been able to keep up. It can fall hours behind during peak usage. We will discuss changes to the SHS to improve efficiency, performance and stability, enabling SHS to analyze large amount of logs. Another challenge we encountered was a lack of proper metrics related to Spark application performance. We will present new metrics added to Spark which can precisely report resource usage during runtime, and discuss how these are used in heuristics to identify problems. Based on this analysis, custom recommendations are provided to help users tune their applications. We will also show the impact provided by these tuning recommendations, including improvements in application performance itself and the overall cluster utilization.

Pinot: Near Realtime Analytics @ Uber

Xiang Fu

Real-time Analytics with Trino and Apache Pinot

Xiang Fu

KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...

confluent

Since its release in 2018, KSQL has grown from interesting curiosity into ksqlDB - a production grade streaming system. What does it look like to run KSQL in the enterprise? How has the promise of the Kafka Streams with an SQL dialect worked in the wild? Let's explore stream processing with ksqlDB in the enterprise. How is it used to rapid prototyping; for taking an idea to production. Using the flexible scripting to help teams with error discover and system introspection. Plus how extended teams can use KSQL as a stepping stone for building and sharing real-time scoring and streaming insights. This session will cover production deployments of ksqlDB in banking, finance, transport and insurance. What can go wrong, and what can go right. See how teams embrace the technology to solve stream processing challenges.

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...

Databricks

Spark SQL is one of the most popular components in big data warehouse for SQL queries in batch mode, and it allows user to process data from various data sources in a highly efficient way. However, Spark SQL is a general purpose SQL engine and not well designed for ad hoc queries. Intel invented an Apache Spark data source plugin called Spinach for fulfilling such requirements, by leveraging user-customized indices and fine-grained data cache mechanisms. To be more specific, Spinach defines a new Parquet-like data storage format, offering a fine-grained hierarchical cache mechanism in the unit of “Fiber” in memory. Even existing Parquet or ORC data files can be loaded using corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow users to define the customized indices based on relation. Currently, B+ tree and bloom filter are the first two types of indices supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment. All you need to do is to pick Spinach from Spark packages when launching the Spark SQL. sing corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow user to define the customized indices based on relation. Currently B+ tree and bloom filter are the first 2 types of index we’ve supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment, all we need to do is to pick Spinach from Spark packages when launch the Spark SQL. Spinach has been imported in Baidu’s production environment since Q4 2016. It helps several teams migrating their regular data analysis tasks from Hive or MR jobs to ad-hoc queries. In Baidu search ads system FengChao, data engineers analyze advertising effectiveness based on several TBs data of display and click logs every day. Spinach brings a 5x boost compared to original Spark SQL (version 2.1), especially in the scenario of complex search and large data volume. It optimizes the average search cost from minutes to seconds, while brings only 3% data size increase for adding a single index.

Uber Real Time Data Analytics

Ankur Bansal

Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.

TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...

Databricks

Have you ever tuned a Spark, Hive or Pig job? If yes, then you must know that it is a never ending cycle of executing the job, observing the running job, making sense out of hundreds of metrics and then re-running it with the better parameters. Imagine doing this for tens of thousands of jobs. Performance optimization at this scale manually is tedious, requires significant expertise and results into wasting a lot of resources to do the same task again and again. As Hadoop/Spark is the natural choice for any data processing with many naive users, it becomes important to develop a tool to automatically tune Hadoop/Spark jobs. At LinkedIn we tried to solve the problem using Dr. Elephant, an open-sourced self-serve performance monitoring and tuning tool for Hadoop and Spark, used at LinkedIn and various other companies. While it has proven to be very successful, it relies on the developers’ initiative to check and apply the recommendation manually. It also expects some expertise from developers to arrive at the optimal configuration from the recommendations. In this talk we will discuss TuneIn, an auto tuning framework developed on top of Dr. Elephant. We will describe how we are using an iterative approach of optimization to find optimal parameters. We will discuss the various optimization algorithms we tried and why we found Particle Swarm Optimization algorithm to give best results. We will talk about how we avoided using any extra execution and tuned the jobs during their regular scheduled execution. We go into detail on techniques employed to ensure faster convergence and zero failed executions while tuning. We will showcase how we achieved more than 50% gain in resource usage by tuning a small set of parameters. We will also talk about the lessons learned and the future roadmap.

Kafka Summit SF 2017 - Fast Data in Supply Chain Planning

confluent

Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...

confluent

The Oak Ridge Leadership Facility (OLCF) in the National Center for Computational Sciences (NCCS) division at Oak Ridge National Laboratory (ORNL) houses world-class high-performance computing (HPC) resources and has a history of operating top-ranked supercomputers on the TOP500 list, including the world's current fastest, Summit, an IBM AC922 machine with a peak of 200 petaFLOPS. With the exascale era rapidly approaching, the need for a robust and scalable big data platform for operations data is more important than ever. In the past when a new HPC resource was added to the facility, pipelines from data sources spanned multiple data sinks which oftentimes resulted in data silos, slow operational data onboarding, and non-scalable data pipelines for batch processing. Using Apache Kafka as the message bus of the division's new big data platform has allowed for easier decoupling of scalable data pipelines, faster data onboarding, and stream processing with the goal to continuously improve insight into the HPC resources and their supporting systems. This talk will focus on the NCCS division's transition to Apache Kafka over the past few years to enhance the OLCF's current capabilities and prepare for Frontier, OLCF's future exascale system; including the development and deployment of a full big data platform in a Kubernetes environment from both a technical and cultural shift perspective. This talk will also cover the mission of the OLCF, the operational data insights related to high-performance computing that the organization strives for, and several use-cases that exist in production today.

Stsg17 speaker yousunjeong

Yousun Jeong

Bullet: A Real Time Data Query Engine

DataWorks Summit

Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average. An instance of Bullet is currently running at Yahoo against its user engagement data pipeline. We’ll highlight how it is powering internal use-cases such as web page and native app instrumentation validation. Finally, we’ll show a demo of Bullet and go over query performance numbers.

Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...

Databricks

This session will explain how NetApp simplifies the process of analyzing IoT data, using Apache Spark clusters across data centers and the cloud using NetApp Private Storage (NPS) for AWS/Azure, NetApp Data Fabric and NetApp Connectors for NFS and S3. IoT data originates at the edge in different geographical locations, and it can arrive at different data centers or the cloud depending on sensor location. The challenge is how to combine these different data streams across different datacenters to generate wider insights. Learn how NetApp Data Fabric helps solve this challenge. In the Data Fabric architecture, the IoT data is ingested via Kafka into an Apache Spark cluster running in AWS/Azure, but the data is stored in NPS provisioned NFS share through NFS Connector. The IoT data in NPS can then be moved to on-prem datacenters, or on-prem IoT data can be moved to NPS or ONTAP Cloud for processing in AWS/Azure using NetApp SnapMirror Flex Clone or NFS Connector. We’ll also review how NetApp StorageGRID object storage maintains IoT data for archival purposes using S3 Target. The above options allow you to analyze IoT data from AWS, StorageGRID, HDFS or NFS, providing a feasible solution for deploying Spark clusters across datacenters. Takeaways will include identifying Spark challenges that can be remedied by extending your Spark environment to take advantage of NPS; understanding how NPS and StorageGRID can provide a cost-effective alternative for dev/test, DR for Spark analytics; and understanding Spark architecture and deployment options that utilize data from multiple locations, including on-prem and cloud-based repositories.

Apache Pulsar: The Next Generation Messaging and Queuing System

Databricks

Apache Pulsar is the next generation messaging and queuing system with unique design trade-offs driven by the need for scalability and durability. Its two layered architecture of separating message storage from serving led to an implementation that unifies the flexibility and the high-level constructs of messaging, queuing and light weight computing with the scalable properties of log storage systems.

Spark, Tachyon and Mesos internals

Claudiu Barbura

Performant Streaming in Production: Preventing Common Pitfalls when Productio...

Databricks

Real-time Analytics with Presto and Apache Pinot

Xiang Fu

Presto Con 2021 In this world, most analytics products either focus on ad-hoc analytics, which requires query flexibility without guaranteed latency, or low latency analytics with limited query capability. In this talk, we will explore how to get the best of both worlds using Apache Pinot and Presto: 1. How people do analytics today to trade-off Latency and Flexibility: Comparison over analytics on raw data vs pre-join/pre-cube dataset. 2. Introduce Apache Pinot as a column store for fast real-time data analytics and Presto Pinot Connector to cover the entire landscape. 3. Deep dive into Presto Pinot Connector to see how the connector does predicate and aggregation push down. 4. Benchmark results for Presto Pinot connector.

Next CERN Accelerator Logging Service with Jakub Wozniak

Spark Summit

The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service. The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex. The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments. During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.

The Revolution Will be Streamed

Databricks

Scaling up uber's real time data analytics

Xiang Fu

Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira

Spark Summit

Toon is a leading brand in the European smart energy market, currently expanding internationally, providing energy usage insights, eco-friendly energy management and smart thermostat use for the connected home. As value added services become ever more relevant in this market, we have the need to ensure that we can easily and safely on-board new tenants into our data platform. In this talk we’re going to guide you across a less discussed side of using Spark in production – devops. We will speak about our journey from an on-premise cluster to a managed solution in the cloud. A lot of moving parts were involved: ETL flows, data sharing with 3rd parties and data migration to the new environment. Add to this the need to have a multi-tenant environment, revamp our toolset and deploy a live public facing service. It’s possible to find a lot of great examples of how Spark is used for data-science purposes. On the data engineering side, we need to deploy production services, ensure data is cleaned, secured and available, and keep the data-science teams happy. We’d like to share some of the options we took and some of the lessons learned from this (ongoing) transition.

Avoiding Log Data Overload in a CI/CD System: Streaming 190 Billion Events an...

Databricks

Learn how Pure Storage engineering manages streaming 190B log events per day and makes use of that deluge of data in our continuous integration (CI) pipeline. Our test infrastructure runs over 70,000 tests per day creating a large triage problem that would require at least 20 triage engineers. Instead, Spark’s flexible computing platform allows us to write a single application for both streaming and batch jobs to understand the state of our CI pipeline for our team of 3 triage engineers. Using encoded patterns, Spark indexes log data for real-time reporting (Streaming), uses Machine Learning for performance modeling and prediction (Batch job), and finds previous matches for newly encoded patterns (Batch job). Resource allocation in this mixed environment can be challenging; a containerized Spark cluster deployment, and disaggregated compute and storage layers allow us to programmatically shift compute resources between the streaming and batch applications.. This talk will go over design decisions to meet SLAs of streaming and batching in hardware, data layout, access patterns, and containers strategy. We will also go over the challenges, lessons learned, and best practices for this kind of setup.”

Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...

DataWorks Summit

Learn how Pure Storage engineering manages streaming 190B log events per day and makes use of that deluge of data in our continuous integration (CI) pipeline. Our test infrastructure runs over 70,000 tests per day creating a large triage problem that would require at least 20 triage engineers. Instead, Spark's flexible computing platform allows us to write a single application for both streaming and batch jobs to understand the state of our CI pipeline for our team of 3 triage engineers. Using encoded patterns, Spark indexes log data for real-time reporting (Streaming), uses Machine Learning for performance modeling and prediction (Batch job), and finds previous matches for newly encoded patterns (Batch job). Resource allocation in this mixed environment can be challenging; a containerized Spark cluster deployment, and disaggregated compute and storage layers allow us to programmatically shift compute resources between the streaming and batch applications.. This talk will go over design decisions to meet SLAs of streaming and batching in hardware, data layout, access patterns, and containers strategy. We will also go over the challenges, lessons learned, and best practices for similar data pipelines. Speaker Joshua Robinson, JOSHUA ROBINSON Founding Engineer Pure Storage

What's hot

Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou

Databricks

Pinot: Near Realtime Analytics @ Uber

Xiang Fu

Real-time Analytics with Trino and Apache Pinot

Xiang Fu

KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...

confluent

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...

Databricks

Uber Real Time Data Analytics

Ankur Bansal

TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...

Databricks

Kafka Summit SF 2017 - Fast Data in Supply Chain Planning

confluent

Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...

confluent

Stsg17 speaker yousunjeong

Yousun Jeong

Bullet: A Real Time Data Query Engine

DataWorks Summit

Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...

Databricks

Apache Pulsar: The Next Generation Messaging and Queuing System

Databricks

Spark, Tachyon and Mesos internals

Claudiu Barbura

Performant Streaming in Production: Preventing Common Pitfalls when Productio...

Databricks

Real-time Analytics with Presto and Apache Pinot

Xiang Fu

Next CERN Accelerator Logging Service with Jakub Wozniak

Spark Summit

The Revolution Will be Streamed

Databricks

Scaling up uber's real time data analytics

Xiang Fu

Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira

Spark Summit

What's hot (20)

Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou

Pinot: Near Realtime Analytics @ Uber

Real-time Analytics with Trino and Apache Pinot

KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...

Uber Real Time Data Analytics

TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...

Kafka Summit SF 2017 - Fast Data in Supply Chain Planning

Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...

Stsg17 speaker yousunjeong

Bullet: A Real Time Data Query Engine

Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...

Apache Pulsar: The Next Generation Messaging and Queuing System

Spark, Tachyon and Mesos internals

Performant Streaming in Production: Preventing Common Pitfalls when Productio...

Real-time Analytics with Presto and Apache Pinot

Next CERN Accelerator Logging Service with Jakub Wozniak

The Revolution Will be Streamed

Scaling up uber's real time data analytics

Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira

Similar to Efficiently Triaging CI Pipelines with Apache Spark: Mixing 52 Billion Events/Day of Streaming with 40 TB/Hour of Batch Processing with Ivan Jibaja

Avoiding Log Data Overload in a CI/CD System: Streaming 190 Billion Events an...

Databricks

Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...

DataWorks Summit

Learn how Pure Storage engineering manages streaming 190B log events per day and makes use of that deluge of data in our continuous integration (CI) pipeline. Our test infrastructure runs over 70,000 tests per day creating a large triage problem that would require at least 20 triage engineers. Instead, Spark's flexible computing platform allows us to write a single application for both streaming and batch jobs to understand the state of our CI pipeline for our team of 3 triage engineers. Using encoded patterns, Spark indexes log data for real-time reporting (Streaming), uses Machine Learning for performance modeling and prediction (Batch job), and finds previous matches for newly encoded patterns (Batch job). Resource allocation in this mixed environment can be challenging; a containerized Spark cluster deployment, and disaggregated compute and storage layers allow us to programmatically shift compute resources between the streaming and batch applications.. This talk will go over design decisions to meet SLAs of streaming and batching in hardware, data layout, access patterns, and containers strategy. We will also go over the challenges, lessons learned, and best practices for similar data pipelines. Speaker Joshua Robinson, JOSHUA ROBINSON Founding Engineer Pure Storage

Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage

Databricks

"At Pure Storage, our strong belief in aggressive automated testing has caused our continuous integration (CI) systems to generate massive amounts of messy log data. Spark's flexible computing platform allows us to write a single application to understand the state of our CI pipeline for both streaming (over a million events per second) and batch jobs (at 40TB/hour). Decoupling our data storage enabled us to orchestrate and independently scale stateless pipeline components (spark, kafka, rsyslog, and custom code) using nomad. In this talk, we will discuss how we architected our data pipeline to leverage simple orchestration and enable resiliency with ephemeral compute components."

Syngenta's Predictive Analytics Platform for Seeds R&D

Michael Swanson

Gimel at Dataworks Summit San Jose 2018

Romit Mehta

We presented Gimel at Dataworks Summit in San Jose. Gimel is the open source unified data API which enables connectivity to any data store with a single API. Along with the API which works with Scala and Python, we are also surfacing a SQL interface to access any data store with just SQL. Now data scientists and analysts can directly consume data from big data platforms like Kafka for real-time streaming data access or Elastic for search-related data all with SQL just like they can access Oracle or Teradata. On the other hand, data engineers can relax now with this abstracted API since it isolates the ever-changing world of big data infrastructure from their code. No longer do they need to worry about API versions, connector versions, data store-specific semantics, or compute engine and version. Gimel is also tightly integrated with Jupyter notebooks so all of the power is now available to anyone with a browser. gimel.io unifieddatacatalog.io ppextensions.io Ping me on LinkedIn for more info!

Dataworks | 2018-06-20 | Gimel data platform

Deepak Chandramouli

https://dataworkssummit.com/san-jose-2018/expo-theatre/gimel-paypals-analytics-data-platform/ At PayPal, data engineers, analysts, and data scientists work with a variety of data sources (Messaging, NoSQL, RDBMS, Documents, TSDB), compute engines (Spark, Flink, Beam, Hive), languages (Scala, Python, SQL) and execution models (stream, batch, interactive). Due to this complex matrix of technologies and thousands of datasets, engineers spend considerable time learning about different data sources, formats, programming models, APIs, optimizations, etc. which impacts time-to-market (TTM). To solve this problem and to make product development more effective, PayPal Data Platform developed “Gimel”, a unified analytics data platform which provides access to any storage through a single unified data API and SQL, that is powered by a centralized data catalog. In this session, we will introduce you to the various components of Gimel – Compute Platform, Data API, PCatalog, GSQL, and Notebooks. We will provide a demo depicting how Gimel reduces TTM by helping our engineers write a single line of code to access any storage without knowing the complexity behind the scenes.

Top 5 Lessons Learned in Deploying AI in the Real World

Digital Transformation EXPO Event Series

Storage for big-data by Joshua Robinson

Data Con LA

Abstract:- Headaches dealing with big-data storage? A Pure Engineering use case: See how we use Spark and other big data technologies to improve our software engineering development cycle through automated testing and automated failure triaging. Learn how we gain flexibility, agility, and simplicity by leveraging the FlashBlade as a shared storage system across multiple silos of infrastructure instead of distributed DAS deployments.

Why You Need Manageability Now More than Ever and How to Get It

Gustavo Rene Antunez

Whether you are operating in a completely on-premises environment or have some kind of hybrid cloud setup, you need to be able to clearly monitor and manage your entire organization in one single, unified structure. In this session learn how IOUG’s volunteer team decided to review Oracle Management Cloud Services to see if this “single pane of glass” was up to the challenge of providing the information data professionals need to serve their organization. Come and see how to put the pieces together, illustrated with real examples from Oracle Public Cloud services.

How Financial Services can Save On File Storage

Charly Mostert

Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...

Amazon Web Services

Amazon Kinesis makes it easy to speed up the time it takes for you to get valuable, real-time insights from your streaming data. In this session, we walk through the most popular applications that customers implement using Amazon Kinesis, including streaming extract-transform-load, continuous metric generation, and responsive analytics. Our customer Autodesk joins us to describe how they created real-time metrics generation and analytics using Amazon Kinesis and Amazon Elasticsearch Service. They walk us through their architecture and the best practices they learned in building and deploying their real-time analytics solution.

Webinar slides: How to Measure Database Availability?

Severalnines

Database availability is notoriously hard to measure and report on, although it is an important KPI in any SLA between you and your customer. We often define availability in terms of 9’s (e.g. 99.9% or 99.999%), although there is often a lack of understanding of what these numbers might mean, or how we can measure them. Is the database available if an instance is up and running, but it is unable to serve any requests? Or if response times are excessively long, so that users consider the service unusable? Is the impact of one longer outage the same as multiple shorter outages? How do partial outages affect database availability, where some users are unable to use the service while others are completely unaffected? Not agreeing on precise definitions with your customer might lead to dissatisfaction. The database team might be reporting that they have met their availability goals, while the customer is dissatisfied with the service. In this webinar, we will discuss the different factors that affect database availability. We will then see how you can measure your database availability in a realistic way. AGENDA - Defining availability targets - Critical business functions - Customer needs - Duration and frequency of downtime - Planned vs unplanned downtime - SLA - Measuring the database availability - Failover/Switchover time - Recovery time - Upgrade time - Queries latency - Restoration time from backup - Service outage time - Instrumentation and tools to measure database availability: - Free & open-source tools - CC's Operational Report - Paid tools SPEAKER Bartlomiej Oles is a MySQL and Oracle DBA, with over 15 years experience in managing highly available production systems at IBM, Nordea Bank, Acxiom, Lufthansa, and other Fortune 500 companies. In the past five years, his focus has been on building and applying automation tools to manage multi-datacenter database environments.

Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025

Nicola Sandoli

Oracle Project Analytics

Nitai Partners Inc

Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...

Amazon Web Services

Organizations need to gain insight and knowledge from a growing number of IoT, APIs, clickstreams, and unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, we cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL pipelines for your data lake. We also discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Please join us for a speaker meet-and-greet following this session at the Speaker Lounge (ARIA East, Level 1, Willow Lounge). The meet-and-greet starts 15 minutes after the session and runs for half an hour.

Get to Know Your Customers - Build and Innovate with a Modern Data Architecture

Amazon Web Services

Your customers probably want a better experience with your brand. Your different business teams want and need better insights in their decision making. Almost certainly, your finance and operations teams require this to happen at a fraction of the cost of traditional on-premises options. Modern data architectures on AWS help many of our best customers realise all of those goals. Your business data contains critical information about customer behaviours, operational decisions, and many factors that have financial impact on your organisation. Increasingly, this data sits beyond your transactional systems, and is too big, too fast, and too complex for existing systems to handle. AWS Data and Analytics services are designed from our customers' requirements to ingest, store, analyse, and consume information at record-breaking scale. In this session you will learn how these services work together to deliver business automation, enhance customer engagement and intelligence.

Oracle BI Publisher to Transform Cloud ERP Reports

Mahesh Vallampati

Ever wondered how you could format the Oracle Standard reports to meet your requirements. No worries. It is easy. Oracle ERP makes it easy with BI Publisher to extend the output of standard reports to flat excel templates so you can do excel based analysis on ERP data. You could even make your format the default output format or output both standard and the extended format. We will use the SubLedger Account Analysis Report and the Fixed Asset Reserve Ledger report as use cases and do a tutorial on how to format the output. This capability improves productivity and empowers the end user to focus on analyzing the data instead of formatting it.

Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...

Amazon Web Services

Today, organizations deploy more AI/ML workloads on AWS than on any other cloud platform. The cloud has removed many of the challenges associated with scalability, and it’s never been easier or more cost effective to build custom and intelligent data models. In this session, learn how the C3 Platform leverages the full power of Intel Xeon Scalable processors on AWS to rapidly train, deploy, and operationalize AI/ML and big data applications like C3 Inventory Optimization and C3 Predictive Maintenance. In addition, a customer shares how these solutions helped achieve demonstrable value. This session is brought to you by AWS partner, Intel.

Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...

Lucidworks

Case Study: Tableau Integration with CA Project & Portfolio Management

CA Technologies

Similar to Efficiently Triaging CI Pipelines with Apache Spark: Mixing 52 Billion Events/Day of Streaming with 40 TB/Hour of Batch Processing with Ivan Jibaja (20)

Avoiding Log Data Overload in a CI/CD System: Streaming 190 Billion Events an...

Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...

Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage

Syngenta's Predictive Analytics Platform for Seeds R&D

Gimel at Dataworks Summit San Jose 2018

Dataworks | 2018-06-20 | Gimel data platform

Top 5 Lessons Learned in Deploying AI in the Real World

Storage for big-data by Joshua Robinson

Why You Need Manageability Now More than Ever and How to Get It

How Financial Services can Save On File Storage

Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...

Webinar slides: How to Measure Database Availability?

Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025

Oracle Project Analytics

Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...

Get to Know Your Customers - Build and Innovate with a Modern Data Architecture

Oracle BI Publisher to Transform Cloud ERP Reports

Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...

Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...

Case Study: Tableau Integration with CA Project & Portfolio Management

More from Databricks

DW Migration Webinar-March 2022.pptx

Databricks

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized Platform

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data Science

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML Monitoring

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature Aggregations

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单

ewymefz

UPenn毕业证【微信95270640】办理宾夕法尼亚大学毕业证原版一模一样、UPenn毕业证制作【Q微信95270640】《宾夕法尼亚大学毕业证购买流程》《UPenn成绩单制作》宾夕法尼亚大学毕业证书UPenn毕业证文凭宾夕法尼亚大学本科毕业证书,学历学位认证如何办理【留学国外学位学历认证、毕业证、成绩单、大学Offer、雅思托福代考、语言证书、学生卡、高仿教育部认证等一切高仿或者真实可查认证服务】代办国外（海外）英国、加拿大、美国、新西兰、澳大利亚、新西兰等国外各大学毕业证、文凭学历证书、成绩单、学历学位认证真实可查。办国外宾夕法尼亚大学宾夕法尼亚大学硕士学位证成绩单教育部学历学位认证留信认证大使馆认证留学回国人员证明修改成绩单信封申请学校offer录取通知书在读证明offer letter。快速办理高仿国外毕业证成绩单： 1宾夕法尼亚大学毕业证+成绩单+留学回国人员证明+教育部学历认证（全套留学回国必备证明材料给父母及亲朋好友一份完美交代）; 2雅思成绩单托福成绩单OFFER在读证明等留学相关材料（申请学校转学甚至是申请工签都可以用到）。 3.毕业证 #成绩单等全套材料从防伪到印刷从水印到钢印烫金高精仿度跟学校原版100%相同。专业服务请勿犹豫联系我！联系人微信号：95270640诚招代理：本公司诚聘当地代理人员如果你有业余时间有兴趣就请联系我们。国外宾夕法尼亚大学宾夕法尼亚大学硕士学位证成绩单办理过程： 1客户提供办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄）。我们在哪里父母对我们的爱和思念为我们的生命增加了光彩给予我们自由追求的力量生活的力量我们也不忘感恩正因为这股感恩的线牵着我们使我们在一年的结束时刻义无反顾的踏上了回家的旅途人们常说父母恩最难回报愿我能以当年爸爸妈妈对待小时候的我们那样耐心温柔地对待我将渐渐老去的父母体谅他们以反哺之心奉敬父母以感恩之心孝顺父母哪怕只为父母换洗衣服为父母喂饭送汤按摩酸痛的腰背握着父母的手扶着他们一步一步地慢慢散步.娃

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样

u86oixdj

学校原件一模一样【微信：741003700 】《(swinburne毕业证书)斯威本科技大学毕业证》【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

Everything you wanted to know about LIHTC

Roger Valdez

一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理

g4dpvqap0

毕业原版【微信:41543339】【(爱大毕业证书)爱丁堡大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...

Timothy Spann

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI Discussion on Vector Databases, Unstructured Data and AI https://www.meetup.com/unstructured-data-meetup-new-york/ This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理

dwreak4tg

原版定制【微信:41543339】【(BCU毕业证书)伯明翰城市大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Best best suvichar in gujarati english meaning of this sentence as Silk road ...

AbhimanyuSinha9

Criminal IP - Threat Hunting Webinar.pdf

Criminal IP

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理

oz8q3jxlp

原版定制【微信:41543339】【(Deakin毕业证书)迪肯大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样

axoqas

原版定制【Q微信:741003700】《(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书》【Q微信:741003700】成绩单、雅思、外壳、留信学历认证永久存档查询，采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【Q微信741003700】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信741003700】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。

一比一原版(CBU毕业证)卡普顿大学毕业证如何办理

ahzuo

CBU毕业证offer【微信95270640】《卡普顿大学毕业证书》《QQ微信95270640》学位证书电子版：在线制作卡普顿大学毕业证成绩单GPA修改（制作CBU毕业证成绩单CBU文凭证书样本）、卡普顿大学毕业证书与成绩单样本图片、《CBU学历证书学位证书》、卡普顿大学毕业证案例毕业证书制作軟體、在线制作加拿大硕士学历证书真实可查. 如果您是以下情况，我们都能竭诚为您解决实际问题：【公司采用定金+余款的付款流程，以最大化保障您的利益，让您放心无忧】 1、在校期间，因各种原因未能顺利毕业，拿不到官方毕业证+微信95270640 2、面对父母的压力，希望尽快拿到卡普顿大学卡普顿大学毕业证成绩单； 3、不清楚流程以及材料该如何准备卡普顿大学卡普顿大学毕业证成绩单； 4、回国时间很长，忘记办理； 5、回国马上就要找工作，办给用人单位看； 6、企事业单位必须要求办理的；面向美国乔治城大学毕业留学生提供以下服务: 【★卡普顿大学卡普顿大学毕业证成绩单毕业证、成绩单等全套材料，从防伪到印刷，从水印到钢印烫金，与学校100%相同】【★真实使馆认证（留学人员回国证明），使馆存档可通过大使馆查询确认】【★真实教育部认证，教育部存档，教育部留服网站可查】【★真实留信认证，留信网入库存档，可查卡普顿大学卡普顿大学毕业证成绩单】我们从事工作十余年的有着丰富经验的业务顾问，熟悉海外各国大学的学制及教育体系，并且以挂科生解决毕业材料不全问题为基础，为客户量身定制1对1方案，未能毕业的回国留学生成功搭建回国顺利发展所需的桥梁。我们一直努力以高品质的教育为起点，以诚信、专业、高效、创新作为一切的行动宗旨，始终把“诚信为主、质量为本、客户第一”作为我们全部工作的出发点和归宿点。同时为海内外留学生提供大学毕业证购买、补办成绩单及各类分数修改等服务；归国认证方面，提供《留信网入库》申请、《国外学历学位认证》申请以及真实学籍办理等服务，帮助众多莘莘学子实现了一个又一个梦想。专业服务，请勿犹豫联系我如果您真实毕业回国，对于学历认证无从下手，请联系我，我们免费帮您递交诚招代理：本公司诚聘当地代理人员，如果你有业余时间，或者你有同学朋友需要，有兴趣就请联系我你赢我赢，共创双赢你做代理，可以帮助卡普顿大学同学朋友你做代理，可以拯救卡普顿大学失足青年你做代理，可以挽救卡普顿大学一个个人才你做代理，你将是别人人生卡普顿大学的转折点你做代理，可以改变自己，改变他人，给他人和自己一个机会道银边山娃摸索着扯了扯灯绳小屋顿时一片刺眼的亮瞅瞅床头的诺基亚山娃苦笑着摇了摇头连他自己都感到奇怪居然又睡到上午点半掐指算算随父亲进城已一个多星期了山娃几乎天天起得这么迟在乡下老家暑假五点多山娃就醒来在爷爷奶奶嘁嘁喳喳的忙碌声中一骨碌爬起把牛驱到后龙山再从莲塘里采回一蛇皮袋湿漉漉的莲蓬也才点多点半早就吃过早餐玩耍去了山娃的家在闽西山区依山傍水山清水秀门前潺潺流淌的蜿蜒小溪一直都是山娃和小伙伴们盛试

一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理

mbawufebxi

原版定制【微信:41543339】【(Bradford毕业证书)布拉德福德大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...

pchutichetpong

M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years. Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success. MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies. According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”

哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样

axoqas

原版定制【Q微信:741003700】《(usq毕业证书)南昆士兰大学毕业证研究生文凭证书》【Q微信:741003700】成绩单、雅思、外壳、留信学历认证永久存档查询，采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【Q微信741003700】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信741003700】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。

Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...

2023240532

一比一原版(NYU毕业证)纽约大学毕业证成绩单

ewymefz

NYU毕业证【微信95270640】《如何办理NYU毕业证纽约大学文凭学历》【Q微信95270640】《纽约大学文凭学历证书》《纽约大学毕业证书与成绩单样本图片》毕业证书补办 Fake Degree做学费单《毕业证明信-推荐信》成绩单，录取通知书，Offer，在读证明，雅思托福成绩单，真实大使馆教育部认证，回国人员证明，留信网认证。网上存档永久可查！【本科硕士】纽约大学纽约大学毕业证学位证（GPA修改）；学历认证（教育部认证）；大学Offer录取通知书留信认证使馆认证；雅思语言证书等高仿类证书。办理流程： 1客户提供办理纽约大学纽约大学毕业证学位证信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄）真实网上可查的证明材料 1教育部学历学位认证留服官网真实存档可查永久存档。 2留学回国人员证明（使馆认证）使馆网站真实存档可查。我们对海外大学及学院的毕业证成绩单所使用的材料尺寸大小防伪结构（包括：纽约大学纽约大学毕业证学位证隐形水印阴影底纹钢印LOGO烫金烫银LOGO烫金烫银复合重叠。文字图案浮雕激光镭射紫外荧光温感复印防伪）都有原版本文凭对照。质量得到了广大海外客户群体的认可同时和海外学校留学中介做到与时俱进及时掌握各大院校的（毕业证成绩单资格证结业证录取通知书在读证明等相关材料）的版本更新信息能够在第一时间掌握最新的海外学历文凭的样版尺寸大小纸张材质防伪技术等等并在第一时间收集到原版实物以求达到客户的需求。本公司还可以按照客户原版印刷制作且能够达到客户理想的要求。有需要办理证件的客户请联系我们在线客服中心微信：95270640 或咨询在线已转到了尽头他的城市生活也将划上一个不很圆满的句号了值得庆幸的是山娃早记下了他们的学校和联系方式说也奇怪在山娃离城的头一天父亲居然请假陪山娃耍了一天那一天父亲陪着山娃辗转长隆水上乐园疯了一整天水上漂流高空冲浪看大马戏大凡里面有的父亲都带着他去疯一把山娃算了算这一次足足花了老爸元够他挣上半个月的山娃很不解一向节俭的父亲啥时变得如此阔绰大方大把大把掏钱时居然连眉头也不皱一下车票早买好了直达卧铺车得经子

The affect of service quality and online reviews on customer loyalty in the E...

jerlynmaetalle

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理

ahzuo

UIUC毕业证offer【微信95270640】☀《伊利诺伊大学|厄巴纳-香槟分校毕业证购买》GoogleQ微信95270640《UIUC毕业证模板办理》加拿大文凭、本科、硕士、研究生学历都可以做,二、业务范围： ★、全套服务：毕业证、成绩单、化学专业毕业证书伪造《伊利诺伊大学|厄巴纳-香槟分校大学毕业证》Q微信95270640《UIUC学位证书购买》 (诚招代理)办理国外高校毕业证成绩单文凭学位证,真实使馆公证（留学回国人员证明）真实留信网认证国外学历学位认证雅思代考国外学校代申请名校保录开请假条改GPA改成绩ID卡 1.高仿业务:【本科硕士】毕业证,成绩单（GPA修改）,学历认证（教育部认证）,大学Offer,,ID,留信认证,使馆认证,雅思,语言证书等高仿类证书； 2.认证服务: 学历认证（教育部认证）,大使馆认证（回国人员证明）,留信认证（可查有编号证书）,大学保录取,雅思保分成绩单。 3.技术服务：钢印水印烫金激光防伪凹凸版设计印刷激凸温感光标底纹镭射速度快。办理伊利诺伊大学|厄巴纳-香槟分校伊利诺伊大学|厄巴纳-香槟分校毕业证offer流程： 1客户提供办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄） -办理真实使馆公证（即留学回国人员证明） -办理各国各大学文凭（世界名校一对一专业服务,可全程监控跟踪进度） -全套服务：毕业证成绩单真实使馆公证真实教育部认证。让您回国发展信心十足！（详情请加一下文凭顾问+微信:95270640）欢迎咨询！的鬼地方父亲的家在高楼最底屋最下面很矮很黑是很不显眼的地下室父亲的家安在别人脚底下须绕过高楼旁边的垃圾堆下八个台阶才到父亲的家很狭小除了一张单人床和一张小方桌几乎没有多余的空间山娃一下子就联想起学校的男小便处山娃很想笑却怎么也笑不出来山娃很迷惑父亲的家除了一扇小铁门连窗户也没有墓穴一般阴森森有些骇人父亲的城也便成了山娃的城父亲的家也便成了山娃的家父亲让山娃呆在屋里做作业看电视最多只能在门口透透气间

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

Subhajit Sahu

For massive graphs that fit in RAM, but not in GPU memory, it is possible to take advantage of a shared memory system with multiple CPUs, each with multiple cores, to accelerate pagerank computation. If the NUMA architecture of the system is properly taken into account with good vertex partitioning, the speedup can be significant. To take steps in this direction, experiments are conducted to implement pagerank in OpenMP using two different approaches, uniform and hybrid. The uniform approach runs all primitives required for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid approach runs certain primitives in sequential mode (i.e., sumAt, multiply).

Q1’2024 Update: MYCI’s Leap Year Rebound

Oppotus

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样

Everything you wanted to know about LIHTC

一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理

Best best suvichar in gujarati english meaning of this sentence as Silk road ...

Criminal IP - Threat Hunting Webinar.pdf

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理

做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样

一比一原版(CBU毕业证)卡普顿大学毕业证如何办理

一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理

Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...

哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样

Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...

一比一原版(NYU毕业证)纽约大学毕业证成绩单

The affect of service quality and online reviews on customer loyalty in the E...

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

Q1’2024 Update: MYCI’s Leap Year Rebound

Efficiently Triaging CI Pipelines with Apache Spark: Mixing 52 Billion Events/Day of Streaming with 40 TB/Hour of Batch Processing with Ivan Jibaja

5. 5 © 2018 PURE STORAGE INC. PURE PROPRIETARY 700 failures x 15 min 70,000+ tests / day 20 Triage Engineers 2x in the next 12 months 1500+ VMs 250+ FBs 20+ Jenkins 700+ clients 100+ Engineers Scale Problems

6. 6 © 2018 PURE STORAGE INC. PURE PROPRIETARY Log Analysis Dream 1. Automate triaging of failures 2. Extract performance metrics 3. Save our logs for future use 4. Do all of this in a scalable system 5. Real-time results!

15. 15 © 2018 PURE STORAGE INC. PURE PROPRIETARY Log Analysis Pipeline Augment & Centralize LogSources Streaming Buffer Filter Store Re-Filter Aggregate Transform Logic Timeseries DB Alert Visualize Index

16. 16 © 2018 PURE STORAGE INC. PURE PROPRIETARY Log Analysis Pipeline Augment & Centralize LogSources Streaming Buffer Filter Store Aggregate Transform Logic Timeseries DB Alert Visualize Index Re-Filter

23. 23 © 2018 PURE STORAGE INC. PURE PROPRIETARY FULL PIPELINE 1,500+ VMs 250+ FBs 20+ Jenkins 700+ clients 27T 12 12 12 12 12 12 12 12 12 12 27T 12 12 12 12 12 12 12 12 12 12 12 12 12 12 70,000+ tests / day 9T rsyslog 16 16 16 16 16 16 300G 12 12 12 12 12 12 ü Duplicate bug ü Infrastructure failure ü Performance regression

24. 24 © 2018 PURE STORAGE INC. PURE PROPRIETARY FULL PIPELINE 1,500+ VMs 250+ FBs 20+ Jenkins 700+ clients 27T 12 12 12 12 12 12 12 12 12 12 27T 12 12 12 12 12 12 12 12 12 12 12 12 12 12 70,000+ tests / day 9T rsyslog 16 16 16 16 16 16 300G 12 12 12 12 12 12 ü Duplicate bug ü Infrastructure failure ü Performance regression81T 12 12 12 12 12 12 30G

25. 25 © 2018 PURE STORAGE INC. PURE PROPRIETARY FULL PIPELINE 1,500+ VMs 250+ FBs 20+ Jenkins 700+ clients 27T 12 12 12 12 12 12 12 12 12 12 27T 12 12 12 12 12 12 12 12 12 12 12 12 12 12 70,000+ tests / day 9T rsyslog 16 16 16 16 16 16 300G 12 12 12 12 12 12 ü Duplicate bug ü Infrastructure failure ü Performance regression81T 12 12 12 12 12 12 30G 50G 12 12 12 12189T ü Low level details ü Easy to read graphs

26. 26 © 2018 PURE STORAGE INC. PURE PROPRIETARY Takeaways ü Index only what you need, store the rest (in a storage layer that scales in throughput and to billions of files/objects) ü Disaggregation of compute and storage for scalability of subsystems

Efficiently Triaging CI Pipelines with Apache Spark: Mixing 52 Billion Events/Day of Streaming with 40 TB/Hour of Batch Processing with Ivan Jibaja

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Efficiently Triaging CI Pipelines with Apache Spark: Mixing 52 Billion Events/Day of Streaming with 40 TB/Hour of Batch Processing with Ivan Jibaja

Similar to Efficiently Triaging CI Pipelines with Apache Spark: Mixing 52 Billion Events/Day of Streaming with 40 TB/Hour of Batch Processing with Ivan Jibaja (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Efficiently Triaging CI Pipelines with Apache Spark: Mixing 52 Billion Events/Day of Streaming with 40 TB/Hour of Batch Processing with Ivan Jibaja