From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets with Avi Aminov

•

4 likes•1,234 views

The landscape of security threats an enterprise faces is vast. It is imperative for an organization to know when one of the machines within the network has been compromised. One layer of detection can take advantage of the DNS requests made by machines within the network. A request to a Command & Control (CNC) domain can act as an indication of compromise. It is thus advisable to find these domains before they come into play. The team at Akamai aims to do just that. In this session, Aminov will share Akamai’s experience in porting their PoC detection algorithms, written in Python, to a reliable production-level implementation using Scala and Apache Spark. He will specifically cover their experience regarding an algorithm they developed to detect botnet domains based on passive DNS data. The session will also include some useful insights Akamai has learned while handing out solutions from research to development, including the transition from small-scale to large-scale data consumption, model export/import using PMML and sampling techniques. This information is valuable for researchers and developers alike.

Data & Analytics

The Road to Uncovering Botnets
From Python Scikit-Learn
to Scala Spark

whoami
• Avi Aminov
– ~2 years Security Researcher at Akamai
– Physics PhD student
• Asaf Nadler
– ~1.5 years Security Researcher at Akamai
– CS PhD student

Enterprise Threat Protection
• Detect malware presence from outbound traffic
– Behavioral pattern analysis
– Domain blacklisting
• Availability – End of June ’17
Akamai
Recursive
DNS
Branch / HQ
Enterprise
DNS

Data
• Akamai Data
– 20-30% of internet traffic
– Customer ISP/Enterprise logs – 20B DNS queries/day
• Third party data
– e.g. Authoritative DNS log lines
• Open data sources
– e.g. WHOIS information

Bot Networks – IP Fluxing
• Goal – Evasion
– Regular bots: waiting for orders
– Proxies: concealing origin server
Command
& Control
server
Bots
Proxy Bots

Bot Networks Detection
• Detect illegitimate IP fluxing
• Features
– IP dispersity (Geo, systems)
– TTL features
– Lexical
Domain Description #Systems #Countries
astro-travels.net PoS CNC Host 157 11

Decision Tree Model
Malicious with high confidence
• Spread across systems
• Unpopular
Benign with high confidence
• IPs in the same system
• Contains meaningful words

Challenge – Going to Production
Feature
Extraction
Scoring Blacklist
Feature
Extraction
Model
Training Model
Model
Evaluation
Data
Sources

What have we done so far?
• Flow
– Researcher describes an algorithm (document + Hive query)
– Dev rewrites the code in MapReduce (now Scala/Spark)
• Problems
– Not applicable to ML pipelines
– Prone to mistakes
– Longer development cycle

Can We Do Better? Option #1
• Research side – Pipeline in Scala/Spark
• Dev side – Implement the algorithms
• Pros
– Greater flexibility
– Research scale
• Cons
– Learning curve
– Lose sklearn/R benefits

Can We Do Better? Option #2
• Research side – Train locally and export model
• Dev side – Transform data using imported model
• Pros
– Quick implementation
– Unified procedure
• Cons
– No support for all models

Export scheme
• Predictive Model Markup Language
• General scheme for ML pipelines
– Data transformations
– Scoring models
• XML format – Readable
• Supported by major data science / ML
frameworks using jPMML (R, sklearn)

PMML Simple Boilerplate
Python (Research side) Scala (Dev side)
Credit: jpmml lib http://openscoring.io/ , https://github.com/jpmml/
Maintained by Villu Ruusmann

Lessons Learned
• Work process adjusted to the task
– Training locally? Export the model
– Training on larger scales? Better to use Spark
• Use jpmml for model export
• When applicable, reduce workload in production
– Example – only look at domains with many IPs

Challenge solved
Feature
Extraction
Scoring Blacklist
Data
Collection
Model
Training Model
Model
Evaluation
Data
Sources PMML

This document provides an overview of Spark: Data Science as a Service by Sridhar Alla and Kiran Muglurmath of Comcast. It discusses Comcast's data science challenges due to massive data size and lack of scalable architecture. It introduces Roadrunner, Comcast's solution built on Spark, which provides a centralized processing system with SQL and machine learning capabilities to enable data ingestion, quality checks, feature engineering, modeling and workflow management. Roadrunner is accessed through REST APIs and helps multiple teams work with the same large datasets. Examples of transformations, joins, aggregations and anomaly detection algorithms demonstrated in Roadrunner are also included.

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

Databricks

This document discusses speeding up OLAP cube building in Apache Kylin using Spark. Cubing with MapReduce can be slow due to serialization overhead and repeated job submissions. Spark allows caching data in memory across cuboid layers in one job, significantly reducing build times compared to MapReduce as shown in a benchmark on a 160 million row dataset. Spark simplifies Kylin development and brings capabilities for real-time OLAP and cloud integration.

SSR: Structured Streaming for R and Machine Learning

felixcss

Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases. Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages. Session hashtag: #SFdev2

Spark r under the hood with Hossein Falaki

Databricks

SparkR is a new and evolving interface to Apache Spark. It offers a wide range of APIs and capabilities to Data Scientists and Statisticians. Being a distributed system with a JVM core some R users find SparkR errors unfamiliar. In this talk we will show what goes on under the hood when you interact with SparkR. We will look at SparkR architecture, performance bottlenecks and API semantics. Equipped with those, we will show how some common errors can be eliminated. I will use debugging examples based on our experience with real SparkR use cases.

Huawei Advanced Data Science With Spark Streaming

Jen Aman

This document discusses streamDM, an open source machine learning library for stream mining in Spark Streaming. It summarizes streamDM's capabilities for incremental learning on data streams using algorithms like SGD, Naive Bayes, clustering and decision trees. Examples of using streamDM in Huawei's network alarm analysis and fault localization systems are provided, demonstrating improvements in efficiency, accuracy and ability to handle large volumes of streaming data. The document encourages researchers to apply for Huawei's Innovation Research Program grants to further collaborative work on stream mining algorithms and applications.

Spark Summit EU talk by Kent Buenaventura and Willaim Lau

Spark Summit

This document summarizes Unity Technologies' journey migrating their data pipeline from a legacy Hive-based system to using Spark. Some key points: - They moved to Spark for its scaling, performance, and ability to handle both batch and streaming workloads from a single stack. - The new Spark-based pipeline uses Airflow for workflow management and saves processed data to Parquet files stored in S3 for backup. - Taking a test-driven development approach with unit and integration tests helped ensure a smooth migration. Staging the pipeline in an environment similar to production also helped address issues early. - The new Spark pipeline completed analysis stages up to 2x faster than the previous Hive-based system and

Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...

Databricks

Boosted by Apache Spark’s data processing engine, machine learning as a service (MLaaS) is now faster and more powerful. However, Spark MLlib is developing and is limited by data preprocessing algorithms. In this session, learn how Suning R&D’s MLaaS platform abstracted, standardized and implemented a very rich machine learning pipeline on top of Spark, from data pre-processing, supervised and unsupervised modeling, performance evaluation, to model deployment. Their feature Spark extensions are: 1) a rich function set of data pre-processing, such as missing data treatment, many types of sampling, outlier detecting, advanced binning, etc.; (2) time series analysis/modeling algorithms; (3) domain-specific library for finance, such as cost sensitive decision tree for fraud detection; (4) a user-friendly drag-and-play codeless modeling canvas.

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Databricks

- MLlib has rapidly developed over the past 5 years, growing from a few algorithms to over 50 algorithms and featurizers for classification, regression, clustering, recommendation, and more. - This growth has shifted from just adding algorithms to improving algorithms, infrastructure, and integrating ML workflows with Spark's broader capabilities like SQL, DataFrames, and streaming. - Going forward, areas of focus include continued scalability improvements, enhancing core algorithms, extensible APIs, and making MLlib a more comprehensive standard library.

This document summarizes Sparkling Water 2.0, which is a new version of the Sparkling Water platform that integrates the H2O machine learning library with Apache Spark. Some key features of Sparkling Water 2.0 include the ability to use H2O data structures and algorithms within the Spark API, machine learning pipelines that allow embedding H2O algorithms within Spark ML pipelines, and high availability support to make the H2O cluster resilient to Spark executor failures. The document outlines how Sparkling Water can be used for tasks like data munging, model building, streaming data processing, and provides code examples.

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks

With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Databricks

At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure. GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.

Random Walks on Large Scale Graphs with Apache Spark with Min Shen

Databricks

Random Walks on graphs is a useful technique in machine learning, with applications in personalized PageRank, representational learning and others. This session will describe a novel algorithm for enumerating walks on large-scale graphs that benefits from the several unique abilities of Apache Spark. The algorithm generates a recursive branching DAG of stages that separates out the “closed” and “open” walks. Spark’s shuffle file management system is ingeniously used to accumulate the walks while the computation is progressing. In-memory caching over multi-core executors enables moving the walks several “steps” forward before shuffling to the next stage. See performance benchmarks, and hear about LinkedIn’s experience with Spark in production clusters. The session will conclude with an observation of how Spark’s unique and powerful construct opens new models of computation, not possible with state-of-the-art, for developing high-performant and scalable algorithms in data science and machine learning.

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Databricks

Spark Summit EU talk by Shay Nativ and Dvir Volk

Spark Summit

This document discusses accelerating Spark ML models with Redis modules. It provides an overview of Redis and Spark, and describes how Redis modules can add new capabilities like secondary indexes, time series, and machine learning. The document demonstrates a Redis ML module that implements random forests and decision trees. It shows how Spark ML models can be trained, saved to Redis for low-latency serving, and evaluated directly in Redis for improved performance over Spark alone.

Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...

Spark Summit

This document discusses using Spark MLlib to predict which digital media files should be offlined from storage to free up space. It describes using k-means clustering, naive Bayes classification, and support vector machines (SVM) on features like file size, age, and airing schedule. SVM performed best and allowed building a predictive system in under an hour. The system is run twice daily on a Spark cluster to select files for purging from a large storage system based on predictions. Some initial issues were addressed and the system is now running robustly in production.

Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang

Databricks

At Facebook, millions of Hive queries are executed on a daily basis, and the workload contributes to important analytics that drive product decisions and insights. Spark SQL in Apache Spark provides much of the same functionality as Hive query language (HQL) more efficiently, and Facebook is building a framework to migrate existing production Hive workload to Spark SQL with minimal user intervention. Before Facebook began large-scale migration to SparkSQL, they worked on identifying the gap between HQL and SparkSQL. They built an offline syntax analysis tool that parses, analyzes, optimizes and generates physical plans on daily HQL workload. In this session, they’ll share their results. After finding their syntactic analysis encouraging, they built tooling for offline semantic analysis where they run HQL queries in their Spark shadow cluster and validate the outputs. Output validation is necessary since the runtime behavior in Spark SQL may be different from HQL. They have built a migration framework that supports HQL in both Hive and Spark execution engines, can shadow and validate HQL workloads in Spark, and makes it easy for users to convert their workloads.

Resource-Efficient Deep Learning Model Selection on Apache Spark

Databricks

Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu

Databricks

XGBoost (https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost. While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment. We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (https://github.com/dmlc/xgboost/tree/master/jvm-packages), which seamlessly integrates Apache Spark and XGBoost. The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic.

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...

Spark Summit

This document discusses Spark ML pipelines for machine learning workflows. It begins with an introduction to Spark MLlib and the various algorithms it supports. It then discusses how ML workflows can be complex, involving multiple data sources, feature transformations, and models. Spark ML pipelines allow specifying the entire workflow as a single pipeline object. This simplifies debugging, re-running on new data, and parameter tuning. The document provides an example text classification pipeline and demonstrates how data is transformed through each step via DataFrames. It concludes by discussing upcoming improvements to Spark ML pipelines.

Apache Spark Performance is too hard. Let's make it easier

Databricks

Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.

Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark

Databricks

Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner. In this talk, we’ll survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways: • It has a simple API that integrates well with enterprise Machine Learning pipelines. • It automatically scales out common Deep Learning patterns, thanks to Spark. • It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL. In this talk, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show: • how to build deep learning models in a few lines of code; • how to scale common tasks like transfer learning and prediction; and • how to publish models in Spark SQL.

Spark Summit EU talk by Heiko Korndorf

Spark Summit

Heiko Korndorf presented on scaling SparkR in production and lessons from real-world projects. The presentation covered classifying SparkR as both a data science and data engineering tool. It discussed SparkR's architecture in versions 1.x and 2.x, approaches to parallelizing R code with Spark 1.5/1.6 and YARN, and dynamic R deployment including dependencies. Advanced techniques like 2-level parallelization using GPGPU were also presented. The talk concluded with an outlook on further integrating data engineering and data science and new technical approaches like simplifying data pipelines and moving calculations to GPUs.

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Jen Aman

The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark. Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including: * optimizing cluster setup; * configuring the cluster; * ingesting data; and * monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.

Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal

Databricks

Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community contributors have continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2. Continuing forward in that spirit, Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of Spark 2.3 features: Kubernetes Scheduler Backend PySpark Performance and Enhancements Continuous Structured Streaming Processing DataSource v2 APIs Spark History Server Performance Enhancements

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Spark Summit

Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.

Spark Summit EU talk by Berni Schiefer

Spark Summit

This document summarizes experiences using the TPC-DS benchmark with Spark SQL 2.0 and 2.1 on a large cluster designed for Spark. It describes the configuration of the "F1" cluster including its hardware, operating system, Spark, and network settings. Initial results show that Spark SQL 2.0 provides significant improvements over earlier versions. While most queries completed successfully, some queries failed or ran very slowly, indicating areas for further optimization.

Spark Summit EU talk by Kaarthik Sivashanmugam

Spark Summit

This document introduces Mobius, a C# API for building Apache Spark applications in .NET. Mobius allows organizations invested in .NET to develop Spark jobs using C#. It provides bindings between the Scala/Java Spark API and C#, enabling reuse of existing .NET libraries in Spark and development of Spark applications using C# and F#. The document demonstrates running and debugging Mobius applications, and discusses performance considerations and internals of the Mobius driver and worker architecture.

Spark Streaming and MLlib - Hyderabad Spark Group

Phaneendra Chiruvella

Presto as a Service - Tips for operation and monitoring

Taro L. Saito

- Presto as a Service in Treasure Data involves deploying Presto using blue-green deployments with no downtime and automatic error recovery of failed queries. - Monitoring Presto involves using its JSON API to view queries and query plans as well as collecting Presto metrics with Fluentd and detecting anomalies. - Benchmarking compares query performance between Presto versions by running predefined query sets and aggregating the results.

East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Chris Fregly

This document provides an overview and summary of Spark Streaming. It discusses Spark Streaming's architecture and APIs. Spark Streaming receives live input data streams and divides them into micro-batches, which it processes using Spark's execution engine to perform operations like transformations and actions. This allows for low-latency, high-throughput stream processing with fault tolerance. The document also covers Spark Streaming deployment and integrating it with sources like Kinesis, as well as monitoring and tuning Spark Streaming applications.

What's hot

Spark Summit EU talk by Jakub Hava

Spark Summit

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Databricks

Random Walks on Large Scale Graphs with Apache Spark with Min Shen

Databricks

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Databricks

Spark Summit EU talk by Shay Nativ and Dvir Volk

Spark Summit

Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...

Spark Summit

Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang

Databricks

Resource-Efficient Deep Learning Model Selection on Apache Spark

Databricks

Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu

Databricks

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...

Spark Summit

Apache Spark Performance is too hard. Let's make it easier

Databricks

Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark

Databricks

Spark Summit EU talk by Heiko Korndorf

Spark Summit

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Jen Aman

Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal

Databricks

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Spark Summit

Spark Summit EU talk by Berni Schiefer

Spark Summit

Spark Summit EU talk by Kaarthik Sivashanmugam

Spark Summit

Spark Streaming and MLlib - Hyderabad Spark Group

Phaneendra Chiruvella

What's hot (20)

Spark Summit EU talk by Jakub Hava

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Random Walks on Large Scale Graphs with Apache Spark with Min Shen

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Spark Summit EU talk by Shay Nativ and Dvir Volk

Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...

Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang

Resource-Efficient Deep Learning Model Selection on Apache Spark

Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...

Apache Spark Performance is too hard. Let's make it easier

Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark

Spark Summit EU talk by Heiko Korndorf

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Spark Summit EU talk by Berni Schiefer

Spark Summit EU talk by Kaarthik Sivashanmugam

Spark Streaming and MLlib - Hyderabad Spark Group

Similar to From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets with Avi Aminov

Presto as a Service - Tips for operation and monitoring

Taro L. Saito

East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Chris Fregly

Big Data Introduction - Solix empower

Durga Gadiraju

This document provides an overview of big data and the Spark framework. It discusses the big data ecosystem, including file systems, data ingestion tools, batch and real-time data processing frameworks, visualization tools, and support technologies. It outlines common big data job roles and their associated skills. The document then focuses on Spark, describing its core functionality, modules like DataFrames and MLlib, and execution modes. It provides guidance on learning Spark, emphasizing programming skills and Spark APIs. A demo of Spark fundamentals on a big data lab is also proposed.

A machine learning and data science pipeline for real companies

DataWorks Summit

Comcast is one of the largest cable and telecommunications providers in the country built on decades of mergers, acquisitions, and subscriber growth. The success of our company depends on keeping our customers happy and how quickly we can pivot with changing trends and new technologies. Data abounds within our internal data centers and edge networks as well as both the private and public cloud across multiple vendors. Within such an environment and given such challenges, how do we get AI, machine learning, and data science platforms built so our company can respond to the market, predict our customers’ needs and create new revenue generating products that delight our customers? If you don’t happen to be our friends and colleagues at Google, Facebook, and Amazon, what are technologies, strategies, and toolkits you can employ to bring together disparate data sets and quickly get them into the hands of your data scientists and then into your own production systems for use by your customers and business partners? We’ll explore our journey and evolution and look at specific technologies and decisions that have gotten us to where we are today and demo how our platform works. Speaker Ray Harrison, Comcast, Enterprise Architect Prashant Khanolkar, Comcast, Principal Architect Big Data

Performance and Abstractions

Metosin Oy

Tommi Reiman discusses optimizing Clojure performance and abstractions. He shares lessons learned from optimizing middleware performance and JSON serialization. Data-driven approaches can enable high performance while maintaining abstraction. Reitit is a new routing library that aims to have the fastest performance through techniques like compiled routing data. Middleware can also benefit from data-driven approaches without runtime penalties. Overall performance should be considered but not obsessively, as many apps do not require extreme optimization.

On SDN Research Topics - Christian Esteve Rothenberg

CPqD

This document summarizes Christian Esteve Rothenberg's research interests in software-defined networking topics. It outlines his background and experience in SDN and lists several areas of focus, including SDN in the WAN with a focus on software-defined IP routing. It also discusses high performance SDN stacks, building high availability into SDNs, and exploring the integration of optics and electronics with SDN programmable abstractions and datapaths. Rothenberg's research aims to advance these topics through ongoing work and collaborations.

Hadoop Ecosystem and Low Latency Streaming Architecture

InSemble

Spark summit 2019 infrastructure for deep learning in apache spark 0425

Wee Hyong Tok

In machine learning projects, the preparation of large datasets is a key phase which can be complex and expensive. It was traditionally done by data engineers before the handover to data scientists or ML engineers. They operated in different environments due to the differences in the tools, frameworks and runtimes required in each phase. Spark's support for different types of workloads brought data engineering closer to the downstream activities like machine learning that depended on the data. Unifying data acquisition, preprocessing, training models and batch inferencing under a single platform enabled by Spark not only provided seamless experience between different phases and helped accelerate the end-to-end ML lifecycle but also lowered the TCO in the building, managing the infrastructure to cover different phases. With that, the needs of a shared infrastructure expanded to include specialized hardware like GPUs and support deep learning workloads as well. Spark can effectively make use of such infrastructure as it integrates with popular deep learning frameworks and supports acceleration of deep learning jobs using GPUs. In this talk, we share learnings and experiences in supporting different types of workloads in shared clusters equipped for doing deep learning as well as data engineering. We will cover the following topics: * Considerations for sharing the infrastructure for big data and deep learning in Spark * Deep learning in Spark in clusters with and without GPUs * Differences between distributed data processing and distributed machine learning * Multitenancy and isolation in shared infrastructure. https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=97

Strata NY 2017 Parquet Arrow roadmap

Julien Le Dem

This document summarizes Apache Parquet and Apache Arrow, two open source projects for columnar data formats. It discusses how Parquet provides an on-disk columnar format for storage while Arrow provides an in-memory columnar format. The document outlines how Arrow builds on the success of Parquet by providing a common in-memory format that avoids serialization overhead and allows systems to share functionality. It provides examples of performance gains from the vertical integration of Parquet and Arrow.

Internals of Presto Service

Treasure Data, Inc.

Presto is a distributed SQL query engine that Treasure Data provides as a service. Taro Saito discussed the internals of the Presto service at Treasure Data, including how the TD Presto connector optimizes scan performance from storage systems and how the service manages multi-tenancy and resource allocation for customers. Key challenges in providing a database as a service were also covered, such as balancing cost and performance.

Apache Con 2021 Structured Data Streaming

Shivji Kumar Jha

Type safety is extremely important in any application built around a stream / queue. Type definition and evolution can either be built in the application or relied upon the data layer to support it out of the box allowing the application to only concentrate on business logic, not how of data store and evolution. It is this property of the good old relational databases (among others) that make them a favourite among all the modern NoSQL databases. Modern software architectures requires asynchronous communication (via stream / queue). While the data store and query design changes with asynchronous communication, type safety is still equally important. In this slide deck, used for Apache Con 2021 talk, we will go over ways in which one can force structure (schema) over the streaming data. As an example, we will talk about Apache Pulsar. Apache pulsar offers server as well as client side support for the structured streaming. We have been using pulsar for asynchronous communication among microservices in our nutanix beam and flow security central apps for over 1.5 years in production. This deck presents the technical details on what is schema, how to represent schema, what is available in the apache pulsar server and client side, how we have used pulsar’s schema support to build our use cases and our learnings from them.

PinTrace Advanced AWS meetup

Suman Karumuri

The document discusses distributed tracing at Pinterest. It provides an overview of distributed tracing, describes the motivation and architecture of Pinterest's tracing system called PinTrace, and discusses challenges faced and lessons learned. PinTrace collects trace data from services using instrumentation and sends it to a collector via a Kafka pipeline. This allows PinTrace to provide insights into request flows and performance bottlenecks across Pinterest's microservices. Key challenges included ensuring data quality, scaling the infrastructure, and user education on tracing.

Apache Spark sql

aftab alam

Spark SQL provides relational data processing capabilities in Spark. It introduces a DataFrame API that allows both relational operations on external data sources and Spark's built-in distributed collections. The Catalyst optimizer improves performance by applying database query optimization techniques. It is highly extensible, making it easy to add data sources, optimization rules, and data types for domains like machine learning. Spark SQL evaluation shows it outperforms alternative systems on both SQL query processing and Spark program workloads involving large datasets.

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014

Chris Fregly

Spark Streaming allows for processing of real-time data streams using Spark. The document discusses using Spark Streaming with Amazon Kinesis for streaming data ingestion. It covers the Spark Streaming and Kinesis integration architecture, how the Spark Kinesis receiver works, scaling considerations, and fault tolerance mechanisms through checkpointing. Examples of monitoring and tuning Spark Streaming jobs on Kinesis data are also provided.

Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...

Spark Summit

Today there are several compliance use cases — archiving, e-discovery, supervision + surveillance, to name a few — that appear naturally suited as Hadoop workloads but haven’t seen wide adoption. In this talk, we’ll discuss common limitations, how Apache Spark helps, and propose some new blueprints as to how to modernize this architecture and disrupt existing solutions. Additionally, we’ll discuss the rising role of Apache Spark in this ecosystem; leveraging machine learning and advanced analytics in a space that has traditionally been restricted to fairly rote reporting.

I Heart Log: Real-time Data and Apache Kafka

Jay Kreps

SOHOpelessly Broken

The Security of Things Forum

Swt

Ngoc Anh

This document outlines several open source semantic web tools, including ontology editors (Protégé, SWeDE), APIs for working with ontologies (Jena, Drive, cwm), persistence and query engines (Sesame, Kowari), and rule frameworks (SweetRules). It provides brief descriptions and links to tool homepages, with a focus on introductions, discriminators, and demonstrations of the capabilities of each tool.

Simulating the behavior of satellite Internet links to small islands

APNIC

This document summarizes a talk about simulating satellite internet links to small islands using a hardware-based simulation. The simulation aims to demonstrate how coding and performance enhancing proxies impact link utilization and packet loss. It consists of configuring the simulated satellite link parameters, running background traffic from servers to clients to generate demand, capturing traffic on both ends, and measuring the impact of coding and proxies on large file transfers and ping times. Preliminary results show that medium earth orbit links have higher goodput than geostationary links under high load, and that performance enhancing proxies help large file transfers without significantly impacting overall throughput. Future work will explore forward error correction coding and balancing redundancy with spare capacity.

Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz

Databricks

This document discusses using Hadoop for archiving, e-discovery, and supervision. It outlines the key components of each task and highlights traditional shortcomings. Hadoop provides strengths like speed, ease of use, and security. An architectural overview shows how Hadoop can be used for ingestion, processing, analysis, and machine learning. Examples demonstrate surveillance use cases. While some obstacles remain, partners can help address areas like user interfaces and compliance storage.

Similar to From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets with Avi Aminov (20)

Presto as a Service - Tips for operation and monitoring

East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Big Data Introduction - Solix empower

A machine learning and data science pipeline for real companies

Performance and Abstractions

On SDN Research Topics - Christian Esteve Rothenberg

Hadoop Ecosystem and Low Latency Streaming Architecture

Spark summit 2019 infrastructure for deep learning in apache spark 0425

Strata NY 2017 Parquet Arrow roadmap

Internals of Presto Service

Apache Con 2021 Structured Data Streaming

PinTrace Advanced AWS meetup

Apache Spark sql

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014

Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...

I Heart Log: Real-time Data and Apache Kafka

SOHOpelessly Broken

Swt

Simulating the behavior of satellite Internet links to small islands

Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz

More from Databricks

DW Migration Webinar-March 2022.pptx

Databricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized Platform

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data Science

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML Monitoring

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature Aggregations

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Recently uploaded

Build applications with generative AI on Google Cloud

Márton Kodok

We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.

一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理

eoxhsaa

办理【微信号:176555708】【办理(UofT毕业证书)】【微信号:176555708】《成绩单、外壳、offer、真实留信官方学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路）我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信号:176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信号:176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理

1tyxnjpia

原版制作【微信:41543339】【(Sheffield毕业证书)谢菲尔德大学毕业证】【微信:41543339】《成绩单、外壳、雅思、offer、留信学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器一制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Template xxxxxxxx ssssssssssss Sertifikat.pptx

TeukuEriSyahputra

Drownings spike from May to August in children

Bisnar Chase Personal Injury Attorneys

A gentle exploration of Retrieval Augmented Generation

dataschool1

一比一原版加拿大麦吉尔大学毕业证（mcgill毕业证书）如何办理

agdhot

原版一模一样【微信：741003700 】【加拿大麦吉尔大学毕业证（mcgill毕业证书）成绩单】【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理加拿大麦吉尔大学毕业证（mcgill毕业证书）【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理加拿大麦吉尔大学毕业证（mcgill毕业证书）【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理加拿大麦吉尔大学毕业证（mcgill毕业证书）【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理加拿大麦吉尔大学毕业证（mcgill毕业证书）【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

原版一比一多伦多大学毕业证(UofT毕业证书)如何办理

mkkikqvo

原版制作【微信:41543339】【多伦多大学毕业证(UofT毕业证书)】【微信:41543339】《成绩单、外壳、雅思、offer、真实留信官方学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路）我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

一比一原版(UO毕业证)渥太华大学毕业证如何办理

bmucuha

原件一模一样【微信：95270640】【渥太华大学毕业证UO学位证成绩单】【微信：95270640】（留信学历认证永久存档查询）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信：95270640】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信：95270640】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信：95270640】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才 → 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！办理渥太华大学毕业证毕业证offerUO学位证【微信：95270640 】外观非常精致，由特殊纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理渥太华大学毕业证UO学位证毕业证offer【微信：95270640 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理渥太华大学毕业证毕业证offerUO学位证【微信：95270640 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理渥太华大学毕业证毕业证offerUO学位证【微信：95270640 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理

z6osjkqvd

原版办【微信号:BYZS866】【英属哥伦比亚大学毕业证(UBC毕业证书)】【微信号:BYZS866】《成绩单、外壳、雅思、offer、真实留信官方学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路）我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信号BYZS866】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信号BYZS866】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Palo Alto Cortex XDR presentation .......

Sachin Paul

一比一原版美国帕森斯设计学院毕业证（parsons毕业证书）如何办理

asyed10

原版一模一样【微信：741003700 】【美国帕森斯设计学院毕业证（parsons毕业证书）成绩单】【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理美国帕森斯设计学院毕业证（parsons毕业证书）【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理美国帕森斯设计学院毕业证（parsons毕业证书）【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理美国帕森斯设计学院毕业证（parsons毕业证书）【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理美国帕森斯设计学院毕业证（parsons毕业证书）【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

一比一原版加拿大渥太华大学毕业证（uottawa毕业证书）如何办理

uevausa

原版一模一样【微信：741003700 】【渥太华大学毕业证（uottawa毕业证书）成绩单】【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理渥太华大学毕业证（uottawa毕业证书）【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理渥太华大学毕业证（uottawa毕业证书）【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理渥太华大学毕业证（uottawa毕业证书）【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理渥太华大学毕业证（uottawa毕业证书）【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Predictably Improve Your B2B Tech Company's Performance by Leveraging Data

Kiwi Creative

Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts. Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!). From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing. - - - This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA. Watch the video recording at https://youtu.be/5vjwGfPN9lw Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/

Experts live - Improving user adoption with AI

jitskeb

DSSML24_tspann_CodelessGenerativeAIPipelines

Timothy Spann

Codeless Generative AI Pipelines (GenAI with Milvus) https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience. Timothy Spann https://www.youtube.com/@FLaNK-Stack https://medium.com/@tspann https://www.datainmotion.dev/ milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge

在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样

v7oacc3l

学校原件一模一样【微信：741003700 】《(英国UCA毕业证书)创意艺术大学毕业证》【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

社内勉強会資料_Hallucination of LLMs　　　　　　　　　　　　　　　.

NABLAS株式会社

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...

Aggregage

REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx

KiriakiENikolaidou

Recently uploaded (20)

Build applications with generative AI on Google Cloud

一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理

一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理

Template xxxxxxxx ssssssssssss Sertifikat.pptx

Drownings spike from May to August in children

A gentle exploration of Retrieval Augmented Generation

一比一原版加拿大麦吉尔大学毕业证（mcgill毕业证书）如何办理

原版一比一多伦多大学毕业证(UofT毕业证书)如何办理

一比一原版(UO毕业证)渥太华大学毕业证如何办理

一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理

Palo Alto Cortex XDR presentation .......

一比一原版美国帕森斯设计学院毕业证（parsons毕业证书）如何办理

一比一原版加拿大渥太华大学毕业证（uottawa毕业证书）如何办理

Predictably Improve Your B2B Tech Company's Performance by Leveraging Data

Experts live - Improving user adoption with AI

DSSML24_tspann_CodelessGenerativeAIPipelines

在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样

社内勉強会資料_Hallucination of LLMs　　　　　　　　　　　　　　　.

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...

REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx

From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets with Avi Aminov

1. The Road to Uncovering Botnets From Python Scikit-Learn to Scala Spark

2. whoami • Avi Aminov – ~2 years Security Researcher at Akamai – Physics PhD student • Asaf Nadler – ~1.5 years Security Researcher at Akamai – CS PhD student

3. Enterprise Threat Protection • Detect malware presence from outbound traffic – Behavioral pattern analysis – Domain blacklisting • Availability – End of June ’17 Akamai Recursive DNS Branch / HQ Enterprise DNS

4. Data • Akamai Data – 20-30% of internet traffic – Customer ISP/Enterprise logs – 20B DNS queries/day • Third party data – e.g. Authoritative DNS log lines • Open data sources – e.g. WHOIS information

5. Bot Networks – IP Fluxing • Goal – Evasion – Regular bots: waiting for orders – Proxies: concealing origin server Command & Control server Bots Proxy Bots

6. Bot Networks Detection • Detect illegitimate IP fluxing • Features – IP dispersity (Geo, systems) – TTL features – Lexical Domain Description #Systems #Countries astro-travels.net PoS CNC Host 157 11

7. Decision Tree Model Malicious with high confidence • Spread across systems • Unpopular Benign with high confidence • IPs in the same system • Contains meaningful words

8. Challenge – Going to Production Feature Extraction Scoring Blacklist Feature Extraction Model Training Model Model Evaluation Data Sources

9. What have we done so far? • Flow – Researcher describes an algorithm (document + Hive query) – Dev rewrites the code in MapReduce (now Scala/Spark) • Problems – Not applicable to ML pipelines – Prone to mistakes – Longer development cycle

10. Can We Do Better? Option #1 • Research side – Pipeline in Scala/Spark • Dev side – Implement the algorithms • Pros – Greater flexibility – Research scale • Cons – Learning curve – Lose sklearn/R benefits

11. Can We Do Better? Option #2 • Research side – Train locally and export model • Dev side – Transform data using imported model • Pros – Quick implementation – Unified procedure • Cons – No support for all models

12. Export scheme • Predictive Model Markup Language • General scheme for ML pipelines – Data transformations – Scoring models • XML format – Readable • Supported by major data science / ML frameworks using jPMML (R, sklearn)

13. PMML Simple Boilerplate Python (Research side) Scala (Dev side) Credit: jpmml lib http://openscoring.io/ , https://github.com/jpmml/ Maintained by Villu Ruusmann

14. Lessons Learned • Work process adjusted to the task – Training locally? Export the model – Training on larger scales? Better to use Spark • Use jpmml for model export • When applicable, reduce workload in production – Example – only look at domains with many IPs

15. Challenge solved Feature Extraction Scoring Blacklist Data Collection Model Training Model Model Evaluation Data Sources PMML

16. Thank you! @AviBachsh

From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets with Avi Aminov

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets with Avi Aminov

Similar to From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets with Avi Aminov (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets with Avi Aminov