While struggling to choose among different computing and machine learning frameworks such as Spark, Dask, Scikit-learn, Tensorflow, etc. for your ETL and machine learning projects, have you thought about unifying them into one ecosystem to use?
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher ScientificDatabricks
Thermo Fisher Scientific has one of the most extensive product portfolios in the industry, ranging from reagents to capital instruments across customers in biotechnology, pharmaceuticals, academic, and more.
The modern data customer wants data now. Batch workloads are not going anywhere, but at Scribd the future of our data platform requires more and more streaming data sets.
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
Kubernetes is the most popular container orchestration system that is natively designed for Cloud. At Lyft and Cloudera, we have both emerged the next-generation, cloud-native infrastructure based on Kubernetes, which supports various distributed workloads.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing.
In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30%
In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher ScientificDatabricks
Thermo Fisher Scientific has one of the most extensive product portfolios in the industry, ranging from reagents to capital instruments across customers in biotechnology, pharmaceuticals, academic, and more.
The modern data customer wants data now. Batch workloads are not going anywhere, but at Scribd the future of our data platform requires more and more streaming data sets.
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
Kubernetes is the most popular container orchestration system that is natively designed for Cloud. At Lyft and Cloudera, we have both emerged the next-generation, cloud-native infrastructure based on Kubernetes, which supports various distributed workloads.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing.
In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30%
In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.
Superworkflow of Graph Neural Networks with K8S and FugueDatabricks
This document introduces a superworkflow for running Node2Vec on graphs using the Fugue framework on Kubernetes. It describes the Node2Vec algorithm and different steps in the superworkflow, including graph creation and indexing, random walks, Word2Vec preprocessing, and embedding training. The superworkflow provides advantages like parallelizing steps, efficient resource usage through auto-persist and checkpointing. Benchmark results show the superworkflow reduces runtime significantly compared to Spark MLlib, such as reducing a 100M node graph embedding from 6,800 CPU hours to 100 CPU hours and 16 GPU hours. Open source links for the Node2Vec on Fugue project are also provided.
Managing big data stored on ADLSgen2/Databricks may be challenging. Setting up security, moving or copying the data of Hive tables or their partitions may be very slow, especially when dealing with hundreds of thousands of files.
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.
Productionizing Machine Learning with a Microservices ArchitectureDatabricks
Deploying machine learning models from training to production requires companies to deal with the complexity of moving workloads through different pipelines and re-writing code from scratch.
A Collaborative Data Science Development WorkflowDatabricks
Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
Running a stream in a development environment is relatively easy. However, some topics can cause serious issues in production when they are not addressed properly.
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.
To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.
On Improving Broadcast Joins in Apache Spark SQLDatabricks
- The document discusses improving broadcast joins in Apache Spark SQL, which are more efficient than shuffle joins when the broadcasted data fits in memory.
- Experimenting with increasing the broadcast threshold showed that executor-side broadcasting performs better than driver-side broadcasting by avoiding data shuffling to the driver.
- Comparing the cost models of shuffle joins and broadcast joins showed that shuffle joins perform better with more cores while broadcast joins perform better when the size difference between tables is larger.
- Applying these techniques to joins in Workday HR customer data pipelines showed that increasing the broadcast threshold did not always improve performance due to the presence of self-joins and outer joins.
Understanding and Improving Code GenerationDatabricks
Code generation is integral to Spark’s physical execution engine. When implemented, the Spark engine creates optimized bytecode at runtime improving performance when compared to interpreted execution. Spark has taken the next step with whole-stage codegen which collapses an entire query into a single function.
End-to-End Deep Learning with Horovod on Apache SparkDatabricks
Data processing and deep learning are often split into two pipelines, one for ETL processing, the second for model training. Enabling deep learning frameworks to integrate seamlessly with ETL jobs allows for more streamlined production jobs, with faster iteration between feature engineering and model training.
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
Continuous Processing in Structured Streaming with Jose TorresDatabricks
This talk will cover the details of Continuous Processing in Structured Streaming and my work implementing the initial version in Spark 2.3 as well as the updates for 2.4. DStreams was Spark’s first attempt at streaming, and through dstream Spark became the first framework to provide both batch and streaming functionalities in one unified execution engine.
The way streaming execution happens is through this “micro-batch” model, in which the underlying execution engine simply runs on batches of data over and over again. Dstream’s design tightly couples the user-facing APIs with the execution model, and as a result was very difficult to accomplish certain tasks important in streaming, e.g. using event time and working with late data, without breaking the user-facing APIs. Structured Streaming was the 2nd (and the latest) major streaming effort in Spark. Its design decouples the frontend (user-facing APIs) and backend (execution), and allows us to change the execution model without any user API change.
However, the (historical) minimum possible latency for any record for DStreams or Structured Streaming was bounded by the amount of time that it takes to launch a task. This limitation is a result of the fact that the engine requires us to know both the starting and the ending offset, before any tasks are launched. In the worst case, the end-to-end latency is actually closer to the average batch time + task launching time. Continuous Processing removes this constraints and allows users to achieve sub-millisecond end-to-end latencies with the new execution engine.
This talk will take a technical deep dive into its capabilities, what it took to implement, and discuss the future developments.
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks
"Databricks Unified Analytics Platform (UAP) is a cloud-based service for running all analytics in one place - from highly reliable and performant data pipelines to state-of-the-art Machine Learning. From the original creators of Apache Spark and MLflow, it provides data science and engineering teams ready to use pre-packaged clusters with optimized Apache Spark and various ML frameworks coupled with powerful collaboration capabilities to improve productivity across the ML lifecycle. Yada yada yada... But in addition to being a vendor Databricks is also a user of UAP.
So, what have we learned by eating our own dogfood? Attend a “from the trenches report” from Suraj Acharya, Director Engineering responsible for Databricks’ in-house data engineering team how his team put Databricks technology to use, the lessons they have learned along the way and best practices for using Databricks for data engineering.
"
Improving Apache Spark's Reliability with DataSourceV2Databricks
DataSourceV2 is Spark's new API for working with data from tables and streams, but "v2" also includes a set of changes to SQL internals, the addition of a catalog API, and changes to the data frame read and write APIs. This talk will cover the context for those additional changes and how "v2" will make Spark more reliable and predictable for building enterprise data pipelines. This talk will include: * Problem areas where the current behavior is unpredictable or unreliable * The new standard SQL write plans (and the related SPIP) * The new table catalog API and a new Scala API for table DDL operations (and the related SPIP) * Netflix's use case that motivated these changes
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsDatabricks
This document summarizes Daniel Galvez's presentation on creating The People's Speech Dataset using Apache Spark and TPUs. The key points are:
1) The dataset aims to provide 86,000 hours of speech data with forced alignments between audio and transcripts in order to be challenging, free to use, and have a commercial license.
2) The conceptual workload is to take hour-long audio files, split them into 15 second segments, and use a pretrained speech recognition model to discover when each word in the transcript was said.
3) Creating the dataset encountered limitations with accelerator-aware scheduling in Spark, memory issues with PySpark UDFs, crashes in TPUs, and the need to reorder data by
Data Distribution and Ordering for Efficient Data Source V2Databricks
This presentation discusses data distribution and ordering in Apache Iceberg's Data Source V2. It explains that proper distribution and ordering of data is important for performance when writing and reading large datasets. The new version introduces an API for connectors to specify their required distribution and ordering, addressing issues in V1 where connectors could apply arbitrary transformations. Supported distribution options include ordered, clustered, and unspecified, and the API supports batch and streaming writes. Future work includes supporting distribution and ordering in table creation and improving partition handling. Proper data distribution and ordering is key to scaling performance in Iceberg.
Apache Spark 2.0 includes improvements that provide considerable speedups for CPU-intensive queries through techniques like code generation. Profiling tools like flame graphs can help analyze where CPU cycles are spent by visualizing stack traces. Flame graphs are useful for performance troubleshooting but have limitations. Testing Spark applications locally and through unit tests allows faster iteration compared to running on clusters and saves resources. It is also important to test with local approximations of distributed components like HDFS and Hive.
Apache Spark has been gaining steam, with rapidity, both in the headlines and in real-world adoption. Spark was developed in 2009, and open sourced in 2010. Since then, it has grown to become one of the largest open source communities in big data with over 200 contributors from more than 50 organizations. This open source analytics engine stands out for its ability to process large volumes of data significantly faster than contemporaries such as MapReduce, primarily owing to in-memory storage of data on its own processing framework. That being said, one of the top real-world industry use cases for Apache Spark is its ability to process ‘streaming data‘.
Superworkflow of Graph Neural Networks with K8S and FugueDatabricks
This document introduces a superworkflow for running Node2Vec on graphs using the Fugue framework on Kubernetes. It describes the Node2Vec algorithm and different steps in the superworkflow, including graph creation and indexing, random walks, Word2Vec preprocessing, and embedding training. The superworkflow provides advantages like parallelizing steps, efficient resource usage through auto-persist and checkpointing. Benchmark results show the superworkflow reduces runtime significantly compared to Spark MLlib, such as reducing a 100M node graph embedding from 6,800 CPU hours to 100 CPU hours and 16 GPU hours. Open source links for the Node2Vec on Fugue project are also provided.
Managing big data stored on ADLSgen2/Databricks may be challenging. Setting up security, moving or copying the data of Hive tables or their partitions may be very slow, especially when dealing with hundreds of thousands of files.
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.
Productionizing Machine Learning with a Microservices ArchitectureDatabricks
Deploying machine learning models from training to production requires companies to deal with the complexity of moving workloads through different pipelines and re-writing code from scratch.
A Collaborative Data Science Development WorkflowDatabricks
Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
Running a stream in a development environment is relatively easy. However, some topics can cause serious issues in production when they are not addressed properly.
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.
To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.
On Improving Broadcast Joins in Apache Spark SQLDatabricks
- The document discusses improving broadcast joins in Apache Spark SQL, which are more efficient than shuffle joins when the broadcasted data fits in memory.
- Experimenting with increasing the broadcast threshold showed that executor-side broadcasting performs better than driver-side broadcasting by avoiding data shuffling to the driver.
- Comparing the cost models of shuffle joins and broadcast joins showed that shuffle joins perform better with more cores while broadcast joins perform better when the size difference between tables is larger.
- Applying these techniques to joins in Workday HR customer data pipelines showed that increasing the broadcast threshold did not always improve performance due to the presence of self-joins and outer joins.
Understanding and Improving Code GenerationDatabricks
Code generation is integral to Spark’s physical execution engine. When implemented, the Spark engine creates optimized bytecode at runtime improving performance when compared to interpreted execution. Spark has taken the next step with whole-stage codegen which collapses an entire query into a single function.
End-to-End Deep Learning with Horovod on Apache SparkDatabricks
Data processing and deep learning are often split into two pipelines, one for ETL processing, the second for model training. Enabling deep learning frameworks to integrate seamlessly with ETL jobs allows for more streamlined production jobs, with faster iteration between feature engineering and model training.
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
Continuous Processing in Structured Streaming with Jose TorresDatabricks
This talk will cover the details of Continuous Processing in Structured Streaming and my work implementing the initial version in Spark 2.3 as well as the updates for 2.4. DStreams was Spark’s first attempt at streaming, and through dstream Spark became the first framework to provide both batch and streaming functionalities in one unified execution engine.
The way streaming execution happens is through this “micro-batch” model, in which the underlying execution engine simply runs on batches of data over and over again. Dstream’s design tightly couples the user-facing APIs with the execution model, and as a result was very difficult to accomplish certain tasks important in streaming, e.g. using event time and working with late data, without breaking the user-facing APIs. Structured Streaming was the 2nd (and the latest) major streaming effort in Spark. Its design decouples the frontend (user-facing APIs) and backend (execution), and allows us to change the execution model without any user API change.
However, the (historical) minimum possible latency for any record for DStreams or Structured Streaming was bounded by the amount of time that it takes to launch a task. This limitation is a result of the fact that the engine requires us to know both the starting and the ending offset, before any tasks are launched. In the worst case, the end-to-end latency is actually closer to the average batch time + task launching time. Continuous Processing removes this constraints and allows users to achieve sub-millisecond end-to-end latencies with the new execution engine.
This talk will take a technical deep dive into its capabilities, what it took to implement, and discuss the future developments.
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks
"Databricks Unified Analytics Platform (UAP) is a cloud-based service for running all analytics in one place - from highly reliable and performant data pipelines to state-of-the-art Machine Learning. From the original creators of Apache Spark and MLflow, it provides data science and engineering teams ready to use pre-packaged clusters with optimized Apache Spark and various ML frameworks coupled with powerful collaboration capabilities to improve productivity across the ML lifecycle. Yada yada yada... But in addition to being a vendor Databricks is also a user of UAP.
So, what have we learned by eating our own dogfood? Attend a “from the trenches report” from Suraj Acharya, Director Engineering responsible for Databricks’ in-house data engineering team how his team put Databricks technology to use, the lessons they have learned along the way and best practices for using Databricks for data engineering.
"
Improving Apache Spark's Reliability with DataSourceV2Databricks
DataSourceV2 is Spark's new API for working with data from tables and streams, but "v2" also includes a set of changes to SQL internals, the addition of a catalog API, and changes to the data frame read and write APIs. This talk will cover the context for those additional changes and how "v2" will make Spark more reliable and predictable for building enterprise data pipelines. This talk will include: * Problem areas where the current behavior is unpredictable or unreliable * The new standard SQL write plans (and the related SPIP) * The new table catalog API and a new Scala API for table DDL operations (and the related SPIP) * Netflix's use case that motivated these changes
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsDatabricks
This document summarizes Daniel Galvez's presentation on creating The People's Speech Dataset using Apache Spark and TPUs. The key points are:
1) The dataset aims to provide 86,000 hours of speech data with forced alignments between audio and transcripts in order to be challenging, free to use, and have a commercial license.
2) The conceptual workload is to take hour-long audio files, split them into 15 second segments, and use a pretrained speech recognition model to discover when each word in the transcript was said.
3) Creating the dataset encountered limitations with accelerator-aware scheduling in Spark, memory issues with PySpark UDFs, crashes in TPUs, and the need to reorder data by
Data Distribution and Ordering for Efficient Data Source V2Databricks
This presentation discusses data distribution and ordering in Apache Iceberg's Data Source V2. It explains that proper distribution and ordering of data is important for performance when writing and reading large datasets. The new version introduces an API for connectors to specify their required distribution and ordering, addressing issues in V1 where connectors could apply arbitrary transformations. Supported distribution options include ordered, clustered, and unspecified, and the API supports batch and streaming writes. Future work includes supporting distribution and ordering in table creation and improving partition handling. Proper data distribution and ordering is key to scaling performance in Iceberg.
Apache Spark 2.0 includes improvements that provide considerable speedups for CPU-intensive queries through techniques like code generation. Profiling tools like flame graphs can help analyze where CPU cycles are spent by visualizing stack traces. Flame graphs are useful for performance troubleshooting but have limitations. Testing Spark applications locally and through unit tests allows faster iteration compared to running on clusters and saves resources. It is also important to test with local approximations of distributed components like HDFS and Hive.
Apache Spark has been gaining steam, with rapidity, both in the headlines and in real-world adoption. Spark was developed in 2009, and open sourced in 2010. Since then, it has grown to become one of the largest open source communities in big data with over 200 contributors from more than 50 organizations. This open source analytics engine stands out for its ability to process large volumes of data significantly faster than contemporaries such as MapReduce, primarily owing to in-memory storage of data on its own processing framework. That being said, one of the top real-world industry use cases for Apache Spark is its ability to process ‘streaming data‘.
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...DevOps.com
The team at InfluxData needed to review benchmark data on InfluxDB during their development cycle to ensure any new changes continue to improve performance. However, using the existing benchmarking framework, InfluxDB-comparison, was manual and time consuming and not used consistently. To change this, InfluxData asked the team at Bonitoo to enhance the benchmark framework to easily incorporate new use cases, add new versions of the software quickly and easily and provide benchmark data on a cadence that works for the development (daily, weekly, monthly) cycle.
In this webinar the team from Bonitoo will share how they were able to accomplish this as well as build automation into the existing framework. In addition, they will share the benchmark results generated from the framework that highlights how performant a time series database like InfluxDB is compared to the latest versions of products like MongoDB, Cassandra, Elasticsearch, and OpenTSDB.
This document discusses change data capture (CDC) and its components. CDC is an approach that identifies, captures, and delivers changes made to enterprise data sources. It feeds these changes into a central data stream that can be combined with other data sources in real-time. The document outlines Kafka Connect, Debezium, Schema Registry, and Apache Avro which are key parts of the CDC architecture. It also discusses future steps like supporting additional databases and improving deployment, as well as open issues around performance and compatibility with certain databases.
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
M3 has been successfully deployed at Databricks to replace their Prometheus monitoring system. Some key lessons learned include monitoring important M3 metrics like memory and disk usage, having automated deployment processes, and planning for capacity needs and spikes in metrics. Updates to M3 have gone smoothly, and future plans include using new M3 features like downsampling and separate namespaces.
The document discusses declarative programming as it relates to network programmability. It provides examples of declarative versus imperative code and explains key concepts of declarative programming like lack of side effects, referential transparency, and idempotence. It also discusses how declarative programming can provide benefits like robustness, scalability, and reusability for network systems, which often operate in uncertain distributed environments. Finally, it outlines some declarative programming approaches being used for network control, orchestration, and automation.
Declarative Programming and a form of SDN Miya Kohno
The document discusses declarative programming as it relates to network programmability. It provides examples of declarative versus imperative code and explains key concepts of declarative programming like lack of side effects, referential transparency, and idempotence. It also discusses how declarative programming could be beneficial for networking given its robustness in complex distributed environments but may lack universal computational power. OpenDaylight and ETSI NFV architectures are presented as examples combining declarative and imperative approaches.
Concurrency involves accomplishing multiple tasks with shared resources, giving the illusion of parallelism to users. Parallelism accomplishes tasks with dedicated resources, making it most efficient. I/O involves accessing network sockets, disks, files, and system calls. It can be blocking or non-blocking. Computational constructs like processes, threads, mutexes, coroutines, and green threads are used. Amdahl's law and latency vs throughput are discussed. Asynchronous processing uses buffers to separate producers and consumers. Event loops, actors, CSP, and observers are programming abstractions. Various runtimes like JVM, NodeJS, CPython, Go, BEAM, and GHC are compared for concurrency and parallel
Fast federated SQL with Apache CalciteChris Baynes
This document discusses Apache Calcite, an open source framework for federated SQL queries. It provides an introduction to Calcite and its components. It then evaluates Calcite's performance on single data sources through benchmarks. Lastly, it proposes a hybrid approach to enable efficient federated queries using Calcite and Spark.
The document discusses upcoming features and changes in Apache Airflow 2.0. Key points include:
1. Scheduler high availability will use an active-active model with row-level locks to allow killing a scheduler without interrupting tasks.
2. DAG serialization will decouple DAG parsing from scheduling to reduce delays, support lazy loading, and enable features like versioning.
3. Performance improvements include optimizing the DAG file processor and using a profiling tool to identify other bottlenecks.
4. The Kubernetes executor will integrate with KEDA for autoscaling and allow customizing pods through templating.
5. The official Helm chart, functional DAGs, and smaller usability changes
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
TubeMogul grew from few servers to over two thousands servers and handling over one trillion http requests a month, processed in less than 50ms each. To keep up with the fast growth, the SRE team had to implement an efficient Continuous Delivery infrastructure that allowed to do over 10,000 puppet deployment and 8,500 application deployment in 2014. In this presentation, we will cover the nuts and bolts of the TubeMogul operations engineering team and how they overcome challenges.
Improve Monitoring And Observability for Kubernetes with OSS tools.pdfNilesh Gule
Slide deck related to the presentation at the KubeDay Singapore event. The session covered 3 pillars of Observability and how to use Jaeger for Distribute Tracing, Loki for Log Aggregation and Prometheus and Grafana for Metrics in a distributed application. Azure Kubernetes Service AKS cluster was used for live demo.
https://events.linuxfoundation.org/kubeday-singapore/
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLScyllaDB
Tractian, an AI-driven industrial monitoring company, recently discovered that their real-time ML environment needed to handle a tenfold increase in data throughput. In this session, JP Voltani (Head of Engineering at Tractian), details why and how they moved to ScyllaDB to scale their data pipeline for this challenge. JP compares ScyllaDB, MongoDB, and PostgreSQL, evaluating their data models, query languages, sharding and replication, and benchmark results. Attendees will gain practical insights into the MongoDB to ScyllaDB migration process, including challenges, lessons learned, and the impact on product performance.
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
There are many great tools for training machine learning tools, ranging from sci-kit to Apache Spark, and tensorflow. However many of these systems largely leave open the question how to use our models outside of the batch world (like in a reactive application). Different options exist for persisting the results and using them for live training, and we will explore the trade-offs of the different formats and their corresponding serving/prediction layers.
Getting Started with PHP on Engine Yard CloudEngine Yard
Topics Covered:
• How to deploy a PHP application to Engine Yard
• How to use Composer to automate dependency management
• The key differences between Orchestra and Engine Yard Cloud
This document discusses using Apache Kafka as a data hub to capture changes from various data sources using change data capture (CDC). It outlines several common CDC patterns like using modification dates, database triggers, or log files to identify changes. It then discusses using Kafka Connect to integrate various data sources like MongoDB, PostgreSQL and replicate changes. The document provides examples of open source CDC connectors and concludes with suggestions for getting involved in the Apache Kafka community.
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Distributed real time stream processing- why and howPetr Zapletal
In this talk you will discover various state-of-the-art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs, their intended use-cases, and how to choose between them. Petr will focus on the popular frameworks, including Spark Streaming, Storm, Samza and Flink. You will also explore theoretical introduction, common pitfalls, popular architectures, and much more.
The demand for stream processing is increasing. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications, include trading, social networks, the Internet of Things, and system monitoring, are becoming more and more important. A number of powerful, easy-to-use open source platforms have emerged to address this.
Petr's goal is to provide a comprehensive overview of modern streaming solutions and to help fellow developers with picking the best possible solution for their particular use-case. Join this talk if you are thinking about, implementing, or have already deployed a streaming solution.
Modern ETL Pipelines with Change Data CaptureDatabricks
In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.
This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.
We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITOpenStack
Audience: Advanced
About: Real world lessons and war stories about Catalyst IT’s experience in rolling out an OpenStack based public cloud in New Zealand.
This presentation will provide tips and advice that may save you a lot of time, money and nights of sleep if you are planning to run OpenStack in the future. It may also bring some insights to people that are already running OpenStack in production.
Topics covered will include: selection of hardware for optimal costs, techniques that drive quality and service levels up, common deployment mistakes, in place upgrades, how to identify the maturity level of each project and decide what is ready for production, and much more!
Speaker Bio: Bruno Lago – Entrepreneur, Catalyst IT Limited
Bruno Lago is a solutions architect that has been involved with the Catalyst Cloud (New Zealand’s first public cloud based on OpenStack) from its inception. He is passionate about open source software, cloud computing and disruptive technologies.
OpenStack Australia Day - Sydney 2016
https://events.aptira.com/openstack-australia-day-sydney-2016/
Similar to Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics (20)
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Machine Learning CI/CD for Email Attack DetectionDatabricks
Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models.
In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
This presentation is about health care analysis using sentiment analysis .
*this is very useful to students who are doing project on sentiment analysis
*
Did you know that drowning is a leading cause of unintentional death among young children? According to recent data, children aged 1-4 years are at the highest risk. Let's raise awareness and take steps to prevent these tragic incidents. Supervision, barriers around pools, and learning CPR can make a difference. Stay safe this summer!
06-18-2024-Princeton Meetup-Introduction to MilvusTimothy Spann
06-18-2024-Princeton Meetup-Introduction to Milvus
tim.spann@zilliz.com
https://www.linkedin.com/in/timothyspann/
https://x.com/paasdev
https://github.com/tspannhw
https://github.com/milvus-io/milvus
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/142-17June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
Expand LLMs' knowledge by incorporating external data sources into LLMs and your AI applications.
6. Motivation of Fugue
● A pure abstraction layer
● Unify and simplify core concepts of distributed computing
● Decouple your logic from any specific solution
● Easy to learn and easy to switch
● NOT invasive, NOT obstructive, and NOT exclusive
7. Example: Node2Vec
Apply certain walk strategy on graph to generate a collection of node vectors
to be used by embedding algos such as Word2Vec
15. Why DAG?
1. X = Run mapper A on a dataframe
2. Map X by mapper B and save
3. Map X by mapper C and save
16. Optimizations on DAG Execution
● Automatically parallelize independent branches
● Auto persist
● More errors can be captured at “compile” time
● Determinism enables checkpointing, executions can “resume”
18. # Enriched syntax
a:= CREATE [[“k1”,0],[“k2”,1]] SCHEMA k:str,f:int
# Transformer extension
b:= TRANSFORM a USING plus_n PARTITION BY k
# SELECT statement
c:= SELECT a.*, b.f2 FROM a JOIN b ON a.k = b.k
# Simplified syntax & multi tasks
SELECT f, f2, 3 AS f3 PERSIST
PRINT
OUTPUT TO “file.parquet”
# Checkpoint
df ?? TRANSFORM b USING expensive_op
OUTPUT c, df USING assert_eq
Fugue SQL
19. Fugue SQL vs Spark SQL
Fugue SQL Spark SQL
Workflow level Yes No
Cross platform Yes No
SELECT statement Yes Yes
Other SQL statements No, can be done in extensions Yes
Multiple statements Yes Yes (WITH statement)
Spark/Hive UDF (Java/Py) Yes Yes
Fugue extensions Yes No
Caching/checkpointing Yes No
24. ML Library: Node2Vec
● We implemented the distributed Node2Vec algorithm on Fugue
○ Use adjacency lists to represent a graph
○ Distributed Breadth-First Search for random walk
○ Cache critical variables for picking next step during BFS
26. ● Graph (10 million vertices, 300 million edges)
○ 2-3 hours with 500 cores and 3 TB memory
● Graph (100 million vertices, 3 billion edges)
○ 6-8 hours with 2,000 cores and 12 TB memory
Large Scale Testing
27. ML Library: Time Series Seasonality
● Forecast seasonality coefficients using Kalman Filter
○ Decent performance on noisy data
○ Simulate special events (holidays, and etc.) and anomalies
○ Any interval: hourly, daily, weekly, yearly, and etc.
● Handle very large number of time series with seasonalities
28. ● Fugue supports Spark streaming very well
○ Treats batch processing and streaming equivalently
○ Fugue spark-streaming pipeline in production
● Fugue abstract connectors for streaming
○ Kinesis connectors
○ Confluent Kafka connectors
○ Commonly used streaming APIs
Fugue Streaming
31. Migrated Projects
● Collaborated with multiple product teams to migrate legacy
pipelines
○ Large cost and runtime saving
○ Higher testability
○ Shorter development time
● Performance Improvement on all migrated projects (by Dec 2019)
○ Average total CPU hours: 74.6% reduction
○ Average total runtime: 83.9% reduction
32. Multi-region Regression
Region-based models to be trained and tuned.
Reliability Avg Cost/Run Runtime
Legacy Pipeline ~80% ~$630 7+ hours
Fugue Pipeline 99.5% ~$23 30 min
Improvement - 95+% reduction 90+% reduction
33. Time-series Forecasting
Forecast business metric for better budget planning and decision making
Horizon: weekly, monthly, quarterly
Reliability Avg Cost/Run Runtime
Legacy Pipeline ~70% ~$70 2+ hours
Fugue Pipeline 99.5% ~$5 10 min
Improvement - 90+% reduction 90+% reduction
34. ▪ Fugue unifies various computing frameworks with uniform
interfaces.
▪ Fugue SQL is a novel language for workflows.
▪ K8S + Spark + Fugue is a great combination with high flexibility
and efficiency for distributed computing.
▪ The Fugue project will build a unified ecosystem for integrating
distributed systems and machine learning.
Summary