The document describes Onyx, a new flexible and extensible data processing system. It discusses limitations of existing frameworks in new resource environments like resource disaggregation and transient resources. The Onyx architecture includes a compiler that transforms dataflow programs into optimized physical execution plans using passes, and a runtime that executes the plans across cluster resources. It provides examples of compiling and running MapReduce and ALS jobs, and handling dynamic data skew through runtime optimization.
This document summarizes a presentation about Netflix's big data platform and Spark. The key points are:
1. Netflix uses Apache Spark on YARN and Mesos clusters to process batch and streaming data from sources like Cassandra and Kafka.
2. Netflix has contributed improvements to Spark's dynamic resource allocation, predicate pushdown, and support for S3 filesystems.
3. A use case showed Spark outperforming Pig for an iterative job that duplicated and aggregated data in multiple steps.
[212]big models without big data using domain specific deep networks in data-...NAVER D2
The document discusses techniques for using deep learning with limited data. It presents methods for data synthesis, domain adaptation, and data cleaning. For data synthesis, it describes using a game engine to procedurally generate synthetic videos with automatic annotations for action recognition training. For domain adaptation, it applies a model trained on mouse tracking saliency data to eye tracking data. For data cleaning, it introduces a technique to prune noisy images from a landmark dataset to obtain reliable training annotations. The techniques aim to leverage limited data to train deep networks for tasks like saliency mapping, image retrieval, and action recognition.
Stream processing is designed for continuously processing unbounded data streams. It allows for unbounded data inputs and continuous processing, unlike batch processing which requires bounded, finite data sets. The key challenges of stream processing include out-of-order data arrival and needing to relate events that occur close together in time but may be processed out of order. To address this, stream processing systems use watermarks to indicate processing progress, triggers to determine output timing, and accumulation to handle refinements from late data.
[2C5]Map-D: A GPU Database for Interactive Big Data AnalyticsNAVER D2
Map-D is a super-fast SQL-enabled columnar database built into GPU memory that allows for real-time analytics and interactive visualization of big data. It is optimized to take advantage of GPU memory bandwidth and computational power. Map-D can scan data at over 2TB/second per node and handle queries of over 1 billion records within milliseconds by leveraging the parallel processing capabilities of GPUs. This allows for truly interactive analysis of large datasets.
This document describes Onyx, a new flexible and extensible data processing system. Onyx aims to address limitations in existing frameworks when dealing with new resource environments like disaggregated computing and transient resources. The Onyx architecture includes a compiler that transforms dataflow programs into optimized execution plans using various passes. The runtime then executes the plans across cluster resources. Onyx allows dynamic optimization by collecting metrics during execution and generating new plans. It can harness transient resources by placing tasks strategically.
Presto generates Java bytecode at runtime to optimize query execution. Key query operations like filtering, projections, joins and aggregations are compiled into efficient Java methods using libraries like ASM and Fastutil. This bytecode generation improves performance by 30% through techniques like compiling row hashing for join lookups directly into machine instructions.
This document discusses the development of Apache Pig on Tez, an execution engine for Pig jobs. Pig on Tez allows Pig workflows to be executed as directed acyclic graphs (DAGs) using Tez, improving performance over the default MapReduce execution. Key benefits of Tez include eliminating intermediate data writes, reducing job launch overhead, and allowing more flexible data flows. However, challenges remain around automatically determining optimal parallelism and integrating Tez with user interface and monitoring tools. Future work is needed to address these issues.
The document discusses Apache NiFi, an open source software project that provides a dataflow solution for managing enterprise data movement and integration. It describes challenges with traditional messaging systems for enterprise dataflow and introduces Apache NiFi as an alternative. NiFi is based on Flow-Based Programming and allows users to visually create dataflows that can transform, route, and process data in real-time. The document includes a demonstration of NiFi and discusses its architecture, features, and future proposals.
This document summarizes a presentation about Netflix's big data platform and Spark. The key points are:
1. Netflix uses Apache Spark on YARN and Mesos clusters to process batch and streaming data from sources like Cassandra and Kafka.
2. Netflix has contributed improvements to Spark's dynamic resource allocation, predicate pushdown, and support for S3 filesystems.
3. A use case showed Spark outperforming Pig for an iterative job that duplicated and aggregated data in multiple steps.
[212]big models without big data using domain specific deep networks in data-...NAVER D2
The document discusses techniques for using deep learning with limited data. It presents methods for data synthesis, domain adaptation, and data cleaning. For data synthesis, it describes using a game engine to procedurally generate synthetic videos with automatic annotations for action recognition training. For domain adaptation, it applies a model trained on mouse tracking saliency data to eye tracking data. For data cleaning, it introduces a technique to prune noisy images from a landmark dataset to obtain reliable training annotations. The techniques aim to leverage limited data to train deep networks for tasks like saliency mapping, image retrieval, and action recognition.
Stream processing is designed for continuously processing unbounded data streams. It allows for unbounded data inputs and continuous processing, unlike batch processing which requires bounded, finite data sets. The key challenges of stream processing include out-of-order data arrival and needing to relate events that occur close together in time but may be processed out of order. To address this, stream processing systems use watermarks to indicate processing progress, triggers to determine output timing, and accumulation to handle refinements from late data.
[2C5]Map-D: A GPU Database for Interactive Big Data AnalyticsNAVER D2
Map-D is a super-fast SQL-enabled columnar database built into GPU memory that allows for real-time analytics and interactive visualization of big data. It is optimized to take advantage of GPU memory bandwidth and computational power. Map-D can scan data at over 2TB/second per node and handle queries of over 1 billion records within milliseconds by leveraging the parallel processing capabilities of GPUs. This allows for truly interactive analysis of large datasets.
This document describes Onyx, a new flexible and extensible data processing system. Onyx aims to address limitations in existing frameworks when dealing with new resource environments like disaggregated computing and transient resources. The Onyx architecture includes a compiler that transforms dataflow programs into optimized execution plans using various passes. The runtime then executes the plans across cluster resources. Onyx allows dynamic optimization by collecting metrics during execution and generating new plans. It can harness transient resources by placing tasks strategically.
Presto generates Java bytecode at runtime to optimize query execution. Key query operations like filtering, projections, joins and aggregations are compiled into efficient Java methods using libraries like ASM and Fastutil. This bytecode generation improves performance by 30% through techniques like compiling row hashing for join lookups directly into machine instructions.
This document discusses the development of Apache Pig on Tez, an execution engine for Pig jobs. Pig on Tez allows Pig workflows to be executed as directed acyclic graphs (DAGs) using Tez, improving performance over the default MapReduce execution. Key benefits of Tez include eliminating intermediate data writes, reducing job launch overhead, and allowing more flexible data flows. However, challenges remain around automatically determining optimal parallelism and integrating Tez with user interface and monitoring tools. Future work is needed to address these issues.
The document discusses Apache NiFi, an open source software project that provides a dataflow solution for managing enterprise data movement and integration. It describes challenges with traditional messaging systems for enterprise dataflow and introduces Apache NiFi as an alternative. NiFi is based on Flow-Based Programming and allows users to visually create dataflows that can transform, route, and process data in real-time. The document includes a demonstration of NiFi and discusses its architecture, features, and future proposals.
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
This document discusses how to use Storm and Hadoop together to enable real-time and batch processing of large datasets. It describes using Hadoop to precompute batch views of data, and Storm to incrementally update real-time views as new data streams in. This allows for low-latency queries by combining precomputed batch views with real-time views that compensate for recent data not yet absorbed into the batch views.
This document discusses Netflix's use of Spark workflows on Mesos clusters and how they implement autoscaling. It describes how Netflix runs Spark jobs on Mesos clusters, uses Fenzo to schedule tasks, and implements fixed-size Spark executors. It also discusses how the clusters are autoscaled by monitoring executor usage and resource availability, and scaling the cluster up or down by adding/removing Mesos agents based on utilization thresholds.
Some vignettes and advice based on prior experience with Cassandra clusters in live environments. Includes some material from other operational slides.
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
This document compares and summarizes several deep learning frameworks: Caffe, Chainer, CNTK, DL4J, Keras, MXNet, TensorFlow, and Theano. It describes who created each framework, when it was released, example applications, design motivations, and key features from technical, design, and programming perspectives.
Chainer v4 includes performance improvements like Intel integration and cuDNN enhancements. It also introduces usability features like Sequential chains and reorganized documentation. Chainer v4 allows exporting models to Caffe and ONNX formats. Chainer v5 is planned to improve usability with NumPy compatibility, distributions support, and code generation. It also aims to enhance performance through static subgraph caching.
고급 자바 8 교육 (6일 중 6일차)
티맥스소프트 연구소에 연구소장으로 재직 중이던 2013년 10월에 진행한 자바 언어 강의 내용입니다.
JVM에 대한 이해와 Java 8에 대한 소개를 포함하려고 노력하였습니다.
아래 강의 동영상이 있습니다.
http://javadom.blogspot.com/2017/07/8-6.html
Bobby from Yahoo presents on running Apache Storm as a service on and off Hadoop. Storm provides low-latency data processing through streaming data flows defined by topologies of spouts and bolts. Yahoo runs Storm as a service and also maintains Spark. Bobby discusses securing standalone Storm, running Storm on YARN for security, reduced overhead and elasticity, and future work including Nimbus high availability and running Storm topologies as unmanaged applications in YARN.
This document discusses using caching to improve performance for web applications. It provides three key points:
1. Cache stores data to serve future requests faster by avoiding accessing the database. It is commonly used for things like login information, page content, and API responses.
2. There are different cache architectures like memcached and Redis that support storing data in-memory for fast retrieval. Factors like data size, update frequency, and consistency requirements determine the appropriate caching strategy.
3. Real-world examples show how companies like Facebook, Twitter, and Wonga use caching extensively to handle high volumes of traffic and database requests. Caching is critical to scaling applications in a cost-effective way.
Ryosuke Okuta presented the roadmap for CuPy v4 and v5. Key points include: (1) CuPy aims to make NumPy code easily run on GPUs with minimal changes, (2) CuPy v4 adds wheel packages and memory profiling support, and (3) CuPy v5 plans to support Windows, add more functions, and potentially support AMD GPUs via HIP.
This document discusses using PostgreSQL with Amazon RDS. It begins with an introduction to Amazon RDS and then discusses setting up a PostgreSQL RDS instance, available features like backups and monitoring, limitations, pricing, and references for further reading. The document is intended to provide an overview of deploying and managing PostgreSQL on Amazon RDS.
Distributed real time stream processing- why and howPetr Zapletal
In this talk you will discover various state-of-the-art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs, their intended use-cases, and how to choose between them. Petr will focus on the popular frameworks, including Spark Streaming, Storm, Samza and Flink. You will also explore theoretical introduction, common pitfalls, popular architectures, and much more.
The demand for stream processing is increasing. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications, include trading, social networks, the Internet of Things, and system monitoring, are becoming more and more important. A number of powerful, easy-to-use open source platforms have emerged to address this.
Petr's goal is to provide a comprehensive overview of modern streaming solutions and to help fellow developers with picking the best possible solution for their particular use-case. Join this talk if you are thinking about, implementing, or have already deployed a streaming solution.
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...DataStax
Making sure your Data Model will work on the production cluster after 6 months as well as it does on your laptop is an important skill. It's one that we use every day with our clients at The Last Pickle, and one that relies on tools like the cassandra-stress. Knowing how the data model will perform under stress once it has been loaded with data can prevent expensive re-writes late in the project.
In this talk Christopher Batey, Consultant at The Last Pickle, will shed some light on how to use the cassandra-stress tool to test your own schema, graph the results and even how to extend the tool for your own use cases. While this may be called premature optimisation for a RDBS, a successful Cassandra project depends on it's data model.
About the Speaker
Christopher Batey Consultant / Software Engineer, The Last Pickle
Christopher (@chbatey) is a part time consultant at The Last Pickle where he works with clients to help them succeed with Apache Cassandra as well as a freelance software engineer working in London. Likes: Scala, Haskell, Java, the JVM, Akka, distributed databases, XP, TDD, Pairing. Hates: Untested software, code ownership. You can checkout his blog at: http://www.batey.info
Stream processing engines allow for real-time processing of streaming data. There are several open source stream processing frameworks including Storm, Spark Streaming, Samza, Flink, and Orleans. Each framework has different programming models, APIs, guarantees around message delivery, and approaches to reliability and fault tolerance. Performance characteristics around latency and throughput also vary between frameworks.
[오픈소스컨설팅]클라우드기반U2L마이그레이션 전략 및 고려사항Ji-Woong Choi
Cloud 기반으로 U2C(Unix to Cloud),U2L(Unix to Linux) 마이그레이션에 대한 가이드 라인과 사이징 관련 고려 사항에 대해 설명한 자료입니다.
많은 전환 프로젝트에서 추출된 경험치가 들어가 있으며, 전환별 난이도 및 고려사항이 들어가 있습니다.
[142] 생체 이해에 기반한 로봇 – 고성능 로봇에게 인간의 유연함과 안전성 부여하기NAVER D2
Yong Jae Kim is a professor at KoreaTech who researches wearable, surgical, humanoid and mobile robots. His research interests include mechanism design and control of flexible robots. He has worked at Samsung Electronics and MIT. Some of his projects include a high-DOF robotic hand called RoboRay Hand, a power-assist glove called TATH Glove, and a soft parallel gripper. He aims to design robots with both flexibility and precision, like the human hand. He proposes using tendon-driven and compliant mechanisms inspired by biology to achieve this goal.
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
This document discusses how to use Storm and Hadoop together to enable real-time and batch processing of large datasets. It describes using Hadoop to precompute batch views of data, and Storm to incrementally update real-time views as new data streams in. This allows for low-latency queries by combining precomputed batch views with real-time views that compensate for recent data not yet absorbed into the batch views.
This document discusses Netflix's use of Spark workflows on Mesos clusters and how they implement autoscaling. It describes how Netflix runs Spark jobs on Mesos clusters, uses Fenzo to schedule tasks, and implements fixed-size Spark executors. It also discusses how the clusters are autoscaled by monitoring executor usage and resource availability, and scaling the cluster up or down by adding/removing Mesos agents based on utilization thresholds.
Some vignettes and advice based on prior experience with Cassandra clusters in live environments. Includes some material from other operational slides.
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
This document compares and summarizes several deep learning frameworks: Caffe, Chainer, CNTK, DL4J, Keras, MXNet, TensorFlow, and Theano. It describes who created each framework, when it was released, example applications, design motivations, and key features from technical, design, and programming perspectives.
Chainer v4 includes performance improvements like Intel integration and cuDNN enhancements. It also introduces usability features like Sequential chains and reorganized documentation. Chainer v4 allows exporting models to Caffe and ONNX formats. Chainer v5 is planned to improve usability with NumPy compatibility, distributions support, and code generation. It also aims to enhance performance through static subgraph caching.
고급 자바 8 교육 (6일 중 6일차)
티맥스소프트 연구소에 연구소장으로 재직 중이던 2013년 10월에 진행한 자바 언어 강의 내용입니다.
JVM에 대한 이해와 Java 8에 대한 소개를 포함하려고 노력하였습니다.
아래 강의 동영상이 있습니다.
http://javadom.blogspot.com/2017/07/8-6.html
Bobby from Yahoo presents on running Apache Storm as a service on and off Hadoop. Storm provides low-latency data processing through streaming data flows defined by topologies of spouts and bolts. Yahoo runs Storm as a service and also maintains Spark. Bobby discusses securing standalone Storm, running Storm on YARN for security, reduced overhead and elasticity, and future work including Nimbus high availability and running Storm topologies as unmanaged applications in YARN.
This document discusses using caching to improve performance for web applications. It provides three key points:
1. Cache stores data to serve future requests faster by avoiding accessing the database. It is commonly used for things like login information, page content, and API responses.
2. There are different cache architectures like memcached and Redis that support storing data in-memory for fast retrieval. Factors like data size, update frequency, and consistency requirements determine the appropriate caching strategy.
3. Real-world examples show how companies like Facebook, Twitter, and Wonga use caching extensively to handle high volumes of traffic and database requests. Caching is critical to scaling applications in a cost-effective way.
Ryosuke Okuta presented the roadmap for CuPy v4 and v5. Key points include: (1) CuPy aims to make NumPy code easily run on GPUs with minimal changes, (2) CuPy v4 adds wheel packages and memory profiling support, and (3) CuPy v5 plans to support Windows, add more functions, and potentially support AMD GPUs via HIP.
This document discusses using PostgreSQL with Amazon RDS. It begins with an introduction to Amazon RDS and then discusses setting up a PostgreSQL RDS instance, available features like backups and monitoring, limitations, pricing, and references for further reading. The document is intended to provide an overview of deploying and managing PostgreSQL on Amazon RDS.
Distributed real time stream processing- why and howPetr Zapletal
In this talk you will discover various state-of-the-art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs, their intended use-cases, and how to choose between them. Petr will focus on the popular frameworks, including Spark Streaming, Storm, Samza and Flink. You will also explore theoretical introduction, common pitfalls, popular architectures, and much more.
The demand for stream processing is increasing. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications, include trading, social networks, the Internet of Things, and system monitoring, are becoming more and more important. A number of powerful, easy-to-use open source platforms have emerged to address this.
Petr's goal is to provide a comprehensive overview of modern streaming solutions and to help fellow developers with picking the best possible solution for their particular use-case. Join this talk if you are thinking about, implementing, or have already deployed a streaming solution.
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...DataStax
Making sure your Data Model will work on the production cluster after 6 months as well as it does on your laptop is an important skill. It's one that we use every day with our clients at The Last Pickle, and one that relies on tools like the cassandra-stress. Knowing how the data model will perform under stress once it has been loaded with data can prevent expensive re-writes late in the project.
In this talk Christopher Batey, Consultant at The Last Pickle, will shed some light on how to use the cassandra-stress tool to test your own schema, graph the results and even how to extend the tool for your own use cases. While this may be called premature optimisation for a RDBS, a successful Cassandra project depends on it's data model.
About the Speaker
Christopher Batey Consultant / Software Engineer, The Last Pickle
Christopher (@chbatey) is a part time consultant at The Last Pickle where he works with clients to help them succeed with Apache Cassandra as well as a freelance software engineer working in London. Likes: Scala, Haskell, Java, the JVM, Akka, distributed databases, XP, TDD, Pairing. Hates: Untested software, code ownership. You can checkout his blog at: http://www.batey.info
Stream processing engines allow for real-time processing of streaming data. There are several open source stream processing frameworks including Storm, Spark Streaming, Samza, Flink, and Orleans. Each framework has different programming models, APIs, guarantees around message delivery, and approaches to reliability and fault tolerance. Performance characteristics around latency and throughput also vary between frameworks.
[오픈소스컨설팅]클라우드기반U2L마이그레이션 전략 및 고려사항Ji-Woong Choi
Cloud 기반으로 U2C(Unix to Cloud),U2L(Unix to Linux) 마이그레이션에 대한 가이드 라인과 사이징 관련 고려 사항에 대해 설명한 자료입니다.
많은 전환 프로젝트에서 추출된 경험치가 들어가 있으며, 전환별 난이도 및 고려사항이 들어가 있습니다.
[142] 생체 이해에 기반한 로봇 – 고성능 로봇에게 인간의 유연함과 안전성 부여하기NAVER D2
Yong Jae Kim is a professor at KoreaTech who researches wearable, surgical, humanoid and mobile robots. His research interests include mechanism design and control of flexible robots. He has worked at Samsung Electronics and MIT. Some of his projects include a high-DOF robotic hand called RoboRay Hand, a power-assist glove called TATH Glove, and a soft parallel gripper. He aims to design robots with both flexibility and precision, like the human hand. He proposes using tendon-driven and compliant mechanisms inspired by biology to achieve this goal.
[215]streetwise machine learning for painless parkingNAVER D2
The document summarizes research on using machine learning techniques for optimizing parking policies. It discusses using parking data from various sources like sensors and payments to set pricing, guide enforcement, and help drivers find spaces. Pricing models are developed to maximize the overall value people get from the parking system. A voting rule is proposed as a simple way to adjust prices based on occupancy levels over time. Spatial and temporal sampling techniques are explored to reduce sensor costs while still obtaining high quality data, such as prioritizing observations of locations with higher predictive uncertainty.
[221]똑똑한 인공지능 dj 비서 clova musicNAVER D2
Clova Music provides AI-assisted music recommendations through three key techniques:
1) Musical semantic embedding to map music items to a shared semantic space for identifying similarities.
2) Matrix factorization for personalized music recommendations based on a user's preferences.
3) Music content modeling using convolutional recurrent attention networks (CRAN) to extract highlights from songs and a multimodal convolutional recurrent network (MCRN) to recognize musical emotions.
[241]large scale search with polysemous codesNAVER D2
This document discusses using polysemous codes to perform large-scale search over visual signatures. Polysemous codes allow product quantization codes to be interpreted as both compact binary codes for efficient Hamming distance search and codes that preserve distance information for accurate nearest neighbor search. The key ideas are to learn an index assignment that maps similar product quantization codes to binary codes with smaller Hamming distance, and to directly optimize this assignment to match the distances between codebook centroids. This allows using a single code representation for both fast Hamming search and precise distance search, without increasing memory requirements. The document provides examples of applying polysemous codes to build a large graph connecting images based on visual similarity.
[213]building ai to recreate our visual worldNAVER D2
The document discusses recent advances in generative models for visual data synthesis and translation. It provides an overview of generative adversarial networks (GANs) and their applications, including generating images from random noise and translating images from one domain to another. Cycle-consistent adversarial networks (CycleGAN) are introduced as a way to perform image-to-image translation without paired training examples. The document highlights several examples of visual synthesis and translation using GANs and CycleGAN, including style transfer, domain adaptation, and generating images from segmentation maps.
[246]reasoning, attention and memory toward differentiable reasoning machinesNAVER D2
This document discusses using differentiable programming and memory networks to build machines capable of reasoning. It describes how attention and memory can help with reasoning tasks by focusing on relevant information and overcoming limits of data size. Application areas discussed include machine reading, dialog state tracking, and end-to-end dialog learning. Memory networks are presented as a way to perform these tasks with an end-to-end trainable architecture using a non-parametric memory accessed through attention. The document concludes by noting this is an open field of research with opportunities in theoretical analysis and improving learning procedures.
This document describes MIST, a system for large-scale IoT stream processing. MIST uses a cluster of machines to efficiently handle billions of IoT stream queries. It provides query APIs that allow users to define dataflow and complex event processing queries. MIST optimizes processing by sharing code, exploiting locality of code references through query grouping, and merging queries to reuse system resources.
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
This document discusses hybrid transactional/analytical processing (HTAP) with Apache Spark and in-memory data grids. It begins by introducing the speaker and GigaSpaces. It then discusses how modern applications require both online transaction processing and real-time operational intelligence. The document presents examples from retail and IoT and the goals of minimizing latency while maximizing data analytics locality. It provides an overview of in-memory computing options and describes how GigaSpaces uses an in-memory data grid combined with Spark to achieve HTAP. The document includes deployment diagrams and discusses data grid RDDs and pushing predicates to the data grid. It describes how this was productized as InsightEdge and provides additional innovations and reference architectures.
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
Abstract:
Data exploration often requires running aggregation/slice-dice queries on data sourced from disparate sources. You may want to identify distribution patterns, outliers, etc and aid the feature selection process as you train your predictive models. As you begin to understand your data, you want to ask ad-hoc questions expressed through your visualization tool (which typically translates to SQL queries), study the results and iteratively explore the data set through more queries. Unfortunately, even when data sets can be in-memory, large data set computations take time breaking the train of thought and increasing time to insight . We know Spark can be fast through its in-memory parallel processing. But, Spark 1.x isn’t quite there. Spark 2.0 promises to offer 10X better speed than its predecessor. Spark 2.0 ushers some impressive improvements to interactive query performance. We first explore these advances - compiling the query plan eliminating virtual function calls, and other improvements in the Catalyst engine. We compare the performance to other popular popular query processing engines by studying the spark query plans. We then go through SnappyData (an open source project that integrates Spark with a database that offers OLTP, OLAP and stream processing in a single cluster) where we use smarter data colocation and Synopses data (.e.g. Stratified sampling) to dramatically cut down on the memory requirements as well as the query latency. We explain the key concepts in summarizing data using structures like stratified sampling by walking through some examples in Apache Zeppelin notebooks (a open source visualization tool for spark) and demonstrate how we can explore massive data sets with just your laptop resources while achieving remarkable speeds.
Bio:
Jags is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory Bio:
Jags Ramnarayan is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory products.
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.
Talk Overview:
Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark.
Demo:
Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
Linaro is building an OpenStack based Developer Cloud. Here we present what was required to bring OpenStack to 64-bit ARM, the pitfalls, successes and lessons learnt; what’s missing and what’s next.
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
This document provides an overview of techniques to boost Spark performance, including:
1) Phase 1 focused on memory management, code generation, and cache-aware algorithms which provided 5-30x speedups
2) Phase 2 focused on whole-stage code generation and columnar in-memory support which are now enabled by default in Spark 2.0+
3) Additional techniques discussed include choosing an optimal garbage collector, using multiple small executors, exploiting data locality, disabling hardware prefetchers, and keeping hyper-threading on.
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
In this second part, we'll continue the Spark's review and introducing SparkSQL which allows to use data frames in Python, Java, and Scala; read and write data in a variety of structured formats; and query Big Data with SQL.
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
The document discusses performance optimization of Apache Spark on scale-up servers through near-data processing. It finds that Spark workloads have poor multi-core scalability and high I/O wait times on scale-up servers. It proposes exploiting near-data processing through in-storage processing and 2D-integrated processing-in-memory to reduce data movements and latency. The author evaluates these techniques through modeling and a programmable FPGA accelerator to improve the performance of Spark MLlib workloads by up to 9x. Challenges in hybrid CPU-FPGA design and attaining peak performance are also discussed.
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
This document provides an introduction to Apache Spark presented by Maxime Dumas. It discusses how Spark improves on MapReduce by offering better performance through leveraging distributed memory and supporting iterative algorithms. Spark retains MapReduce's advantages of scalability, fault-tolerance, and data locality while offering a more powerful and easier to use programming model. Examples demonstrate how tasks like word counting, logistic regression, and streaming data processing can be implemented on Spark. The document concludes by discussing Spark's integration with other Hadoop components and inviting attendees to try Spark.
Spark's distributed programming model uses resilient distributed datasets (RDDs) and a directed acyclic graph (DAG) approach. RDDs support transformations like map, filter, and actions like collect. Transformations are lazy and form the DAG, while actions execute the DAG. RDDs support caching, partitioning, and sharing state through broadcasts and accumulators. The programming model aims to optimize the DAG through operations like predicate pushdown and partition coalescing.
This document provides an overview of Apache Spark's architectural components through the life of simple Spark jobs. It begins with a simple Spark application analyzing airline on-time arrival data, then covers Resilient Distributed Datasets (RDDs), the cluster architecture, job execution through Spark components like tasks and scheduling, and techniques for writing better Spark applications like optimizing partitioning and reducing shuffle size.
The document discusses various machine learning clustering algorithms like K-means clustering, DBSCAN, and EM clustering. It also discusses neural network architectures like LSTM, bi-LSTM, and convolutional neural networks. Finally, it presents results from evaluating different chatbot models on various metrics like validation score.
The document discusses challenges with using reinforcement learning for robotics. While simulations allow fast training of agents, there is often a "reality gap" when transferring learning to real robots. Other approaches like imitation learning and self-supervised learning can be safer alternatives that don't require trial-and-error. To better apply reinforcement learning, robots may need model-based approaches that learn forward models of the world, as well as techniques like active localization that allow robots to gather targeted information through interactive perception. Closing the reality gap will require finding ways to better match simulations to reality or allow robots to learn from real-world experiences.
[243] Deep Learning to help student’s Deep LearningNAVER D2
This document describes research on using deep learning to predict student performance in massive open online courses (MOOCs). It introduces GritNet, a model that takes raw student activity data as input and predicts outcomes like course graduation without feature engineering. GritNet outperforms baselines by more than 5% in predicting graduation. The document also describes how GritNet can be adapted in an unsupervised way to new courses using pseudo-labels, improving predictions in the first few weeks. Overall, GritNet is presented as the state-of-the-art for student prediction and can be transferred across courses without labels.
[234]Fast & Accurate Data Annotation Pipeline for AI applicationsNAVER D2
This document provides a summary of new datasets and papers related to computer vision tasks including object detection, image matting, person pose estimation, pedestrian detection, and person instance segmentation. A total of 8 papers and their associated datasets are listed with brief descriptions of the core contributions or techniques developed in each.
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지NAVER D2
This document presents a formula for calculating the loss function J(θ) in machine learning models. The formula averages the negative log likelihood of the predicted probabilities being correct over all samples S, and includes a regularization term λ that penalizes predicted embeddings being dissimilar from actual embeddings. It also defines the cosine similarity term used in the regularization.
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기NAVER D2
The document discusses running a TensorFlow Serving (TFS) container using Docker. It shows commands to:
1. Pull the TFS Docker image from a repository
2. Define a script to configure and run the TFS container, specifying the model path, name, and port mapping
3. Run the script to start the TFS container exposing port 13377
The document discusses linear algebra concepts including:
- Representing a system of linear equations as a matrix equation Ax = b where A is a coefficient matrix, x is a vector of unknowns, and b is a vector of constants.
- Solving for the vector x that satisfies the matrix equation using linear algebra techniques such as row reduction.
- Examples of matrix equations and their component vectors are shown.
This document describes the steps to convert a TensorFlow model to a TensorRT engine for inference. It includes steps to parse the model, optimize it, generate a runtime engine, serialize and deserialize the engine, as well as perform inference using the engine. It also provides code snippets for a PReLU plugin implementation in C++.
The document discusses machine reading comprehension (MRC) techniques for question answering (QA) systems, comparing search-based and natural language processing (NLP)-based approaches. It covers key milestones in the development of extractive QA models using NLP, from early sentence-level models to current state-of-the-art techniques like cross-attention, self-attention, and transfer learning. It notes the speed and scalability benefits of combining search and reading methods for QA.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Essentials of Automations: Exploring Attributes & Automation ParametersSafe Software
Building automations in FME Flow can save time, money, and help businesses scale by eliminating data silos and providing data to stakeholders in real-time. One essential component to orchestrating complex automations is the use of attributes & automation parameters (both formerly known as “keys”). In fact, it’s unlikely you’ll ever build an Automation without using these components, but what exactly are they?
Attributes & automation parameters enable the automation author to pass data values from one automation component to the next. During this webinar, our FME Flow Specialists will cover leveraging the three types of these output attributes & parameters in FME Flow: Event, Custom, and Automation. As a bonus, they’ll also be making use of the Split-Merge Block functionality.
You’ll leave this webinar with a better understanding of how to maximize the potential of automations by making use of attributes & automation parameters, with the ultimate goal of setting your enterprise integration workflows up on autopilot.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/how-axelera-ai-uses-digital-compute-in-memory-to-deliver-fast-and-energy-efficient-computer-vision-a-presentation-from-axelera-ai/
Bram Verhoef, Head of Machine Learning at Axelera AI, presents the “How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-efficient Computer Vision” tutorial at the May 2024 Embedded Vision Summit.
As artificial intelligence inference transitions from cloud environments to edge locations, computer vision applications achieve heightened responsiveness, reliability and privacy. This migration, however, introduces the challenge of operating within the stringent confines of resource constraints typical at the edge, including small form factors, low energy budgets and diminished memory and computational capacities. Axelera AI addresses these challenges through an innovative approach of performing digital computations within memory itself. This technique facilitates the realization of high-performance, energy-efficient and cost-effective computer vision capabilities at the thin and thick edge, extending the frontier of what is achievable with current technologies.
In this presentation, Verhoef unveils his company’s pioneering chip technology and demonstrates its capacity to deliver exceptional frames-per-second performance across a range of standard computer vision networks typical of applications in security, surveillance and the industrial sector. This shows that advanced computer vision can be accessible and efficient, even at the very edge of our technological ecosystem.
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
7. Data Processing from 10,000 Feet
7
Data Processing Application
Data Processing Framework
Resource Environment
Spark, Flink,
Hadoop MR,
Dryad, Tez,
...
It is hard to add new application optimization features
to existing frameworks.
8. Dynamic Optimization
Dynamic skew handling
Optimizing job execution based on its characteristics
Adapting execution to resource elasticity
8
9. Key Observation
Current data processing frameworks
are not flexible and extensible.
9
=> A new flexible and extensible data processing system
14. Compiler Passes
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Annotation Pass examples:
● Parallelism Pass
● Executor Placement Pass
● Data Flow Model Pass
● Stage Partitioning Pass
14
● Transient Resource EP Pass
● Transient Resource DFM Pass
● Resource Disaggregation EP Pass
● Resource Disaggregation DFM Pass
Variations
15. Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Annotation Pass examples:
● Parallelism Pass
● Executor Placement Pass
● Data Flow Model Pass
● Stage Partitioning Pass
● Transient Resource EP Pass
● Transient Resource DFM Pass
● Resource Disaggregation EP Pass
● Resource Disaggregation DFM Pass
Compiler Passes
15
Common
Specialized
Specialized
Variations
16. Compiler Passes
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Reshaping Pass examples:
● Loop Extraction Pass
● Loop Fusion Pass (Loop Optimization)
● Common Subexpression Elimination Pass
● Data Skew Reshaping Pass
Runtime Pass example:
● Data Skew Runtime Pass
16
17. Compiler Passes
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Reshaping Pass examples:
● Loop Extraction Pass
● Loop Fusion Pass (Loop Optimization)
● Common Subexpression Elimination Pass
● Data Skew Reshaping Pass
Runtime Pass example:
● Data Skew Runtime Pass
17
Specialized
18. Compiler to Runtime
1818
Type: “Map” Operator
Placement: “Compute” Node
Parallelism: 100
Shuffle,Pull,Disk
Type: “Reduce” Operator
Placement: “Compute” Node
Parallelism: 50
Reduce StageMap Stage
Optimized IR DAG
19. Compiler to Runtime
1919
PhysicalStage PhysicalStage
“Map”Tasks “Reduce”Tasks.
.
.
.
.
.
.
X 100
.
.
X 50
I/O channels for
intermediate data flow
between tasks
Physical DAG
24. Onyx in Action
● Onyx compiler and runtime components
● Onyx job execution: MR, ALS
● Onyx runtime optimization: dynamic skew handling
● Harnessing transient resources with Onyx
Omitted other optimizations due to time constraints!
24
29. MapReduce
● We will show two executions of MapReduce using
different settings:
○ Intermediate data is saved in disk, and pulled by the reducers
○ Intermediate data is saved in memory, and pushed to the reducers
● In order to vary the settings, we go through the following
passes:
○ A data store pass
○ A data flow model pass
○ All of these are “Annotation” passes
29
32. Alternating Least Squares Example
● Alternating Least Square is an ML algorithm used
commonly in recommendation systems.
● Most ML algorithms are iterative processes
=> ALS is one of them!
● But how is this expressed in terms of a DAG? (Acyclic!)
32
33. Alternating Least Squares Example
Naively…
33
(Read input data) . . . . . . . . . . . . (Write output). . . . . . .
Iteration 1 Iteration 2 Iteration N
But what if we want to decide this
“N” according to some condition?
(ex. model convergence in ML)
A set of operators that executes the ALS algorithm
34. Alternating Least Squares Example
Something special we have for the ALS example: Loops!
34
(Read input data) . . . . . . . . . . . . (Write output)
LoopVertex
with termination condition
(Read input data) . . . . . . . . . (Write output). . . . . .
Iteration 1 Iteration NIteration 2
36. Dynamic Data Partitioning Example
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
36
Onyx Compiler
Onyx Runtime
AnnotationPass(es) and
ReshapingPass(es)
IR DAG
37. Dynamic Data Partitioning Example
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
37
Onyx Compiler
Onyx Runtime
Physical DAG Conversion
Shuffle,Pull,Disk
StageStage
Optimized IR DAG
38. Dynamic Data Partitioning Example
38
Onyx Compiler
Onyx Runtime
PhysicalStage PhysicalStage
Physical DAG
Physical DAG Conversion
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
39. Dynamic Data Partitioning Example
39
Onyx Compiler
Onyx Runtime
Execute!
PhysicalStage PhysicalStage
Physical DAG
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
40. Dynamic Data Partitioning Example
40
Onyx Compiler
Onyx Runtime
Data Size Metric
Physical DAG Executing...
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
41. Dynamic Data Partitioning Example
41
Onyx Compiler
Onyx Runtime
New DAG
RuntimePass(es)
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
42. Dynamic Data Partitioning Example
42
Onyx Compiler
Onyx Runtime
Execute! New DAG
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
45. Harnessing Transient Resources with Onyx
45
Using the techniques introduced in
Pado: A Data Processing Engine for
Harnessing Transient Resources in Datacenters
from EuroSys 2017
71. Operator Placement Example with the
Transient Resource Policy
Multinomial Logistic Regression(MLR)
: Machine learning application for classifying
inputs, like tumors as malignant or benign, and
ad clicks as profitable or not.
Gradients are used to update the regression
model, which is used for prediction.
71
89. Containers
● Amazon EC2s(with local SSDs) as containers
● 40 Transient Containers, 5 Reserved Containers
● All containers used for computation
89
90. Workloads
● Alternating Least Squares
Yahoo! Music User Ratings of Songs with Artist, Album, and Genre Meta
Information, v. 1.0. https://webscope. sandbox.yahoo.com/catalog.php?datatype=r
● Multinomial Logistic Regression
Synthetic
● Map-Reduce
Page view statistics for Wikimedia projects.
https://dumps.wikimedia.org/other/pagecounts-raw
90
92. Summary
● Introduces a new data processing system that is flexible
and extensible
○ Compiler that represents various execution policies
○ Runtime that are modular and reconfigurable
● Adapts data processing seamlessly for new deployment
and application requirements
92
93. 93
We are working on creating an Apache incubator
project. We look forward contribution from many
developers!
We are hiring software developers!
Contact: onyx@spl.snu.ac.kr
Software platform lab site: http://spl.snu.ac.kr
94. Onyx:
A Flexible and Extensible
Data Processing System
전병곤, 김주연, 송원욱
Software Platform Lab
Joint work with 양영석, 이산하, 서장호, 어정윤, 이계원, 엄태건, 이우연,
이윤성, 정주성, 하현민, 정은지, 김수정, 유경인, 신동진
94