Machine Learning (ML) models are often composed as pipelines of operators, from “classical” ML operators to pre-processing and featurization operators. Current systems deploy pipelines as "black boxes”, where the same implementation of training is run for inference. This solution is convenient but leaves large room to improve performance and resource usage. This talk presents Pretzel, a framework for deployment of ML pipelines that is inspired to Database Systems: Pretzel inspects and optimizes pipelines end-to-end much like queries, and manages resources common to multiple pipelines such as operators' state. Pretzel is joint work with University of Seoul and Microsoft Research and has recently been presented at OSDI ’18. After the overview, this talk also shows experimental results of Pretzel against state-of-art ML solutions and discusses limitations and extensions.
Many Machine Learning inference workloads compute predictions based on a limited number of models that are deployed together in the system. These models often share common structure and state. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks, thus being unaware of optimization and sharing opportunities.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMVLSICS Design
This paper is devoted to the design of dual core crypto processor for executing both Prime field and binary field instructions. The proposed design is specifically optimized for Field programmable gate array (FPGA) platform. Combination of two different field (prime field GF(p) and Binary field GF(2m)) instructions execution is analysed.The design is implemented in Spartan 3E and virtex5. Both the performance results are compared. The implementation result shows the execution of parallelism using dual field instructions
Over time, Machine Learning inference workloads became more and more demanding in terms of latency and throughput. Moreover, many inference workloads compute predictions based on a limited number of models that are deployed in the system. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In particular, Pretzel can properly schedule ML jobs on NUMA machines, whose complexities may impact latencies and efficiency aspects.
In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
Presentation by Lee Edlefsen, Revolution Analytics to JSM 2012, San Diego CA, July 30 2012
For the past several decades the rising tide of technology has allowed the same data analysis code to handle the increase in sizes of typical data sets. That era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of RAM, and of hard drives. To deal with this, statistical software must be able to use multiple cores and computers. Parallel external memory algorithms (PEMA's) provide the foundation for such software. External memory algorithms (EMA's) are those that do not require all data to be in RAM, and are widely available. Parallel implementations of EMA's allow them to run on multiple cores and computers, and to process unlimited rows of data. This paper describes a general approach to efficiently parallelizing EMA's, using an R and C++ implementation of GLM as a detailed example. It examines the requirements for efficient PEMA's; the arrangement of code for automatic parallelization; efficient threading; and efficient inter-process communication. It includes billion row benchmarks showing linear scaling with rows and nodes, and demonstrating that extremely high performance is achievable.
A novel methodology for task distributionijesajournal
Modern embedded systems are being modeled as Heterogeneous Reconfigurable Computing Systems
(HRCS) where Reconfigurable Hardware i.e. Field Programmable Gate Array (FPGA) and soft core
processors acts as computing elements. So, an efficient task distribution methodology is essential for
obtaining high performance in modern embedded systems. In this paper, we present a novel methodology
for task distribution called Minimum Laxity First (MLF) algorithm that takes the advantage of runtime
reconfiguration of FPGA in order to effectively utilize the available resources. The MLF algorithm is a list
based dynamic scheduling algorithm that uses attributes of tasks as well computing resources as cost
function to distribute the tasks of an application to HRCS. In this paper, an on chip HRCS computing
platform is configured on Virtex 5 FPGA using Xilinx EDK. The real time applications JPEG, OFDM
transmitters are represented as task graph and then the task are distributed, statically as well dynamically,
to the platform HRCS in order to evaluate the performance of the designed task distribution model. Finally,
the performance of MLF algorithm is compared with existing static scheduling algorithms. The comparison
shows that the MLF algorithm outperforms in terms of efficient utilization of resources on chip and also
speedup an application execution.
Many Machine Learning inference workloads compute predictions based on a limited number of models that are deployed together in the system. These models often share common structure and state. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks, thus being unaware of optimization and sharing opportunities.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMVLSICS Design
This paper is devoted to the design of dual core crypto processor for executing both Prime field and binary field instructions. The proposed design is specifically optimized for Field programmable gate array (FPGA) platform. Combination of two different field (prime field GF(p) and Binary field GF(2m)) instructions execution is analysed.The design is implemented in Spartan 3E and virtex5. Both the performance results are compared. The implementation result shows the execution of parallelism using dual field instructions
Over time, Machine Learning inference workloads became more and more demanding in terms of latency and throughput. Moreover, many inference workloads compute predictions based on a limited number of models that are deployed in the system. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In particular, Pretzel can properly schedule ML jobs on NUMA machines, whose complexities may impact latencies and efficiency aspects.
In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
Presentation by Lee Edlefsen, Revolution Analytics to JSM 2012, San Diego CA, July 30 2012
For the past several decades the rising tide of technology has allowed the same data analysis code to handle the increase in sizes of typical data sets. That era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of RAM, and of hard drives. To deal with this, statistical software must be able to use multiple cores and computers. Parallel external memory algorithms (PEMA's) provide the foundation for such software. External memory algorithms (EMA's) are those that do not require all data to be in RAM, and are widely available. Parallel implementations of EMA's allow them to run on multiple cores and computers, and to process unlimited rows of data. This paper describes a general approach to efficiently parallelizing EMA's, using an R and C++ implementation of GLM as a detailed example. It examines the requirements for efficient PEMA's; the arrangement of code for automatic parallelization; efficient threading; and efficient inter-process communication. It includes billion row benchmarks showing linear scaling with rows and nodes, and demonstrating that extremely high performance is achievable.
A novel methodology for task distributionijesajournal
Modern embedded systems are being modeled as Heterogeneous Reconfigurable Computing Systems
(HRCS) where Reconfigurable Hardware i.e. Field Programmable Gate Array (FPGA) and soft core
processors acts as computing elements. So, an efficient task distribution methodology is essential for
obtaining high performance in modern embedded systems. In this paper, we present a novel methodology
for task distribution called Minimum Laxity First (MLF) algorithm that takes the advantage of runtime
reconfiguration of FPGA in order to effectively utilize the available resources. The MLF algorithm is a list
based dynamic scheduling algorithm that uses attributes of tasks as well computing resources as cost
function to distribute the tasks of an application to HRCS. In this paper, an on chip HRCS computing
platform is configured on Virtex 5 FPGA using Xilinx EDK. The real time applications JPEG, OFDM
transmitters are represented as task graph and then the task are distributed, statically as well dynamically,
to the platform HRCS in order to evaluate the performance of the designed task distribution model. Finally,
the performance of MLF algorithm is compared with existing static scheduling algorithms. The comparison
shows that the MLF algorithm outperforms in terms of efficient utilization of resources on chip and also
speedup an application execution.
This lecture covers the principles and the architectures of modern cluster schedulers, including Apache Mesos, Apache Yarn, Google Borg and K8s, and some notes on Omega
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
A presentation of adaptive classification and regression algorithms available in scikit-learn with a Focus on Stochastic Gradient Descent and KNN. Performance examples on 2 Large datasets are presented for SGD, Multinomial Naive Bayes, Perceptron and Passive Aggressive Algorithms.
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEMIJCSEA Journal
Task assignment is one of the most challenging problems in distributed computing environment. An optimal task assignment guarantees minimum turnaround time for a given architecture. Several approaches of optimal task assignment have been proposed by various researchers ranging from graph partitioning based tools to heuristic graph matching. Using heuristic graph matching, it is often impossible to get optimal task assignment for practical test cases within an acceptable time limit. In this paper, we have parallelized the basic heuristic graph-matching algorithm of task assignment which is suitable only for cases where processors and inter processor links are homogeneous. This proposal is a derivative of the basic task assignment methodology using heuristic graph matching. The results show that near optimal assignments are obtained much faster than the sequential program in all the cases with reasonable speed-up.
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...Sunny Kr
Cardinality estimation has a wide range of applications and
is of particular importance in database systems. Various
algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
Database is defined as a set of data that is organized and distributed in a manner that permits the user to access the data being stored in an easy and more convenient manner. However, in the era of big-data the traditional methods of data analytics may not be able to manage and process the large amount of data. In order to develop an efficient way of handling big-data, this work enhances the use of Map-Reduce technique to handle big-data distributed on the cloud. This approach was evaluated using Hadoop server and applied on Electroencephalogram (EEG) Big-data as a case study. The proposed approach showed clear enhancement on managing and processing the EEG Big-data with average of 50% reduction on response time. The obtained results provide EEG researchers and specialist with an easy and fast method of handling the EEG big data.
A multi objective hybrid aco-pso optimization algorithm for virtual machine p...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
MetOp Satellites Data Processing for Air Pollution Monitoring in Morocco IJECEIAES
This paper presents a data processing system based on an architecture comprised of multiple stacked layers of computational processes that transforms Raw Binary Pollution Data coming directly from Two EUMETSAT MetOp satellites to our servers, into ready to interpret and visualise continuous data stream in near real time using techniques varying from task automation, data preprocessing and data analysis to machine learning using feedforward artificial neural networks. The proposed system handles the acquisition, cleaning, processing, normalizing, and predicting of Pollution Data in our area of interest of Morocco.
Design and Estimation of delay, power and area for Parallel prefix addersIJERA Editor
In Very Large Scale Integration (VLSI) designs, Parallel prefix adders (PPA) have the better delay performance. This paper investigates four types of PPA’s (Kogge Stone Adder (KSA), Spanning Tree Adder (STA), Brent Kung Adder (BKA) and Sparse Kogge Stone Adder (SKA)). Additionally Ripple Carry Adder (RCA), Carry Look ahead Adder (CLA) and Carry Skip Adder (CSA) are also investigated. These adders are implemented in verilog Hardware Description Language (HDL) using Xilinx Integrated Software Environment (ISE) 13.2 Design Suite. These designs are implemented in Xilinx Spartan 6 Field Programmable Gate Arrays (FPGA). Delay and area are measured using XPower analyzer and all these adder’s delay, power and area are investigated and compared finally
Over time, Machine Learning inference workloads became more and more demanding in terms of latency and throughput, with multiple models being deployed in the system. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
This lecture covers the principles and the architectures of modern cluster schedulers, including Apache Mesos, Apache Yarn, Google Borg and K8s, and some notes on Omega
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
A presentation of adaptive classification and regression algorithms available in scikit-learn with a Focus on Stochastic Gradient Descent and KNN. Performance examples on 2 Large datasets are presented for SGD, Multinomial Naive Bayes, Perceptron and Passive Aggressive Algorithms.
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEMIJCSEA Journal
Task assignment is one of the most challenging problems in distributed computing environment. An optimal task assignment guarantees minimum turnaround time for a given architecture. Several approaches of optimal task assignment have been proposed by various researchers ranging from graph partitioning based tools to heuristic graph matching. Using heuristic graph matching, it is often impossible to get optimal task assignment for practical test cases within an acceptable time limit. In this paper, we have parallelized the basic heuristic graph-matching algorithm of task assignment which is suitable only for cases where processors and inter processor links are homogeneous. This proposal is a derivative of the basic task assignment methodology using heuristic graph matching. The results show that near optimal assignments are obtained much faster than the sequential program in all the cases with reasonable speed-up.
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...Sunny Kr
Cardinality estimation has a wide range of applications and
is of particular importance in database systems. Various
algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
Database is defined as a set of data that is organized and distributed in a manner that permits the user to access the data being stored in an easy and more convenient manner. However, in the era of big-data the traditional methods of data analytics may not be able to manage and process the large amount of data. In order to develop an efficient way of handling big-data, this work enhances the use of Map-Reduce technique to handle big-data distributed on the cloud. This approach was evaluated using Hadoop server and applied on Electroencephalogram (EEG) Big-data as a case study. The proposed approach showed clear enhancement on managing and processing the EEG Big-data with average of 50% reduction on response time. The obtained results provide EEG researchers and specialist with an easy and fast method of handling the EEG big data.
A multi objective hybrid aco-pso optimization algorithm for virtual machine p...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
MetOp Satellites Data Processing for Air Pollution Monitoring in Morocco IJECEIAES
This paper presents a data processing system based on an architecture comprised of multiple stacked layers of computational processes that transforms Raw Binary Pollution Data coming directly from Two EUMETSAT MetOp satellites to our servers, into ready to interpret and visualise continuous data stream in near real time using techniques varying from task automation, data preprocessing and data analysis to machine learning using feedforward artificial neural networks. The proposed system handles the acquisition, cleaning, processing, normalizing, and predicting of Pollution Data in our area of interest of Morocco.
Design and Estimation of delay, power and area for Parallel prefix addersIJERA Editor
In Very Large Scale Integration (VLSI) designs, Parallel prefix adders (PPA) have the better delay performance. This paper investigates four types of PPA’s (Kogge Stone Adder (KSA), Spanning Tree Adder (STA), Brent Kung Adder (BKA) and Sparse Kogge Stone Adder (SKA)). Additionally Ripple Carry Adder (RCA), Carry Look ahead Adder (CLA) and Carry Skip Adder (CSA) are also investigated. These adders are implemented in verilog Hardware Description Language (HDL) using Xilinx Integrated Software Environment (ISE) 13.2 Design Suite. These designs are implemented in Xilinx Spartan 6 Field Programmable Gate Arrays (FPGA). Delay and area are measured using XPower analyzer and all these adder’s delay, power and area are investigated and compared finally
Over time, Machine Learning inference workloads became more and more demanding in terms of latency and throughput, with multiple models being deployed in the system. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsLightbend
Audience: Architects, Data Scientists, Developers
Technical level: Introductory
From home intrusion detection, to self-driving cars, to keeping data center operations healthy, Machine Learning (ML) has become one of the hottest topics in software engineering today. While much of the focus has been on the actual creation of the algorithms used in ML, the less talked-about challenge is how to serve these models in production, often utilizing real-time streaming data.
The traditional approach to model serving is to treat the model as code, which means that ML implementation has to be continually adapted for model serving. As the amount of machine learning tools and techniques grows, the efficiency of such an approach is becoming more questionable. Additionally, machine learning and model serving are driven by very different quality of service requirements; while machine learning is typically batch, dealing with scalability and processing power, model serving is mostly concerned with performance and stability.
In this webinar with O’Reilly author and Lightbend Principal Architect, Boris Lublinsky, we will define an alternative approach to model serving, based on treating the model itself as data. Using popular frameworks like Akka Streams and Apache Flink, Boris will review how to implement this approach, explaining how it can help you:
* Achieve complete decoupling between the model implementation for machine learning and model serving, enforcing better standardization of your model serving implementation.
* Enable dynamic updates of the served model without having to restart the system.
* Utilize Tensorflow and PMML as model representation and their usage for building “real time updatable” model serving architecture.
Operationalizing Machine Learning: Serving ML ModelsLightbend
Join O’Reilly author and Lightbend Principal Architect, Boris Lublinsky, as he discusses one of the hottest topics in software engineering today: serving machine learning models.
Typically with machine learning, different groups are responsible for model training and model serving. Data scientists often introduce their own machine-learning tools, causing software engineers to create complementary model-serving frameworks to keep pace. It’s not a very efficient system. In this webinar, Boris demonstrates a more standardized approach to model serving and model scoring:
* How to develop an architecture for serving models in real time as part of input stream processing
* How this approach enables data science teams to update models without restarting existing applications
* Different ways to build this model-scoring solution, using several popular stream processing engines and frameworks
Exploring Neo4j Graph Database as a Fast Data Access LayerSambit Banerjee
This article describes the findings of an extensive investigative work conducted to explore the feasibility of using a Neo4j Graph Database to build a Fast Data Access Layer with near-real time data ingestion from the underlying source systems.
This is our contributions to the Data Science projects, as developed in our startup. These are part of partner trainings and in-house design and development and testing of the course material and concepts in Data Science and Engineering. It covers Data ingestion, data wrangling, feature engineering, data analysis, data storage, data extraction, querying data, formatting and visualizing data for various dashboards.Data is prepared for accurate ML model predictions and Generative AI apps
This is our project work at our startup for Data Science. This is part of our internal training and focused on data management for AI, ML and Generative AI apps
Intelligent Systems Project: Bike sharing service modelingAlessio Villardita
A university project on data analysis and modeling based on a bike sharing system data set. Models developed: MLP and RBF fitting, Fuzzy Inference System, ANFIS and time series forecasting.
Model-Based Integration for FMI Co-Simulation and Heterogeneous Simulations o...Modelon
Virtual evaluation of complex Cyber-Physical Systems
(CPS) [1] with a number of tightly integrated
domains such as physical, mechanical, electrical,
thermal, cyber, etc. demand the use of heterogeneous
simulation environments. Our previous effort with
C2 Wind Tunnel (C2WT) [2] [3] attempted to solve
the challenges of evaluating these complex systems
as-a-whole, by integrating multiple simulation platforms
with varying semantics and integrating and
managing different simulation models and their interactions.
Recently, a great interest has developed to
use Functional Mockup Interface (FMI) [4] for a variety
of dynamics simulation packages, particularly
in the automotive industry. Leveraging the C2WT
effort on effective integration of different simulation
engines with different Models of Computation
(MoCs), we propose, in this paper, to use the proven
methods of High-Level Architecture (HLA)-based
model and system integration. We identify the challenges
of integrating Functional Mockup Unit for
Co-Simulation (FMU-CS) in general and via HLA
[5] and present a novel model-based approach to rapidly
synthesize an effective integration. The approach
presented provides a unique opportunity to
integrate readily available FMU-CS components
with various specialized simulation packages to rapidly
synthesize HLA-based integrated simulations
for the overall composed Cyber-Physical Systems.
Authors: Himanshu Neema, Jesse Gohl, Zsolt Lattmann, Janos Sztipanovits, Gabor Karsai, Sandeep Neema, Ted Bapty, John Batteh, Hubertus Tummescheit and Chandrasekar Sureshkumar
Machine learning at scale with Google Cloud PlatformMatthias Feys
Machine Learning typically involves big datasets and lots of model iterations. This presentation shows how to use GCP to speed up that process with ML Engine and Dataflow. The focus of the presentation is on tooling not on models or business cases.
The processor architecture that has been developed is capable of carrying out both the modelview and projection transformations within the geometric transformations process of the OpenGL pipeline. These are two of the most prominent transformations carried out within the OpenGL pipeline, as they each implement key steps in constructing the scene.
These transformations within the OpenGL pipeline are executed in succession on all objects in the scene per frame. Through the context switching implemented by the architecture, it is able to process successive NLPs with no latency (during which execution would normally have to halt) due to transferring the associated data in or out of the register file using the DMA engine.
The objects in a scene are most likely to have different numbers of vertices, and as such create programs of different sizes to be run. The Optimal Controller developed can efficiently support a wide range of program sizes as it will only ever store one instruction per loop, thus as well as the efficiency in terms of the amount of program memory required, the architecture is also initialised very quickly by the High-Level Controller to run the next NLP.
The processing architecture is therefore a good fit for the processing engine required to perform the modelview and projection transformations.
It has been shown that when code is written specifically for the Optimal Controller, the steps involved in constructing the initialization instruction words to be issued by the High-Level controller in loading the program into the Optimal Controller are trivial, and that this alone would be all that was necessary to constitute compilation of the code.
Although it hasn’t been explicitly shown in this report, when considering the range of induction variable operations supported by the Data Address Generator block it is reasonable to assume that the Optimal Controller and the processor architecture as a whole would efficiently support a wider range of NLPs, especially the FIR filter and matrix-matrix vector multiplication, which are also prominent NLPs in graphics processing as well as matrix-vector multiplication.
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTINGcscpconf
Parallel computing systems compose task partitioning strategies in a true multiprocessing
manner. Such systems share the algorithm and processing unit as computing resources which
leads to highly inter process communications capabilities. The main part of the proposed
algorithm is resource management unit which performs task partitioning and co-scheduling .In
this paper, we present a technique for integrated task partitioning and co-scheduling on the
privately owned network. We focus on real-time and non preemptive systems. A large variety of
experiments have been conducted on the proposed algorithm using synthetic and real tasks.
Goal of computation model is to provide a realistic representation of the costs of programming
The results show the benefit of the task partitioning. The main characteristics of our method are
optimal scheduling and strong link between partitioning, scheduling and communication. Some
important models for task partitioning are also discussed in the paper. We target the algorithm
for task partitioning which improve the inter process communication between the tasks and use
the recourses of the system in the efficient manner. The proposed algorithm contributes the
inter-process communication cost minimization amongst the executing processes.
School of Computing, Science & EngineeringAssessment Briefin.docxanhlodge
School of Computing, Science & Engineering
Assessment Briefing to Students
Learning Outcomes of this Assessment
A2 - show awareness of a variety of graphics toolkits and select an appropriate one for a given task
A3 - discuss the capabilities of various input and output devices and their relationship to graphics programming
A4 - use appropriate mathematics to perform standard graphical transformations
A5 - application of graphics programming skills in a real-world application
Key Skills to be Assessed
C/C++ programming
Use of OpenGL API
Application of low level graphics principles & data management techniques for developing interactive graphics application
Critical Evaluation of tools used
The Assessment Task
Your task is to demonstrate your newly acquired skills in C and OpenGL programming. This will be achieved by producing a
demonstration application that will offer a simple visualisation comprising a collection of discrete objects located in a navigable
space. Fundamental to the successful completion of this assignment is careful consideration of the management of scene data,
using resizable dynamic memory structures, and the application of appropriate mathematical models for simulation, navigation and
interaction.
The form of this assignment will be a basic solar system simulation in which a dynamic collection of planetary bodies will be
simulated. These bodies will be represented by simple graphical forms and have the ability to show a historical trail of their
movement. The bodies motion should be defined by a simple gravitational system simulation which calculates forces based on the
masses of the bodies and uses this to derive discrete step changes to acceleration and velocity.Inital starting conditions for the
planetary bodies should be random (mass, position, starting velocity and material (colour)). Advanced solutions should consider the
actions taking place when collisions between bodies occur. In these cases the collision should be detected. The mass and
velocities of the bodies should be combined (thereby removing one of the bodies from the data structure) with the major body
taking priority. Ideally the size of the resultant body should be changed to reflect the enhanced mass. You should also provide
mechanisms to add bodies during the runtime of the simulation (based on random data) both at user request and to maintain a set
number of bodies in the system.
Assessment Title : Computer Graphics Assignment 1: OpenGL Programming - Solar System
Module Title : Computer Graphics
You are provided with an example solution to evaluate and a template project, including a maths library, camera model and basic
utilities as a starting point.
The implementation of the assignment problem will be assessed in the following areas
1. Design and implementation of a suitable dynamic data structure that will maintain an ordered list of the render-able objects with
facilities to add and remove entit.
Marco D. Santambrogio, responsabile del #NECSTLab, in questo talk dà indicazioni su come iniziare a prendere parte alle nostre attività di ricerca e le opportunità per gli studenti interessanti al progetto #NECSTCamp
- Silvia Brembati, Product Designer
- Benedetta Bolis, Engineering Physics Student
Due to the recent COVID-19 outbreak, everybody had to quickly rearrange their lifestyle and learn how to get through isolation.
Keeping in touch has never been more compelling and challenging at the same time.
A recent survey conducted in Italy, states that 80% of the population felt like they needed psychological support to get through quarantine. We believe that if people had a way to feel surrounded by their friends and had been able to share activities, this number would be significantly lower. This is where our new app TreeHouse comes in handy as it guides the user in contributing to the life of the community: a virtual tree will come to life and thrive thanks to both real-life and online interactions. Sharing content, chatting with friends, or drinking a cup of tea together will make a leaf or a branch grow, but if the user is missing for too long, the tree will suffer from their absence, in complete symbiosis.
Nevertheless, checking how the tree develops helps the members feel the actual presence of the community, and makes them able to support each other, letting the tree flourish again.
- Filippo Carloni, M.Sc. student in Computer Science and Engineering
Expressions (REs) are widely used to find patterns among data, like in genomic markers research for DNA analysis, signature-based detection for network intrusion detection systems, or search engines. TiReX is a novel and efficient RE matching architecture for
FPGAs, based on the concept of matching core. RE passes into the compilation and optimization phase to be efficiently translated into sequences of basic matching instructions that a matching core runs on input data, and can be replaced to change the RE to be matched.
- Edoardo Ramalli, M.Sc. student in Computer Science and Engineering
Drug Repurposing is the investigation of existing drugs on the pharmaceutical market for new therapeutic purposes; drug repurposing reduces the time and cost of clinical trial steps, saving years, and billions of dollars in R&D. Identifying new diseases on which a drug can be effective is a complex problem: our approach leverages knowledge graphs (KG), networks composed of many types of entities and relations, on which embedding and graph completion techniques can be applied to infer insights and analyses. Our KG is built from well-known databases such as DrugBank, UniProt, and CTD and contains over one million relationships between more than 70K biological and pharmaceutical entities like diseases, genes, proteins and drugs. In this work, we research the applicability of knowledge graph completion techniques, such as link prediction (and triple classification) using a various number of different embedding models from different families: matrix factorization, geometric and Deep learning. Using these models is possible to infer new drug-disease relationships on our KG, and identify novel drug repurposing candidates. Preliminary experimental results are encouraging and show how state-of-the-art machine learning models, combined with the ever-growing amount of biological data freely available to the research community, could significantly improve the field of drug repurposing.
- Daniele Valentino de Vincenti, B.Sc. graduate in Biomedical Engineering @Politecnico di Milano
- Lorenzo Farinelli, B.Sc. graduate in Computer Science and Engineering @Politecnico di Milano
Plaster is a multi-layered infrastructure (based on C++) aimed at supporting the development of multi-FPGA systems and the management of large data flows between the nodes. In particular, the goal of the project is to provide the end-user with a set of tools (by the means of a Python library and a C++ service) to easily assign bitstreams to nodes and route data between them, in the context of a PYNQ-based cluster suitable for distributed acceleration of computation-intensive tasks. Using this platform, an abandoned objects detection tool is implemented, designed as a Multi-FPGA distributed system exploiting an hardware accelerated version of the YOLO neural network for image detection.
- Jessica Leoni, PhD student in Data Analysis and Decision Science @Politecnico di Milano
- Luca Stornaiuolo, PhD student in Computer Science @Politecnico di Milano
- Irene Canavesi, B.Sc. student in Biomedical Engineering
- Sara Caramaschi, B.Sc. student in Biomedical Engineering
Lung cancer is one of the most frequently diagnosed cancer forms, with a mortality of 84.2% in 2018. Our project focuses on shortening diagnosis time and improving accuracy in the overall detection of this disease. We implemented a convolutional neural network capable of automatically identifying lungs on a CT image. Segmentation is a necessary first step for the development of an algorithm capable of identifying and classifying the tumor mass since errors in the ROI identification can lead to errors in the tumor mass recognition. The network architecture follows the structure of a preexisting network, the U-Net that performs well on medical images. We reached a very good test accuracy of 99.63%: the strength of our work lies in the large number of CT images of both healthy and sick patients, used for the training and validation of the network.
- Samuele Barbieri, B.Sc. student in Computer Science and Engineering
The last decade saw cloud computing more and more involved as the primary technology to develop, deploy and maintain complex infrastructures and services at scale. This happened because cloud computing allows to consume resources on-demand and to dynamically scale performance. Some compute-intensive workloads require computing power that current CPUs are not able to provide and, for this reason, heterogeneous computing with FPGAs is becoming an interesting solution to continue to meet SLAs. However, requests to cloud services can come at unpredictable rates and, for this reason, resources may be underutilized for significant portions of time. To increase resource utilization, we propose BlastFunction, which is a system that allows to accelerate compute-intensive kernels with shared FPGAs handled in a serverless fashion, while reaching near-native execution latency. In this talk we will present the main aspects of BlastFunction, showing its capabilities to time-share FPGAs across multiple function instances to optimize devices utilization. We will also show how we implemented the sharing and orchestration mechanism on a Kubernetes cluster based on the Amazon Web Services (AWS) EC2 F1 instances.
- Sofia Breschi, B.Sc. student in Biomedical Engineering
- Beatrice Branchini, B.Sc. student in Biomedical Engineering
In the last few years, the use of Next Generation Sequencing technology in medicine has become more and more common, in particular for the diagnosis of genetic diseases and the production of personalized drugs. In this context, the identification of characteristic patterns in the human genome plays an important role. Exact pattern matching algorithms are an efficient way to identify those sequences. However, this process represents a bottleneck in the genomic field as it is very computationally intensive and time-consuming. Moreover, general-purpose architectures are not optimized to handle the huge amount of data and operations used in a genomics context. Due to these considerations, we propose an implementation of the Knuth-Morris-Pratt (KMP) algorithm on FPGA, a particular family of integrated circuits capable of reconfiguration for an infinite number of times. The KMP algorithm results in being very fast and efficient, by reducing unnecessary comparisons of characters that have already been matched. Furthermore, to achieve an overall speedup of the alignment process, the implementation on FPGA will bring on an even faster and more efficient solution, thus providing the patient with a quick response.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
1. Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
Politecnico di Milano, Milano, 19/10/2018
Yunseong Lee, Alberto Scolari, Byung-Gon Chun, Marco D.
Santambrogio, Markus Weimer, Matteo Interlandi
PRETZEL: Opening the Black Box
of Machine Learning Prediction Serving Systems
2. ML-as-a-Service
• ML models are learnt from data
during training
2
• Key requirements:
1. Performance: latency/throughput
2. Minimal resource usage: minimal service cost
• Are deployed on cloud platforms for
Prediction Serving
• State-of-art deployment strategy: Black Box
3. 3
Inside the black box: interesting facts
• Applications host multiple models per machine
(10-100s)
• Deployed models are often similar in structure and
state
- Customer personalization, Templates, Transfer
Learning
• Inside, models are DAGs
of different operators
• But with black boxes, you can apply only
external optimizations: caching, batching, …
4. We need to know structure and state:
PRETZEL white-box model
4
Breaking the black-box model
1. To generate an optimised version of a model on
deployment: higher performance
2. To allocate shared state only once, and share
resources among models: higher density
5. ‣ Limitations of Black Box approaches
‣ PRETZEL, White Box Prediction Serving System
‣ Evaluation
‣ Conclusions and Future Work
5
Outline
7. 7
Case study
• 250 Sentiment Analysis models in ML.Net, C#, run 100 times
• First warm-up execution is cold, 99 following executions are hot
• Long-Tail latency, especially with cold:
cannot ensure Service-Level-Objectives
(SLOs)
• Overheads: JIT, memory allocation
• Profiling shows no clear bottleneck, with ML op LogReg being 0.3% of
runtime for simple models
8. • Black Box models cannot share resources
• Each model has its own container/process/
thread: overhead, poor scalability
8
Resource waste
• Each model has its own state
• But many operators have similar/equal state, and models are
deployed together
9. • Optimisations for single operators, like DNNs [1-2]
• TensorFlow Serving [3] as Servable Python objects, ML.Net as zip files
with state files and DLLs
• Clipper [4] and Rafiki [5] deploy pipelines as Docker containers
– They schedule requests based on latency target
– Can apply caching and batching
• MauveDB [6] accepts regression and interpolation model and optimises
them as DB views
• Tensor Comprehension [7] optimizes DNN models only via tensors
9
Related work
[1] https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/
[2] In-Datacenter Performance Analysis of a Tensor Processing Unit, arXiv, Apr 2017
[3] C.Olston, et al., Tensorflow - serving: Flexible, high-performance ml serving. In Work-shop on ML Systems at NIPS, 2017
[4] D. Crankshaw, et al.. Clipper: A low-latency online prediction serving system. In NSDI, 2017
[5] W.Wang, et al., Rafiki: Machine Learning as an Analytics Service System. ArXiv e-prints, Apr. 2018
[6] A. Deshpande and S. Madden, Mauvedb: Supporting model-based user views in database systems. In SIGMOD, 2006
[7] N. Vasilache, et al., Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
11. White Box Prediction Serving: make pipelines co-exist
better, schedule better
1. End-to-end Optimizations: inspect models and optimise
internal execution
2. Multi-model Optimizations: share data, code and
resources
11
Design principles
12. End-to-end
1. Ahead-of-Time compilation at deployment, to
minimise JIT
2. Vector pooling, pre-allocate data structures
12
Models optimizations
Multi-model
1. Use Object Store to share operators parameters/
weights
2. Sub-Plan materialisation to -use intermediate results
across models
13. 13
Off-line phase - Flour+Oven
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
recognize
Char and
of Concat
Char and
CharNgra
CharNgra
created. T
stages, ve
Model Pl
two DAG
DAG of p
tion of the
lated para
that will b
given DA
physical s
execution
physical i
ters chara
Plan co
DAG is g
Plan Com
representa
formation
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
recognize that the Linear R
Char and WordNgram, ther
of Concat. Additionally, To
Char and WordNgram, ther
CharNgram (in one stage)
CharNgram and WordNGr
created. The final plan wil
stages, versus the initial 4 o
Model Plan Compiler: M
two DAGs: a DAG comp
DAG of physical stages. L
tion of the stages output of
lated parameters; physical s
that will be executed by the
given DAG, there is a 1-to-
physical stages so that a lo
execution code of different
physical implementation is
ters characterizing a logica
Plan compilation is a two
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET
ZEL. In (1), a model is translated into a Flour program. (2
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the element
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg
isters [33, 38]. Compute-intensive transformations (e.g
Oven
Optimiser
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstr
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API
14. • Oven optimizes models much like DB queries
• Uses a rule-based optimiser
– repeatedly looks for patterns of operators within model DAG
– merge operators into stage
14
Oven optimizations
Initial
model DAG
Push linear
predictor back
and remove
Concat
Apply rules
and group
into stages
Add
statistics and
create Model
Plan
15. • Two main components:
– Runtime, with an Object Store
– Scheduler
• Runtime handles physical resources: threads and buffers
• Object Store caches objects of all models
– models register and retrieve state objects via a key,
like the file MD5
• Scheduler is event-based, each stage being an event
15
On-line phase
17. • Two model classes written in ML.NET, running in ML.NET and
Pretzel
– 250 Sentiment Analysis (SA) models
– 250 Attendee Count (AC) models
• Testbed representing a small production server
– 2 8-core Xeon E5-2620 v4 at 2.10 GHz, HT disabled
– 32 GB RAM
– Windows 10
– .Net Core 2.0
17
Workload and testbed
18. • Experiments with all 250 AC models, smaller than SA
• With SA, only Pretzel can load all models
18
Memory
Setting
Shared
Objects
Shared
Runtime
ML.Net + Clipper
ML.Net ✓
PRETZEL without
ObjectStore ✓
PRETZEL ✓ ✓
19. • Micro-benchmark with stand-alone system, no communication
• All 250 SA models
19
Latency
ML.Net PRETZEL
P99 (hot) 0.6 0.2
P99 (cold) 8.1 0.8
Worst (cold) 280.2 6.2
20. • 250 AC models, run 1000 times each
• 1000 queries in a batch
• ML.Net vs PRETZEL
20
Throughput
22. • We addressed performance/density bottlenecks in ML inference for
Model-as-a-Service
• We advocate the adoption of a white-box approach
• We apply DB query optimizations techniques to ML Prediction Serving
• We were accepted at OSDI ’18
• Limitations:
- PRETZEL currently supports a subset of ML.Net operators
- No NN operators
- No automated code generation: stages implementation still
involves some manual process
22
Conclusions and Limitations
23. • NUMA-aware Scheduler and Runtime
• Fully automated code-generation of stages:
- hardware-specific templates [8]
- Halide-based generator for CPU and GPU: no JIT anymore
• Support user-coded operators for filtering and pre-processing
23
Future Work 1
[8] K.Krikellas, S.Viglas,et al. Generating code for holistic query evaluation, in ICDE, pages 613–624. IEEE Computer Society, 2010
24. • Supporting ML.Net operators, including ONNX [9], is complex
• Not just manpower: Oven rules need to scale fairly with number of
operators
- Cannot write rules for all possible (sequences of) operators
• We need a formal framework to describe operators
- Something like Relational Algebra for query optimiser
- Maybe Tensor Algebra, like Tensor Comprehension [10]?
24
Future Work 2
QUESTIONS ?
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
[9] Open Neural Network Exchange (ONNX). https://onnx.ai, 2017
[10] Announcing Tensor Comprehensions. https://research.fb.com/announcing-tensor-comprehensions/
Y. Lee, A. Scolari, B.-G. Chun, M. D. Santambrogio, M. Weimer, M. Interlandi
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
https://arxiv.org/abs/1810.06115