Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Databricks
Current workshop diagnostics are based on manually-generated decision trees. This approach is increasingly reaching its limits due to a growing variant diversity, and the increasing complexity of vehicle systems. This session will describe BMW’s new Apache Spark-enabled approach: Use the available data from cars and workshops to train models that are able to predict the right part to switch, or the action to take.
You’ll get an overview and presentation of BMW’s complete pipeline including ETL, model training based on Spark 2.1, serializing results along with metadata and serving the gained insights as Web-App. You’ll also hear how Spark helped BMW leverage the information from millions of observations and thousands of features, and learn what pitfalls they experienced (e.g. setting up a working dev-toolchain, working with 50K features, parallelizing well), and how you can avoid them.
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. GPUs provide the computational power needed for the most demanding applications such as Deep Neural Networks, nuclear or weather simulation. Since the launch of RAPIDS in mid-2018, this vast computational resource has become available for Data Science workloads too. The RAPIDS toolkit, which is now available on the Databricks Unified Analytics Platform, is a GPU-accelerated drop-in replacement for utilities such as Pandas/NumPy/ScikitLearn/XGboost. Through its use of Dask wrappers the platform allows for true, large scale computation with minimal, if any, code changes.
The goal of this talk is to discuss RAPIDS, its functionality, architecture as well as the way it integrates with Spark providing on many occasions several orders of magnitude acceleration versus its CPU-only counterparts.
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Databricks
This session will explain how NetApp simplifies the process of analyzing IoT data, using Apache Spark clusters across data centers and the cloud using NetApp Private Storage (NPS) for AWS/Azure, NetApp Data Fabric and NetApp Connectors for NFS and S3. IoT data originates at the edge in different geographical locations, and it can arrive at different data centers or the cloud depending on sensor location. The challenge is how to combine these different data streams across different datacenters to generate wider insights.
Learn how NetApp Data Fabric helps solve this challenge. In the Data Fabric architecture, the IoT data is ingested via Kafka into an Apache Spark cluster running in AWS/Azure, but the data is stored in NPS provisioned NFS share through NFS Connector. The IoT data in NPS can then be moved to on-prem datacenters, or on-prem IoT data can be moved to NPS or ONTAP Cloud for processing in AWS/Azure using NetApp SnapMirror Flex Clone or NFS Connector. We’ll also review how NetApp StorageGRID object storage maintains IoT data for archival purposes using S3 Target. The above options allow you to analyze IoT data from AWS, StorageGRID, HDFS or NFS, providing a feasible solution for deploying Spark clusters across datacenters.
Takeaways will include identifying Spark challenges that can be remedied by extending your Spark environment to take advantage of NPS; understanding how NPS and StorageGRID can provide a cost-effective alternative for dev/test, DR for Spark analytics; and understanding Spark architecture and deployment options that utilize data from multiple locations, including on-prem and cloud-based repositories.
SQream DB - Bigger Data On GPUs: Approaches, Challenges, SuccessesArnon Shimoni
This talk will present SQream’s journey to building an analytics data warehouse powered by GPUs. SQream DB is an SQL data warehouse designed for larger than main-memory datasets (up to petabytes). It’s an on-disk database that combines novel ideas and algorithms to rapidly analyze trillions of rows with the help of high-throughput GPUs. We will explore some of SQream’s ideas and approaches to developing its analytics database – from simple prototype and tech demos, to a fully functional data warehouse product containing the most important features for enterprise deployment. We will also describe the challenges of working with exotic hardware like GPUs, and what choices had to be made in order to combine the CPU and GPU capabilities to achieve industry-leading performance – complete with real world use case comparisons.
As part of this discussion, we will also share some of the real issues that were discovered, and the engineering decisions that led to the creation of SQream DB’s high-speed columnar storage engine, designed specifically to take advantage of streaming architectures like GPUs.
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Spark Summit
Come explore a feature we’ve created that is not supported out-of-the-box: the ability to add or remove nodes to always-on real time Spark Streaming jobs. Elastic Spark Streaming jobs can automatically adjust to the demands of traffic or volume. Using a set of configurable utility classes, these jobs scale down when lulls are detected and scale up when load is too high. We process multiple TB’s per day with billions of events. Our traffic pattern experiences natural peaks and valleys with the occasional sustained unexpected spike. Elastic jobs has freed us from manual intervention, given back developer time, and has made a large financial impact through maximized resource utilization.
When OLAP Meets Real-Time, What Happens in eBay?DataWorks Summit
OLAP Cube is about pre-aggregations, it reduces the query latency by spending more time and resources on data preparation. But for real-time analytics, data preparation and visibility latency are critical. What happens when OLAP cube meets real-time use cases?
Can we pre-build the cubes in real-time with a quick and more cost effective way? This is hard but still doable.
In eBay,we built our own real-time OLAP solution based on Apache Kylin & Apache Kafka. We read unbounded events from Kafka cluster then divide the streaming data into 3 stages, In-Memory Stage (Continuously In-Memory Aggregations) , On Disk Stage (Flush to disk, columnar based storage and indexes) and Full Cubing Stage (with MR or Spark, save to HBase). Data are aggregated to different layers in different stage, but all query able. Data will be transformed from 1 stage to another stage automatically and transparent to user.
This solution is built to support quite a few realtime analytics use cases in eBay, we will share some use cases like site speed monitoring and eBay site deal performance in this session as well.
Speaker:
Qiaoneng Qian, Senior Product Manager, eBay
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Databricks
Current workshop diagnostics are based on manually-generated decision trees. This approach is increasingly reaching its limits due to a growing variant diversity, and the increasing complexity of vehicle systems. This session will describe BMW’s new Apache Spark-enabled approach: Use the available data from cars and workshops to train models that are able to predict the right part to switch, or the action to take.
You’ll get an overview and presentation of BMW’s complete pipeline including ETL, model training based on Spark 2.1, serializing results along with metadata and serving the gained insights as Web-App. You’ll also hear how Spark helped BMW leverage the information from millions of observations and thousands of features, and learn what pitfalls they experienced (e.g. setting up a working dev-toolchain, working with 50K features, parallelizing well), and how you can avoid them.
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. GPUs provide the computational power needed for the most demanding applications such as Deep Neural Networks, nuclear or weather simulation. Since the launch of RAPIDS in mid-2018, this vast computational resource has become available for Data Science workloads too. The RAPIDS toolkit, which is now available on the Databricks Unified Analytics Platform, is a GPU-accelerated drop-in replacement for utilities such as Pandas/NumPy/ScikitLearn/XGboost. Through its use of Dask wrappers the platform allows for true, large scale computation with minimal, if any, code changes.
The goal of this talk is to discuss RAPIDS, its functionality, architecture as well as the way it integrates with Spark providing on many occasions several orders of magnitude acceleration versus its CPU-only counterparts.
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Databricks
This session will explain how NetApp simplifies the process of analyzing IoT data, using Apache Spark clusters across data centers and the cloud using NetApp Private Storage (NPS) for AWS/Azure, NetApp Data Fabric and NetApp Connectors for NFS and S3. IoT data originates at the edge in different geographical locations, and it can arrive at different data centers or the cloud depending on sensor location. The challenge is how to combine these different data streams across different datacenters to generate wider insights.
Learn how NetApp Data Fabric helps solve this challenge. In the Data Fabric architecture, the IoT data is ingested via Kafka into an Apache Spark cluster running in AWS/Azure, but the data is stored in NPS provisioned NFS share through NFS Connector. The IoT data in NPS can then be moved to on-prem datacenters, or on-prem IoT data can be moved to NPS or ONTAP Cloud for processing in AWS/Azure using NetApp SnapMirror Flex Clone or NFS Connector. We’ll also review how NetApp StorageGRID object storage maintains IoT data for archival purposes using S3 Target. The above options allow you to analyze IoT data from AWS, StorageGRID, HDFS or NFS, providing a feasible solution for deploying Spark clusters across datacenters.
Takeaways will include identifying Spark challenges that can be remedied by extending your Spark environment to take advantage of NPS; understanding how NPS and StorageGRID can provide a cost-effective alternative for dev/test, DR for Spark analytics; and understanding Spark architecture and deployment options that utilize data from multiple locations, including on-prem and cloud-based repositories.
SQream DB - Bigger Data On GPUs: Approaches, Challenges, SuccessesArnon Shimoni
This talk will present SQream’s journey to building an analytics data warehouse powered by GPUs. SQream DB is an SQL data warehouse designed for larger than main-memory datasets (up to petabytes). It’s an on-disk database that combines novel ideas and algorithms to rapidly analyze trillions of rows with the help of high-throughput GPUs. We will explore some of SQream’s ideas and approaches to developing its analytics database – from simple prototype and tech demos, to a fully functional data warehouse product containing the most important features for enterprise deployment. We will also describe the challenges of working with exotic hardware like GPUs, and what choices had to be made in order to combine the CPU and GPU capabilities to achieve industry-leading performance – complete with real world use case comparisons.
As part of this discussion, we will also share some of the real issues that were discovered, and the engineering decisions that led to the creation of SQream DB’s high-speed columnar storage engine, designed specifically to take advantage of streaming architectures like GPUs.
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Spark Summit
Come explore a feature we’ve created that is not supported out-of-the-box: the ability to add or remove nodes to always-on real time Spark Streaming jobs. Elastic Spark Streaming jobs can automatically adjust to the demands of traffic or volume. Using a set of configurable utility classes, these jobs scale down when lulls are detected and scale up when load is too high. We process multiple TB’s per day with billions of events. Our traffic pattern experiences natural peaks and valleys with the occasional sustained unexpected spike. Elastic jobs has freed us from manual intervention, given back developer time, and has made a large financial impact through maximized resource utilization.
When OLAP Meets Real-Time, What Happens in eBay?DataWorks Summit
OLAP Cube is about pre-aggregations, it reduces the query latency by spending more time and resources on data preparation. But for real-time analytics, data preparation and visibility latency are critical. What happens when OLAP cube meets real-time use cases?
Can we pre-build the cubes in real-time with a quick and more cost effective way? This is hard but still doable.
In eBay,we built our own real-time OLAP solution based on Apache Kylin & Apache Kafka. We read unbounded events from Kafka cluster then divide the streaming data into 3 stages, In-Memory Stage (Continuously In-Memory Aggregations) , On Disk Stage (Flush to disk, columnar based storage and indexes) and Full Cubing Stage (with MR or Spark, save to HBase). Data are aggregated to different layers in different stage, but all query able. Data will be transformed from 1 stage to another stage automatically and transparent to user.
This solution is built to support quite a few realtime analytics use cases in eBay, we will share some use cases like site speed monitoring and eBay site deal performance in this session as well.
Speaker:
Qiaoneng Qian, Senior Product Manager, eBay
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
Cloud native deployment has become one of the major trends for large scale Big Data analytics. Compared to on-premise data center, cloud offers much stronger scalability and higher elasticity to Big Data applications. However, cloud is also considered to be less performance than on-premise alternatives due to virtualization and cluster resource disaggregation. We present a new cloud native Spark application architecture backed by persistent memory technology. The key ingredient of this architecture is a novel acceleration engine that uses Intel's 3DXPoint technology as external memory. We discuss how the performance of multiple aspects of data processing can be improved using this new architecture. As a key takeaway, audience will gain understanding on the benefits of latest persistent memory technology, and how such new technology could be leveraged in cloud data processing architecture.
Amazon EC2 provides resizable compute capacity in the cloud and makes web scale computing easier for customers. It offers a wide variety of compute instances is well suited to every imaginable use case, from static websites to high performance supercomputing on-demand, all available via highly flexible pricing options. This session covers the latest EC2 features and capabilities, including new instance families available in Amazon EC2, the differences among their hardware types and capabilities, and their optimal use cases. We also will cover some best practices on how you can optimize your spend on EC2 to make the most of your EC2 instances, saving time and money.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Databricks
As the development of semiconductor devices, manufacturing system leads to improve productivity and efficiency for wafer fabrication. Owing to such improvement, the number of wafers yielded from the fabrication process has been rapidly increasing. However, current software systems for semiconductor wafers are not aimed at processing large number of wafers. To resolve this issue, the BISTel (a world-class provider of manufacturing intelligence solutions and services for manufacturers) tries to build several products for big data such as Trace Analyzer (TA) and Map Analyzer (MA) using Apache Spark. TA is to analyze raw trace data from a manufacturing process. It captures details on all variable changes, big and small and give the traces' statistical summary (i.e.: min, max, slope, average, etc.). Several BISTel's customers, which are the top-tier semiconductor company in the world use the TA to analyze the massive raw trace data from their manufacturing process. Especially, TA is able to manage terabytes of data by applying Apache Spark's APIs. MA is an advanced pattern recognition tool that sorts wafer yield maps and automatically identify common yield loss patterns. Also, some semiconductor companies use MA to identify clustering patterns for more than 100,000 wafers, which can be considered as big data in the semiconductor area. This talk will introduce these two products which are developed based on the Apache Spark and present how to handle the large-scale semiconductor data in the aspects of software techniques.
Speakers: Seungchul Lee, Daeyoung Kim
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing.
Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks.
We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.
GPU databases - How to use them and what the future holdsArnon Shimoni
GPU databases are the hottest new thing, with about 7 different companies producing their own variant. In this session, we will discuss why they were created, how they are already disrupting the database world, and what the future of computing holds for them.
This presentation demonstrates how the power of NVIDIA GPUs can be leveraged to both accelerate speed to insight and to scale the amount of hot and warm data analyzed to meet the increasing demands of data scientists and business intelligence professionals alike, as well as to find tactical and strategic insights with greater speed on exponentially growing datasets.
Organizations commonly believe that they are advancing in analytical capabilities due to the rise in the data science profession and the myriad of technologies available for analytics, business intelligence, artificial intelligence and machine learning. However, if you do the math, they are actually falling behind as the increases in the rates of data collection volume far outpace the rate of increases in hot and warm data used for analytics. This is causing organizations to rely on an ever-decreasing percentage of their information assets for decision making.
We talk about why GPU databases were created and share what sets SQream apart from other GPU databases, MPP solutions, in memory and Hadoop based analytic alternatives.
We will also outline how an organization can use GPU databases to thrive in the information revolution by using a significantly greater percentage of its data for analytical purposes, obtaining insights that are desired today, and will remain cost-effective into the next few years when data lakes are expected to balloon from petabytes to exabytes.
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analyticsinside-BigData.com
Today NVIDIA announced a GPU-acceleration platform for data science and machine learning, with broad adoption from industry leaders, that enables even the largest companies to analyze massive amounts of data and make accurate business predictions at unprecedented speed.
Data analytics and machine learning are the largest segments of the high performance computing market that have not been accelerated — until now,” said Jensen Huang, founder and CEO of NVIDIA, who revealed RAPIDS in his keynote address at the GPU Technology Conference. “The world’s largest industries run algorithms written by machine learning on a sea of servers to sense complex patterns in their market and environment, and make fast, accurate predictions that directly impact their bottom line.
"RAPIDS open-source software gives data scientists a giant performance boost as they address highly complex business challenges, such as predicting credit card fraud, forecasting retail inventory and understanding customer buying behavior. Reflecting the growing consensus about the GPU’s importance in data analytics, an array of companies is supporting RAPIDS — from pioneers in the open-source community, such as Databricks and Anaconda, to tech leaders like Hewlett Packard Enterprise, IBM and Oracle."
Learn more: https://insidehpc.com/2018/10/open-source-rapids-gpu-platform-accelerate-predictive-data-analytics/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
In recent releases, TensorFlow has been enhanced for distributed learning and HDFS access. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. There are several community projects wiring TensorFlow onto Apache Spark clusters. Unfortunately, they are limited to support synchronous distributed learning only, and don’t allow TensorFlow servers to communicate with each other directly.
In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, which will be open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Spark Summit
In this talk, we will present SPynq framework: A framework for the efficient mapping and acceleration of Spark applications on heterogeneous all-programmable MPSoC-based platforms, such as Zynq. Spark has been mapped to the Pynq platform and the proposed framework allows the seamlessly utilization of the programmable logic for the hardware acceleration of computational intensive Spark kernels. We have also developed the required libraries in Spark, by extending the MLLib library, that hides the accelerator’s details to minimize the design effort to utilize the accelerators. A cluster of 4 nodes (workers) based on the all-programmable MPSoCs has been implemented and the proposed platform is evaluated in a typical machine learning application based on logistic regression. The logistic regression kernel has been developed as an accelerator and incorporated to the Spark. The developed system is compared to a high-performance Xeon cluster that is typically used in cloud computing. The performance evaluation shows that the heterogeneous accelerator-based MpSoC can achieve up to 2.3x system speedup compared with a Xeon system (with 90% accuracy) and 20x better energy-efficiency. For embedded application, the proposed system can achieve up to 40x speedup compared to the software only implementation on low-power embedded processors and 30x lower energy consumption.
We have been offering many internet services and smart phone applications for over 20 years in Japan, and Cassandra has been used by our services since 2010. In this presentation, I will explain some issues and solutions about Cassandra, and our next generation infrastructure for Cassandra.
About the Speaker
Satoshi Konno Technical Manager, Yahoo Japan Corporation
Satoshi Konno is a software engineer with 20 years of experience. He has worked in Yahoo Japan as a programmer for 10 years and in their NoSQL team for the past 4 years and he is currently in a computer science doctoral course studying distributed computing.
Coral is a framework that allows the distributed acceleration of large data sets across clusters of FPGA resources using simple programming models. It is designed to scale up from single devices to multiple FPGAs, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of FPGAs, each of which may be prone to failures.
Coral abstracts FPGA resources (device, memory), enabling fault-tolerant heterogeneous distributed systems to easily be built and run effectively.
It allows:
- instant scalability to multiple FPGAs
- seamless virtualization of the FPGA cluster
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...confluent
Watch this talk here: https://www.confluent.io/online-talks/scaling-security-on-100s-of-millions-of-mobile-devices-using-kafka-and-scylla-on-demand
Join mobile cybersecurity leader Lookout as they talk through their data ingestion journey.
Lookout enables enterprises to protect their data by evaluating threats and risks at post-perimeter endpoint devices and providing access to corporate data after conditional security scans. Their continuous assessment of device health creates a massive amount of telemetry data, forcing new approaches to data ingestion. Learn how Lookout changed its approach in order to grow from 1.5 million devices to 100 million devices and beyond, by implementing Confluent Platform and switching to Scylla.
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
Cloud native deployment has become one of the major trends for large scale Big Data analytics. Compared to on-premise data center, cloud offers much stronger scalability and higher elasticity to Big Data applications. However, cloud is also considered to be less performance than on-premise alternatives due to virtualization and cluster resource disaggregation. We present a new cloud native Spark application architecture backed by persistent memory technology. The key ingredient of this architecture is a novel acceleration engine that uses Intel's 3DXPoint technology as external memory. We discuss how the performance of multiple aspects of data processing can be improved using this new architecture. As a key takeaway, audience will gain understanding on the benefits of latest persistent memory technology, and how such new technology could be leveraged in cloud data processing architecture.
Amazon EC2 provides resizable compute capacity in the cloud and makes web scale computing easier for customers. It offers a wide variety of compute instances is well suited to every imaginable use case, from static websites to high performance supercomputing on-demand, all available via highly flexible pricing options. This session covers the latest EC2 features and capabilities, including new instance families available in Amazon EC2, the differences among their hardware types and capabilities, and their optimal use cases. We also will cover some best practices on how you can optimize your spend on EC2 to make the most of your EC2 instances, saving time and money.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Databricks
As the development of semiconductor devices, manufacturing system leads to improve productivity and efficiency for wafer fabrication. Owing to such improvement, the number of wafers yielded from the fabrication process has been rapidly increasing. However, current software systems for semiconductor wafers are not aimed at processing large number of wafers. To resolve this issue, the BISTel (a world-class provider of manufacturing intelligence solutions and services for manufacturers) tries to build several products for big data such as Trace Analyzer (TA) and Map Analyzer (MA) using Apache Spark. TA is to analyze raw trace data from a manufacturing process. It captures details on all variable changes, big and small and give the traces' statistical summary (i.e.: min, max, slope, average, etc.). Several BISTel's customers, which are the top-tier semiconductor company in the world use the TA to analyze the massive raw trace data from their manufacturing process. Especially, TA is able to manage terabytes of data by applying Apache Spark's APIs. MA is an advanced pattern recognition tool that sorts wafer yield maps and automatically identify common yield loss patterns. Also, some semiconductor companies use MA to identify clustering patterns for more than 100,000 wafers, which can be considered as big data in the semiconductor area. This talk will introduce these two products which are developed based on the Apache Spark and present how to handle the large-scale semiconductor data in the aspects of software techniques.
Speakers: Seungchul Lee, Daeyoung Kim
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing.
Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks.
We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.
GPU databases - How to use them and what the future holdsArnon Shimoni
GPU databases are the hottest new thing, with about 7 different companies producing their own variant. In this session, we will discuss why they were created, how they are already disrupting the database world, and what the future of computing holds for them.
This presentation demonstrates how the power of NVIDIA GPUs can be leveraged to both accelerate speed to insight and to scale the amount of hot and warm data analyzed to meet the increasing demands of data scientists and business intelligence professionals alike, as well as to find tactical and strategic insights with greater speed on exponentially growing datasets.
Organizations commonly believe that they are advancing in analytical capabilities due to the rise in the data science profession and the myriad of technologies available for analytics, business intelligence, artificial intelligence and machine learning. However, if you do the math, they are actually falling behind as the increases in the rates of data collection volume far outpace the rate of increases in hot and warm data used for analytics. This is causing organizations to rely on an ever-decreasing percentage of their information assets for decision making.
We talk about why GPU databases were created and share what sets SQream apart from other GPU databases, MPP solutions, in memory and Hadoop based analytic alternatives.
We will also outline how an organization can use GPU databases to thrive in the information revolution by using a significantly greater percentage of its data for analytical purposes, obtaining insights that are desired today, and will remain cost-effective into the next few years when data lakes are expected to balloon from petabytes to exabytes.
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analyticsinside-BigData.com
Today NVIDIA announced a GPU-acceleration platform for data science and machine learning, with broad adoption from industry leaders, that enables even the largest companies to analyze massive amounts of data and make accurate business predictions at unprecedented speed.
Data analytics and machine learning are the largest segments of the high performance computing market that have not been accelerated — until now,” said Jensen Huang, founder and CEO of NVIDIA, who revealed RAPIDS in his keynote address at the GPU Technology Conference. “The world’s largest industries run algorithms written by machine learning on a sea of servers to sense complex patterns in their market and environment, and make fast, accurate predictions that directly impact their bottom line.
"RAPIDS open-source software gives data scientists a giant performance boost as they address highly complex business challenges, such as predicting credit card fraud, forecasting retail inventory and understanding customer buying behavior. Reflecting the growing consensus about the GPU’s importance in data analytics, an array of companies is supporting RAPIDS — from pioneers in the open-source community, such as Databricks and Anaconda, to tech leaders like Hewlett Packard Enterprise, IBM and Oracle."
Learn more: https://insidehpc.com/2018/10/open-source-rapids-gpu-platform-accelerate-predictive-data-analytics/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
In recent releases, TensorFlow has been enhanced for distributed learning and HDFS access. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. There are several community projects wiring TensorFlow onto Apache Spark clusters. Unfortunately, they are limited to support synchronous distributed learning only, and don’t allow TensorFlow servers to communicate with each other directly.
In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, which will be open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Spark Summit
In this talk, we will present SPynq framework: A framework for the efficient mapping and acceleration of Spark applications on heterogeneous all-programmable MPSoC-based platforms, such as Zynq. Spark has been mapped to the Pynq platform and the proposed framework allows the seamlessly utilization of the programmable logic for the hardware acceleration of computational intensive Spark kernels. We have also developed the required libraries in Spark, by extending the MLLib library, that hides the accelerator’s details to minimize the design effort to utilize the accelerators. A cluster of 4 nodes (workers) based on the all-programmable MPSoCs has been implemented and the proposed platform is evaluated in a typical machine learning application based on logistic regression. The logistic regression kernel has been developed as an accelerator and incorporated to the Spark. The developed system is compared to a high-performance Xeon cluster that is typically used in cloud computing. The performance evaluation shows that the heterogeneous accelerator-based MpSoC can achieve up to 2.3x system speedup compared with a Xeon system (with 90% accuracy) and 20x better energy-efficiency. For embedded application, the proposed system can achieve up to 40x speedup compared to the software only implementation on low-power embedded processors and 30x lower energy consumption.
We have been offering many internet services and smart phone applications for over 20 years in Japan, and Cassandra has been used by our services since 2010. In this presentation, I will explain some issues and solutions about Cassandra, and our next generation infrastructure for Cassandra.
About the Speaker
Satoshi Konno Technical Manager, Yahoo Japan Corporation
Satoshi Konno is a software engineer with 20 years of experience. He has worked in Yahoo Japan as a programmer for 10 years and in their NoSQL team for the past 4 years and he is currently in a computer science doctoral course studying distributed computing.
Coral is a framework that allows the distributed acceleration of large data sets across clusters of FPGA resources using simple programming models. It is designed to scale up from single devices to multiple FPGAs, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of FPGAs, each of which may be prone to failures.
Coral abstracts FPGA resources (device, memory), enabling fault-tolerant heterogeneous distributed systems to easily be built and run effectively.
It allows:
- instant scalability to multiple FPGAs
- seamless virtualization of the FPGA cluster
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...confluent
Watch this talk here: https://www.confluent.io/online-talks/scaling-security-on-100s-of-millions-of-mobile-devices-using-kafka-and-scylla-on-demand
Join mobile cybersecurity leader Lookout as they talk through their data ingestion journey.
Lookout enables enterprises to protect their data by evaluating threats and risks at post-perimeter endpoint devices and providing access to corporate data after conditional security scans. Their continuous assessment of device health creates a massive amount of telemetry data, forcing new approaches to data ingestion. Learn how Lookout changed its approach in order to grow from 1.5 million devices to 100 million devices and beyond, by implementing Confluent Platform and switching to Scylla.
Instant scaling and deployment of Vitis Libraries on Alveo clusters using InA...Christoforos Kachris
In this webinar we will show how to address the 3 main challenges in the domain of FPGA deployment.
First, we will show you how you can deploy your applications from high level frameworks like python, java scikit-learn and Keras easier than ever.
Then, we will show how you can scale out your applications from frameworks like Keras and Scikit Learn on multiple FPGA Alveo cards automatically.
Finally, we will show you how multiple users can share the available FPGA resources seamlessly. We will show how InAccel allows dynamic resource management of the FPGA resources in order to allow multi application and multi tenant deployment.
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
This slide introduces technical specs and details about Backend.AI 19.09.
* On-premise clustering / container orchestration / scaling on cloud
* Container-level fractional GPU technology to use one GPU as many GPUs on many containers at the same time.
* NVidia GPU Cloud integrations
* Enterprise features
Harnessing the virtual realm for successful real world artificial intelligenceAlison B. Lowndes
Artificial Intelligence is impacting all areas of society, from healthcare and transportation to smart cities and energy. How NVIDIA invests both in internal pure research and accelerated computation to enable its diverse customer base, across gaming & extended reality, graphics, AI, robotics, simulation, high performance scientific computing, healthcare & more. You will be introduced to the GPU computing platform & shown real world successfully deployed applications as well as a glimpse into the current state of the art across academia, enterprise and startups.
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
In this presentation from the GPU Technology Conference, Wyatt Gorman from Google and Abhishek Gupta from Schlumberger present: Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger.
"Demand for GPUs in High Performance Computing is only growing, and it is costly and difficult to keep pace in an entirely on-premise environment. We will hear from Schlumberger on why and how they are utilizing cloud-based GPU-enabled computing resources from Google Cloud to supply their users with the computing power they need, from exploration and modeling to visualization."
Watch the video: https://wp.me/p3RLHQ-kcl
Learn more: https://www.blog.google/products/google-cloud/schlumberger-chooses-gcp-to-deliver-new-oil-and-gas-technology-platform/
and
https://www.nvidia.com/en-us/gtc/
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...HostedbyConfluent
Whether you are a die-hard DC comic enthusiast, mad for Marvel, or completely clueless when it comes to comic books, at the end of the day each of us would love to possess the superpower to transform data in seconds versus minutes or days. But architects and developers are challenged with designing and managing platforms that scale elastically and combine event streams with stored data, to enable more contextually rich data analytics. This made even more complex with data coming from hundreds of sources, and in hundreds of terabytes, or even petabytes, per day.
Now, with Apache Kafka and Intel hardware technology advances, organizations can turn massive volumes of disparate data into actionable insights with the ability to filter, enrich, join and process data instream. Let's consider Information Security. IT leaders need to ensure all company data and IP is secured against threats and vulnerabilities. A combination of real-time event streaming with Confluent Platform and Intel Architecture has enabled threat detection efforts that once took hours to be completed in seconds, while simultaneously reducing technical debt and data processing and storage costs.
In this session, Confluent and Intel architects will share detailed performance benchmarking results and new joint reference architecture. We’ll detail ways to remove Kafka performance bottlenecks, and improve platform resiliency and ensure high availability using Confluent Control Center and Multi-Region Clusters. And we’ll offer up tips for addressing challenges that you may be facing in your own super heroic efforts to design, deploy, and manage your organization’s data platforms.
This presentation gives an overview of InAccel's Accelerated Machine Learning Suite that allows to speedup ML applications on the cloud without changing your code. The Accelerated ML suite is fully compatible with Apache Spark and allows the seamless utilization of FPGA in AWS in order to speedup your applications. More at https://www.inaccel.com/
This lecture aims to give some food for thought regarding how the current High Performance Computing systems (hardware and software) tends to merge with Big Data ones (Machine Learning, Analytics and Enterprise workloads) in order to meet both workloads demands sharing the same clusters.
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...Amazon Web Services
The PanCancer Analysis of Whole Genomes (PCAWG) project is a large-scale, highly distributed research collaboration designed to identify common patterns of mutations across 2,800 cancer genomes. The use of public and private clouds were instrumental in analyzing this dataset using current best practice containerized pipelines. This session describes the technical infrastructure built for the project, how we leveraged cloud environments to perform the “core” analysis, and the lessons learned along the way.
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...DataWorks Summit
Hadoop is becoming a standard platform for building critical financial applications such as risk reporting, trading and fraud detection. These applications require high level of SLAs (service-level agreement) in terms of RPO (Recovery Point Objective) and RTO (Recovery Time Objective). To achieve these SLAs, organizations need to build a disaster recovery plan that cover several layers ranging from the infrastructure to the clients going through the platform and the applications. In this talk, we will present the different architecture blueprints for disaster recovery as well as their corresponding SLA objectives. Then, we will focus on the stretch cluster solution that Crédit Agricole CIB is using in production. We will discuss the solution’s advantages, drawbacks and the impact of this approach on the global architecture. Finally, we will explain in detail how to configure and deploy this solution and how to integrate each layer (storage layer, processing layer...) into the architecture.
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
Graphics cards (GPU) open up new ways of processing and analytics over big data, showing millisecond selections over billions of lines, as well as telling stories about data. #QikkDB
How to present data to be understood by everyone? Data analysis is for scientists, but data storytelling is for everyone. For managers, product owners, sales teams, the general public. #TellStory
Learn about high performance computing with GPU and how to present data with a rich Covid-19 data story example on the upcoming webinar.
Computação de Alto Desempenho - Fator chave para a competitividade do País, d...Igor José F. Freitas
Vídeo: https://www.youtube.com/watch?v=8cFqNwhQ7uE
Fator chave para a competitividade do País, da Ciência e da Indústria.
Palestra ministrada durante o Intel Innovation Week 2015 .
Application Report: Big Data - Big Cluster InterconnectsIT Brand Pulse
As a leading analytics platform that runs on industry-standard hardware and integrates industry-standard database tools and applications, one of ParAccel’s biggest challenges is to architect and test hardware (servers, storage, interconnects) that make their software perform at its peak. In this case, they have achieved their mission to eliminate a cluster bottleneck by implementing 10GbE NICs to provide the bandwidth needed to-day, and well into the future.
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Similar to 17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning on the Cloud Using Hardware Accelerators (20)
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
Title: MLOps Workshop: The Full ML Lifecycle - How to Use ML in Production
Speakers: Spyros Cavadias (https://www.linkedin.com/in/spyros-cavadias/), Konstantinos Pittas (https://www.linkedin.com/in/konstantinos-pittas-83310270/), Thanos Gkinakos (https://www.linkedin.com/in/thanos-gkinakos-03582a128/)
Date: Saturday, December 17, 2022
Event: https://www.meetup.com/athens-big-data/events/289927468/
21st Athens Big Data Meetup - 2nd Talk - Dive into ClickHouse storage systemAthens Big Data
Title: Dive into ClickHouse storage system
Speaker: Alexander Sapin (https://github.com/alesapin/)
Date: Thursday, March 5, 2020
Event: https://meetup.com/Athens-Big-Data/events/268379195/
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...Athens Big Data
Title: NLP: From news recommendation to word prediction
Speaker: Mihalis Papakonstantinou (https://linkedin.com/in/mihalispapak/)
Date: Thursday, December 12, 2019
Event: https://meetup.com/Athens-Big-Data/events/266762191/
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...Athens Big Data
Title: Fast and simple data exploration with ClickHouse
Speaker: Alexander Kuzmenkov (https://github.com/akuzm/)
Date: Thursday, March 5, 2020
Event: https://meetup.com/Athens-Big-Data/events/268379195/
20th Athens Big Data Meetup - 2nd Talk - Druid: under the coversAthens Big Data
Title: Druid: under the covers
Speaker: Peter Marshall (https://linkedin.com/in/amillionbytes/)
Date: Tuesday, January 28, 2020
Event: https://meetup.com/Athens-Big-Data/events/266900242/
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...Athens Big Data
Title: Druid: the open source, performant, real-time, analytical datastore
Speaker: Peter Marshall (https://linkedin.com/in/amillionbytes/)
Date: Tuesday, January 28, 2020
Event: https://meetup.com/Athens-Big-Data/events/266900242/
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on KubernetesAthens Big Data
Title: Run Spark and Flink Jobs on Kubernetes
Speaker: Chaoran Yu (https://linkedin.com/in/chaoran-yu-97b1144a/)
Date: Thursday, November 14, 2019
Event: https://meetup.com/Athens-Big-Data/events/265957761/
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a ServiceAthens Big Data
Title: Timeseries Forecasting as a Service
Speaker: Thanassis Spyrou (https://linkedin.com/in/thanassis-spyrou-92911959/)
Date: Thursday, November 14, 2019
Event: https://meetup.com/Athens-Big-Data/events/265957761/
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...Athens Big Data
Title: Data Flow Building and Calculation Pipelines via PySpark and ML Modeling via Python
Speaker: Theodoros Michalareas (https://linkedin.com/in/theodorosmichalareas/)
Date: Tuesday, September 24, 2019
Event: https://meetup.com/Athens-Big-Data/events/264702584/
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...Athens Big Data
Title: A Focus on Building and Optimizing ML Models with Amazon SageMaker
Speaker: Pavlos Mitsoulis-Ntompos (https://linkedin.com/in/pavlosmitsoulis/)
Date: Thursday, March 14, 2019
Event: https://www.meetup.com/Athens-Big-Data/events/259091496/
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...Athens Big Data
Title: An Introduction to Machine Learning with Python and Scikit-Learn
Speaker: Julien Simon (https://linkedin.com/in/juliensimon/)
Date: Thursday, March 14, 2019
Event: https://www.meetup.com/Athens-Big-Data/events/259091496/
15th Athens Big Data Meetup - 1st Talk - Running Spark On MesosAthens Big Data
Title: Running Spark On Mesos
Speaker: Mr. Chris Sidiropoulos (https://linkedin.com/in/chris-sidiropoulos-2a6b156a//)
Date: Tuesday, November 27, 2018
Event: https://meetup.com/Athens-Big-Data/events/256098657/
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
Title: Real-Time Training and Deploying Spark ML Recommendations With Kafka and NetflixOSS
Speaker: Chris Fregly (https://linkedin.com/in/cfregly/)
Date: Monday, October 17, 2016
Event: https://meetup.com/Athens-Big-Data/events/234546355/
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...Athens Big Data
Title: Training Neural Networks With Enterprise Relational Data
Speaker: Dr. Christos Malliopoulos (https://linkedin.com/in/cmalliopoulos//)
Date: Thursday, June 21, 2018
Event: https://meetup.com/Athens-Big-Data/events/251411957/
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...Athens Big Data
Title: Beyond Bitcoin; Blockchain Technologies And How They Disrupt Every Industry
Speaker: Mr. Dimitrios Kouzis-Loukas (https://linkedin.com/in/lookfwd//)
Date: Wednesday, December 27, 2017
Event: https://meetup.com/Athens-Big-Data/events/246033156/
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And GradingAthens Big Data
Title: Lead Scoring And Lead Grading
Speaker: Dr. Lefteris Mantelas (https://linkedin.com/in/lefterismantelas//)
Date: Tuesday, October 24, 2017
Event: https://meetup.com/Athens-Big-Data/events/243829781/
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
7. www.inaccel.com™
Hyperscale Data Centers
7
Data Center Site Sq ft
Facebook (Santa Clara) 86,000
Google (South Carolina) 200,000
HP (Atlanta) 200,000
IBM (Colorado) 300,000
Microsoft (Chicago) 700,000
[Source: “How Clean is Your Cloud?”, Greenpeace 2011]
Wembley Stadium:172,000 square ft
8. www.inaccel.com™
Data Center Power consumption
8
˃ Data centers consumed 330 Billion KWh in 2007 and is expected
to reach 1012 Billion KWh in 2020
2007 (Billion
KWh)
2020 (Billion
KWh)
Data Centers 330 1012
Telecoms 293 951
Total Cloud 623 1963
[Source: How Clean is Your Data Center, Greenpeace,
2012
10. www.inaccel.com™
Specialization
10
CPU: High speed, lower efficiency GPU/FPGA: High throughput, higher efficiency
GPUs and FPGAs can provide massive parallelism and higher
efficiency than CPUs for certain categories of applications
16. www.inaccel.com™
Market size
16
˃ The data center accelerator market is expected to reach USD 21.19
billion by 2023 from USD 2.84 billion by 2018, at a CAGR of 49.47% from
2018 to 2023.
˃ The market for FPGA is expected to grow at the highest CAGR during
the forecast period owing to the increasing adoption of FPGAs for the
acceleration of enterprise workloads.
˃ The global MLaaS market is expected to register a CAGR of about
43.46% during 2018-2023 (the forecast period), to reach a value of USD
8.315 billion, by 2023, from USD 0.932 billion, as of 2017
17. www.inaccel.com™
Open FPGA Data Science
˃ We help Data science/engineers run up to 15x faster their applications without
changing their code
17
18. www.inaccel.com™
Integration solution for Application Acceleration
18
InAccel Scalable FPGA Resource Manager
Accelerated ML suite
On-premise Cloud
Higher Performance
Up to 16x Speedup compared to
highly optimized libraries
Lower Cost
Up to 4x lower TCO
Zero-code changes
Seamless integration to widely
used frameworks
Easy deployment
Docker-based container for
seamless integration
On-prem or on cloud
Available on cloud and on-prem
“Automated deployment, scaling and management of FPGA clusters”
20. www.inaccel.com™
Current Framework for FPGAs on the cloud
Current limitations without the InAccel
Coral Manager
˃ Currently only one application can talk to
each FPGA accelerator through OpenCL
˃ Every application can talk to a single
FPGA.
˃ Complex device sharing
• From multiple threads/processes
• Even from the same thread
˃ Explicit allocation of the resources
(memory/compute units)
App1
Vendor drivers
Single FPGA
20
21. InAccel FPGA Resource manager
www.inaccel.com
Seamless integration with C/C++, Python,
Java and Scala
Automatic configuration and management
of the FPGA bitstreams and memory
Seamless sharing of the FPGA cluster from
multiple applications
Automatic virtualization and scheduling of
the applications to the FPGA cluster
Fully scalable: Scale-up (multiple FPGAs
per node) and Scale-out (multiple FPGA-
based servers over Spark)
21
InAccel Coral
Resource
Manager
InAccel Runtime
- Resource isolation
Applications
FPGA drivers
Server
FPGA
Kernels
“ automating deployment, scaling, and management of FPGAs”
22. www.inaccel.com™
InAccel Coral FPGA Resource Manager
˃ Coral abstracts FPGA resources
(device, memory), enabling fault-tolerant
heterogeneous distributed systems to
easily be built and run effectively.
22
Worlds’ first FPGA Orchestrator:
Program against your FPGAs like it’s
a single pool of accelerators
InAccel Coral
Resource
Manager
InAccel Runtime
- Resource isolation
Applications
FPGA drivers
Server
FPGA
Kernels
23. www.inaccel.com™
InAccel’s Coral FPGA Manager
Acceleration abstraction layer to virtualize,
manage and monitor the FPGA resources
˃ Management
• Automatic device (re-)configurations and efficient
memory transfers
˃ Fault-tolerance
• Highly-available service on top of a cluster of
FPGAs, each of which may be prone to failures
˃ Scalability
• Automatic scale-up from single devices (e.g. f1.x2) to
multi-FPGA systems (e.g. f1.x4, f1.x16)
App1
InAccel FPGA Manager
FPGA Cluster
C++/Java socket
App2 App3
RTE/Drivers (Intel/Xilinx)
OpenCL / OPAE
23
Documentation: https://docs.inaccel.com/latest/
24. www.inaccel.com™
FPGA Manager features
Ease of Use
˃ Write applications quickly in C/C++, Java,
Scala and Python.
InAccel offers all the required high-level
functions that make it easy to build and
accelerate parallel apps. No need to modify
your application to use an unfamiliar parallel
programming language (like OpenCL)
App1
InAccel FPGA Manager
FPGA Cluster
C++/Java socket
App2 App3
RTE/Drivers (Intel/Xilinx)
OpenCL / OPAE
24
26. www.inaccel.com™
Software simplicity
26
• 30x simpler code
• Software-alike function invoking
• No need for OpenCL directives
https://github.com/Xilinx/AWS-F1-Developer-Labs/blob/master/helloworld_ocl/src/host.cpp
27. www.inaccel.com™
Example on scaling to 2 FPGA using the resource
manager for logistic regression
27
1.86x speedup using 2 FPGAs
simply by changing a line
inaccel start --
fpga=xilinx:0,xilinx:1
You specify how many
FPGAs you want to use
inaccel start --fpga=all
or
28. www.inaccel.com™
InAccel’s Coral manager integrated with Spark
˃ Integrated solution that allows
Scale Up (1, 2, or 8 FPGAs per node)
Scale Out to multiple nodes (using Spark API)
Seamless integration
Docker-based deployment
28
Worker node (f1)
Driver Program
Executor
Cluster Manager
Task
Task
SparkContext
FPGA RTE
RTE & Man.
interface
Worker node (f1)
Executor
Task
Task
FPGA RTE
RTE & Man.
FPGA cluster FPGA cluster
30. www.inaccel.com™
Bitstream repository
˃ FPGA Resource Manager is
integrated with Jfrog
bitstream repository that is
used to store FPGA
bitstreams
30
Application FPGA bitstream
repository
FPGA cluster
https://store.inaccel.com
31. www.inaccel.com™
Performance evaluation on Machine Learning
˃ Up to 15x speedup for LR ML
(7.5x overall)
˃ Up to 14x speedup for Kmeans
ML (6.2x overall)
˃ Spark- GPU* (3.8x – 5.7x)
˃ F1.4x
16 cores + 2 FPGAs (InAccel)
˃ R5d.4x
16 cores
31
r5d.4x
f1.4x (InAccel)
0 200 400 600 800 1000 1200 1400
Logistic Regression execution time MNIST
24GB, 100 iter. (secs)
Data preprocessing Data transformation ML training
15x Speedup
r5d.4x
f1.4x (InAccel)
0 500 1000 1500 2000 2500
K-Means clustering exection time
MNIST 24GB, 100 iter. (secs)
Data preprocessing Data transformation ML training
14x Speedup
*[Spark-GPU: An Accelerated In-Memory Data Processing Engine on Clusters]
32. www.inaccel.com™
Demo on logistic regression
32
˃ 14x faster training of logistic regression Python Jupyter, Instantly
https://www.inaccel.com/wp-content/uploads/InAccel_ML_accelerate_sound.mp4
34. www.inaccel.com™
Personnel cost savings
˃ Reduction on the complexity to program the FPGAs
˃ Over 4x lower LoC (lines of code)
˃ Significant savings in time to run and debug the FPGAs
34
35. www.inaccel.com™
FPGA Manager deployment
Easy to Deploy
˃ Launch a container with InAccel's Docker
image or even deploy it as a daemonset on a
Kubernetes cluster and enjoy acceleration
services at the drop of a hat.
˃ https://hub.docker.com/u/inaccel/
FPGA Manager
• Easy deployment
• Easy scalability
• Easy integration
35
38. www.inaccel.com™
Serverless deployment
˃ Integrated framework for serverless
deployment
˃ Compatible with Kubernetes
˃ Compatible with Kubeless, Knative
˃ Users only have to upload the images on
the S3 bucket and then InAccel’s FPGA
Manager automatically deploy the cluster
of FPGAs, process the data and then store
back the results on the S3 bucket.
˃ Users do not have to know anything about
the FPGA execution.
38
Amazon S3 Amazon S3
Cluster of Amazon
EC2 f1 instances
trigger
InAccel FPGA Resource Manager
f1 library of
accelerated
functions
Upload files Download files
Accelerated
function
https://medium.com/@inaccel/fpgas-goes-serverless-on-kubernetes-55c1d39c5e30
39. www.inaccel.com™
Integration with Arrow
˃ Seamless experience for the application
developer writing software using Arrow-
backed dataframes.
˃ Arrow adoptance growth makes our
integration even more profound.
˃ Zero extra overhead to other operations
supported by Arrow - e.g serialization
39
Apache Arrow specifies a standardized language-
independent columnar memory format for flat and
hierarchical data, organized for efficient analytic
operations on modern hardware. The Arrow memory
format supports zero-copy reads for efficient data-
access without serialization overhead. —Apache
Software
41. www.inaccel.com™
Job openings
˃ Software engineer
Java,
Python,
Scala,
Go lang
C++
˃ DevOps
Kubernetes,
K-native
Anthos
Mesos
Docker
41
• Collaboration with Hyperscale companies
• State of the art technology
• International exposure
Interested?
jobs@inaccel.com