The capacity of data grows rapidly in big data area, more and more memory are consumed either in the computation or holding the intermediate data for analytic jobs. For those memory intensive workloads, end-point users have to scale out the computation cluster or extend memory with storage like HDD or SSD to meet the requirement of computing tasks. For scaling out the cluster, the extra cost from cluster management, operation and maintenance will increase the total cost if the extra CPU resources are not fully utilized. To address the shortcoming above, Intel Optane DC persistent memory (Optane DCPM) breaks the traditional memory/storage hierarchy and scale up the computing server with higher capacity persistent memory. Also it brings higher bandwidth & lower latency than storage like SSD or HDD. And Apache Spark is widely used in the analytics like SQL and Machine Learning on the cloud environment. For cloud environment, low performance of remote data access is typical a stop gap for users especially for some I/O intensive queries. For the ML workload, it's an iterative model which I/O bandwidth is the key to the end-2-end performance. In this talk, we will introduce how to accelerate Spark SQL with OAP (https://github.com/Intel-bigdata/OAP) to accelerate SQL performance on Cloud to archive 8X performance gain and RDD cache to improve K-means performance with 2.5X performance gain leveraging Intel Optane DCPM. Also we will have a deep dive how Optane DCPM for these performance gains.
Speakers: Cheng Xu, Piotr Balcer
The Linux Block Layer - Built for Fast StorageKernel TLV
The arrival of flash storage introduced a radical change in performance profiles of direct attached devices. At the time, it was obvious that Linux I/O stack needed to be redesigned in order to support devices capable of millions of IOPs, and with extremely low latency.
In this talk we revisit the changes the Linux block layer in the
last decade or so, that made it what it is today - a performant, scalable, robust and NUMA-aware subsystem. In addition, we cover the new NVMe over Fabrics support in Linux.
Sagi Grimberg
Sagi is Principal Architect and co-founder at LightBits Labs.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
The Linux Block Layer - Built for Fast StorageKernel TLV
The arrival of flash storage introduced a radical change in performance profiles of direct attached devices. At the time, it was obvious that Linux I/O stack needed to be redesigned in order to support devices capable of millions of IOPs, and with extremely low latency.
In this talk we revisit the changes the Linux block layer in the
last decade or so, that made it what it is today - a performant, scalable, robust and NUMA-aware subsystem. In addition, we cover the new NVMe over Fabrics support in Linux.
Sagi Grimberg
Sagi is Principal Architect and co-founder at LightBits Labs.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
Jvm tuning for low latency application & CassandraQuentin Ambard
G1, CMS, Shenandoah, or Zing? Heap size at 8GB or 31GB? compressed pointers? Region size? What is the maximum break time? Throughput or Latency... What gain? MaxGCPauseMillis, G1HeapRegionSize, MaxTenuringThreshold, UnlockExperimentalVMOptions, ParallelGCThreads, InitiatingHeapOccupancyPercent, G1RSetUpdatingPauseTimePercent, which parameters have the most impact?
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
The increasing challenge to serve ever-growing data driven by AI and analytics workloads makes disaggregated storage and compute more attractive as it enables companies to scale their storage and compute capacity independently to match data & compute growth rate. Cloud based big data services is gaining momentum as it provides simplified management, elasticity, and pay-as-you-go model.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Основные темы, затронутые на семинаре:
Задачи и компоненты подсистемы управления памятью;
Аппаратные возможности платформы x86_64;
Как описывается в ядре физическая и виртуальная память;
API подсистемы управления памятью;
Высвобождение ранее занятой памяти;
Инструменты мониторинга;
Memory Cgroups;
Compaction — дефрагментация физической памяти.
Zero Data Loss Recovery Applianceによるデータベース保護のアーキテクチャオラクルエンジニア通信
データ量の増大、業務の24時間化に伴い、従来のバックアップ・ソリューションではデータ保護のニーズをすべて満たせなくなってきています。これを解消すべくOracle Databaseの保護に特化して設計されたエンジニアド・システム、Zero Data Loss Recovery Applianceが登場しました。これからの時代のデータ保護テクノロジーに関して、アーキテクチャを中心に紹介します。
Linux Memory Management with CMA (Contiguous Memory Allocator)Pankaj Suryawanshi
Fundamentals of Linux Memory Management and CMA (Contiguous Memory Allocator) In Linux.
Virtual Memory, Physical Memory, Swap Space, DMA, IOMMU, Paging, Segmentation, TLB, Hugepages, Ion google memory manager
Oracle Open World (OOW) 2014 presentation on Oracle Cache Fusion; how it works and how to use it in an optimized fashion to scale an Oracle RAC system.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
In this presentation, Yasunori Goto and Qi fuli will talk about basis of NVDIMM, the issues of RAS of Non Volatile DIMM(NVDIMM), and what feature is made and is developing for it.
NVDIMM is expected as new age device recently. Though a cpu can read/write the NVDIMM directly like RAM, the data of NVDIMM remains after power down or reboot. So, on memory database will be one of the good example of usecase of NVDIMM.
Since many people have made great effort for Linux, NVDIMM drivers, filesystems,management command, and many libraries has been well developed for a few years,
However, Yasunori Goto found some issues about RAS(Reliabivity, Availability, and Serviceability) feature of NVDIMM, because characteristic of the NVDIMM is likely mixture of Storage and RAM. For example, NVDIMM does not have hotplug feature because it is inserted at DIMM slot like RAM, but its data must be back-upped/restored like storage.
Linux Kernel Booting Process (2) - For NLKBshimosawa
Describes the bootstrapping part in Linux, and related architectural mechanisms and technologies.
This is the part two of the slides, and the succeeding slides may contain the errata for this slide.
Delivered as plenary at USENIX LISA 2013. video here: https://www.youtube.com/watch?v=nZfNehCzGdw and https://www.usenix.org/conference/lisa13/technical-sessions/plenary/gregg . "How did we ever analyze performance before Flame Graphs?" This new visualization invented by Brendan can help you quickly understand application and kernel performance, especially CPU usage, where stacks (call graphs) can be sampled and then visualized as an interactive flame graph. Flame Graphs are now used for a growing variety of targets: for applications and kernels on Linux, SmartOS, Mac OS X, and Windows; for languages including C, C++, node.js, ruby, and Lua; and in WebKit Web Inspector. This talk will explain them and provide use cases and new visualizations for other event types, including I/O, memory usage, and latency.
This ppt was used by Devrim at pgDay Asia 2017. He talked about some important facts about WAL - Transaction Logs or xlogs in PostgreSQL. Some of these can really come handy on a bad day
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...Databricks
Apache Spark is a popular data processing engine designed to execute advanced analytics on very large data sets which are common in today’s enterprise use cases. To enable Spark’s high performance for different workloads (e.g. machine-learning applications), in-memory data storage capabilities are built right in.
However, Spark’s in-memory capabilities are limited by the memory available in the server; it is common for computing resources to be idle during the execution of a Spark job, even though the system’s memory is saturated. To mitigate this limitation, Spark’s distributed architecture can run on a cluster of nodes, thus taking advantage of the memory available across all nodes. While employing additional nodes would solve the server DRAM capacity problem, it does so at an increased cost. Intel(R) Memory Drive Technology is a software-defned memory (SDM) technology, which combined with an Intel(R) Optane(TM) SSD, expands the system’s memory.
This combination of Intel(R) Optane(TM) SSD with Intel Memory Drive Technology alleviates those memory limitations that are inherent to Spark, by making more memory available to the operating system and to Spark jobs, transparently.
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chipinside-BigData.com
Today at Hot Chips 2019, Intel revealed new details of upcoming high-performance AI accelerators: Intel Nervana neural network processors, with the NNP-T for training and the NNP-I for inference. Intel engineers also presented technical details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
"To get to a future state of ‘AI everywhere,’ we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources," said Naveen Rao, Intel vice president and GM, Artificial Intelligence Products Group. "Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”
Learn more: https://www.intel.ai/accelerating-for-ai/?elq_cid=1192980
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Jvm tuning for low latency application & CassandraQuentin Ambard
G1, CMS, Shenandoah, or Zing? Heap size at 8GB or 31GB? compressed pointers? Region size? What is the maximum break time? Throughput or Latency... What gain? MaxGCPauseMillis, G1HeapRegionSize, MaxTenuringThreshold, UnlockExperimentalVMOptions, ParallelGCThreads, InitiatingHeapOccupancyPercent, G1RSetUpdatingPauseTimePercent, which parameters have the most impact?
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
The increasing challenge to serve ever-growing data driven by AI and analytics workloads makes disaggregated storage and compute more attractive as it enables companies to scale their storage and compute capacity independently to match data & compute growth rate. Cloud based big data services is gaining momentum as it provides simplified management, elasticity, and pay-as-you-go model.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Основные темы, затронутые на семинаре:
Задачи и компоненты подсистемы управления памятью;
Аппаратные возможности платформы x86_64;
Как описывается в ядре физическая и виртуальная память;
API подсистемы управления памятью;
Высвобождение ранее занятой памяти;
Инструменты мониторинга;
Memory Cgroups;
Compaction — дефрагментация физической памяти.
Zero Data Loss Recovery Applianceによるデータベース保護のアーキテクチャオラクルエンジニア通信
データ量の増大、業務の24時間化に伴い、従来のバックアップ・ソリューションではデータ保護のニーズをすべて満たせなくなってきています。これを解消すべくOracle Databaseの保護に特化して設計されたエンジニアド・システム、Zero Data Loss Recovery Applianceが登場しました。これからの時代のデータ保護テクノロジーに関して、アーキテクチャを中心に紹介します。
Linux Memory Management with CMA (Contiguous Memory Allocator)Pankaj Suryawanshi
Fundamentals of Linux Memory Management and CMA (Contiguous Memory Allocator) In Linux.
Virtual Memory, Physical Memory, Swap Space, DMA, IOMMU, Paging, Segmentation, TLB, Hugepages, Ion google memory manager
Oracle Open World (OOW) 2014 presentation on Oracle Cache Fusion; how it works and how to use it in an optimized fashion to scale an Oracle RAC system.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
In this presentation, Yasunori Goto and Qi fuli will talk about basis of NVDIMM, the issues of RAS of Non Volatile DIMM(NVDIMM), and what feature is made and is developing for it.
NVDIMM is expected as new age device recently. Though a cpu can read/write the NVDIMM directly like RAM, the data of NVDIMM remains after power down or reboot. So, on memory database will be one of the good example of usecase of NVDIMM.
Since many people have made great effort for Linux, NVDIMM drivers, filesystems,management command, and many libraries has been well developed for a few years,
However, Yasunori Goto found some issues about RAS(Reliabivity, Availability, and Serviceability) feature of NVDIMM, because characteristic of the NVDIMM is likely mixture of Storage and RAM. For example, NVDIMM does not have hotplug feature because it is inserted at DIMM slot like RAM, but its data must be back-upped/restored like storage.
Linux Kernel Booting Process (2) - For NLKBshimosawa
Describes the bootstrapping part in Linux, and related architectural mechanisms and technologies.
This is the part two of the slides, and the succeeding slides may contain the errata for this slide.
Delivered as plenary at USENIX LISA 2013. video here: https://www.youtube.com/watch?v=nZfNehCzGdw and https://www.usenix.org/conference/lisa13/technical-sessions/plenary/gregg . "How did we ever analyze performance before Flame Graphs?" This new visualization invented by Brendan can help you quickly understand application and kernel performance, especially CPU usage, where stacks (call graphs) can be sampled and then visualized as an interactive flame graph. Flame Graphs are now used for a growing variety of targets: for applications and kernels on Linux, SmartOS, Mac OS X, and Windows; for languages including C, C++, node.js, ruby, and Lua; and in WebKit Web Inspector. This talk will explain them and provide use cases and new visualizations for other event types, including I/O, memory usage, and latency.
This ppt was used by Devrim at pgDay Asia 2017. He talked about some important facts about WAL - Transaction Logs or xlogs in PostgreSQL. Some of these can really come handy on a bad day
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...Databricks
Apache Spark is a popular data processing engine designed to execute advanced analytics on very large data sets which are common in today’s enterprise use cases. To enable Spark’s high performance for different workloads (e.g. machine-learning applications), in-memory data storage capabilities are built right in.
However, Spark’s in-memory capabilities are limited by the memory available in the server; it is common for computing resources to be idle during the execution of a Spark job, even though the system’s memory is saturated. To mitigate this limitation, Spark’s distributed architecture can run on a cluster of nodes, thus taking advantage of the memory available across all nodes. While employing additional nodes would solve the server DRAM capacity problem, it does so at an increased cost. Intel(R) Memory Drive Technology is a software-defned memory (SDM) technology, which combined with an Intel(R) Optane(TM) SSD, expands the system’s memory.
This combination of Intel(R) Optane(TM) SSD with Intel Memory Drive Technology alleviates those memory limitations that are inherent to Spark, by making more memory available to the operating system and to Spark jobs, transparently.
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chipinside-BigData.com
Today at Hot Chips 2019, Intel revealed new details of upcoming high-performance AI accelerators: Intel Nervana neural network processors, with the NNP-T for training and the NNP-I for inference. Intel engineers also presented technical details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
"To get to a future state of ‘AI everywhere,’ we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources," said Naveen Rao, Intel vice president and GM, Artificial Intelligence Products Group. "Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”
Learn more: https://www.intel.ai/accelerating-for-ai/?elq_cid=1192980
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergenceinside-BigData.com
In this deck, Johann Lombardi from Intel presents: DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence.
"Intel has been building an entirely open source software ecosystem for data-centric computing, fully optimized for Intel® architecture and non-volatile memory (NVM) technologies, including Intel Optane DC persistent memory and Intel Optane DC SSDs. Distributed Asynchronous Object Storage (DAOS) is the foundation of the Intel exascale storage stack. DAOS is an open source software-defined scale-out object store that provides high bandwidth, low latency, and high I/O operations per second (IOPS) storage containers to HPC applications. It enables next-generation data-centric workflows that combine simulation, data analytics, and AI."
Unlike traditional storage stacks that were primarily designed for rotating media, DAOS is architected from the ground up to make use of new NVM technologies, and it is extremely lightweight because it operates end-to-end in user space with full operating system bypass. DAOS offers a shift away from an I/O model designed for block-based, high-latency storage to one that inherently supports fine- grained data access and unlocks the performance of next- generation storage technologies.
Watch the video: https://youtu.be/wnGBW31yhLM
Learn more: https://www.intel.com/content/www/us/en/high-performance-computing/daos-high-performance-storage-brief.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
hbaseconasia2019 HBase Bucket Cache on Persistent MemoryMichael Stack
Anoop Sam John, Ramkrishna S Vasudevan, and Xu Kai of Intel
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
Apache CarbonData & Spark meetup
"QATCodec: past, present and future" if from INTEL
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Deep Learning Training at Scale: Spring Crest Deep Learning Acceleratorinside-BigData.com
Today at Hot Chips 2019, Intel revealed new details of upcoming high-performance AI accelerators: Intel Nervana neural network processors, with the NNP-T for training and the NNP-I for inference. Intel engineers also presented technical details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
"To get to a future state of ‘AI everywhere,’ we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources," said Naveen Rao, Intel vice president and GM, Artificial Intelligence Products Group. "Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”
Learn more: https://www.intel.ai/accelerating-for-ai/?elq_cid=1192980
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
Abstract: Explore the packet I/O data path from a NIC across PCI-Express to cache/memory and understand how to build efficient CPU code for networked applications.
Speaker: Venky Venkatesan, Intel Fellow, Chief Architect – Packet Processing and Networking Applications
Overcoming Scaling Challenges in MongoDB Deployments with SSDMongoDB
Horizontal scaling of databases can increase performance and capacity, but adding nodes also increases infrastructure and management complexities. Cluster management can challenge even the most seasoned IT professional. While vertical scaling is easier to implement, it has traditionally been limited by memory and disk throughput. As both SSD latency and price continue to improve, the MongoDB database scaling equation changes. This session will review a number of SSD technologies that Intel employs (SATA, NVMe) and their impacts on I/O performance and database scaling. We will look at various architectural options for optimizing I/O based on our discussions with real world users. We will also provide attendees a glimpse at our future plans in terms of technologies in the storage area.
Accelerating Virtual Machine Access with the Storage Performance Development ...Michelle Holley
Abstract: Although new non-volatile media inherently offers very low latency, remote access
using protocols such as NVMe-oF and presenting the data to VMs via virtualized interfaces such as virtio
adds considerable software overhead. One way to reduce the overhead is to use the Storage
Performance Development Kit (SPDK), an open-source software project that provides building blocks for
scalable and efficient storage applications with breakthrough performance. Comparing the software
paths for virtualizing block storage I/O illustrates the advantages of the SPDK-based approach. Empirical
data shows that using SPDK can improve CPU efficiency by up to 10 x and reduce latency up to 50% over
existing methods. Future enhancements for SPDK will make its advantages even greater.
Speaker Bio: Anu Rao is Product line manager for storage software in Data center Group. She helps
customer ease into and adopt open source Storage software like Storage Performance Development Kit
(SPDK) and Intelligent Software Acceleration-Library (ISA-L).
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciIntel® Software
Preprocess, visualize, and Build AI Faster at-Scale on Intel Architecture. Develop end-to-end AI pipelines for inferencing including data ingestion, preprocessing, and model inferencing with tabular, NLP, RecSys, video and image using Intel oneAPI AI Analytics Toolkit and other optimized libraries. Build at-scale performant pipelines with Databricks and end-to-end Xeon optimizations. Learn how to visualize with the OmniSci Immerse Platform and experience a live demonstration of the Intel Distribution of Modin and OmniSci.
Webinář "Konsolidace Oracle DB na systémech s procesory M7, včetně migrace z konkurenčních serverových platforem"
Prezentuje Josef Šlahůnek, Oracle
9.3.2016
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY
HPC DAY 2017 - http://www.hpcday.eu/
Accelerating tomorrow's HPC and AI workflows with Intel Architecture
Atanas Atanasov | HPC solution architect, EMEA region at Intel
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
4. About US
Intel SSP/DAT – Data Analytics Technology
● Active open source contributions: About 30 committers from our team with
significant contributions to a lot of big data key projects including Spark, Hadoop,
HBase, Hive...
● A global engineering team (China, US, India) - Major team is based in Shanghai
● Mission: To fully optimize analytics solutions on IA and make Intel the best and
easiest platform for big data and analytics customers
Intel NVMS – Non-Volatile Memory Solutions Group
• Tasked with preparing the ground for Intel® Optane™ DC Persistent Memory
• Enabling new and existing software through Persistent Memory Development Kit
(PMDK)
• pmem.io
4
5. Out of memory Large Spill
Some Challenges in Spark: Memory???
7. 7
Intel®Optane™DCPersistentMemory-ProductOverview
(Optane™ based Memory Module for the Data Center)
IMC
Cascade Lake
Server
IMC
• 128, 256, 512GB
DIMM Capacity
• 2666 MT/sec
Speed
• 3TB (not including DRAM)
Capacity per CPU
• DDR4 electrical & physical
• Close to DRAM latency
• Cache line size access
* DIMM population shown as an example only.
11. 11
AppDirectModeOptions
Legacy Storage APIs
Block Atomicity
Storage APIs with DAX
(AppDirect)
PersistentMemory
USERSPACEKERNELSPACE
Standard
File API
GenericNVDIMMDriver
Application
FileSystem
Standard
Raw Device
Access
mmap
Load/
Store
Standard
File API
pmem-Aware
FileSystem
MMU
Mappings
“DAX”
BTT
DevDAX
PMDK
mmap
HARDWARE
• No Code Changes Required
• Operates in Blocks like SSD/HDD
• Traditional read/write
• Works with Existing File
Systems
• Atomicity at block level
• Block size configurable
• 4K, 512B*
• NVDIMM Driver required
• Support starting Kernel 4.2
• Configured as Boot Device
• Higher Endurance than Enterprise
SSDs
• High Performance Block Storage
• Low Latency, higher BW, High
IOPs
*Requires Linux
• Code changes may be required*
• Bypasses file system page cache
• Requires DAX enabled file system
• XFS, EXT4, NTFS
• No Kernel Code or interrupts
• No interrupts
• Fastest IO path possible
* Code changes required for load/store direct
access if the application does not already support
this.
12. Low Bandwidth IO storage (e.g. HDD, S3)
RDD Cache
SparkDCPMMoptimizationOverview
OAP Cache
DRAM
IO Layer
Compute Layer
Spark
Input Data
Cache Hit Cache Miss
DRAM
Tied Storage
RDD cache DCPMM optimization (Against DRAM):
● Reduce DRAM footprint
● Higher Capacity to cache more data
● Status: To Be Added
OAP DCPMM optimization (Against DRAM)
● High Capacity, Less Cache Miss
● Avoid heavy cost disk read
● Status: To Be Added
9 I/O intensive Workload
(from Decision Support
Queries)
K-means Workload
14. OAP (Optimized Analytics Package)
Goal
• IO cache is critical for I/O intensive workload especially on low bandwidth environment (e.g. Cloud, On-Prem
HDD based system)
• Make full use of the advantage from DCPMM to speed up Spark SQL
• High capacity and high throughput
• No latency and reduced DRAM footprint
• Better performance per TCO
Feature
• Fine grain cache (e.g. column chunk for Parquet) columnar based cache
• Cache aware scheduler (V2 API via preferred location)
• Self managed DCPMM pool (no extra GC overhead)
• Easy to use (easily turn on/off), transparent to user (no changes for their queries)
14
https://github.com/Intel-bigdata/OAP
15. OAP
15
SparkDCPMMFullSoftwareStack
SQL workload
Spark SQL
VMEMCACHE
PMEM-AWARE
File SYSTEM
Intel® Optane™ DC Persistent Memory Module
Unmodified SQL
Unchanged Spark
Provide scheduler and fine gain cache based on Data Source API
Abstract away hardware details and cache implementation
Expose persistent memory as memory-mapped files (DAX)
Native library
to access DCPMM
16. Deployment Overview
Server 1
Local Storage (HDD)
Spark Executor
Spark Gateway
(e.g. ThriftServer, Spark shell)SQL
Cached Data Source (v1/v2)
Task scheduled
Intel Optane DC Persistent
Memory
Cache Hit Cache Miss
Server 2
Native library (vmemcache)
Cache Aware Scheduler
17. Cache Design - Problem statement
• Local LRU cache
• Support for large capacities available with
persistent memory (many terabytes per server)
• Lightweight, efficient and embeddable
• In-memory
• Scalable
18. Cache Design - Existing solutions
• In-memory databases tend to rely on malloc() in some form for allocating
memory for entries
– Which means allocating anonymous memory
• Persistent Memory is exposed by the operating system through normal file-
system operations
– Which means allocating byte-addressable PMEM needs to use file memory mapping
(fsdax).
• We could modify the allocator of an existing in-memory database and be
done with it, right? J
19. Cache Design - Fragmentation
• Manual dynamic memory
management a’la
dlmalloc/jemalloc/tcmalloc/palloc
causes fragmentation
• Applications with substantial
expected runtime durations need
a way to combat this problem
– Compacting GC (Java, .NET)
– Defragmentation (Redis, Apache
Ignite)
– Slab allocation (memcached)
• Especially so if there’s substantial
expected variety in allocated sizes
20. Cache Design - Extent allocation
• If fragmentation is unavoidable,
and defragmentation/compacting
is CPU and memory bandwidth
intensive, let’s embrace it!
• Usually only done in relatively
large blocks in file-systems.
• But on PMEM, we are no longer
restricted by large transfer units
(sectors, pages etc)
21. Cache Design - Scalable replacement policy
• Performance of
libvmemcache was
bottlenecked by naïve
implementation of LRU
based on a doubly-linked
list.
• With 100st of threads, most
of the time of any request
was spent
waiting on a list lock…
• Locking per-node doesn’t
solve the problem…
22. Cache Design - Buffered LRU
• Our solution was quite
simple.
• We’ve added a wait-free
ringbuffer which buffers
the list-move operations
• This way, the list only
needs to get locked
during eviction or when
the ringbuffer is full.
23. Cache Design - Lightweight, embeddable,
in-memory caching
https://github.com/pmem/vmemcache
VMEMcache *cache = vmemcache_new("/tmp", VMEMCACHE_MIN_POOL,
VMEMCACHE_MIN_EXTENT, VMEMCACHE_REPLACEMENT_LRU);
const char *key = "foo";
vmemcache_put(cache, key, strlen(key), "bar", sizeof("bar"));
char buf[128];
ssize_t len = vmemcache_get(cache, key, strlen(key),
buf, sizeof(buf), 0, NULL);
vmemcache_delete(cache);
libvmemcache has normal get/put APIs, optional replacement policy, and configurable extent size
Works with terabyte-sized in-memory workloads without a sweat, with very high space utilization.
Also works on regular DRAM.
24. 24
Parquet File
Footer
RowGroup #1
Column Chunk #1
Column Chunk #2
Column Chunk #N
RowGroup #2
RowGroup #N
Guava Based
Cache
Vmemcache
Fine Grain Cache
Cache Status & Report
Cache Design – Status and Fine Grain
25. 25
Experiments and Configurations
DCPMM DRAM
Hardware DRAM 192GB (12x 16GB DDR4) 768GB (24x 32GB DDR4)
Intel Optane DC Persistent Memory 1TB (QS: 8 x 128GB) N/A
DCPMM Mode App Direct (Memkind) N/A
SSD N/A N/A
CPU 2 * Cascadelake 8280M (Thread(s) per core: 2, Core(s) per socket: 28, Socket(s): 2 CPU max MHz:
4000.0000 CPU min MHz: 1000.0000 L1d cache: 32K, L1i cache: 32K, L2 cache: 1024K, L3 cache: 39424)
OS 4.20.4-200.fc29.x86_64 (BKC: WW06'19, BIOS: SE5C620.86B.0D.01.0134.100420181737)
Software OAP 1TB DCPMM based OAP cache 610GB DRAM based OAP cache
Hadoop 8 * HDD disk (ST1000NX0313, 1-replica uncompressed & plain encoded data on Hadoop)
Spark 1 * Driver (5GB) + 2 * Executor (62 cores, 74GB), spark.sql.oap.rowgroup.size=1MB
JDK Oracle JDK 1.8.0_161
Workload Data Scale 2TB, 3TB, 4TB
Decision Making Queries 9 I/O intensive queries
Multi-Tenants 9 threads (Fair scheduled)
26. Test Scenario 1: Both Fit In DRAM And DCPMM - 2TB
*The performance gain can be much lower if IO is improved (e.g. compression
& encoding enabled) or some hacking codes fixed (e.g. NUMA scheduler)
With 2TB Data Scale, both DCPMM and DRAM based OAP cache can hold the full dataset
(613GB). DCPMM is 24.6% (=1- 100.1/132.81) performance lower than DRAM as measured on 9
I/O intensive decision making queries.
On Premise
Performance
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Configurations: See page20.Test by Intel on 24/02/2019.
27. Test Scenario 2: Fit In DCPMM Not For DRAM - 3TB
*The performance gain can be much lower if IO is improved (e.g. compression
& encoding enabled) or some hacking codes fixed (e.g. NUMA scheduler)
With 3TB Data Scale, only DCPMM based OAP cache can hold the full dataset (920GB).
DCPMM shows 8X* performance gain over DRAM as measured on 9 I/O intensive decision
making queries.
On Premise
Performance
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Configurations: See page20.Test by Intel on 24/02/2019.
28. Performance Analysis - System Metrics
● Input data is all cached in DCPMM while partially for DRAM
● DCPMM reaches up to 18GB/sbandwidth while DRAM case it’s bounded by disk IO which is only about
250MB/s ~ 450MB/s
● OAP cache doesn’t apply for shuffle data (intermediate data) that extra IO pressure put onto DRAM case
● DCPMM reads about 27.5GBwhile DRAM reads about 395.6GB
OAP Cache works avoiding disk read (blue)
and only disk write for shuffle (red)
Disk read (blue) comes from
shuffle data
Disk read (blue) happens from time to time
DCPMM
29. Test Scenario 3: None Of DRAM &DCPMM Fit - 4TB
*The performance gain can be much lower if IO is improved (e.g. compression
& encoding enabled) or some hacking codes fixed (e.g. NUMA scheduler)
With 4TB Data Scale, none of DCPMM and DRAM based OAP cache can hold the full dataset
(1226.7GB). DCPMM shows 1.66X* (=2252.80/1353.67) performance gain over DRAM as
measured on 9 I/O intensive decision making queries.
On Premise
Performance
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Configurations: See page20.Test by Intel on 24/02/2019.
31. DCPMM Storage Level
DRAM
Disk (SSD, HDD)
Object
Byte buffer
Byte buffer
Spark Core (Compute/Storage Memory)
DRAM (on heap)
DRAM (off heap)
Disk
K-means workload GraphX workload
Decompression
Deseriliazation
• Spark Storage Level
• A few storage levels serving for different purposes
including memory and disk
• Off-heap memory is supported to avoid GC
overhead in Java
• Large capacity storage level (disk) is widely used
for iterative computation workload (e.g. K-means,
GraphX workloads) by caching hot data in storage
level
• DCPMM Storage Level
• Extend memory layer
• Using Pmem library to access DCPMM avoiding
the overhead of decompression from disk
• Large capacity and high I/O performance of
DCPMM shows better performance than tied
solution (original DRAM + Disk solution)
32. 32
K-means in Spark
M0 - Mj
Mj+1 - Mk
Mk+1 - Mm
…
Mx+1 - Mn
HDFS
Cache #1
(DRAM + DISK)
Cache #2
(DRAM + DISK)
Cache #3
(DRAM + DISK)
…
Cache #N
(DRAM + DISK)
Spark Executor
Processes
Loading
Normalization
Caching
C1, C2, C3, … CK
For Each Record in
the Cache
For Each Record in
the Cache
For Each Record in
the Cache
For Each Record in
the Cache
For Each Record in
the Cache
Step 1: For each record in the cache, Sum the
vectors with the same closet centroid
Update the new Centroids
Step 2: Sum the vectors according to
the centroids and find out the new
centroids globally
Sync
N iterations
Load TrainRandom Initial Centroids
2 Iterations for all
of the cache data
to get the K
centroids in a
random mode
Load the data from HDFS to DRAM (and DCPMM / SSD if DRAM cannot hold all of the data)(Load), after that, the data will not be changed,
and will be iterated repeatedly in Initialization and Train stages.
33. Compute
Node
33
K-means Basic Data Flow
• Load data from HDFS to memory.
• Spill over to local storage
Load
• Compute using initial centroid based on
data in memory or local storage
Initialization
• Compute iterations based on local data
Train
CPU
Memory
Storage
Local
Shared
Compute
Node
CPU
Memory
Storage
Local
Shared
Compute
Node
CPU
Memory
Storage
Local
Shared
HDFS
…
…
K centroids
C1, C2, C3, … CK
1
2
3
1
2
3
34. 34
Experiments and Configurations
#1 AD #2 DRAM
Hardware DRAM 192GB (12x 16GB DDR4 2666) 768GB (24× 32GB DDR4 2666)
Intel Optane DC Persistent Memory 1TB (8x 128GB QS) N/A
DCPMM Mode AppDirect N/A
CPU Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
Disk Drive 8x Intel DC S4510 SSD (1.8T) + 2 * P4500 (1.8T)
Software CPU / Memory Allocation Driver (5GB + 10 cores) + 2 * Executor (80GB + 32 Cores)
Software Stack Spark (2.2.2-snapshot: bb3e6ed9216f98f3a3b96c8c52f20042d65e2181) + Hadoop (2.7.5)
OS & Kernel & BIOS
Linux-4.18.8-100.fc27.x86_64-x86_64-with-fedora-27-Twenty_Seven
BIOS: SE5C620.86B.0D.01.0299.122420180146
Mitigation variants (1,2,3,3a,4, L1TF) 1,2,3,3a,4, L1TF
Cache Spill Disk N/A 8 * SSD
NUMA Bind 2 NUMA Nodes bind 2 Spark Executors respectively
Data Cache Size (OffHeap) DCPMM (no spill) 680GB DRAM (partially spill)
Workload
Record Count / Data Size / Memory Footprint
Row size 6.1 Billion Records / data size 1.1TB / cache data 954GB
Cache Spill Size 0 235GB
Kmeans (Iteration times) 10
Kmeans (K value) 5
1 Modified to support DCPMM
35. 35
Performance Comparison(DCPMM AD V.S. DRAM)
• DCPMM provides larger “memory” than DRAM and avoid the data spill in this case, hence got up to 4.98X performance in Training(READ),
and 2.85X end to end performance;
• The DCPMM AppDirect mode is about ~19% faster than DRAM in Load(WRITE cache into DCPMM / DRAM);
• Training(READ) execution gap of DRAM case caused by storage media, as well as the data decompression and de-serialization from the
local spill file;
4.98X1.19X
2.85X
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Configurations: See page11.Test by Intel on 14/03/2019.
DCPMM DCPMM
37. • Intel Optane DC Persistent Memory brings high capacity, low latency and high throughput
• Two operation modes for DCPMM: 1) App-Direct 2) 2LM
• Good for Spark in: 1) memory bound workload 2) I/O intensive workload
Summary
38. Want to know more?
See you at Intel Booth 100
Download source code from Github:
https://github.com/intel-bigdata/OAP
https://github.com/pmem/vmemcache
39. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT