This document discusses scaling Python to HPC and big data environments. It describes how Intel provides Python distributions to improve performance and parallelism. Key points include:
- Intel distributions optimize Python packages like NumPy and SciPy with Intel MKL to improve performance of dense and sparse linear algebra operations.
- Composable multi-threading is supported using Intel TBB to better balance workloads across CPU threads.
- Intel DAAL provides optimized algorithms for machine learning in scikit-learn to accelerate tasks like PCA.
The document outlines techniques Intel uses to make Python more suitable for high performance and big data computing through parallelism, optimized libraries, and integration with frameworks like Spark and TensorFlow.
Intel Distribution for Python - Scaling for HPC and Big DataDESMOND YUEN
Make Python usable beyond prototyping environment by
scaling out to HPC and Big Data environments. Hardware and software efficiency crucial in production (Perf/Watt, etc.)
If you like what you read be sure you ♥ it below. Thank you!
The VTune analyzer provides an integrated performance analysis and tuning environment that helps you analyze your code's performance on systems with IA-32, Intel(R) 64, and IA-64 architecture.
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...Intel® Software
Explore practical examples of Intel® Embree and Intel® OSPRay in production rendering and the best practices of using the kernels in typical rendering pipelines.
Intel Distribution for Python - Scaling for HPC and Big DataDESMOND YUEN
Make Python usable beyond prototyping environment by
scaling out to HPC and Big Data environments. Hardware and software efficiency crucial in production (Perf/Watt, etc.)
If you like what you read be sure you ♥ it below. Thank you!
The VTune analyzer provides an integrated performance analysis and tuning environment that helps you analyze your code's performance on systems with IA-32, Intel(R) 64, and IA-64 architecture.
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...Intel® Software
Explore practical examples of Intel® Embree and Intel® OSPRay in production rendering and the best practices of using the kernels in typical rendering pipelines.
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...Intel® Software
Overview of the new Embree 3 ray tracing framework, including how to use the new API, supported geometry types, and ray intersection methods. Includes a look at new features like normal oriented curves, vertex grids, etc.
Today at Hot Chips 2019, Intel engineers presented technical details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
"To get to a future state of ‘AI everywhere,’ we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources," said Naveen Rao, Intel vice president and GM, Artificial Intelligence Products Group. "Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”
Learn more: https://www.intel.ai/accelerating-for-ai/?elq_cid=1192980
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...Intel® Software
Deep learning based Inference on edge based devices is growing rapidly. In this talk, learn about how developers and researchers are taking advantage of Intel® Processor Graphics to get best performance.
Learn how Intel worked with Pixar Animation Studios* and Sony Imageworks* to realize dynamic SIMD code generation of Open Shading Language shader networks, achieving 3-9x speedups with Intel® AVX-512.
Deep Learning Training at Scale: Spring Crest Deep Learning Acceleratorinside-BigData.com
Today at Hot Chips 2019, Intel revealed new details of upcoming high-performance AI accelerators: Intel Nervana neural network processors, with the NNP-T for training and the NNP-I for inference. Intel engineers also presented technical details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
"To get to a future state of ‘AI everywhere,’ we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources," said Naveen Rao, Intel vice president and GM, Artificial Intelligence Products Group. "Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”
Learn more: https://www.intel.ai/accelerating-for-ai/?elq_cid=1192980
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...Intel® Software
Universal Scene Description* (USD) is an open source initiative developed by Pixar for fast, large scale, and universal asset management across multiple programs including Maya, Houdini, and others.
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chipinside-BigData.com
Today at Hot Chips 2019, Intel revealed new details of upcoming high-performance AI accelerators: Intel Nervana neural network processors, with the NNP-T for training and the NNP-I for inference. Intel engineers also presented technical details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
"To get to a future state of ‘AI everywhere,’ we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources," said Naveen Rao, Intel vice president and GM, Artificial Intelligence Products Group. "Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”
Learn more: https://www.intel.ai/accelerating-for-ai/?elq_cid=1192980
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Intel® Software
Explore how to build a unified framework based on FFmpeg and GStreamer to enable video analytics on all Intel® hardware, including CPUs, GPUs, VPUs, FPGAs, and in-circuit emulators.
Today at Hot Chips 2019, Intel engineers presented technical details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
"To get to a future state of ‘AI everywhere,’ we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources," said Naveen Rao, Intel vice president and GM, Artificial Intelligence Products Group. "Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”
Learn more: https://www.intel.ai/accelerating-for-ai/?elq_cid=1192980
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...Intel® Software
This talk focuses on the newest release in RenderMan* 22.5 and its adoption at Pixar Animation Studios* for rendering future movies. With native support for Intel® Advanced Vector Extensions, Intel® Advanced Vector Extensions 2, and Intel® Advanced Vector Extensions 512, it includes enhanced library features, debugging support, and an extensive test framework.
Abstract: Explore the packet I/O data path from a NIC across PCI-Express to cache/memory and understand how to build efficient CPU code for networked applications.
Speaker: Venky Venkatesan, Intel Fellow, Chief Architect – Packet Processing and Networking Applications
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Intel® Software
Explore practical elements, such as performance profiling, debugging, and porting advice. Get an overview of advanced programming topics, like common design patterns, SIMD lane interoperability, data conversions, and more.
Practical RISC-V Random Test Generation using Constraint Programminged271828
A proof-of-concept random test generator for RISC-V ISA is presented. The test generator uses constraint programming for specification of relationships between instructions and operands. Example scenarios to cover basic instruction randomization, data hazards, and non-sharing are presented. The tool integrates the RISC-V instruction set simulator to enable the generation of self-checking tests. The tool is implemented in Python using a freely-available constraint solver library. A summary of problems encountered is provided and next steps are discussed.
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...Intel® Software
Overview of the new Embree 3 ray tracing framework, including how to use the new API, supported geometry types, and ray intersection methods. Includes a look at new features like normal oriented curves, vertex grids, etc.
Today at Hot Chips 2019, Intel engineers presented technical details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
"To get to a future state of ‘AI everywhere,’ we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources," said Naveen Rao, Intel vice president and GM, Artificial Intelligence Products Group. "Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”
Learn more: https://www.intel.ai/accelerating-for-ai/?elq_cid=1192980
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...Intel® Software
Deep learning based Inference on edge based devices is growing rapidly. In this talk, learn about how developers and researchers are taking advantage of Intel® Processor Graphics to get best performance.
Learn how Intel worked with Pixar Animation Studios* and Sony Imageworks* to realize dynamic SIMD code generation of Open Shading Language shader networks, achieving 3-9x speedups with Intel® AVX-512.
Deep Learning Training at Scale: Spring Crest Deep Learning Acceleratorinside-BigData.com
Today at Hot Chips 2019, Intel revealed new details of upcoming high-performance AI accelerators: Intel Nervana neural network processors, with the NNP-T for training and the NNP-I for inference. Intel engineers also presented technical details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
"To get to a future state of ‘AI everywhere,’ we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources," said Naveen Rao, Intel vice president and GM, Artificial Intelligence Products Group. "Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”
Learn more: https://www.intel.ai/accelerating-for-ai/?elq_cid=1192980
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...Intel® Software
Universal Scene Description* (USD) is an open source initiative developed by Pixar for fast, large scale, and universal asset management across multiple programs including Maya, Houdini, and others.
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chipinside-BigData.com
Today at Hot Chips 2019, Intel revealed new details of upcoming high-performance AI accelerators: Intel Nervana neural network processors, with the NNP-T for training and the NNP-I for inference. Intel engineers also presented technical details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
"To get to a future state of ‘AI everywhere,’ we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources," said Naveen Rao, Intel vice president and GM, Artificial Intelligence Products Group. "Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”
Learn more: https://www.intel.ai/accelerating-for-ai/?elq_cid=1192980
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Intel® Software
Explore how to build a unified framework based on FFmpeg and GStreamer to enable video analytics on all Intel® hardware, including CPUs, GPUs, VPUs, FPGAs, and in-circuit emulators.
Today at Hot Chips 2019, Intel engineers presented technical details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
"To get to a future state of ‘AI everywhere,’ we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources," said Naveen Rao, Intel vice president and GM, Artificial Intelligence Products Group. "Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”
Learn more: https://www.intel.ai/accelerating-for-ai/?elq_cid=1192980
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...Intel® Software
This talk focuses on the newest release in RenderMan* 22.5 and its adoption at Pixar Animation Studios* for rendering future movies. With native support for Intel® Advanced Vector Extensions, Intel® Advanced Vector Extensions 2, and Intel® Advanced Vector Extensions 512, it includes enhanced library features, debugging support, and an extensive test framework.
Abstract: Explore the packet I/O data path from a NIC across PCI-Express to cache/memory and understand how to build efficient CPU code for networked applications.
Speaker: Venky Venkatesan, Intel Fellow, Chief Architect – Packet Processing and Networking Applications
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Intel® Software
Explore practical elements, such as performance profiling, debugging, and porting advice. Get an overview of advanced programming topics, like common design patterns, SIMD lane interoperability, data conversions, and more.
Practical RISC-V Random Test Generation using Constraint Programminged271828
A proof-of-concept random test generator for RISC-V ISA is presented. The test generator uses constraint programming for specification of relationships between instructions and operands. Example scenarios to cover basic instruction randomization, data hazards, and non-sharing are presented. The tool integrates the RISC-V instruction set simulator to enable the generation of self-checking tests. The tool is implemented in Python using a freely-available constraint solver library. A summary of problems encountered is provided and next steps are discussed.
ONS 2018 LA - Intel Tutorial: Cloud Native to NFV - Alon Bernstein, Cisco & K...Kuralamudhan Ramakrishnan
The first wave of NFV was about taking a network function and running it as-is in a virtual environment. The web giants follow a different approach called Cloud Native. Cloud Native views the cloud as a huge distributed compute platform, applications are broken into micro-services and deployed in a container based environment using DevOps.
Communication Service Providers are looking to adopt Cloud Native, yet the existing Cloud Native principles are not sufficient to meet their business and NFV use case needs. In this session, Intel and Cisco will explore and share experiences addressing challenges, technology gaps and migration path to Cloud Native for NFV.
Join us to alleviate your concerns around data plane performance, control, and DevOps deployment when using micro-services, Containers, and Kubernetes implementations.
E5 Intel Xeon Processor E5 Family Making the Business Case Intel IT Center
This presentation highlights cloud computing advantages of the Intel® Xeon® processor E5 family and helps you make the business case for investing. Includes access to an ROI calculator.
Intel® Select Solutions for the Network provide a faster means to address these challenges as we transition to 5G with pre-validated, optimized building blocks to help drive scale. Hear the what, why, when and where around Intel® Select Solutions for the Network.
Software Development Tools for Intel® IoT PlatformsIntel® Software
This talk familiarizes participants with the benefits of using the Intel® software development tools and libraries for developing end-to-end IoT solutions.
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY
HPC DAY 2017 - http://www.hpcday.eu/
Accelerating tomorrow's HPC and AI workflows with Intel Architecture
Atanas Atanasov | HPC solution architect, EMEA region at Intel
Ready access to high performance Python with Intel Distribution for Python 2018AWS User Group Bengaluru
Talk by Mayank Tiwari, Technical Consulting Engineer, Intel Software on the topic "Ready access to high performance Python with Intel Distribution for Python 2018" at AWS Community Day, Bangalore 2018
Describes how Clear Linux OS is designed, highlighting core features, operating models, and foundational tools that are key to understanding how the distro operates.
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...Igor José F. Freitas
O objetivo desta palestra é apresentar aos desenvolvedores como o universo da Computação de Alto Desempenho (computação paralela) está se tornando cada vez mais acessível e se democratizando nos softwares de Big Data e Inteligência Artificial. Supercomputadores que até pouco tempo eram utilizados apenas em indústrias de nicho, setores do governo e pela ciência, estão contribuindo para a solução de grandes desafios da sociedade, da indústria e da ciência. Esta palestra terá uma abordagem técnica envolvendo conceitos de software e hardware com o intuito de provocar o desenvolvedor a fazer uso de grandes servidores para desenvolverem aplicações inovadoras.
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingIntel® Software
Discover, extend, and modernize your current development approach for hetergeneous compute with standards-based OpenFabrics Interfaces* (OFI), message passing interface (MPI), and OpenMP* programming methods on Intel® Xeon Phi™ processors.
Similar to Scaling python to_hpc_big_data-maidanov (20)
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
SOCRadar Research Team: Latest Activities of IntelBroker
Scaling python to_hpc_big_data-maidanov
1.
2. Scaling Out Python* To
HPC and Big Data
Sergey Maidanov
Software Engineering Manager for
Intel® Distribution for Python*
3. What Problems We Solve: Scalable Performance
Make Python usable beyond prototyping environment by
scaling out to HPC and Big Data environments
4. What Problems We Solve: Ease of Use
4
“Any articles I found on your site that
related to actually using the MKL for
compiling something were overly
technical. I couldn't figure out what
the heck some of the things were
doing or talking about.“ – Intel® Parallel Studio
2015 Beta Survey Response
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/280832
https://software.intel.com/en-us/articles/building-numpyscipy-with-intel-mkl-and-intel-fortran-on-windows
https://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl
5. Why Yet Another Python Distribution?
Intel® Xeon® Processors Intel® Xeon Phi™ Product Family
Configuration Info: apt/atlas: installed with apt-get, Ubuntu 16.10, python 3.5.2, numpy 1.11.0, scipy 0.17.0; pip/openblas: installed with pip, Ubuntu 16.10, python 3.5.2, numpy 1.11.1, scipy 0.18.0; Intel Python: Intel Distribution for Python 2017;. Hardware: Xeon: Intel
Xeon CPU E5-2698 v3 @ 2.30 GHz (2 sockets, 16 cores each, HT=off), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Xeon Phi: Intel Intel® Xeon Phi™ CPU 7210 1.30 GHz, 96 GB of RAM, 6 DIMMS of 16GB@1200MHz
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with
other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations.
Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not
specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .
Mature AVX2 instructions based product New AVX512 instructions based product
6. Scaling To HPC & Big Data Environments
• Hardware and software efficiency crucial in production (Perf/Watt, etc.)
• Efficiency = Parallelism
– Instruction Level Parallelism with effective memory access patterns
– SIMD
– Multi-threading
– Multi-node
Roofline Performance Model*
Arithmetic Intensity
SpMVBLAS1
Stencils
FFT
BLAS3 Particle
Methods
Low High
Gflop/s
Peak Gflop/s
* Roofline Performance Model https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
7. Efficiency = Parallelism and Python?
• CPython as interpreter inhibits parallelism but…
• … Overall Python tools evolved far toward unlocking parallelism
Native extensions
numpy*, scipy*, scikit-
learn* accelerated with
Intel® MKL, Intel® DAAL,
Intel® IPP
Composable multi-
threading with Intel®
TBB and Dask*
Multi-node
parallelism with
mpi4py* accelerated
with Intel® MPI*
Language extensions
for vectorization &
multi-threading
(Cython*, Numba*)
Integration with Big Data
platforms and Machine
Learning frameworks
(pySpark*, Theano*,
TensorFlow*, etc.)
Mixed language
profiling with Intel®
VTune™ Amplifier
8. Composable Multi-Threading
With Intel® TBB
• Amhdal’s law suggests extracting parallelism at all levels
• If software components do not coordinate on threads
use it may lead to oversubscription
• Intel TBB dynamically balances HW thread loads and
effectively manages oversubscription
• Intel engineers extended Cpython* and Numba* thread
pools with support of Intel® TBB
>python –m TBB myapp.py
Application
Component 1
Component N
Subcomponent 1
Subcomponent 2
Subcomponent K
Subcomponent 1
Subcomponent M
Subcomponent
1
Subcomponent
1
Subcomponent
1
Subcomponent
1
Subcomponent
1
Subcomponent
1
Subcomponent
1
Subcomponent
1
MKL
TBB
DAAL
pyDAAL
NumPy
SciPy TBB
Joblib
Dask
Application
PythonpackagesNativelibs
9. Composable Multi-Threading Example:
Batch QR Performance
Numpy
1.00x
Numpy
0.22x
Numpy
0.47x
Dask
0.61x
Dask
0.89x
Dask
1.46x
0.0x
0.2x
0.4x
0.6x
0.8x
1.0x
1.2x
1.4x
Default MKL Serial MKL Intel® TBB
Speedup relative to Default Numpy*
Intel® MKL,
OpenMP* threading
Intel® MKL,
Serial
Intel® MKL,
Intel® TBB threading
Over-
subscription
App-level
parallelism
only
TBB-
composable
nested
parallelism
System info: 32x Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, disabled HT, 64GB RAM; Intel(R) MKL 2017.0 Beta Update 1 Intel(R) 64 architecture, Intel(R) AVX2;
Intel(R)TBB 4.4.4; Ubuntu 14.04.4 LTS; Dask 0.10.0; Numpy 1.11.0.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark
and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause
the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source:
Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability,
functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are
intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .
10. NumPy* and SciPy* Continuous Improvement
• MKL-level performance for dense linear algebra and FFT
• NumPy random, umath & NumExpr* exploit SIMD and multi-threading
out-of-the-box
• Better memory allocation and copy in NumPy
0
10
20
30
40
50
1024 2048 4096 8192 16384
AxisTitle
Black Scholes Formula
NumPy* umath scalability
Numpy - U1+TBB
Numpy - U1-opt+TBB
Intel Python 2017U1
Intel Python 2017U1.1
4X
See system configuration details on
the next slide
MOPS
Does not
scale beyond
16K options
11. NumExpr* Scalability: Black-Scholes Formula
System info: 32x Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz, disabled HT, 64GB RAM; Intel® Distribution for Python* 2017 Update
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance
tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other
brands and names are the property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are
not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does
not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific
to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more
information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .
0
100
200
300
400
500
600
700
800
900
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388608
16777216
33554432
67108864
MOPS
Black-Scoles Formula with NumExpr*
What’s a
magic?
13. Skt-Learn* Optimizations With Intel® MKL... And Intel® DAAL
0x
1x
2x
3x
4x
5x
6x
7x
8x
9x
Approximate
neighbors
Fast K-means GLM GLM net LASSO Lasso path Least angle
regression,
OpenMP
Non-negative
matrix
factorization
Regression by
SGD
Sampling
without
replacement
SVD
Speedups of Scikit-Learn Benchmarks
Intel® Distribution for Python* 2017 Update 1 vs. system Python & NumPy/Scikit-Learn
System info: 32x Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz, disabled HT, 64GB RAM; Intel® Distribution for Python* 2017 Gold; Intel® MKL 2017.0.0; Ubuntu 14.04.4 LTS; Numpy 1.11.1; scikit-learn 0.17.1.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any
change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands
and names are the property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not
guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .
Effect of Intel MKL
optimizations for
NumPy* and SciPy*
1 1.11
54.13
0x
10x
20x
30x
40x
50x
60x
System Sklearn Intel SKlearn Intel PyDAAL
Speedup
Potential Speedup of Scikit-learn* due to
PyDAAL
PCA, 1M Samples, 200 Features
Effect of DAAL
optimizations for
Scikit-Learn*
Intel® Distribution for Python* ships Intel® Data
Analytics Acceleration Library with Python
interfaces, a.k.a. pyDAAL
14. Ideas Behind Intel® DAAL: Heterogeneous Analytics
• Data is different, data analytics pipeline is the same
• Data transfer between devices is costly, protocols are different
– Need data analysis proximity to Data Source
– Need data analysis proximity to Client
– Data Source device ≠ Client device
– Requires abstraction from communication protocols
Pre-processing Transformation Analysis Modeling Decision Making
Decompression,
Filtering, Normalization
Aggregation,
Dimension Reduction
Summary Statistics
Clustering, etc.
Machine Learning (Training)
Parameter Estimation
Simulation
Forecasting
Decision Trees, etc.
Scientific/Engineering
Web/Social
Business
Compute (Server, Desktop, … ) Client EdgeData Source Edge
Validation
Hypothesis testing
Model errors
15. Ideas Behind Intel® DAAL: Effective Data Management,
Streaming and Distributed Processing
…
Observations, n
Time
Memory
Capacity
Categorical
Blank/Missing
Numeric
Features,p
Big Data Attributes Computational Solution
Distributed across different devices •Distributed processing with communication-avoiding
algorithms
Huge data size not fitting into device
memory
•Distributed processing
•Streaming algorithms
Data coming in time •Data buffering & asynchronous computing
•Streaming algorithms
Non-homogeneous data •CategoricalNumeric (counters, histograms, etc)
•Homogeneous numeric data kernels
• Conversions, Indexing, Repacking
Sparse/Missing/Noisy data •Sparse data algorithms
•Recovery methods (bootstrapping, outlier correction)
Outlier
16. Ideas Behind Intel® DAAL: Storage & Compute
• Optimizing storage ≠ optimizing compute
– Storage: efficient non-homogeneous data encoding for smaller footprint and faster retrieval
– Compute: efficient memory layout, homogeneous data, contiguous access
– Easier manageable for traditional HPC, much more challenging for Big Data
…
Samples, n
Variables,p
Storage ComputeMemory
Filtering,
conversions,
basic statistics
Data homogenization
and blocking
DAAL DataSource DAAL NumericTable DAAL Algorithm
SIMD
ymm1
ymm3
ymm2
17. Ideas Behind Intel® DAAL: Languages & Platforms
DAAL has multiple programming language bindings
• C++ – ultimate performance for real-time analytics with DAAL
• Java*/Scala* – easy integration with Big Data platforms (Hadoop*, Spark*, etc)
• Python* – advanced analytics for data scientist
18. 50%
60%
70%
80%
90%
100%
110%
120%
130%
pyDAAL (Offline) pyDAAL (Offline+Basic
Statistics)
pyDAAL (Online)
10% faster if
basic statistics
pre-computed
while reading
dataset
25%
incremental
cost for
streaming
computing
support
Ex 1. Compute Basic Statistics While Reading Dataset
Ex 2. Offline vs. Online
• 𝑐𝑜𝑣 𝑋, 𝑌 = 𝐸((𝑋 − 𝐸 𝑋 )(𝑌 − 𝐸 𝑌 )
• Algorithm: Maximum Likelihood Estimator
• Dataset: CSV file n=1M, p=100
…
Samples, n
Variables,p
Storage ComputeMemory
Basic statistics Data blocking
SIMD
ymm1
ymm3
ymm2
System Info: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 504GB, 2x24 cores, HT=on, OS RH7.2 x86_64, Intel Distribution
for Python 2017 Update 1 (Python 3.5)
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. * Other brands and names are the
property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction
sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization
on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for
use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice. Notice revision #20110804 .
Better memory management in pyDAAL allows
bigger datasets even for offline processing
• Ex.: Scikit-learn fails to process n=50M, p=1000 while
pyDAAL successfully completes work
Online processing removes dataset size limitation
19. Ex 3: Read vs. Compute
• Algorithm: SVM Classification with RBF kernel
• Training dataset: CSV file (PCA-preprocessed MNIST, 40
principal components) n=42000, p=40
• Testing dataset: CSV file (PCA-preprocessed MNIST, 40
principal components) n=28000, p=40
System Info: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 504GB, 2x24 cores, HT=on, OS RH7.2 x86_64, Intel
Distribution for Python 2017 Update 1 (Python 3.5)
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors
may cause the results to vary. You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that product when combined
with other products. * Other brands and names are the property of their respective owners. Benchmark
Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel
microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include
SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability,
functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please
refer to the applicable product User and Reference Guides for more information regarding the specific
instruction sets covered by this notice. Notice revision #20110804 .
0
5
10
15
20
25
Scikit-Learn, Pandas pyDAAL
Training (sec)
Read Training Dataset (incl. labels) Training Compute
0
5
10
15
20
25
Scikit-Learn, Pandas pyDAAL
Prediction (sec)
Read Test Dataset Prediction Compute
60% faster CVS file
read, 2.2x faster
training
66x faster prediction
in pyDAAL. Read and
compute balanced
20. Distributed Parallelism
• Intel® MPI* accelerates Intel®
Distribution for Python (mpi4py*,
ipyparallel*)
• Intel Distribution for Python also
supports
– PySpark* - Python* interfaces for Spark*, an
engine for large-scale data processing
– Dask* - flexible parallel computing library for
numerical computing 1.7x 2.2x 3.0x 5.3x
0x
1x
2x
3x
4x
5x
6x
2 nodes 4 nodes 8 nodes 16 nodes
PyDAAL Implicit ALS with Mpi4Py*
Scales with MPI, Spark, Dask and
other distributed computing
engines
Configuration Info: Hardware (each node): Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz, 2x18 cores, HT is ON, RAM 128GB; Versions: Oracle Linux Server 6.6, Intel® DAAL 2017 Gold, Intel® MPI 5.1.3; Interconnect: 1 GB Ethernet.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. *
Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel
does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to
Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .
21. Summary And Call To Action
• Intel created the Python* distribution for out-of-the-box performance and
scalability on Intel® Architecture
– With minimum to no code modification Python aims to scale
• Multiple technologies applied to unlock parallelism at all levels
– Numerical libraries, libraries for parallelism, Python code compilation/JITing, profiling
– Enhancing mature Python packages and bringing new technologies, e.g. pyDAAL, TBB
• With multiple choices available Python developer needs to be conscious what
will scale best
– Intel® VTune™ Amplifier helps making conscious decisions
Intel Distribution for Python is free!
https://software.intel.com/en-us/intel-distribution-for-python
Commercial support included for Intel® Parallel Studio XE customers!
Easy to install with Anaconda* https://anaconda.org/intel/