This issue’s feature article, Tuning Autonomous Driving Using Intel® System Studio, illustrates how the tools in Intel System Studio give embedded systems and connected device developers an integrated development environment to build, debug, and tune performance and power usage. Continuing the theme of tuning edge applications, Building Fast Data Compression Code for Cloud and Edge Applications shows how to use the Intel® Integrated Performance Primitives
to speed data compression.
MobiCloud: Towards Cloud Mobile Hybrid Application Generation using Semantica...Amit Sheth
Ajith Ranabahu, Amit Sheth, Ashwin Manjunatha, and Krishnaprasad Thirunarayan, 'Towards Cloud Mobile Hybrid Application Generation using Semantically Enriched Domain Specific Languages', International Workshop on Mobile Computing and Clouds (MobiCloud 2010), Santa Clara, CA,October 28, 2010.
Paper: http://knoesis.org/library/resource.php?id=865
Project: http://knoesis.wright.edu/research/srl/projects/mobi-cloud/
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/sep-2019-alliance-vitf-cocoonhealth
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Pavan Kumar, Co-founder and CTO of Cocoon Health (formerly Cocoon Cam), delivers the presentation "Edge/Cloud Tradeoffs and Scaling a Consumer Computer Vision Product" at the Embedded Vision Alliance's September 2019 Vision Industry and Technology Forum. Kumar explains how his company is evolving its use of edge and cloud vision computing in continuing to bring new capabilities to the product category of baby monitors.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/09/may-2021-embedded-vision-summit-opening-remarks-may-28/
Jeff Bier, Founder of the Edge AI and Vision Alliance, welcomes attendees to the May 2021 Embedded Vision Summit on May 28, 2021. Bier provides an overview of the edge AI and vision market opportunities, challenges, solutions and trends. He also introduces the Edge AI and Vision Alliance and the resources it offers for both product creators and members, and reviews the day’s agenda and other logistics.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/03/market-analysis-on-socs-for-imaging-vision-and-deep-learning-in-automotive-and-mobile-markets-a-presentation-from-yole-developpement/
For more information about edge AI and vision, please visit:
http://www.edge-ai-vision.com
John Lorenz, Market and Technology Analyst for Computing and Software at Yole Développement, delivers the presentation “Market Analysis on SoCs for Imaging, Vision and Deep Learning in Automotive and Mobile Markets” at the Edge AI and Vision Alliance’s March 2020 Vision Industry and Technology Forum. Lorenz presents Yole Développement’s latest analysis on the evolution of SoCs for imaging, vision and deep learning.
MobiCloud: Towards Cloud Mobile Hybrid Application Generation using Semantica...Amit Sheth
Ajith Ranabahu, Amit Sheth, Ashwin Manjunatha, and Krishnaprasad Thirunarayan, 'Towards Cloud Mobile Hybrid Application Generation using Semantically Enriched Domain Specific Languages', International Workshop on Mobile Computing and Clouds (MobiCloud 2010), Santa Clara, CA,October 28, 2010.
Paper: http://knoesis.org/library/resource.php?id=865
Project: http://knoesis.wright.edu/research/srl/projects/mobi-cloud/
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/sep-2019-alliance-vitf-cocoonhealth
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Pavan Kumar, Co-founder and CTO of Cocoon Health (formerly Cocoon Cam), delivers the presentation "Edge/Cloud Tradeoffs and Scaling a Consumer Computer Vision Product" at the Embedded Vision Alliance's September 2019 Vision Industry and Technology Forum. Kumar explains how his company is evolving its use of edge and cloud vision computing in continuing to bring new capabilities to the product category of baby monitors.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/09/may-2021-embedded-vision-summit-opening-remarks-may-28/
Jeff Bier, Founder of the Edge AI and Vision Alliance, welcomes attendees to the May 2021 Embedded Vision Summit on May 28, 2021. Bier provides an overview of the edge AI and vision market opportunities, challenges, solutions and trends. He also introduces the Edge AI and Vision Alliance and the resources it offers for both product creators and members, and reviews the day’s agenda and other logistics.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/03/market-analysis-on-socs-for-imaging-vision-and-deep-learning-in-automotive-and-mobile-markets-a-presentation-from-yole-developpement/
For more information about edge AI and vision, please visit:
http://www.edge-ai-vision.com
John Lorenz, Market and Technology Analyst for Computing and Software at Yole Développement, delivers the presentation “Market Analysis on SoCs for Imaging, Vision and Deep Learning in Automotive and Mobile Markets” at the Edge AI and Vision Alliance’s March 2020 Vision Industry and Technology Forum. Lorenz presents Yole Développement’s latest analysis on the evolution of SoCs for imaging, vision and deep learning.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/11/getting-efficient-dnn-inference-performance-is-it-really-about-the-tops-a-presentation-from-intel/
For more information about edge AI and computer vision, please visit:
https://www.edge-ai-vision.com
Gary Brown, Director of AI Marketing at Intel, presents the “Getting Efficient DNN Inference Performance: Is It Really About the TOPS?” tutorial at the September 2020 Embedded Vision Summit.
This presentation looks at how performance is measured among deep learning inference platforms, starting with the simple peak TOPS metric, why it’s used and why it might be misleading. Brown looks at compute efficiency as measured by real benchmark workload performance and how it relates to peak TOPS, comparing performance across Intel’s inference platforms. He also discusses how developers can use Intel’s DevCloud for the Edge to quickly access Intel’s inference platforms.
Unity XR platform has a new architecture – Unite Copenhagen 2019Unity Technologies
Unity developed a new architecture that improves the support for existing and future augmented reality (AR) and virtual reality (VR) platforms. Learn about the technology under the hood, the consequent benefits, and improvements to the platform, and how it impacts your workflows in creating AR/VR experiences.
Speakers: Mike Durand, Matt Fuad - Unity
Watch the session on Youtube: https://youtu.be/Stqk1GxlSK0
Applications built with Delphi and C++ Builder for the Windows platform have proven to be indispensable instruments for businesses, but rewriting them for the cloud is often cost-prohibiting. rollApp offers a cloud platform that can run existing desktop applications in the cloud without any need to modify them. At this webinar you will learn how to move your application to the cloud and offer the benefits of a cloud solution to your users in a matter of a few weeks.
Growth of CE and its critical role in various advances including sensors, displays, machine intelligence, automotive and other areas. Invited Lecture at Oxford.
Bridge-Stage Framework for the Smartphone Application Development using HTML5ijsrd.com
Now a days, the Web has become an integral part of our everyday lives. The rapid growth of the smart phone market has brought the Web from our home desks to anywhere we are, and enabled us to access this vast source of information at any time. The mobile operating systems (OS) used by modern smart phones are too diverse such as Google's Android, Apple's iOS, Microsoft's Windows Phone, and so on. Smartphone application development is done using native platform such as iPhone using Objective-C, Android using Java, Windows Mobile using C# and so on. Therefore, a bridge stage framework which supports 'Write once and deploy everywhere' is required to support the development of Smartphone applications. This paper presents the HTML5-based bridge stage framework which uses Phone Gap and Web kit to support the development of Smartphone applications that are written as Web applications. A big problem with developing applications for mobile devices is platform fragmentation [6]. That means that there are many different mobile platforms that are further divided by the different versions available [5][2]. Users with older hardware are left without support and updates as newer devices are put out on the market [9]. This means that the developer has the choice between limiting the solutions and only aim for a minor part of the spectra or to develop for more platforms to reach as many users as possible. To maximize the amount of possible users, the developer has to create an application for each platform and make sure that they are backwards compatible so that users with older devices can use them.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/11/acceleration-of-deep-learning-using-openvino-3d-seismic-case-study-a-presentation-from-intel/
For more information about edge AI and computer vision, please visit:
https://www.edge-ai-vision.com
Manas Pathak, Global AI Lead for Oil and Gas at Intel, presents the “Acceleration of Deep Learning Using OpenVINO: 3D Seismic Case Study” tutorial at the September 2020 Embedded Vision Summit.
The use of deep learning for automatic seismic data interpretation is gaining the attention of many researchers across the oil and gas industry. The integration of high-performance computing (HPC) AI workflows in seismic data interpretation brings the challenge of moving and processing large amounts of data from HPC to AI computing solutions and vice-versa.
In this presentation, Pathak illustrates this challenge via a case study using a public deep learning model for salt identification applied on a 3D seismic survey from the F3 Dutch block in the North Sea. He presents a workflow to address this challenge and perform accelerated AI on seismic data. The Intel Distribution of OpenVINO toolkit was used to increase the inference performance of a pre-trained model on an Intel CPU. OpenVINO allows CPU users to get significant improvement in AI inference performance for high memory capacity deep learning models used on large datasets without any significant loss in accuracy.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/11/getting-efficient-dnn-inference-performance-is-it-really-about-the-tops-a-presentation-from-intel/
For more information about edge AI and computer vision, please visit:
https://www.edge-ai-vision.com
Gary Brown, Director of AI Marketing at Intel, presents the “Getting Efficient DNN Inference Performance: Is It Really About the TOPS?” tutorial at the September 2020 Embedded Vision Summit.
This presentation looks at how performance is measured among deep learning inference platforms, starting with the simple peak TOPS metric, why it’s used and why it might be misleading. Brown looks at compute efficiency as measured by real benchmark workload performance and how it relates to peak TOPS, comparing performance across Intel’s inference platforms. He also discusses how developers can use Intel’s DevCloud for the Edge to quickly access Intel’s inference platforms.
Unity XR platform has a new architecture – Unite Copenhagen 2019Unity Technologies
Unity developed a new architecture that improves the support for existing and future augmented reality (AR) and virtual reality (VR) platforms. Learn about the technology under the hood, the consequent benefits, and improvements to the platform, and how it impacts your workflows in creating AR/VR experiences.
Speakers: Mike Durand, Matt Fuad - Unity
Watch the session on Youtube: https://youtu.be/Stqk1GxlSK0
Applications built with Delphi and C++ Builder for the Windows platform have proven to be indispensable instruments for businesses, but rewriting them for the cloud is often cost-prohibiting. rollApp offers a cloud platform that can run existing desktop applications in the cloud without any need to modify them. At this webinar you will learn how to move your application to the cloud and offer the benefits of a cloud solution to your users in a matter of a few weeks.
Growth of CE and its critical role in various advances including sensors, displays, machine intelligence, automotive and other areas. Invited Lecture at Oxford.
Bridge-Stage Framework for the Smartphone Application Development using HTML5ijsrd.com
Now a days, the Web has become an integral part of our everyday lives. The rapid growth of the smart phone market has brought the Web from our home desks to anywhere we are, and enabled us to access this vast source of information at any time. The mobile operating systems (OS) used by modern smart phones are too diverse such as Google's Android, Apple's iOS, Microsoft's Windows Phone, and so on. Smartphone application development is done using native platform such as iPhone using Objective-C, Android using Java, Windows Mobile using C# and so on. Therefore, a bridge stage framework which supports 'Write once and deploy everywhere' is required to support the development of Smartphone applications. This paper presents the HTML5-based bridge stage framework which uses Phone Gap and Web kit to support the development of Smartphone applications that are written as Web applications. A big problem with developing applications for mobile devices is platform fragmentation [6]. That means that there are many different mobile platforms that are further divided by the different versions available [5][2]. Users with older hardware are left without support and updates as newer devices are put out on the market [9]. This means that the developer has the choice between limiting the solutions and only aim for a minor part of the spectra or to develop for more platforms to reach as many users as possible. To maximize the amount of possible users, the developer has to create an application for each platform and make sure that they are backwards compatible so that users with older devices can use them.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/11/acceleration-of-deep-learning-using-openvino-3d-seismic-case-study-a-presentation-from-intel/
For more information about edge AI and computer vision, please visit:
https://www.edge-ai-vision.com
Manas Pathak, Global AI Lead for Oil and Gas at Intel, presents the “Acceleration of Deep Learning Using OpenVINO: 3D Seismic Case Study” tutorial at the September 2020 Embedded Vision Summit.
The use of deep learning for automatic seismic data interpretation is gaining the attention of many researchers across the oil and gas industry. The integration of high-performance computing (HPC) AI workflows in seismic data interpretation brings the challenge of moving and processing large amounts of data from HPC to AI computing solutions and vice-versa.
In this presentation, Pathak illustrates this challenge via a case study using a public deep learning model for salt identification applied on a 3D seismic survey from the F3 Dutch block in the North Sea. He presents a workflow to address this challenge and perform accelerated AI on seismic data. The Intel Distribution of OpenVINO toolkit was used to increase the inference performance of a pre-trained model on an Intel CPU. OpenVINO allows CPU users to get significant improvement in AI inference performance for high memory capacity deep learning models used on large datasets without any significant loss in accuracy.
This session was held by Vladimir Brenner, Partner Account Manager, Disruptors & AI, Intel AI at the Dive into H2O: London training on June 17, 2019.
Please find the recording here: https://youtu.be/60o3eyG5OLM
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
Short
The growing amount of data captured by sensors and the real time constraints imply that not only big data analytics but also Machine Learning (ML) inference shall be executed at the edge. The multiple options for neural network acceleration in Arm-based platforms provide an unprecedented opportunity for new intelligent devices. It also raises the risk of fragmentation and duplication of efforts when multiple frameworks shall support multiple accelerators.
Andrea Gallo, Linaro VP of Segment Groups, will summarise the existing NN frameworks, accelerator solutions, and will describe the efforts underway in the Arm ecosystem.
Abstract
The dramatically growing amount of data captured by sensors and the ever more stringent requirements for latency and real time constraints are paving the way for edge computing, and this implies that not only big data analytics but also Machine Learning (ML) inference shall be executed at the edge. The multiple options for neural network acceleration in recent Arm-based platforms provides an unprecedented opportunity for new intelligent devices with ML inference. It also raises the risk of fragmentation and duplication of efforts when multiple frameworks shall support multiple accelerators.
Andrea Gallo, Linaro VP of Segment Groups, will summarise the existing NN frameworks, model description formats, accelerator solutions, low cost development boards and will describe the efforts underway to identify the best technologies to improve the consolidation and enable the competitive innovative advantage from all vendors.
Audience
The session will be useful for executives to engineers. Executives will gain a deeper understanding of the issues and opportunities. Engineers at NN acceleration IP design houses will take away ideas for how to collaborate in the open source community on their area of expertise, how to evaluate the performance and accelerate multiple NN frameworks without modifying them for each new IP, whether it be targeting edge computing gateways, smart devices or simple microcontrollers.
Benefits to the Ecosystem
The AI deep learning neural network ecosystem is starting just now and it has similar implications with open source as GPU and video accelerators had in the early days with user space drivers, binary blobs, proprietary APIs and all possible ways to protect their IPs. The session will outline a proposal for a collaborative ecosystem effort to create a common framework to manage multiple NN accelerators while at the same time avoiding to modify deep learning frameworks with multiple forks.
oneAPI: Industry Initiative & Intel ProductTyrone Systems
With the growth of AI, machine learning, and data-centric applications, the industry needs a programming model that allows developers to take advantage of rapid innovation in processor architectures. TensorFlow supports the oneAPI industry initiative and its standards-based open specification.
oneAPI complements TensorFlow’s modular design and provides increased choice of hardware vendor and processor architecture, and faster support of next-generation accelerators. TensorFlow uses oneAPI today on Xeon processors and we look forward to using oneAPI to run on future Intel architectures.
The Wireless Remote Control Car Based On Arm9IOSR Journals
Abstract: TheInternetof Things (IoT) are of great importance in promoting and guiding development of information technology and economic. At Present, theapplicationoftheIoT develops rapidly, but due to the special requirements of some applications, the existing technology cannot meet them very good. Much research work is doing to build IoT. Wi-Fi basedWirelessremote control has the features of high bandwidth and rate, non-line-transmission ability, large-scale data collection and high cost-effective, and it has the capability of video monitoring, which cannot be realized with RF. The research on Wi-Fi based remote control car has high practical significance to the development oftheInternetof Things. Based on the current research work ofapplicationsthe characteristics of Wi-Fi, this paper discusses controlling the car by using Wi-Fi module along with the conditions can be monitored through remote PC or Lap top which supports Wi-Fi technology. In PC or Lap top two tabs are present. In the first tab we can monitor the conditions and in the second tab four buttons are present to control the car in forward, back ward, left side and right side directions. Keywords: S3C2440 (ARM9), Wi-Fi Module, Camera, DC motors with driver IC and laptop with Wi-Fi module.
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciIntel® Software
Preprocess, visualize, and Build AI Faster at-Scale on Intel Architecture. Develop end-to-end AI pipelines for inferencing including data ingestion, preprocessing, and model inferencing with tabular, NLP, RecSys, video and image using Intel oneAPI AI Analytics Toolkit and other optimized libraries. Build at-scale performant pipelines with Databricks and end-to-end Xeon optimizations. Learn how to visualize with the OmniSci Immerse Platform and experience a live demonstration of the Intel Distribution of Modin and OmniSci.
Recover 30% of your day with IBM Development Tools (Smarter Mainframe Develop...Susan Yoskin
If you need to attract new developers, and want to keep your company’s name out of the headlines, then this session is for you. When your business depends on your mainframe apps working and performing well—all the time—you need to be alerted to issues as they occur and have the tools to help you find and fix the problems and test your solutions before disaster strikes (we’ve all been in those late night and weekend drills). You also need to continue supporting these applications for years to come, and that will require new talent.
This session will introduce you to the development environments that college grads are already comfortable with, and help your applications become more resilient at the same time. We’ll walk you through the tools to help you accomplish all of this and demo some scenarios to show you how efficiently our tools can perform the tasks that slow you down.
Evolving Services Into A Cloud Native WorldIain Hull
How Workday manage stateful services with a custom controller on Kubernetes?
Conference talk for CloudNative London 2018
https://skillsmatter.com/skillscasts/12106-evolving-services-into-the-cloud-native-world-how-workday-manage-stateful-services-with-a-custom-controller-on-kubernetes
Kubernetes and declarative infrastructure greatly simplify the way we deploy and manage software. Most services can be orchestrated with the control loops supplied by Kubernetes (deployments, stateful sets or jobs). Some stateful services in Workday require more advanced orchestration, and re-architecting them is not an easy option.
In this talk you will discover why some of our services require extra orchestration, and how we evolved an existing service into a control loop on top of Kubernetes. The control loop organises multiple services into groups these are dynamically created, deleted and scaled. It also orchestrates blue/green deployments of each group. Now we can adopt more kubernetes features and retire some of our old scheduling code. Finally you will learn the process we follow to evaluate and design our own control loops and when you might find them useful.
Bio
Iain is a principal software engineer at Workday using Kubernetes and Scala to deliver their next generation elastic grid. His twin passions are large scale distributed computing and applying clean code to complex problems. He is interested in good design and how this can improve system reliability and reduce friction during development.
He loves sharing his experiences as he learns and builds new systems. He regularly speaks at local meetups in Dublin and has presented at conferences including GotoConf, Scala Days, Functional Kats and Lambda World.
Session 1908 connecting devices to the IBM IoT CloudPeterNiblett
IBM MessageSight and the IBM Internet of Things cloud enable connectivity across a wide variety of devices - from existing devices in silos and systems through the wide range of new devices that are appearing on a daily basis. This session covers patterns of connectivity, how to make it happen, including sending events like measurements and receiving of commands. The session goes into detail on how to use the industry standard MQ Telemetry Transport protocol to achieve this and encompasses best practices for topics and message format.
Qt and Ekkono are working together to improve machine learning integration in the embedded development space. Ekkono have their own SDK, built to help developers rapidly deploy edge machine learning to embedded connected devices, allowing for conscious, self-learning, and predictive software. Imagine if all this functionality was easily adoptable into your existing Qt workflows. The possibilities are mind-boggling.
In this webinar you will learn how:
• Ekkono and Qt are paving the way for a streamlined method to implement a machine learning model for anomaly detection within a Qt application
• Improve workflows between machine learning experts and embedded stakeholders (UI/UX + Product managers + Embedded developers)
• Learn how the integration between Ekkono's machine learning for the Edge and Qt framework provides a faster iteration and prototyping procedure for all stakeholders in the embedded space (machine learning experts, embedded developers, UI/UX experts
The AI Index is an independent initiative at the Stanford Institute for Human-Centered Artificial Intelligence (HAI), led by the AI Index Steering Committee, an interdisciplinary group of experts from across academia and industry. The annual report tracks, collates, distills, and visualizes data relating to artificial intelligence, enabling decision-makers to take meaningful action to advance AI responsibly and ethically with humans in mind.
Intel Blockscale ASICs are built for the demanding environment of cryptocurrency mining. Each ASIC has built-in temperature and voltage sensor capabilities. The accelerator can be operated across a range of frequencies, enabling system designers to balance performance and efficiency.
Cryptography Processing with 3rd Gen Intel Xeon Scalable ProcessorsDESMOND YUEN
Cryptographic operations are amongst the most compute intensive and critical operations applied to data as it is stored, moved, and processed. Comprehending Intel's cryptography processing acceleration is essential to optimizing overall platform workload, and service performance.
At Intel, security comes first both in the way we work and in what we work on. Our culture and practices guide everything we build, with the goal of delivering the highest performance and optimal protections. As with previous reports, the 2021 Intel Product Security Report demonstrates our Security First Pledge and our endless efforts to proactively seek out and mitigate security issues.
How can regulation keep up as transformation races ahead? 2022 Global regulat...DESMOND YUEN
As the pandemic drags into its third year, financial services firms face a range of challenges, from increased operational complexity and an evolving regulatory directive to address environmental and social issues to new forms of competition
and evolving technologies, such as digital assets and cryptocurrencies. Banks, insurers, asset managers and other financial services firms (collectively referred to as “firms” in
the rest of this document) must innovate more effectively — and rapidly — to keep up with the pace of change while still identifying emerging risks and building appropriate governance and controls.
NASA Spinoffs Help Fight Coronavirus, Clean Pollution, Grow Food, MoreDESMOND YUEN
NASA's mission of exploration requires new technologies, software, and research – which show up in daily life. The agency’s Spinoff 2022 publication tells the stories of companies, start-ups, and entrepreneurs transforming these innovations into cutting-edge products and services that boost the economy, protect the planet, and save lives.
“The value of NASA is not confined to the cosmos but realized throughout our country – from hundreds of thousands of well-paying jobs to world-leading climate science, understanding the universe and our place within it, to technology transfers that make life easier for folks around the world,” NASA Administrator Bill Nelson said. “As we combat the coronavirus pandemic and promote environmental justice and sustainability, NASA technology is essential to address humanity’s greatest challenges.”
Spinoff 2022 features more than 45 companies using NASA technology to advance manufacturing techniques, detoxify polluted soil, improve weather forecasting, and even clean the air to slow the spread of viruses, including coronavirus.
"NASA's technology portfolio contains many innovations that not only enable exploration but also address challenges and improve life here at home," said Jim Reuter, associate administrator of the agency’s Space Technology Mission Directorate (STMD) in Washington. "We’ve captured these examples of successful commercialization of NASA technology and research, not only to share the benefits of the space program with the public, but to inspire the next generation of entrepreneurs."
This year in Spinoff, readers will learn more about:
How companies use information from NASA’s vertical farm to sustainably grow fresh produce
New ways that technology developed for insulation in space keeps people warm in the great outdoors
How a system created for growing plants in space now helps improve indoor air quality and reduces the spread of airborne viruses like coronavirus
How phase-change materials – originally developed to help astronauts wearing spacesuits – absorb, hold, and release heat to help keep race car drivers cool
A Survey on Security and Privacy Issues in Edge Computing-Assisted Internet o...DESMOND YUEN
Internet of Things (IoT) is an innovative paradigm
envisioned to provide massive applications that are now part of
our daily lives. Millions of smart devices are deployed within
complex networks to provide vibrant functionalities including
communications, monitoring, and controlling of critical infrastructures. However, this massive growth of IoT devices and the corresponding huge data traffic generated at the edge of the network created additional burdens on the state-of-the-art
centralized cloud computing paradigm due to the bandwidth and
resources scarcity. Hence, edge computing (EC) is emerging as
an innovative strategy that brings data processing and storage
near to the end users, leading to what is called EC-assisted IoT.
Although this paradigm provides unique features and enhanced
quality of service (QoS), it also introduces huge risks in data security and privacy aspects. This paper conducts a comprehensive survey on security and privacy issues in the context of EC-assisted IoT. In particular, we first present an overview of EC-assisted IoT including definitions, applications, architecture, advantages, and challenges. Second, we define security and privacy in the context of EC-assisted IoT. Then, we extensively discuss the major classifications of attacks in EC-assisted IoT and provide possible solutions and countermeasures along with the related research efforts. After that, we further classify some security and privacy issues as discussed in the literature based on security services and based on security objectives and functions. Finally, several open challenges and future research directions for secure EC-assisted IoT paradigm are also extensively provided.
PUTTING PEOPLE FIRST: ITS IS SMART COMMUNITIES AND CITIESDESMOND YUEN
The report covers the benefits, goals, challenges, and success factors associated with smart cities and communities and gives a glimpse of a path forward.
BUILDING AN OPEN RAN ECOSYSTEM FOR EUROPEDESMOND YUEN
Five companies—Deutsche Telekom, Orange, Telecom Italia, Telefónica, and Vodafone—published a report outlining why they feel Europe as a whole is lagging behind other regions such as the U.S. and Japan in developing Open RAN. The companies point to both a lack of companies developing key components, notably silicon chips, for Open RAN technologies, as well as the need to get incumbent equipment vendors Ericsson and Nokia on board with Open RAN development.
An Introduction to Semiconductors and IntelDESMOND YUEN
Did you know that...
The average American adult spends over 12 hours a day engaged with electronics — computers, mobile devices, TVs, cars, to name just a few — powered by semiconductors.
A common chip the size of your smallest fingernail is only about 1-millimeter thick but contains roughly 30 different layers of components and wires (called interconnects) that make up its complex circuitry.
Intel owns nearly 70,000 active patents worldwide. Its first — “Resistor for Integrated Circuit,” #3,631,313 — was granted to Gordon Moore on Dec. 28, 1971.
Those are a few fun facts in a high-level presentation that provides an easy-to-understand look at the world of semiconductors, why they matter and the role Intel plays in their creation.
Changing demographics and economic growth bloomDESMOND YUEN
Demography is destiny” is an oft-cited phrase that suggests the size, growth, and structure of a nation’s population deter mines its long-term social, economic, andpolitical fabric. The phrase highlights the role of
demographics in shaping many complex challenges
and opportunities societies face, including several
pertinent to economic growth and development.
Nevertheless, it is an overstatement to say that
demography determines all, as it downplays the
fact that both demographic trajectories and their
development implications are responsive to economic
incentives; to policy and institutional reforms; and to
changes in technology, cultural norms, and behavior.
The world is undergoing a major demographic
upheaval with three key components: population
growth, changes in fertility and mortality, and
associated changes in population age structure.
Intel Corporation (“Intel”) designs and manufactures
advanced integrated digital technology platforms that power
an increasingly connected world. A platform consists of
a microprocessor and chipset, and may be enhanced by
additional hardware, software, and services. The platforms
are used in a wide range of applications, such as PCs, laptops,
servers, tablets, smartphones, automobiles, automated
factory systems, and medical devices. Intel is also in the midst
of a corporate transformation that has seen its data-centric
businesses capture an increasing share of its revenue.
This report provides economic impact estimates for Intel in terms of employment, labor income, and gross domestic product (“GDP”) for the most recent historical year, 2019.1
Discover how private 5G networks can give enterprises options to enhance services and deliver new use cases with the level of control and investment they want.
Tackle more data science challenges than ever before without the need for discrete acceleration with the 3rd Gen Intel® Xeon® Scalable processors. Learn about the built-in AI acceleration and performance optimizations for popular AI libraries, tools and models.
The document describes how the latest Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions and Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) enabled in the latest Intel® 3rd Generation Xeon® Scalable Processor are used to significantly increase and achieve 1 Tb of IPsec throughput.
"Life and Learning After One-Hundred Years: Trust Is The Coin Of The Realm."DESMOND YUEN
The former secretary of state George Shultz passed away last weekend. He is one of the most influential secretaries of state in US history. Around the time of his hundredth birthday this past December, he published a short book on Trust and Effective Relationships
Telefónica views on the design, architecture, and technology of 4G/5G Open RA...DESMOND YUEN
This whitepaper is a blueprint for developing an Open RAN solution. It provides an overview of the main
technology elements that Telefónica is developing
in collaboration with selected partners in the Open
RAN ecosystem.
It describes the architectural elements, design
criteria, technology choices, and key chipsets
employed to build a complete portfolio of radio
units and baseband equipment capable of a full
4G/5G RAN rollout in any market of interest.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Hivelance Technology
Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders
Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
Tim Combridge from Sensible Giraffe and Salesforce Ben presents some important tips that all developers should know when dealing with Flows in Salesforce.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Modern design is crucial in today's digital environment, and this is especially true for SharePoint intranets. The design of these digital hubs is critical to user engagement and productivity enhancement. They are the cornerstone of internal collaboration and interaction within enterprises.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Your Digital Assistant.
Making complex approach simple. Straightforward process saves time. No more waiting to connect with people that matter to you. Safety first is not a cliché - Securely protect information in cloud storage to prevent any third party from accessing data.
Would you rather make your visitors feel burdened by making them wait? Or choose VizMan for a stress-free experience? VizMan is an automated visitor management system that works for any industries not limited to factories, societies, government institutes, and warehouses. A new age contactless way of logging information of visitors, employees, packages, and vehicles. VizMan is a digital logbook so it deters unnecessary use of paper or space since there is no requirement of bundles of registers that is left to collect dust in a corner of a room. Visitor’s essential details, helps in scheduling meetings for visitors and employees, and assists in supervising the attendance of the employees. With VizMan, visitors don’t need to wait for hours in long queues. VizMan handles visitors with the value they deserve because we know time is important to you.
Feasible Features
One Subscription, Four Modules – Admin, Employee, Receptionist, and Gatekeeper ensures confidentiality and prevents data from being manipulated
User Friendly – can be easily used on Android, iOS, and Web Interface
Multiple Accessibility – Log in through any device from any place at any time
One app for all industries – a Visitor Management System that works for any organisation.
Stress-free Sign-up
Visitor is registered and checked-in by the Receptionist
Host gets a notification, where they opt to Approve the meeting
Host notifies the Receptionist of the end of the meeting
Visitor is checked-out by the Receptionist
Host enters notes and remarks of the meeting
Customizable Components
Scheduling Meetings – Host can invite visitors for meetings and also approve, reject and reschedule meetings
Single/Bulk invites – Invitations can be sent individually to a visitor or collectively to many visitors
VIP Visitors – Additional security of data for VIP visitors to avoid misuse of information
Courier Management – Keeps a check on deliveries like commodities being delivered in and out of establishments
Alerts & Notifications – Get notified on SMS, email, and application
Parking Management – Manage availability of parking space
Individual log-in – Every user has their own log-in id
Visitor/Meeting Analytics – Evaluate notes and remarks of the meeting stored in the system
Visitor Management System is a secure and user friendly database manager that records, filters, tracks the visitors to your organization.
"Secure Your Premises with VizMan (VMS) – Get It Now"
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
2. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
2The Parallel Universe
CONTENTS
FEATURE
Letter from the Editor 3
Old and New
by Henry A. Gabb, Senior Principal Engineer at Intel Corporation
Tuning Autonomous Driving Using Intel® System Studio 5
Intel® GO™ SDK Offers Automotive Solution Developers an Integrated Solutions Environment
OpenMP* Is Turning 20! 19
Making Parallel Programming Accessible to C/C++ and Fortran Programmers
Julia*: A High-Level Language for Supercomputing 23
The Julia Project Continues to Break New Boundaries in Scientific Computing
Vectorization Becomes Important—Again 31
Open Source Code WARP3D Exemplifies Renewed Interest in Vectorization
Building Fast Data Compression Code for Cloud and Edge Applications 40
How to Optimize Your Compression with Intel® Integrated Performance Primitives (Intel® IPP)
MySQL* Optimization with Intel® C++ Compiler 48
How LeCloud Leverages MySQL* to Deliver Peak Service
Accelerating Linear Regression in R* with Intel® DAAL 58
Make Better Predictions with This Highly Optimized Open Source Package
3. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
3The Parallel Universe
Old and New
Back in 1993, the institute where I was doing my postdoctoral research got access to a Cray
C90* supercomputer. Competition for time on this system was so fierce that we were told―
in no uncertain terms―that if our programs didn’t take advantage of the architecture, they
should run elsewhere. The C90 was a vector processor, so we had to vectorize our code.
Those of us who took the time to read the compiler reports and make the necessary code
modifications saw striking performance gains. Though vectorization would eventually take a
backseat to parallelization in the multicore era, this optimization technique remains important.
In Vectorization Becomes Important―Again, Robert H. Dodds Jr. (professor emeritus at the
University of Illinois) shows how to vectorize a real application using modern programming tools.
We continue our celebration of OpenMP*’s 20th birthday with a guest editorial from Bronis R. de
Supinski, chief technology officer of Livermore Computing and the current chair of the OpenMP
Language Committee. Bronis gives his take on the conception and evolution of OpenMP as well
as its future direction in OpenMP Is Turning 20! Though 20 years old, OpenMP continues to
evolve with modern architectures.
I used to be a Fortran zealot, before becoming a Perl* zealot, and now a Python* zealot. Recently,
I had occasion to experiment with a new productivity language called Julia*. I recoded some of
my time-consuming data wrangling applications from Python to Julia, maintaining a line-for-line
translation as much as possible. The performance gains were startling, especially because these
were not numerically intensive applications, where Julia is known to shine. They were string
manipulation applications to prepare data sets for text mining. I’m not ready to forsake Python
and its vast ecosystem just yet, but Julia definitely has my attention. Take a look at Julia*: A High-
Level Language for Supercomputing for an overview of the language and its features.
3The Parallel Universe
Henry A. Gabb, Senior Principal Engineer at Intel Corporation, is a longtime high-performance and parallel
computing practitioner and has published numerous articles on parallel programming. He was editor/coauthor of
“Developing Multithreaded Applications: A Platform Consistent Approach” and was program manager of the Intel/
Microsoft Universal Parallel Computing Research Centers.
LETTERFROMTHEEDITOR
4. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
4The Parallel Universe
This issue’s feature article, Tuning Autonomous Driving Using Intel® System Studio, illustrates
how the tools in Intel System Studio give embedded systems and connected device developers
an integrated development environment to build, debug, and tune performance and power
usage. Continuing the theme of tuning edge applications, Building Fast Data Compression Code
for Cloud and Edge Applications shows how to use the Intel® Integrated Performance Primitives
to speed data compression.
As I mentioned in the last issue of The Parallel Universe, R* is not my favorite language. It is useful,
however, and Accelerating Linear Regression in R* with Intel® DAAL shows how data analytics
applications in R can take advantage of the Intel® Data Analytics Acceleration Library. Finally, in
MySQL* Optimization with Intel® C++ Compiler, we round out this issue with a demonstration
of how Interprocedural Optimization significantly improves the performance of another
important application for data scientists, the MySQL database.
Coming Attractions
Future issues of The Parallel Universe will contain articles on a wide range of topics, including
persistent memory, IoT development, the Intel® AVX instruction set, and much more. Stay tuned!
Henry A. Gabb
July 2017
5. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
5The Parallel Universe
Lavanya Chockalingam, Software Technical Consulting Engineer, Intel Corporation
TuningAutonomousDriving
UsingIntel®
SystemStudioIntel®
GO™
SDK Offers Automotive Solution Developers
an Integrated Solutions Environment
The Internet of Things is a collection of smart devices connected to the cloud. “Things” can be as
small and simple as a connected watch or a smartphone, or they can be as large and complex as
a car. In fact, cars are rapidly becoming some of the world’s most intelligent connected devices,
using sensor technology and powerful processors to sense and continuously respond to their
surroundings. Powering these cars requires a complex set of technologies:
•• Sensors that pick up LIDAR, sonar, radar, and optical signals
•• A sensor fusion hub that gathers millions of data points
•• A microprocessor that processes the data
•• Machine learning algorithms that require an enormous amount of computing power to make the data
intelligent and useful
6. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
6The Parallel Universe
Successfully realizing the enormous opportunities of these automotive innovations has the
potential to not only change driving but also to transform society.
Intel® GO™ Automotive Software Development Kit (SDK)
From car to cloud―and the connectivity in between―there is a need for automated driving
solutions that include high-performance platforms, software development tools, and robust
technologies for the data center. With Intel GO automotive driving solutions, Intel brings its
deep expertise in computing, connectivity, and the cloud to the automotive industry.
Autonomous driving on a global scale takes more than high-performance sensing and computing
in the vehicle. It requires an extensive infrastructure of data services and connectivity. This data
will be shared with all autonomous vehicles to continuously improve their ability to accurately
sense and safely respond to surroundings. To communicate with the data center, infrastructure
on the road, and other cars, autonomous vehicles will need high-bandwidth, reliable two-way
communication along with extensive data center services to receive, label, process, store, and
transmit huge quantities of data every second. The software stack within autonomous driving
systems must be able to efficiently handle demanding real-time processing requirements while
minimizing power consumption.
The Intel GO automotive SDK helps developers and system designers maximize hardware
capabilities with a variety of tools:
•• Computer vision, deep learning, and OpenCL™ toolkits to rapidly develop the necessary middleware
and algorithms for perception, fusion, and decision-making
•• Sensor data labeling tool for the creation of “ground truth” for deep learning training and environment
modeling
•• Autonomous driving-targeted performance libraries, leading compilers, performance and power
analyzers, and debuggers to enable full stack optimization and rapid development in a functional safety
compliance workflow
•• Sample reference applications, such as lane change detection and object avoidance, to shorten the
learning curve for developers
7. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
7The Parallel Universe
Intel® System Studio
Intel also provides software development tools that help accelerate time to market for automated
driving solutions. Intel System Studio provides developers with a variety of tools including
compilers, performance libraries, power and performance analyzers, and debuggers that
maximize hardware capabilities while speeding the pace of development. It is a comprehensive
and integrated tool suite that provides developers with advanced system tools and technologies
to help accelerate the delivery of the next-generation, power-efficient, high-performance, and
reliable embedded and mobile devices. This includes tools to:
•• Build and optimize your code
•• Debug and trace your code to isolate and resolve defects
•• Analyze your code for power, performance, and correctness
Build and Optimize Your Code
•• Intel® C++ Compiler: A high-performance, optimized C and C++ cross-compiler that can offload
compute-intensive code to Intel® HD Graphics.
•• Intel® Math Kernel Library (Intel® MKL): A set of highly optimized linear algebra, fast Fourier transform
(FFT), vector math, and statistics functions.
•• Intel® Threading Building Blocks (Intel® TBB): C++ parallel computing templates to boost embedded
system performance.
•• Intel® Integrated Performance Primitives (Intel® IPP): A software library that provides a broad range
of highly optimized functionality including general signal and image processing, computer vision, data
compression, cryptography, and string manipulation.
Debug and Trace Your Code to Isolate and Resolve Defects
•• Intel® System Debugger: Includes a System Debug feature that provides source-level debugging of
OS kernel software, drivers, and firmware plus a System Trace feature that provides an Eclipse* plug-in,
which adds the capability to access the Intel® Trace Hub providing advanced SoC-wide instruction and
data events tracing through its trace viewer.
•• GNU* Project Debugger: This Intel-enhanced GDB is for debugging applications natively and remotely
on Intel® architecture-based systems.
Analyze Your Code for Power, Performance, and Correctness
•• Intel® VTune™ Amplifier: This software performance analysis tool is for users developing serial and
multithreaded applications.
•• Intel® Energy Profiler: A platform-wide energy consumption analyzer of power-related data collected
on a target platform using the SoC Watch tool.
•• Intel® Performance Snapshot: Provides a quick, simple view into performance optimization
opportunities.
•• Intel® Inspector: A dynamic memory and threading error-checking tool for users developing serial and
multithreaded applications on embedded platforms.
•• Intel® Graphics Performance Analyzers: Real-time, system-level performance analyzers to optimize
CPU/GPU workloads.
9. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
9The Parallel Universe
Optimizing Performance
Advanced Hotspot Analysis
Matrix multiplication is a commonly used operation in autonomous driving. Intel System Studio
tools, mainly the performance analyzers and libraries, can help maximize performance. Consider
this example of a very naïve implementation of matrix multiplication using two nested for-loops:
void multiply0(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM],
TYPE c[][NUM], TYPE t[][NUM])
{
int i,j,k;
// Basic serial implementation
for(i=0; i<msize; i++) {
for(j=0; j<msize; j++) {
for(k=0; k<msize; k++) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}
}
Advanced hotspot analysis is a fast and easy way to identify performance-critical code sections
(hotspots). The periodic instruction pointer sampling performed by Intel VTune Amplifier identifies
code locations where an application spends more time. A function may consume much time either
because its code is slow or because the function is frequently called. But any improvements in the
speed of such functions should have a big impact on overall application performance.
Running an advanced hotspot analysis on the previous matrix multiplication code using Intel
VTune Amplifier shows a total elapsed time of 22.9 seconds (Figure 1). Of that time, the CPU was
actively executing for 22.6 seconds. The CPI rate (i.e., cycles per instruction) of 1.142 is flagged
as a problem. Modern superscalar processors can issue four instructions per cycle, suggesting
an ideal CPI of 0.25, but various effects in the pipeline―like long latency memory instructions,
branch mispredictions, or instruction starvation in the front end―tend to increase the observed
CPI. A CPI of one or less is considered good but different application domains will have different
expected values. In our case, we can further analyze the application to see if the CPI can be
lowered. Intel VTune Amplifier’s advanced hotspot analysis also indicates the top five hotspot
functions to consider for optimization.
10. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
10The Parallel Universe
1 Elapsed time and top hotspots before optimization
CPU Utilization
As shown in Figure 2, analysis of the original code indicates that only one of the 88 logical CPUs
is being used. This means there is significant room for performance improvement if we can
parallelize this sample code.
2 CPU usage
11. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
11The Parallel Universe
Parallelizing the sample code as shown below gives an immediate 12x speedup (Figure 3). Also,
the CPI has gone below 1, which is also a significant improvement.
void multiply1(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM],
TYPE c[][NUM], TYPE t[][NUM])
{
int i,j,k;
// Basic parallel implementation
#pragma omp parallel for
for(i=0; i<msize; i++) {
for(j=0; j<msize; j++) {
for(k=0; k<msize; k++) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}
}
3 Performance improvement from parallelization
12. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
12The Parallel Universe
General Exploration Analysis
Once you have used Basic Hotspots or Advanced Hotspots analysis to determine hotspots
in your code, you can perform General Exploration analysis to understand how efficiently
your code is passing through the core pipeline. During General Exploration analysis, Intel
VTune Amplifier collects a complete list of events for analyzing a typical client application. It
calculates a set of predefined ratios used for the metrics and facilitates identifying hardware-
level performance problems. Superscalar processors can be conceptually divided into the
front end (where instructions are fetched and decoded into the operations that constitute
them) and the back end (where the required computation is performed). General Exploration
analysis performs this estimate and breaks up all pipeline slots into four categories:
1. Pipeline slots containing useful work that issued and retired (retired)
2. Pipeline slots containing useful work that issued and canceled (bad speculation)
3. Pipeline slots that could not be filled with useful work due to problems in the front end
(front-end bound)
4. Pipeline slots that could not be filled with useful work due to a backup in the back end
(back-end bound)
Figure 4 shows the results of running a general exploration analysis on the parallelized
example code using Intel VTune Amplifier. Notice that 77.2 percent of pipeline slots are
blocked by back-end issues. Drilling down into the source code shows where these back-
end issues occur (Figure 5, 49.4 + 27.8 = 77.2 percent back-end bound). Memory issues and
L3 latency are very high. The memory bound metric shows how memory subsystem issues
affect performance. The L3 bound metric shows how often the CPU stalled on the L3 cache.
Avoiding cache misses (L2 misses/L3 hits) improves latency and increases performance.
13. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
13The Parallel Universe
4 General exploration analysis results
5 Identifying issues
14. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
14The Parallel Universe
Memory Access Analysis
The Intel VTune Amplifier’s Memory Access analysis identifies memory-related issues, like
NUMA (non-uniform memory access) problems and bandwidth-limited accesses, and attributes
performance events to memory objects (data structures). This information is provided from
instrumentation of memory allocations/deallocations and getting static/global variables from
symbol information.
By selecting the grouping option of the Function/Memory Object/Allocation stack (Figure 6),
you can identify the memory objects that are affecting performance. Out of the three objects
listed in the multiply1 function, one has a very high latency of 82 cycles. Double-clicking on this
object takes you to the source code, which indicates that array “b” has the highest latency. This
is because array “b” is using a column-major order. Interchanging the nested loops changes the
access to row-major order and reduces the latency, resulting in better performance (Figure 7).
6 Identifying memory objects affecting performance
7 General exploration analysis results
15. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
15The Parallel Universe
We can see that although the sample is still back-end-bound, it is no longer memory-bound. It is
only core-bound. A shortage in hardware compute resources, or dependencies on the software’s
instructions, both fall under core-bound. Hence, we can tell that the machine may have run out of
out-of-order resources. Certain execution units are overloaded, or there may be dependencies in
the program’s data or instruction flow that are limiting performance. In this case, vector capacity
usage is low, which indicates floating-point scalar or vector instructions are using only partial
vector capacity. This can be solved by vectorizing the code.
Another optimization option is to use Intel Math Kernel Library, which offers highly optimized and
threaded implementations of many mathematical operations, including matrix multiplication. The
dgemm routine multiplies two double-precision matrices:
void multiply5(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM],
TYPE c[][NUM], TYPE t[][NUM])
{
double alpha = 1.0, beta = 0.0;
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
NUM, NUM, NUM,
alpha, (const double *)b,
NUM, (const double *)a,
NUM, beta, (double *)c, NUM);
}
Performance Analysis and Tuning for Image Resizing
Image resizing is commonly used in operation in the autonomous driving space. For example, we
ran Intel VTune Amplifier’s advanced hotspots analysis on an open-source OpenCV* version of
image resize (Figure 8). We can see that the elapsed time is 0.33 seconds and the top hotspot is
the cv:HResizeLinear function, which consumes 0.19 seconds of the total CPU time.
16. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
16The Parallel Universe
8 Advanced hotspot analysis
Intel IPP offers developers highly optimized, production-ready building blocks for image
processing, signal processing, and data processing (data compression/decompression and
cryptography) applications. These building blocks are optimized using the Intel® Streaming
SIMD Extensions (Intel® SSE) and Intel® Advanced Vector Extensions (Intel® AVX, Intel® AVX2)
instruction sets. Figure 9 shows the analysis results for the image resize that takes advantage
of the Intel IPP. We can see that the elapsed time has gone down by a factor of two, and since
currently only one core is used, there is an opportunity for further performance improvement
using parallelism via the Intel Threading Building Blocks.
17. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
17The Parallel Universe
9 Checking code with Intel® VTune™ Amplifier
18. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
18The Parallel Universe
Conclusion
The Intel System Studio tools in the Intel GO SDK give automotive solution developers an
integrated development environment with the ability to build, debug and trace, and tune the
performance and power usage of their code. This helps both system and embedded developers
meet some of their most daunting challenges:
•• Accelerate software development to bring competitive automated driving cars to market faster
•• Quickly target and help resolve defects in complex automated driving (AD), advanced driver
assistance systems (ADAS), or software-defined cockpit (SDC) systems
•• Help speed performance and reduce power consumption
This is all provided in one easy-to-use software package.
DOWNLOADAFREETRIALOFINTEL®
SYSTEMSTUDIO›
Performance of Classic Matrix Multiplication Algorithm on Intel®
Xeon Phi™
Processor System
BY SERGEY KOSTROV (INTEL) ›
Matrix multiplication (MM) of two matrices is one of the most fundamental operations in linear algebra.
The algorithm for MM is very simple, it could be easily implemented in any programming language,
and its performance significantly improves when different optimization techniques are applied.
Several versions of the classic matrix multiplication algorithm (CMMA) to compute a product of square
dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to
a highly optimized ‘cblas_sgemm’ function of the Intel® Math Kernel Library (Intel® MKL)7. Tests are
completed on a computer system with Intel® Xeon Phi™ processor 72105 running the Linux Red Hat*
operating system in ‘All2All’ Cluster mode and for ‘Flat’, ‘Hybrid 50-50’, and ‘Cache’ MCDRAM modes.
BLOGHIGHLIGHTS
Read more
19. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
19The Parallel Universe
Bronis R. de Supinski, Chief Technology Officer, Livermore Computing
OpenMP*
IsTurning20!Making Parallel Programming Accessible to C/C++ and Fortran Programmers
I suppose that 20 years makes OpenMP* a nearly full-grown adult. In any event, this birthday
provides us with a good opportunity to review the origins and evolution of the OpenMP
specification. As the chair of the OpenMP Language Committee (a role that I have filled for the last
nine years, since shortly after the release of OpenMP 3.0), I am happy to have the opportunity to
provide this retrospective and to provide a glimpse into the health of the organization that owns
OpenMP and into our future plans to keep it relevant. I hope you will agree that OpenMP remains
true to its roots while evolving sufficiently to have reasonable prospects for another 20 years.
OpenMP arose in an era in which many compiler implementers supported unstandardized
directives to guide loop-based, shared-memory parallelization. While these directives were often
effective, they failed to meet a critical user requirement: portability. Not only did the syntax (i.e.,
spelling) of those implementation-specific directives vary but they also often exhibited subtle
differences in their semantics. Recognizing this deficiency, Mary Zosel from Lawrence Livermore
National Laboratory worked closely with the implementers to reach agreement on common
20. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
20The Parallel Universe
syntax and semantics that all would provide in OpenMP 1.0. In addition, they created the
OpenMP organization to own and to maintain that specification.
Many view OpenMP’s roots narrowly as exactly the loop-based, shared-memory directives
that were standardized in that initial specification. I prefer to look at them more broadly as a
commitment to support portable directive-based parallelization and optimization, possibly
limited to within a process, through an organization that combines compiler implementers
with user organizations who require their systems to deliver the highest possible portable
performance. Ultimately, I view OpenMP as a mechanism for programmers to express key
performance features of their applications that compilers would find difficult or even impossible
to derive through (static) analysis. Organizationally, OpenMP is vibrant, with membership that
includes representatives of all major compiler implementations (at least in the space relevant to
my user organization) and an active and growing set of user organizations.
Technically, the OpenMP specification met its original goal of unifying the available loop-
based parallelization directives. It continues to provide simple, well-defined constructs for that
purpose. Further, OpenMP has fostered ongoing improvements to those directives, such as the
ability to collapse loops and to control the placement of threads that execute the parallelized
code. Achieving consistently strong performance for these constructs across shared-memory
architectures and a complete range of compilers, OpenMP provides the portability that
motivated its creation.
OpenMP’s evolution has led it to adopt several additional forms of parallelism. I am often
annoyed to hear people in our community say that we need a standardized mechanism for
task-based parallelism. OpenMP has provided exactly that for the last nine years! In 2008, the
OpenMP 3.0 specification was adopted with a complete task-based model. While I acknowledge
that OpenMP tasking implementations could still be improved, we face a chicken-and-the-egg
problem. I often hear users state that they will use OpenMP tasking when implementations
consistently deliver strong performance with the model. However, implementers also frequently
state that they will optimize their tasking implementation once they see sufficient adoption of the
model. Another issue for the OpenMP tasking model derives from one of OpenMP’s strengths—it
provides consistent syntax and semantics for a range of parallelization strategies that can all be
used in a single application. Supporting that generality necessarily implies inherent overheads.
However, model refinements can alleviate some of that overhead. For example, OpenMP 3.1
added the final clause (and its related concepts) to facilitate specifying a minimum task
granularity, which otherwise requires complex compiler analysis or much more complex program
structure. As we work toward future OpenMP specifications, we are continuing to identify ways to
reduce the overhead of OpenMP tasking.
21. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
21The Parallel Universe
OpenMP added support for additional forms of parallelism with the release of OpenMP 4.0.
In particular, OpenMP now includes support for accelerator-based parallelization through its
device constructs and for SIMD (or vector) parallelism. The support for the latter is particularly
reminiscent of OpenMP’s origins. Support for SIMD parallelism through implementation-
specific directives had become widespread, most frequently in the form of ivdep. However,
the supported clauses varied widely and, in some cases, the spelling of the directives was also
different. More importantly, the semantics often varied in subtle ways that could lead directives to
be correct with one compiler but not with another. The addition of SIMD constructs to OpenMP
largely solves these problems, just as the creation of OpenMP 20 years ago solved them for
loop-based directives. Of course, the initial set of SIMD directives is not perfect and we continue
to refine them. For example, OpenMP 4.5 added the simdlen clause, which allows the user to
specify the preferred SIMD vector length to use.
The OpenMP target construct specifies that computation in a structured block should be
offloaded to a device. Additional constructs that were added in the 4.0 specification support
efficient parallelization on devices such as GPUs. Importantly, all OpenMP constructs can be used
within a target region, which means that the years of advances in directive-based parallelization
are available for a range of devices. While one still must consider which forms of parallelism
will best map to specific types of devices, as well as to the algorithms being parallelized, the
orthogonality of OpenMP constructs greatly increases the portability of programs that need to
target multiple architectures.
We are actively working on the OpenMP 5.0 specification, with its release planned for November
2018. We have already adopted several significant extensions. OpenMP TR4 documents the
progress on OpenMP 5.0 through last November. We will also release a TR this November that
will document our continued progress. The most significant addition in TR4 is probably OMPT,
which extends OpenMP with a tool interface. I anticipate that OpenMP 5.0 will include OMPD, an
additional tool interface that will facilitate debugging OpenMP applications. TR4 included many
other extensions, perhaps most notably the addition of task reductions.
We expect that OpenMP 5.0 will provide several major extensions that were not included in TR4.
Perhaps most notably, OpenMP will greatly enhance support for complex memory hierarchies.
First, I expect it to include some form of the memory allocation mechanism that TR5 documents.
This mechanism will provide an intuitive interface for programmers to portably indicate which
memory should be used for particular variables or dynamic allocations in systems with multiple
types of memory (e.g., DDR4 and HBM). Further, I expect that OpenMP will provide a powerful
serialization and deserialization interface that will support deep copy of complex data structures.
This interface will eventually enable data-dependent layout transformations and associated
optimizations with device constructs.
22. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
22The Parallel Universe
As I have already mentioned, the OpenMP organization is healthy. We have added several
members in the past few years, both implementers and user organizations. We are currently
discussing changes to the OpenMP bylaws to support even wider participation. If you are
interested in shaping the future of OpenMP, we are always happy to have more participation! Visit
openmp.org for more information about OpenMP.
Bronis R. de Supinski is the chief technology officer of Livermore Computing, the supercomputing
center at Lawrence Livermore National Laboratory. He is also the chair of the OpenMP Language
Committee.
This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore
National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or
usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific
commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or
favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect
those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes. This work, LLNL-
MI-732158, was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
23. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
23The Parallel Universe
Julia*
:AHigh-LevelLanguage
forSupercomputing
High-performance computing (HPC) has been at the forefront of scientific discovery. Scientists
routinely run simulations on millions of cores in distributed environments. The key software
stack involved has typically been a statically compiled language such as C/C++ or Fortran, in
conjunction with OpenMP*/MPI*. This software stack has stood the test of time and is almost
ubiquitous in the HPC world. The key reason for this is its efficiency in terms of the ability to use
all available compute resources while limiting memory usage. This level of efficiency has always
been beyond the scope of more “high-level” scripting languages such as Python*, Octave*, or R*.
This is primarily because such languages are designed to be high-productivity environments that
facilitate rapid prototyping―and are not designed to run efficiently on large compute clusters.
However, there is no technical reason why this should be the case, and this represents a tooling
problem for scientific computing.
The Julia Project Continues to Break New Boundaries in Scientific Computing
Ranjan Anantharaman, Viral Shah, and Alan Edelman, Julia Computing
24. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
24The Parallel Universe
Julia,* a new language for technical computing, is meant to address this problem. It reads like
Python or Octave, but performs as well as C. It has built-in primitives for multithreading and
distributed computing, allowing applications to scale to millions of cores.
The first natural questions that spring to mind are: Why can’t the other high-level languages
perform this well? What’s unique about Julia that lets it achieve this level of efficiency while still
allowing the user write readable code?
The answer is that Julia uses a combination of type inference and code specialization to
generate tight machine code. To see how, let’s use the Julia macro @code_native, which
generates the assembly code generated by a function call. For example:
julia> @code_native 1 + 1
.section __TEXT,__text,regular,pure_instructions
Filename: int.jl
pushq %rbp
movq %rsp, %rbp
Source line: 32
leaq (%rdi,%rsi), %rax
popq %rbp
retq
Source line: 32
nopw (%rax,%rax)
This example shows the assembly code generated when Julia adds two integers. Notice that Julia
picks the right assembly instruction for the operation, and there is virtually no overhead.
The next code snippet shows the addition of two double-precision floating-point numbers:
julia> @code_native 1. + 1.
.section __TEXT,__text,regular,pure_instructions
Filename: float.jl
pushq %rbp
movq %rsp, %rbp
Source line: 240
addsd %xmm1, %xmm0
popq %rbp
retq
Source line: 240
nopw (%rax,%rax)
Julia automatically picks a different instruction, addsd, since we are now adding two floating-
point numbers. This is because two versions of the function “+” are generated depending on
the type of the input, which was inferred by the compiler. This idea allows the user to write high-
level code and then let Julia infer the types of the input and output, and subsequently compile a
25. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
25The Parallel Universe
specialized version of that function for those input types. This is what makes Julia fast (Figure 1).
1 Julia benchmark times relative to C (smaller is better). C performance = 1.0.
Benchmark code can be found at the GitHub respository.
Julia’s unique features solve the two-language problem in data science and numerical
computing. This has led to demand for Julia in a variety of industries from aerospace
to finance.
This led to the formation of Julia Computing, which consults with industry and fosters adoption
of Julia in the community. One of Julia Computing’s flagship products is JuliaPro*, a Julia
distribution that includes an IDE, a debugger, and more than 100 tested packages. JuliaPro will
soon ship with Intel® Math Kernel Library (Intel® MKL) for accelerated BLAS operations and
optimizations for multicore and the latest Intel® processors.
26. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
26The Parallel Universe
Parallel Computing in Julia
Julia has several built-in primitives for parallel computing at every level: vectorization (SIMD),
multithreading, and distributed computing.
Consider an embarrassingly parallel simulation as an example. An iterative calculation of π
involves generating random numbers between 0 and 1 and eventually calculating the ratio of the
number of points that “land” inside the unit circle as opposed to those that don’t. This gives the
ratio of the areas of the unit square and the unit circle, which can then be used to calculate π. The
following Julia code makes use of the @parallel construct. The reducer “+” sums the values of
the final expression that was computed in parallel on all Julia processes:
addprocs(2) # Add 2 Julia worker processes
function parallel_π(n)
in_circle = @parallel (+) for i in 1:n # -- partition the work
x = rand()
y = rand()
Int((x^2 + y^2) 1.0)
end
return (in_circle/n) * 4.0
end
parallel_π(10000)
Julia offloads work to its worker processes that compute the desired results and send them back
to the Julia master, where the reduction is performed. Arbitrary pieces of computation can be
assigned to different worker processes through this one-sided communication model.
A more interesting use-case would be having different processes solve, for example, linear
equations and then serialize their output and send it to the master process. Scalable machine
learning is one such example which involves many independent backsolves. Figure 2 shows
the performance of RecSys.jl, which processes millions of movie ratings to make predictions.
It is based on an algorithm called Alternating Least Squares (ALS), which is a simple, iterative
method for collaborative filtering. Since each solve is independent of every other, this code
can be parallelized and scaled easily. Figure 2 is a performance chart that shows the scaling
characteristics of the system.
27. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
27The Parallel Universe
2 Scaling chart of the ALS Recommender System: “nprocs” refers to the number of worker processes in the shared
memory and distributed setting and number of threads in the multithreaded setting
In Figure 2, we look at the scaling performance of Julia in multithreaded mode (Julia MT),
distributed mode (Julia Distr), and shared memory mode (Julia Shared). Shared memory
mode allows independent Julia worker processes (as opposed to threads) to access a shared
address space. An interesting aspect of this chart is the comparison between Julia’s distributed
capabilities and that of Apache Spark*, a popular framework for scalable analytics that is
widely adopted in the industry. What’s more interesting, though, are the consequences of
this experiment: Julia can scale well to a higher number of nodes. Let’s now move on to a real
experiment in supercomputing.
28. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
28The Parallel Universe
Bringing It All Together (at Scale): Celeste*
The Celeste* Project is a collaboration of Julia Computing (Keno Fischer), Intel Labs (Kiran
Pamnany), JuliaLabs@MIT (Andreas Noack, Jarrett Revels), Lawrence Berkeley National Labs
(David Schlegel, Rollin Thomas, Prabhat), and University of California–Berkeley (Jeffrey Regier,
Maximilian Lam, Steve Howard, Ryan Giordano, Jon McAuliffe). Celeste is a fully generative
hierarchical model that uses statistical inference to mathematically locate and characterize light
sources in the sky. This model allows astronomers to identify promising galaxies for spectrograph
targeting, define galaxies for further exploration, and help understand dark energy, dark matter,
and the geometry of the universe. The data set used for the experiment is the Sloan Digital Sky
Survey (SDSS) (Figure 3), with more than five million images comprising 55 terabytes of data.
3 A sample from the Sloan Digital Sky Survey (SDSS)
29. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
29The Parallel Universe
Using Julia’s native parallel computing capabilities, researchers were able to scale their
application to 8,192 Intel® Xeon® processor cores on the National Energy Research Scientific
Computing Center (NERSC) Cori* supercomputer at LBNL. This resulted in parallel speedups of
225x in image analysis, processing more than 20,000 images (or 250 GB), an increase of three
orders of magnitude compared to previous iterations.
Note that this was achieved with Julia’s native parallel capabilities, which allowed scaling of up
to 256 nodes and four threads per node. This kind of scaling puts Julia firmly in the HPC realm,
joining the elite league of languages such as C/C++ and Fortran.
Pradeep Dubey of Intel Labs summarized this nicely: “With Celeste, we are closer to bringing Julia
into the conversation because we’ve demonstrated excellent efficiency using hybrid parallelism―
not just processes, but threads as well―something that’s still impossible to do with Python or R.”
The next step in the project is scaling even further and processing the entire data set. This will
involve making use of the latest Intel® Xeon Phi™ processors and coprocessors.
Breaking New Boundaries
The Julia project has come a long way since its inception in 2012, breaking new boundaries in
scientific computing. Julia was conceived as a language that retained the productivity of Python
while being as fast and scalable as C/C++. The Celeste Project shows that this approach allows
Julia to be the first high-productivity language to scale well on clusters.
The Julia project continues to double its user base every year and to be increasingly adopted in
a variety of industries. Such advances in tooling for computational science not only bode well
for scientific research to address the challenges of the coming age, but will also result in rapid
innovation cycles and turnaround times in the industry.
LEARNMOREABOUTJULIACOMPUTING›
31. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
31The Parallel Universe
Once the mainstay of high-performance computing, vectorization has reemerged with new
importance given the increasing width of SIMD (vector) registers on each processor core: four
doubles with AVX2, eight with AXV512. Increased vector lengths create opportunities to boost
performance in certain classes of software (e.g., large-scale finite element analysis to simulate
the nonlinear behavior of mechanical systems). The open source research code, WARP3D, also
on GitHub, focuses on such simulations to support the development of safety and life prediction
methodologies for critical components found typically in energy production and related
industries. The discipline emphasis (3-D nonlinear solids), open source, and manageable code
size (1,400 routines) appeal to researchers in industry and academia.
WARP3D reflects the current hierarchy of parallel computing: MPI to access multiple nodes in
a cluster, threads via OpenMP* on each node, and vectorization within each thread. Realistic
simulations to predict complex behavior (e.g., the internal stresses created in welding a
VectorizationBecomesImportant―AgainOpen Source Code WARP3D Exemplifies Renewed Interest in Vectorization
Robert H. Dodds Jr., Professor Emeritus, University of Illinois
32. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
32The Parallel Universe
large component) require many hours of parallel computation on current computers. The
pervasive matrix operations to achieve solutions map very well onto this hierarchy of parallel
computing. Increased use of vectorization at the core level provides a direct multiplying effect on
performance across concurrently executing threads and MPI ranks.
This article provides a brief view into vectorization as implemented in WARP3D. Developed
continuously over the last 20 years, the code adopts many of the evolving features in Fortran. We
employ components of Intel® Parallel Studio XE (Intel® Fortran Compiler, Intel® Math Kernel
Library, and Intel® MPI Library) to build executables included with the open source distribution
for Linux*, Windows*, and MacOS*. WARP3D follows an implicit formulation of the nonlinear finite
element method that necessitates solutions for very large, evolving sets of linear equations—
the PARDISO* and CPARDISO* packages in Intel Math Kernel Library provide exceptional
performance. Examples described here also illustrate the new Intel® Advisor 2017 with roofline
analysis that enables more convenient and in-depth exploration of loop-level performance.
1 Finite element model
Figure 1 illustrates the decomposition of a 3-D finite element model for a flange into domains
of elements (one domain/MPI rank) with elements in a domain assigned to blocks. Each discrete
element in the model has an associated number of generally small permanent and temporary
arrays (e.g., 6x6, 24x24, 60x60, 6x60), leading to millions of such arrays present during the
solution of moderate- to large-sized models.
Figure 2A shows a conventional, multilevel array of structures (AoS) for storage of element data,
with the various arrays for a single element likely contiguous in memory. While conceptually
convenient, the comparatively small, element-level matrices limit inner-loop vector lengths and
provide no mechanism to tune for cache size effects while processing the same array type across
all elements.
33. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
33The Parallel Universe
The alternative, blocked data structure, as shown in Figure 2B for the element stiffness matrices
Ke
and Be
(as examples), defines a structure of arrays (SoA) scheme using dense 2-D and 3-D
arrays with the leading dimension set by the number of elements assigned to the block, termed
span in WARP3D. All the Ke
matrices for elements in a block, and separately the Be
matrices, are
thus contiguous in memory.
2A Conventional array of structures 2B Alternative blocked data structure
With a master list setting block ownership by domains, the following sequence illustrates the
concept of combined MPI plus thread parallelism in WARP3D:
c$OMP PARALLEL DO PRIVATE( blk )
do blk = 1, number_ele_blks
if( blk_to_domain_map(blk) .ne. myrank ) cycle
call process_a_blk( blk )
end do
c$OMP END PARALLEL DO
WARP3D employs a simple manager-worker design for MPI to synchronize tasks across ranks. In
the above, the manager (on rank 0) ensures that all workers execute the OpenMP parallel loop
to process the blocks they own. The process_a_blk routine dynamically allocates working
versions of blocked arrays as needed in a derived type, declared local to the routine and thus
local in each thread. The generally complex, multiple levels of computational routines to build/
update element matrices use vector-style coding—meaning they are unaware of the OpenMP
and MPI runtime environments. The data independence of element-based computations implied
here represents a key feature of the finite element method. Global arrays are accessed read-only
by the threads; the few block→gather operations on one vector specify $OMP ATOMIC UPDATE
for synchronization. All threading operations in WARP3D are accomplished with only a dozen
loops/directives of the type shown above.
34. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
34The Parallel Universe
The introduction of Cray* vector-based computers motivated the development of the blocking
approach described above in finite element codes first created by the U.S. national laboratories.
As these vector-based computers became extinct, the codes evolved to process element blocks
in parallel using threads and then across MPI ranks, with blocking retained as a means to tune for
effects of various cache architectures.
3A Conventional AoS scheme 3B Blocked SoA scheme
3 Example vectorization of blocked arrays
An example here illustrates essential features and performance of the conventional AoS and
blocking SoA approaches. A sequence of matrix multiplications updates an array for each element
in the model. For a single element, matrices Be
, De
, and Ke
in this illustration have sizes 6x60, 6x6,
60x60 (all doubles and dense). Figure 2B shows the blocked storage scheme where the large
arrays are allocatable components of a Fortran derived type. In models having different element
types, and thus matrix sizes, WARP3D creates homogeneous blocks with all elements of the
same type. Updates to Ke
have the form: Ke
← Ke
+ transpose(Be
)De
Be
. This update occurs 300M
to 1B times during various simulations of the flange (De
Be
also change for every update). In this
example, De
and Ke
are symmetric, thus reducing computation to only lower-triangle terms.
35. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
35The Parallel Universe
Figure 3A shows essential code for the conventional AoS scheme. The computational outer loop
cycles sequentially over all elements in the model with different versions of the called subroutine
for symmetry, asymmetry, etc., while inner loops in the subroutine cycle over the much smaller
dimensions of a single element matrix. The Fortran intrinsic matmul at line 19 performs the first
matrix product, with loops over the comparatively small element arrays to compute final updates
for only the lower-triangle terms of Ke
. The compiler unrolls the k=1,6 inner loop at lines 23 to 25
and appears to replace matmul with equivalent source code which is then compiled—based on
the extensive messages about loop interchanges in the annotated optimization report issued at
line 19. [Editor’s note: See “Vectorization Opportunities for Improved Performance with Intel®
AVX-512” in The Parallel Universe issue 27 for advice on generating and interpreting compiler
optimization reports.]
Code for the blocked SoA scheme in Figure 3B runs computational loops in this order:
1. Over all blocks (not shown)
2. Over dimensions of a single element matrix
3. The innermost loops over the number of elements (span) in the block
The first set of loops at lines 10 to 28 computes the product db = De
Be
in a provided workspace
for all elements in the block. A straightforward implementation would specify a four-level loop
structure. Here, the two innermost loops have manual unrolling that essentially forces the
compiler to vectorize the generally longer i=1,span loop. Other arrangements explored for
these loops increase runtimes; the compiler does not know the iteration count for each loop,
which limits the effectiveness of loop reordering. The second set of loops at lines 30 to 42
performs the update Ke
← Ke
+ transpose(Be
)db. Here, utsz=1820 with index vector icp to map
lower-triangle terms for an element onto columns of the blocked Ke
. Again, the innermost loop
has been manually unrolled for vectorization i=1,span.
Both sets of loops in this routine, lines 10 to 28 and 30 to 42, have only unit stride array
references and long (span) lengths set up for vectorization, a characteristic feature throughout
the WARP3D code. These are generally the most time-consuming loops for element-level
processing in WARP3D; multiple redesigns of the loops and array organizations have occurred
over the years with the evolution of compiler optimization and processor capabilities.
36. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
36The Parallel Universe
Figure 4 summarizes the relative execution performance for both approaches running on a single
thread, with the AoS scheme assigned a unit value. Numbers in parentheses indicate the size for
arrays in the compute routines. Relevant compiler options are: ifort (17.0.2) with
-O3 -xHOST -align array64byte -ftz. The computer has dual Intel® Xeon® E5-2698 v3
processors (with AVX2) with no other user software running during the tests. This processor has
cache sizes L1 = 32 KB/core, L2 = 256 KB/core, and L3 = 40 MB shared. Driver code and compute
routines reside in separate files to prevent inlining effects. The runtimes include efforts to call the
computation routines but not driver setup times. The driver program allocates thousands of array
sets for both AoS and SoA testing. On each call, the driver passes a randomly selected set to the
compute routine, in all cases processing 32 M elements. Multiple runs, each completing in two to
four minutes, reveal a 1 to 2 percent variability in measured CPU.
4 Relative execution performance for both approaches running on a single thread
Only a single result is needed for the AoS scheme; there are no adjustable parameters. The
number of elements per block (span) varies in the SoA scheme with a compensating number
of blocks to maintain the identical number of elements processed in each case. The SoA results
show an expected strong effect of block size driven by the interaction of vector processing
lengths and L1, L2, and L3 cache utilization. Runtimes reach a minimum of 37 percent of the AoS
time for block sizes of 64 and 32 elements. At 16 elements/block, runtimes begin to increase. For
all block sizes, the arrays fit within the L3 cache; the much smaller, single-element arrays of AoS
almost fit within the L2 data cache.
37. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
37The Parallel Universe
Figure 5 shows output from the roofline analysis performed by Intel Advisor 2017 with results
from the key cases manually superposed onto a single plot for comparison. Note the log scale on
both axes. The roofline analysis with the graphical tools as implemented in Intel Advisor provides
a convenient mechanism to screen the performance of key loops across large codes (note the
hamburger menu in upper-right corner). Pointer movement over the symbols creates pop-ups
with loop location in source code, GFLOPS, and floating-point operations per byte of L1 cache.
Clicking on the symbol brings up source code for the loop annotated with additional performance
information (e.g., if peel and remainder loops were executed for the loop).
5 Roofline analysis results
During the inner loop of the Ke
update at lines 33 to 41, the best performance of 8.08 GFLOPS
for block size (span) = 64 resides effectively on the roofline for the L2 cache but well below the
peak rate for Fused-Multiply-Add (FMA) vector instructions, which are the kernel for this loop.
The inner loop of the De
Be
computation at lines 11 to 27 gains a surprising 3.7x performance
boost when the block size decreases from 1024→128 but no further gain for smaller blocks. The
longest running loop at lines 33 to 41 gains a 2x performance increase for block size reductions
1024→64.
38. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
38The Parallel Universe
As expected from the timing results (Figure 4), GFLOPS and cache utilization for AoS shown by
the blue filled and outlined squares are smaller than those for the blocked SoA. Replacement of
code in the routine with the direct ek = ek + matmul(transpose(b), matmul(d,b))
increases runtime by 30 percent above the version in Figure 3A.
The best performance of 8.08 GFLOPS relative to the double-precision FMA peak of 51.4
GFLOPS shown on the roofline derives from the small number of floating point operations per
memory access. This is a well-known characteristic of element-level computations inherent in
the finite element method. Consider the key inner loop at lines 33 to 41, which has 12 double
operations (6 + and 6 *) and 13 references to double-precision array terms. The compute
intensity per byte is then, 12/(13*8) = 0.11, the same value reported in the roofline analysis
as FLOP/L1-byte. This value remains unchanged with block size. For reference, the figure also
indicates the best performing routine in MKL BLAS as displayed by the roofline analysis during
equation solving in WARP3D executions (single thread).
A few final points from the benchmarks:
•• The roofline tab for Code Analytics lists: FMA as the only “trait” for the loop at lines 33 to 41,
an 81 percent vectorization efficiency (3.24x gain over scalar), vector length = 4 double (AXV2).
Further, the roofline tab for Memory Access Patterns confirms stride-1 access of all arrays for
both inner loops in the routine.
•• Memory alignment. The compile process uses the option –align array64byte
needed for best performance on processors with AXV-512 instructions. Thirty-two-byte
alignment suffices for the AVX2 capable processor used here. Nevertheless, with the separate
compilation of the driver.f and routines.f files representative of build processes, the compiler
is unaware of the actual alignment of the formal array arguments to the subroutine bdbtSoA.
These adjustable arrays in the Fortran context guarantee to the compiler that array elements
are contiguous in memory, but not necessarily with a specific alignment. The annotated source
listing shows unaligned access to all arrays as expected.
Experiments with source code directives of the type !DIR$ ASSUME_ALIGNED
b(1,1,1):64, d(1,1,1):64, ... or, separately, !DIR$ VECTOR ALIGNED before
the two inner loops yield no measurable improvement in performance. The compiler listing
indicates aligned access for all arrays.
The explanation may lie in the Intel Advisor messages that so-called peel and remainder
loops, used to execute separately the initial loop iterations until access becomes aligned, are
not executed here even without source code directives. At runtime, the alignment of arrays
before entering the loop is easily determined to be 64 bytes in all cases here from the –align
array64byte option in compiling driver.f.
Unit stride. A simple experiment demonstrates the critical significance of stride-1 array
references on performance (at least on this processor). A switch in design/use of array
ek_symm(span,utsz) to ek_symm(utsz,span) with ek_symm(i,j) changed to
ek_symm(j,i) at line 34 reduces performance from 8.08 GFLOPS to 2.39 GFLOPS.
39. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
39The Parallel Universe
Summary
Vectorization offers potential speedups in codes with significant array-based computations—
speedups that amplify the improved performance obtained through higher-level, parallel
computations using threads and distributed execution on clusters. Key features for vectorization
include tunable array sizes to reflect various processor cache and instruction capabilities and
stride-1 accesses within inner loops.
The importance of vectorization to increase performance will continue to grow as hardware
designers extend the number of vector registers to eight doubles (and hopefully more) on
emerging processors to overcome plateauing clock rates and thread scalability.
Intel Advisor, especially the new roofline analysis capability, provides relevant performance
information and recommendations implemented in a convenient GUI at a level suitable for code
developers with differing levels of expertise.
Acknowledgments
The author gratefully acknowledges the many, key contributions of his former graduate
students, postdocs, and other collaborators worldwide toward making WARP3D a useful code
for researchers. See warp3d.net for a complete list and sponsors. These latest efforts toward
improving code performance are in collaboration with Drs. Kevin Manalo and Karen Tomko,
supported by the Intel® Parallel Computing Center at the Ohio Supercomputer Center.
References
“Intel Advisor Roofline Analysis,” The Parallel Universe, Issue 27, 2017.
“Vectorization Opportunities for Improved Performance with Intel® AVX-512,” The Parallel
Universe, Issue 27, 2017.
“Explicit Vector Programming in Fortran,” Corden, Martyn (Intel), January 2016.
About the author
Robert H. Dodds Jr., PhD, NAE, is the M.T. Geoffrey Chair (Emeritus) of Civil Engineering at
the University of Illinois at Urbana–Champaign. He currently is a research professor in the
Department of Civil Engineering, University of Tennessee, Knoxville.
TRYTHEVECTORIZATIONTOOLS
ININTEL®
PARALLELSTUDIOXE›
40. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
40The Parallel Universe
BuildingFastDataCompressionCode
forCloudandEdgeApplicationsHow to Optimize Your Compression with Intel®
Integrated Performance Primitives
(Intel®
IPP)
Finding efficient ways to compress and decompress data is more important than ever. Compressed
data takes up less space and requires less time and network bandwidth to transfer. In cloud service
code, efficient compression cuts storage costs. In mobile and client applications, high-performance
compression can improve communication efficiency―providing a better user experience.
However, compression and decompression consume processor resources. And, especially in data-
intensive applications, this can negatively affect the overall system performance. So an optimized
implementation of compression algorithms plays a crucial role in minimizing the system
performance impact. Intel® Integrated Performance Primitives (Intel® IPP) is a library that
Chao Yu, Software Technical Consulting Engineer, Intel Corporation,
and Sergey Khlystov, Senior Software Engineer, Intel Corporation
41. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
41The Parallel Universe
contains highly optimized functions for various domains, including lossless data compression.
In this article, we’ll discuss these functions and the latest improvements. We will examine their
performance and explain how these functions are optimized to achieve the best performance.
We will also explain how applications can decide which compression algorithm to use based on
workload characteristics.
Intel® IPP Data Compression Functions
The Intel IPP Data Compression Domain provides optimized implementations of the common
data compression algorithms, including the BZIP2*, ZLIB*, LZO*, and a new LZ4* function, which
will be available in the Intel IPP 2018 update 1 release. These implementations provide “drop-in”
replacements for the original compression code. Moving the original data compression code to an
Intel IPP-optimized code is easy.
These data compression algorithms are the basic functions in many applications, so a variety of
applications can benefit from Intel IPP. A typical example is Intel IPP ZLIB compression. ZLIB is
the fundamental compression method for various file archivers (e.g., gzip*, WinZip*, and PKZIP*),
portable network graphics (PNG) libraries, network protocols, and some Java* compression
classes. Figure 1 shows how these applications can adopt the optimized ZLIB in Intel IPP. Intel
IPP provides fully compatible APIs. The application simply needs to relink with the Intel IPP
ZLIB library. If the application uses ZLIB as a dynamic library, it can be easily switched to a new
dynamic ZLIB library built with Intel IPP. In the latter case, relinking the application is not required.
1 Intel® IPP provides drop-in optimization for ZLIB* functions
In recent releases, the Intel IPP data compression domains extended several functionalities and
increased compression performance:
•• Support for LZ4 data compression will be included in Intel IPP 2018 beta update 1, available soon.
•• Increased LZO compression performance by 20 to 50 percent. Added level 999 support in LZO
decompression.
•• A new fastest ZLIB data compression mode, which can improve performance by up to 2.3x compared
to ZLIB level 1 performance.
•• Trained Huffman tables for ZLIB compression let users improve compression ratio in the fastest mode.
42. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
42The Parallel Universe
Intel IPP Data Compression Optimization
How does Intel IPP data compression perform the optimization and achieve top performance?
Intel IPP uses several different methods to optimize data compression functionality. Data
compression algorithms are very challenging for optimization on modern platforms due to
strong data dependency (i.e., the behavior of the algorithm depends on the input data). Basically,
it means that only a few new CPU instructions can be used to speed up these algorithms. The
single instruction, multiple data (SIMD) instructions can be used here in limited cases, mostly
pattern searching.
Performance optimization of data compression in Intel IPP takes place in different ways and at
different levels:
•• Algorithmic optimization
•• Data optimization
•• New CPU instructions
•• Intel IPP data compression performance
Algorithmic Optimization
At the highest level, algorithmic optimization provides the greatest benefit. The data
compression algorithms are implemented from scratch in Intel IPP to process input data and
to generate output in the most effective way. For example, selecting proper starting conditions
for input data processing can bring a performance gain from the very beginning. Let’s look at
the operation of searching for patterns in input data. If you set a minimum match length of two
to three bytes, you will detect more matches―and obtain a better compression ratio. On the
other hand, you will drastically increase input data processing time. So, the minimum match
length should balance compression ratio and compression performance. This is mostly the
result of statistical experiments and then careful planning of condition checks and branches in
the code. In all cases, it is more effective to put checks for the most probable situations at the
beginning of your code so that you will not have to spend time on conditional operations with a
low probability of occurring.
Data Optimization
Careful planning of internal data layout and size also brings performance benefits. Proper
alignment of data makes it possible to save CPU clocks on data reads and writes. In most CPUs,
reading/writing executes faster on aligned data than on nonaligned data. This is especially true
when memory operations are executed on data elements longer than 16 bits. Another factor
affecting optimization is data size. If we keep internal data structures―arrays, tables, and lists―
small, we can often reuse the CPU data cache, increasing overall performance.
43. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
43The Parallel Universe
New CPU Instructions
The Intel® Streaming SIMD Extensions 4.2 (Intel® SSE 4.2) architecture introduced several CPU
instructions, which can be used in data compression algorithms according to their nature. More
opportunities came with the Intel® Advanced Vector Extensions 2 instruction set. First, the CRC32
instruction can be used to compute the checksum of input data for data integrity. It is a key part
of many data compression standards. The CRC32 instruction can also be used to implement a
hash function, another key part of compression algorithms. Also, bit manipulation instructions
(BMI) are used to speed up pattern matching functions. Finally, reading/writing data by 32-, 64-,
128-, or 256-bit portions saves memory bus cycles, which also improves performance.
Intel IPP Data Compression Performance
Each Intel IPP function includes multiple code paths, with each path optimized for specific
generations of Intel® and compatible processors. Also, new optimized code paths are added
before future processors are released, so developers can just link to the newest version of Intel
IPP to take advantage of the latest processor architectures.
Figure 2 shows performance results with the Intel IPP ZLIB compression/decompression library.
Compared with the open-source ZLIB code, the Intel IPP ZLIB implementation can achieve a 1.3x
to 3.3x performance improvement at different compression levels. At the same time, Intel IPP
provides fully compatible APIs and compression results.
Configuration info – Software versions: Intel® Integrated Performance Primitives (Intel® IPP) 2017, Intel® C++ Compiler 16.0.
Hardware: Intel® Core™ processor i7-6700K, 8 MB cache, 4.2 GHz, 16 GB RAM, Windows Server* 2012 R2. Software and
workloads used in performance tests may have been optimized for performance only on Intel® microprocessors.
2 Performance results with the Intel® IPP ZLIB* compression and decompression library
44. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
44The Parallel Universe
Choosing the Compression Algorithms for Applications
Since data compression is a trade-off between compression performance and compression ratio,
there is no “best” compression algorithm for all applications. For example, the BZIP2 algorithm
achieves good compression efficiencies but because it is more complex, it requires significantly
more CPU time for both compression and decompression.
Figure 3 is a summary of the common Intel IPP functions on compression performance and ratio.
Among these algorithms, the LZ4 is the fastest compression method―about 20x faster than
BZIP2 on the standard corps compression data. But BZIPs could achieve the best compression
ratios. ZLIB provides a balance between performance and efficiency, so the algorithm is chosen
by many file archivers, compressed network protocols, file systems, and lossless image formats.
3 Common Intel® IPP functions and compression ratio
How can a user’s application decide the compression algorithm? Often, this involves a number of
factors:
•• Compression performance to meet the application’s requirement
•• Compression ratio
•• Compression latency, which is especially important in some real-time applications
•• Memory footprint to meet some embedded target application requirements
•• Conformance with standard compression methods, enabling the data to be decompressed by any
standard archive utility
•• Other factors as required by the application
Deciding on the compression algorithms for specific applications requires balancing all of these
factors, and also considering the workload characteristics and requirements. There is no “best”
compression method for all scenarios.
45. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
45The Parallel Universe
Consider a typical data compression usage example. Suppose we have a file download scenario
that requires transferring data from servers to a client application. To maximize performance,
lossless data compression is performed on the servers before transmission to the client. In
this situation, the client application needs to download the data over the network and then
decompress it. Because overall performance depends on both transmission and decompression
performance, we need to carefully consider workload and network characteristics. Some high-
compression algorithms, like BZIP2, could reduce data transmission. On the other hand, BZIP2
decompression takes longer to compute than other, less sophisticated methods.
Figure 4 shows the overall performance of the common Intel IPP compression algorithms at
different network speeds. Tests were run on the Intel® Core™ i5-4300U processor. When the
code runs on some lower-bandwidth networks (e.g., 1 Mbps), an algorithm that can achieve high
compression ratios is important to reduce the data/time for transmission. In our test scenario,
Intel IPP BZIP2 achieved the best performance (approximately 2.7x speedup) compared to
noncompressed transmission. However, when the data is transferred by a higher-performance
network (e.g., 512 Mbps), LZ4 had the best overall performance, since the algorithm is better
balanced for compression and transmission time in this scenario.
Configuration info – Software versions: Intel® Integrated Performance Primitives (Intel® IPP) 2018 Beta Update 1, Intel® C++
Compiler 16.0. Hardware: Intel® Core™ processor i7-6700K, 8 MB cache, 4.2 GHz, 16 GB RAM, Windows Server* 2012 R2.
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors.
4 Data transfer with and without compression
46. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
46The Parallel Universe
Optimized Data Compression
Data storage and management are becoming high priorities for today’s data centers and edge-
connected devices. Intel IPP offers highly optimized, easy-to-use functions for data compression.
New CPU instructions and an improved implementation algorithm enable the data compression
algorithms to provide significant performance gains for a wide range of applications and
architectures. Intel IPP also offers common lossless compression algorithms to serve the needs
of different applications.
How to Use the MPI-3 Shared Memory in Intel® Xeon Phi™ Processors
BY LOC Q NGUYEN (INTEL) ›
MPI-3 shared memory is a feature introduced in version 3.0 of the message passing interface (MPI)
standard. It is implemented in Intel® MPI Library version 5.0.2 and beyond. MPI-3 shared memory
allows multiple MPI processes to allocate and have access to the shared memory in a compute node.
For applications that require multiple MPI processes to exchange huge local data, this feature reduces
the memory footprint and can improve performance significantly.
This white paper introduces the MPI-3 shared
memory feature, the corresponding APIs, and
a sample program to illustrate the use of MPI-3
shared memory in the Intel® Xeon Phi™ processor.
BLOGHIGHLIGHTS
Read more
LEARNMOREABOUT
INTEL®
INTEGRATEDPERFORMANCEPRIMITIVES›
48. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
48The Parallel Universe
MYSQL*
OPTIMIZATIONWITHINTEL®
C++COMPILER
How LeCloud Leverages MySQL*
to Deliver Peak Service
LeCloud is a leading video cloud service provider that deploys its content delivery network
(CDN) nodes in more than 60 countries and regions, with 20 terabytes per second (Tbps) of
available bandwidth to support its cloud services―including cloud-on-demand, content sharing,
distribution and live broadcasting of commercial events, virtual reality, and more. LeCloud
platforms support more than a million live events per year, hundreds of millions of device
accesses a day, and millions of concurrent users per second. With the heavy demand for its high-
performance public database services, LeCloud uses MySQL*, a popular open-source database,
as the basis for its cloud service.
Huixiang Tao, Database Chief Architect, LeCloud Computing Co., Ltd.; Ying Hu, Software Technical Consulting
Engineer, Intel Corporation; Ming Gao, Industry Technical Specialist, Intel Corporation
49. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
49The Parallel Universe
One key challenge for LeCloud is optimizing MySQL database performance―also an important
topic for many database administrators (DBA) and operation and maintenance IT developers. Key
performance concerns include:
•• Queries per second (QPS)
•• Transactions per second (TPS)
•• Query response time
This article shows how we optimized MySQL using Intel® C++ Compiler and its Interprocedural
Optimization (IPO) capability. Without having to modify any code, we compiled the MySQL
source code and tested the performance indicators. The result? By using Intel C++ Compiler
optimizations on MySQL in a set of test scenarios, performance improved between 5 and 35
percent―bringing developers another avenue to better MySQL performance.
Intel C++ Compiler
Intel C++ Compiler, a component of Intel® Parallel Studio XE, is a C and C++ optimizing compiler
that takes advantage of the latest instruction sets and architectural features to maximize
performance. Because the Intel® compiler development team knows the Intel® architecture so
well, they can do specialized―and more effective―optimization for CPU features like new SIMD
instructions and cache structure for Intel® CPUs compared to other compilers like GNU GCC* or
Microsoft Visual C++*. Some of the optimization technologies used in the present case study
include:
•• Automatic vectorization
•• Guided auto parallelism
•• Interprocedural Optimization
•• Profile-guided optimization
The Intel compilers help users get the most benefit out of their Intel®-based platforms.
Automatic Vectorization
The automatic vectorizer (also called the auto-vectorizer) is a component of the Intel compiler
that automatically uses SIMD instructions in the Intel® Streaming SIMD Extensions (Intel® SSE,
Intel® SSE2, Intel® SSE3, and Intel® SSE4), Supplemental Streaming SIMD Extensions (SSSE3)
instruction sets, and the Intel® Advanced Vector Extensions (Intel® AVX, Intel® AVX2, Intel®
AVX512) instruction sets. The vectorizer detects operations in the program that can be done in
parallel and converts the sequential operations to parallel. For example, the vectorizer converts
the sequential SIMD instruction that processes up to 16 elements into a parallel operation,
depending on the data type. The compiler also supports a variety of auto-vectorizing hints that
can help the compiler generate effective vector instructions on the latest processors, including
the Intel® Xeon® E5-2699 v4 used in this case study.
50. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
50The Parallel Universe
Guided Auto Parallelism
The Guided Auto Parallelism (GAP) feature of the Intel C++ Compiler offers advice to improve the
performance of sequential applications by suggesting changes that will take advantage of the
compiler’s ability to automatically vectorize and parallelize code as well as improve the efficiency
of data operations.
Interprocedural Optimization (IPO)
This automatic, multistep process allows the compiler to analyze code to find interprocedural
optimizations (i.e., optimizations that go beyond individual program subunits) within source files
and across multiple source files. IPO is covered in more detail below, as it is a key component of
this case study.
Profile-Guided Optimization (PGO)
In PGO, the compiler analyzes the code while it runs and takes advantage of this profile
information during subsequent recompilation. This improves application performance by
shrinking code size, reducing branch mispredictions, and reorganizing code layout to minimize
instruction cache problems.
MySQL Optimization
Compiling with IPO
IPO is a key optimization technique in the Intel C++ Compiler. It does code profiling and static
topological analysis based on both single-source and multisource files, and then implements
specific optimizations like inlining, constant propagation, and dead function elimination for
programs that contain many commonly used small- and medium-sized functions. Figure 1 shows
the IPO process.
When you compile your source code with the IPO option, for single-file compilation, the compiler
performs inline function expansion for calls to procedures defined within the current source file.
For multifile IPO, the compiler may perform some inlining according to the multisource code, such
as inlining functions marked with inlining pragmas or attributes (GNU C and C++) and C++ class
member functions with bodies included in the class declaration. After each source file is compiled
with IPO, the compiler stores an intermediate representation of the source code in mock object files.
When you link with the IPO option, the compiler is invoked a final time to perform IPO across all
mock object files. During the analysis process, the compiler reads all intermediate representations
in the mock file, object files, and library files to determine if all references are resolved and
whether a given symbol is defined in a mock object file. Symbols included in the intermediate
representation in a mock object file for both data and functions are candidates for manipulation
based on the results of whole program analysis.
51. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
51The Parallel Universe
1 IPO compilation process
Building MySQL with the Intel Compiler
The latest official version of MySQL (v5.6.27) was used in this case study. This section describes
how to use Intel C++ Compiler to compile MySQL on Linux*.
Download the MySQL installation package:
wget http://downloads.mysql.com/archives/get/file/mysql-5.6.27.tar.gz
tar –zxvf mysql-5.6.27.tar.gz
•• Compile MySQL with Intel C++ Compiler (Figure 2):
1. Install Cmake: yum -y install wget make cmake gcc gcc-c++
autoconf automake zlib* libxml2* ncurses-devel libmcrypt*
libtool‑ltdl‑devel*
2. Compile MySQL with Intel C++ Compiler: Set CC to icc, CXX to icpc, and enable IPO
with the –ipo option.
Source file IPO compilation .o file with IL
information
Linking with IPO Executable file
52. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
52The Parallel Universe
2 Compiling MySQL* with Intel® C++ Compiler
•• Create MySQL Grant System Tables
cd /usr/local/mysql-5.6.27-icc
groupadd mysql
useradd -M -g mysql mysql -s /sbin/nologin ;
chown -R mysql .
chgrp -R mysql .
./scripts/mysql_install_db --user=mysql --collation-server=utf8_general_ci
Performance Testing
The purpose of this analysis is to do comparative performance testing between MySQL built with
the GNU and Intel compilers. The performance metrics for online transaction processing (OLTP)
include queries per second (QPS) and response time (RT). The test tool was Sysbench* (v0.4.12),
a modular, cross-platform, multithreaded benchmarking tool that is commonly used to evaluate
database performance. Tables 1 and 2 show the test environment.
53. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
53The Parallel Universe
Table 1. Hardware environment
Server Dell PowerEdge* R730xd
Hardware
Environment
CPU Intel® Xeon® processor E5-2699 v4 @ 2.20 GHz 2
Memory 128 GB
Disk SATA* SSD S3510, 1.5T 1
Table 2. Software environment
Software
Environment
OS version CentOS* 6.6
Kernel 2.6.32-926
Intel® C++ Compiler ICC-17.0.1
GNU GCC* 4.4.7 20120313 (RedHat* 4.4.7-11)
MySQL* MySQL 5.6.27
Sysbench* Sysbench-0.4.12
Test Steps
We installed two MySQL instances, 3308 and 3318, on the same Intel® SATA* SSD disk and used
Sysbench for OLTP testing (Table 3). Before testing, we cleaned the operating system cache and
disabled the MySQL query_cache. We created 10 MySQL database tables for testing, with one
million records per table. 4, 8, 16, 32, 64, 128, and 512 test threads were used for random read
OLTP. MySQL buffer_pool size was set as 2 GB, 8 GB, and 16 GB. Each test was run three times
and the average time was used as the final result.
54. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
54The Parallel Universe
Table 3. Test steps
Instance MySQL* Version Compiler Engine QueryCache
MySQL 3308 5.6.27 GCC* Innodb Disable
MySQL 3318 5.6.27 Intel® C++ Compiler Innodb Disable
Performance Testing Results
QPS
The results of the Sysbench TPS test on MySQL compiled with the Intel and GNU compilers are
shown in Table 4 and Figure 3. The number of threads and the MySQL buffer_pool size were
varied. Across the board, MySQL compiled with the Intel compiler gives significantly higher QPS,
especially as the number of threads increases. We see a 5 to 35 percent improvement in QPS for
MySQL compiled with the Intel C++ Compiler (Figure 3).
Table 4. Queries per second (higher is better)
Threads GCC_2GB ICC_2GB GCC_8GB ICC_8GB GCC_16GB ICC_16GB
4 1036.64 1184.88 1406.89 1598.23 1402.00 1592.23
8 1863.62 2154.04 2556.77 2882.07 2541.71 2856.27
16 3394.35 4013.79 4949.17 5444.82 4915.52 5468.59
32 5639.77 6625.03 9036.91 9599.00 9034.98 9570.54
64 6975.84 9510.31 12796.34 14680.04 12839.79 14662.85
128 6975.05 9465.03 12410.70 14804.35 12391.96 14744.11
256 6859.37 9367.19 11586.47 14083.09 11574.40 13887.13
512 6769.54 9260.24 10904.25 12290.89 10893.18 12310.28
55. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
55The Parallel Universe
Average Response Time
The results of the Sysbench average response time (RT) test on MySQL compiled with the Intel
and GNU compilers are shown in Table 5 and Figure 4. RT is in milliseconds. The number of
threads and the MySQL buffer_pool size were varied as in the QPS test. Once again, MySQL
compiled with the Intel compiler gives superior performance, especially at higher numbers of
threads.
Table 5. Average response time (in milliseconds, lower is better)
Threads GCC_2GB ICC_2GB GCC_8GB ICC_8GB GCC_16GB ICC_16GB
4 3.86 3.38 2.84 2.50 2.85 2.51
8 4.29 3.71 3.13 2.77 3.15 2.80
16 4.71 3.98 3.23 2.94 3.25 2.93
32 5.67 4.83 3.54 3.33 3.54 3.34
64 9.17 6.73 5.00 4.36 4.98 4.36
128 18.35 13.52 10.31 8.64 10.33 8.68
256 37.32 27.33 22.09 18.18 22.12 18.45
512 75.63 55.29 46.95 41.65 47.00 41.59
56. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
56The Parallel Universe
3 Queries per second for Intel® C++ Compiler versus GNU GCC*
4 Response time, Intel® C++ Compiler versus GNU GCC*
57. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
57The Parallel Universe
Conclusions
Maximizing MySQL performance is essential for DBAs, operators, and developers trying to reach
a performance goal. This study demonstrates that compiling MySQL with the Intel C++ Compiler
can significantly improve database performance. We found that the Intel C++ Compiler improved
performance by 5 to 35 percent, with an average improvement of about 15 percent.
There are many factors that affect MySQL performance, including the MySQL configuration,
the CPU, and the SSD. For example, the Intel®Xeon® processor E5-2620 v3 was used initially,
but upgrading to the Intel®Xeon processor E5-2699 v4 improved performance. Using a faster
SSD, such as the Intel® DC P3700 instead of the Intel® DC S3510 used in this article, should also
further improve performance.
References
1. Intel® Media Server Studio Support
2. MySQL* NDB Cluster 7.3–7.4 Reference Guide
3. CMake Reference Documentation
TRYTHEINTEL®
C++COMPILER,
PARTOFINTEL®
PARALLELSTUDIOXE›
58. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
58The Parallel Universe
ACCELERATINGLINEARREGRESSION
INR*
WITHINTEL®
DAALMake Better Predictions with This Highly Optimized Open Source Package
Linear regression is a classic statistical algorithm that is used to predict an outcome by leveraging
information from previously collected observations. Several businesses make predictions on
previously collected data and monetize on the results; common examples include forecasting
sales numbers in a new quarter and predicting credit risk given spending patterns. Intel® Data
Analytics Acceleration Library (Intel® DAAL) contains a repertoire of tuned machine learning
algorithms optimized for use on Intel® processors. In this article, we leverage the power of
tuned linear regression in Intel DAAL to accelerate linear regression in R*, a software language
for statistical algorithms and analysis. A preliminary performance analysis indicates that linear
regression in Intel DAAL shows better system utilization with higher arithmetic intensity. Our
Steena Monteiro, Software Engineer, Intel Corporation, and
Shaojuan Zhu, Technical Consulting Engineer, Intel Corporation
59. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
59The Parallel Universe
experiments demonstrate that incorporating Intel DAAL in linear regression in R speeds up model
training by up to 300x, and speeds up model inference by up to 373x.
Overview of Linear Regression
Linear regression1
is a classic statistical algorithm used for predicting values of a dependent
variable. Simple linear regression predicts a single response (dependent variable, y) using a single
independent variable (x) and takes the following form:
y = βx + ε
where β is the coefficient (slope) of the regression line, and ε is the intercept.
Given a set of points, the goal of a linear regression model is to derive a line that will minimize
the squared error of the response variable. Real-world data sets contain multiple independent
variables (or features) to describe the domain they represent. An extension of simple linear
regression—multiple linear regression—is used to represent a multidimensional problem space.
Multiple linear regression is represented as:
y = αx1
+ βx2
+ γx3
+ …+ ηxn + ε
where xi
are feature variables, and α, β, γ, and η are the corresponding coefficients.
In this article, we use linear regression to mean multiple linear regression.
Analyzing Arithmetic Intensity of Linear Regression in Intel DAAL and R
An algorithm’s utilization of a system’s floating point and bandwidth capability can be evaluated
by calculating the application’s arithmetic intensity. Arithmetic intensity2
is defined as the ratio
of floating point operations (FLOPs) performed by the application to the bytes used during the
application’s execution. Arithmetic intensity that is close to the theoretical arithmetic intensity
of the system indicates better system utilization. Given the high-performance potential of Intel
DAAL, we examine the arithmetic intensity of linear regression in R and Intel DAAL through
the Intel® Software Development Emulator (Intel® SDE).3
We measure FLOPs and bytes across
three data sets synthesized to fit into L1 cache (32K), L2 cache (256K), and L3 cache (20 MB).
(Table 1 shows the specs of the node used in our experiments.) Based on our experiments, the
arithmetic intensity of linear regression in Intel DAAL is 2.4x, 1.4x, and 1.4x greater than that of
linear regression in R, across the three data sets. Encouraged by these numbers, we proceed with
improving performance of linear regression in R using Intel DAAL.
60. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
60The Parallel Universe
Table 1: Haswell node specifications
Cores 8 per socket
Sockets 1
Frequency 2.30 GHz (3.6 GHz Turbo)
L1 32K
L2 256K
L3 20480K
Table 2: Linear regression data sets from the University of California–Irvine (UCI)
machine learning repository4
Data set
Rows
(instances)
Columns
(features)
Size (MB) Predicted response
Airfoil self-noise 1503 5 0.06 Sound pressure level
Parkinsons 5875 19 0.83
Clinician’s Parkinson’s disease
symptom score
Naval propulsion
plants
11934 13 1.24
Gas turbine compressor decay
state coefficient
Poker 1000000 10 80 Card suit
Year prediction
million songs
database
515345 90 371.04 Prediction of song release year
Linear Regression API in R
R is a software environment and a language that comes with a host of prepackaged algorithms for
data manipulation, classical statistics, graphics, bioinformatics, and machine learning.5
R provides
linear regression through the lm function.6
The lm function is used to train a linear regression
model, and a predict function is used to make inferences on new data.
61. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
61The Parallel Universe
Linear Regression in Intel DAAL
Linear regression is one of the two regression algorithms in Intel DAAL.7
Linear regression in Intel
DAAL can be executed in three different data processing modes—batch, online, and distributed.
The batch mode can be thought of as a serial mode of implementation, where all the data is
processed sequentially. In the online mode, the algorithm trains data in sets of blocks as data
becomes available. In the distributed mode, linear regression in Intel DAAL can be trained and
tested across multiple compute nodes. Intel DAAL provides a compute method through the
Algorithm class that is used for linear regression training and testing. In its native form, linear
regression in R can only be executed in serial. For very big data sets that do not fit into memory, R
will not work. However, Intel DAAL makes it possible to work on large data sets through its online
or streaming mode. For this article, we use linear regression in batch mode to accelerate the
algorithm in R. We use the QR method for linear regression in both R and Intel DAAL.
Integrating Linear Regression in R Using Intel DAAL
We use Intel DAAL’s C++ API for our experiments on linear regression. The Rcpp8
package in
R provides a way to integrate C++ functionality into native R code. Rcpp enables data transfer
between R objects and Intel DAAL C++ code. To successfully integrate Intel DAAL with R’s linear
regression, we wrote Intel DAAL wrappers in C++ to load data from a file, train, and test a linear
regression model. For details on implementing these wrappers, please refer to Lightning-Fast
Machine Learning R* Algorithms.9
To monitor performance benefits, we time the lm and predict function for linear regression
training and testing, respectively, in native R. For the Intel DAAL wrapper, we measure Intel DAAL’s
compute execution that is the core of Intel DAAL’s linear regression training and testing.
Table 2 provides a brief description of the data sets we used in this study. We use 75 percent of
each data set for linear regression training, and the remaining 25 percent for testing.
Dimension Reduction: Detecting and Eliminating Multicollinearity
A stable linear regression model requires that predictive features in the data should be free from
multicollinearity. Features are said to be multicollinear when sets of features are highly correlated
and can be expressed as linear combinations of each other. Ignoring multicollinearity can result
in a model that makes misleading predictions. We carry out analysis to detect multicollinearity on
features in the naval propulsion plants data set. Intel DAAL provides a set of distance functions,
namely, correlation_distance and cosine_distance to measure correlations between
different feature variables. While exploring the data set, we noticed two features—starboard
propeller torque and port propeller torque—to be duplicates of each other. We use box plots10
to summarize the distribution of values inside these features. Figure 1 shows the minimum,
lower quartile, mean, upper quartile, and maximum values of both torques. The box plots are
62. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
62The Parallel Universe
replicas of each other. This means we can eliminate one of these features, since they are
highly correlated and their duplication does not produce additional insight. While examining
distributions of data across the features, we noticed two features—gas turbine compressor
inlet and compressor outlet pressure—to be uniformly distributed (Figures 2 and 3) and
consequently produce an unstable linear regression model. We thus eliminated three out
of the 16 features to train and test linear regression on the naval propulsion plant data set.
As a side note, another alternative would be to use ridge regression (included in Intel DAAL)
due to its capability of working with collinear predictors.
1 Distribution of multicollinear features in the naval propulsion plant’s data set
63. Sign up for future issuesFor more complete information about compiler optimizations, see our Optimization Notice.
63The Parallel Universe
2 Uniform distribution of air pressure in the naval propulsion plant’s data set
3 Uniform distribution of air pressure in the naval propulsion plant’s data set