Intel® Compilers Professional Editions

  • 783 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
783
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
18
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • These are the required brand names for conventional Pentium M, Woodcrest, and Clovertown
  • Think Parallel or Perish James Reinders, lead evangelist and Director of Marketing and Business Development for Intel Software Development Products, doubts that any developer can survive the next decade without learning think about parallelism. For multi-core processors like the Intel Core 2 Duo, hardware is only part of the performance story. To take full advantage of the new processors directly in applications, developers will likely "thread" their applications so they run in parallel and take advantage of the multiple cores. That can be a daunting task for developers, but Intel has just made it a lot easier. James explains his theory, while making a disturbing claim that software developers have to learn to think differently.  by James Reinders October 12, 2006   I jotted out a note: "Memo to self: The fastest way to unemployment is to not hone my programming skills for exploiting parallelism." Originally I was going to say, "Memo to software developers." Then I realized I was just avoiding the truth—this was a memo to me. I have some experience in this area already, and I may be ahead of the curve, but the rush to multi-core is hardly going to slow down to match my speed. Multi-core changes the world. For software developers, multi-core is both an opportunity and a headache for software developers. Why have we waited so long to move to parallelism? Simple—it is a change we've been able to avoid—one which requires us to think and act differently. What benefits does it have? I think of it this way—you can have all the performance you want if your program is ready for parallelism. All you have to do is buy more processor cores. You no longer have to wait for Moore's law to double your performance every 18 months. Instead, just buy more processor cores. Of course, you can wait for double the cores at the same price—and get the benefits for free as time marches on. You need to think and program "parallel" for this to be real. Do we have a choice? Not really, but that doesn't mean it is bad. In fact, I'm quite convinced that the "not parallel" era will appear to be a very primitive time in the history of computers when people look back in a hundred years. The world works in parallel, it is about time for computer programs to do the same. The exciting new software development products we introduced in August are unique in how they solve key problems others are not solving. They make the developers' work significantly easier. Updates to two of our great tools—Intel Thread Checker and Intel Thread Profiler—and a really important new product that extends C++ for parallelism—the Intel Threading Building Blocks—will make it much easier for developers to realize significant gains with each new processor innovation. At Intel, we have been doing multiprocessor work for years and are arguably the best in the world at producing and debugging multiprocessor code. Now everyone needs to care. And with these new tools, we have solutions to help. We all have things to learn. But the day is not far off when we will entertain a younger generation of developers with hair-raising tales of how we optimized instructions to speed up the one thread of execution we had available to us. We will tell our tales, and they will be interesting history because no one will be thinking that way anymore. Within a decade, a programmer who does not think "parallel" first will not be a programmer. Programming in parallel is not intrinsically harder than what we do now, but it is different. We need to think differently, and we need tools that support this thinking. The tools in this new world do not need to be radically different—but they need to address key problems related to abstraction of thread management and correctness verification. Decades ago programmers looked to high-level languages to abstract machines so that Fortran, COBOL and C could replace assembly language programming. In time, abstraction grew to include C++, Java and C# to name a few. Now, we need to avoid thinking that parallelism is exploited through Windows threads and pthreads. Instead, we need to look to OpenMP and Intel Threading Building Blocks. Also, we need to acknowledge the new challenges in debugging threaded applications. Specifically, we need to eliminate potentials for data races and deadlock. This is why the Intel Thread Checker is so exciting—it is a tool to help developers do exactly that. We have learned a lot since version 1, so our version 3 is greatly refined. I expect there is a lot of room for future innovation here, and development of threaded applications will continue to get easier and easier.
  • Intel introduces Intel® Threading Building Blocks for C++ developers This product will abstract scalable threading for developers of Windows, Linux and Mac OS applications Java and .NET abstracted programming for C++, C, Fortran, Assembly language programmers Threading Building Blocks abstracts low level threading for parallel programmers The developer writes high level / task level code that they’re familiar with, the Threading Building Blocks detect the number of processor cores and scale the application
  • Intel introduces Intel® Threading Building Blocks for C++ developers This product will abstract scalable threading for developers of Windows, Linux and Mac OS applications Java and .NET abstracted programming for C++, C, Fortran, Assembly language programmers Threading Building Blocks abstracts low level threading for parallel programmers The developer writes high level / task level code that they’re familiar with, the Threading Building Blocks detect the number of processor cores and scale the application
  • Intel introduces Intel® Threading Building Blocks for C++ developers This product will abstract scalable threading for developers of Windows, Linux and Mac OS applications Java and .NET abstracted programming for C++, C, Fortran, Assembly language programmers Threading Building Blocks abstracts low level threading for parallel programmers The developer writes high level / task level code that they’re familiar with, the Threading Building Blocks detect the number of processor cores and scale the application
  • Intel introduces Intel® Threading Building Blocks for C++ developers This product will abstract scalable threading for developers of Windows, Linux and Mac OS applications Java and .NET abstracted programming for C++, C, Fortran, Assembly language programmers Threading Building Blocks abstracts low level threading for parallel programmers The developer writes high level / task level code that they’re familiar with, the Threading Building Blocks detect the number of processor cores and scale the application
  • When you program to use two cores – you expect it to run faster than one core. You expect it to be up to twice as fast.
  • Intel TBB extends C++ for parallelism in an easy to use and efficient manner. It is designed to work with any C++ compiler thus simplifying development on applications for multi-core systems. This allows scalable programs at a fraction of the developer effort compared to C++ with threading packages. Intel TBB facilitates scalable performance in a way that works across a variety of machines for today, and readies programs for tomorrow. It detects the number of cores on the hardware platform and makes the necessary adjustments as more cores are added to allow software to adapt. Thus, more effectively taking advantage of multi-core hardware. Intel TBB is a proven solution and currently is being used on a wide variety of C++ applications, particularly ones where the notion of getting scalable performance is important. This covers multiple application segments such as digital content creation, animation, financial services, electronic design and automation and design simulation. Intel continues to support the commercial version of Intel Threading Building Blocks 2.0, which is available for $299. This product includes one-year of technical support, upgrades and new releases. The commercial version of Intel TBB is also included with the recently launched Intel® C++ Compiler Professional Editions 10.0.
  • Static Verifier builds upon the compiler's interprocedural analysis capability to provide whole-program detection of errors including routine mismatches, variable misuse, OpenMP directive errors and more
  • Static Verifier builds upon the compiler's interprocedural analysis capability to provide whole-program detection of errors including routine mismatches, variable misuse, OpenMP directive errors and more
  • Static Verifier builds upon the compiler's interprocedural analysis capability to provide whole-program detection of errors including routine mismatches, variable misuse, OpenMP directive errors and more
  • SV: Benefits less experienced users: Detects +200 kinds of potential coding errors; Can help to “teach” users to correctly use certain features of C/C++ and Fortran languages (for example, OpenMP directives) Benefits experienced users & QA: It detects rather hard to find errors (for example typos or uninitialized variables) Detects inconsistent objects declarations in different program units (for example different types of dummy and actual arguments)
  • SW Versions C++ Fortran Intel® C++ & Fortran Windows 10.0 20070427_RC2 20070427_RC2 Intel® C++ & Fortran Linux 9.1 20060324_RC1 20060324_RC1 PGI C++ & Fortran 6.2 pgi 6-2.5 pgi 6-2.5 GCC 4.2 Linux 4.2.0 4.2.0 GCC 4.2 Mac 4.2.0 MSVC 2005 msvc 8.1 msvc 8.1 GCC 4.2 Linux gcc 4.2.0 gcc 4.2.0 GCC 4.2 Mac gcc 4.2.0 gcc 4.2.0 Pathscale pathscale 2.5 pathscale 2.5 Optimization Options for Win32 C Programs C++ Programs F Programs IC 10.0 -O2 -O2 -Qfp_port -O2 -Qfp_port -Qcxx_features -O2 -Qfp_port IC 10.0 Baseline -fast -Qprof_gen/use -fast -Qcxx_features -fast IC 9.1 -O2 -O2 -Qfp_port -O2 -Qfp_port -Qcxx_features -O2 -Qfp_port IC9.1 Baseline -fast -fast -Qcxx_features -fast PGI C++ & Fortran -O2 -O2 -Op -O2 -O2 PGI C++ & Fortran baseline -fastsse -D__STDC__=0 -D_MSC_VER -Mipa=fast,inline -Mfprelaxed -tp P7 -fastsse -Mipa=fast,inline -Mfprelaxed -tp P7 -fastsse -Mipa=fast,inline -Mfprelaxed -tp P7 MSVC -O2 -O2 -O2 N.A. MSVC Baseline /O2 /ob2 /GL /arch:SSE2 /O2 /Ob2 /GL -Ehsc -GR-arch:SSE2 N.A. Optimization Options for Linux64 C Programs C++ Programs F Programs IC 10.0 -O2 -O2 -fp_port -O2 -fp_port -O2 IC 10.0 Baseline -fast -fast -fast IC9.1 -O2 -O2 -fp_port -O2 -fp_port -O2 IC9.1 Baseline -fast -fast -fast PGI -O2 -O2 -O2 -O2 PGI Baseline -fastsse -Mipa=fast -tp=p7 -fastsse -Mipa=fast -tp=p7 -fastsse -Mipa=fast -tp=p7 GCC -O2 -O2 -O2 -O2 GCC Baseline -O3 -funroll-loops -ffast-math -fno-inline-functions -O3 -funroll-loops -ffast_math -O3 -funroll-loops -ffast-math -fno-inline-functions Pathscale 2.5 -O2 -O2 -O2 -O2 Pathscale 2.5 Baseline -Ofast -march=opteron -Ofast -march=opteron -Ofast -march=opteron Optimization Options for Mac 32 C Programs C++ Programs Fortran Programs IC10.0 -O2 -O2 -O2 -O2 IC10.0 Baseline -fast -IPF_fp_relaxed -ansi_alias -fast -IPF_fp_relaxed -ansi_alias + SmartHeap -fast -IPF_fp_relaxed -ansi_alias -parallel -opt-mem-bandwidth 0 IC9.1 -O2 -O2 -O2 -O2 IC9.1 Baseline -fast -IPF_fp_relaxed -ansi_alias -fast -IPF_fp_relaxed -ansi_alias + SmartHeap -fast -IPF_fp_relaxed -ansi_alias -parallel -opt-mem-bandwidth 0 GCC 4.2 -O2 -O2 -O2 -O2 GCC 4.2 Baseline -O3 -funroll-all-loops -ffast-math -static -O3 -funroll-all-loops -ffast-math -static -O3 -funroll-all-loops -ffast-math -static
  • Optimization Options for Win x86_64/EM64T C Programs C++ Programs F Programs IC 10.0 -O2 -O2 -Qfp_port -O2 -Qfp_port -Qcxx_features -O2 -Qfp_port IC 10.0 Baseline -fast -Qprof_gen/use -fast -Qprof_gen/use -Qcxx_features -fast -Qprof_gen/use IC 9.1 -O2 -O2 -Qfp_port -O2 -Qfp_port -Qcxx_features -O2 -Qfp_port IC9.1 Baseline -fast -Qprof_gen/use -fast -Qprof_gen/use -Qcxx_features -fast -Qprof_gen/use MSVC -O2 -O2 -Op -O2 -Op -GX -GR N.A. MSVC Baseline /O2 /Op /GL /arch:SSE2 -arch:SSE2 /O2 -GX -GR N.A. Optimization Options for Linux C Programs C++ Programs F Programs IC 10.0 -O2 -O2 -fp_port -O2 -fp_port -O2 IC 10.0 Baseline -ipo -O3 -no-prec-div -xW -ipo -O3 -no-prec-div -xW -ipo -O3 -no-prec-div -xW IC9.1 -O2 -O2 -fp_port -O2 -fp_port -O2 IC9.1 Baseline -ipo -O3 -no-prec-div -xW -ipo -O3 -no-prec-div -xW -ipo -O3 -no-prec-div -xW PGI -O2 -O2 -tp=k8-32 -O2 -tp=k8-32 -O2 -tp=8-32 PGI Baseline -fastsse -Mipa=fast -tp=p7 -fastsse -Mipa=fast -tp=p7 -fastsse -Mipa=fast -tp=p7 GCC -O2 -O2 -O2 -O2 GCC Baseline -O3 -funroll-loops -ffast-math -fno-inline-functions -O3 -funroll-loops -ffast_math -O3 -funroll-loops -ffast-math -fno-inline-functions Pathscale 2.5 -O2 -O2 -O2 -O2 Pathscale 2.5 Baseline -Ofast -march=opteron -Ofast -march=opteron -Ofast -march=opteron
  • Optimization Options for Win x86_64/EM64T C Programs C++ Programs F Programs IC 10.0 -O2 -O2 -Qfp_port -O2 -Qfp_port -Qcxx_features -O2 -Qfp_port IC 10.0 Baseline -fast -Qprof_gen/use -fast -Qprof_gen/use -Qcxx_features -fast -Qprof_gen/use IC 9.1 -O2 -O2 -Qfp_port -O2 -Qfp_port -Qcxx_features -O2 -Qfp_port IC9.1 Baseline -fast -Qprof_gen/use -fast -Qprof_gen/use -Qcxx_features -fast -Qprof_gen/use MSVC -O2 -O2 -Op -O2 -Op -GX -GR N.A. MSVC Baseline /O2 /Op /GL /arch:SSE2 -arch:SSE2 /O2 -GX -GR N.A. Optimization Options for Linux C Programs C++ Programs F Programs IC 10.0 -O2 -O2 -fp_port -O2 -fp_port -O2 IC 10.0 Baseline -ipo -O3 -no-prec-div -xW -ipo -O3 -no-prec-div -xW -ipo -O3 -no-prec-div -xW IC9.1 -O2 -O2 -fp_port -O2 -fp_port -O2 IC9.1 Baseline -ipo -O3 -no-prec-div -xW -ipo -O3 -no-prec-div -xW -ipo -O3 -no-prec-div -xW PGI -O2 -O2 -tp=k8-32 -O2 -tp=k8-32 -O2 -tp=8-32 PGI Baseline -fastsse -Mipa=fast -tp=p7 -fastsse -Mipa=fast -tp=p7 -fastsse -Mipa=fast -tp=p7 GCC -O2 -O2 -O2 -O2 GCC Baseline -O3 -funroll-loops -ffast-math -fno-inline-functions -O3 -funroll-loops -ffast_math -O3 -funroll-loops -ffast-math -fno-inline-functions Pathscale 2.5 -O2 -O2 -O2 -O2 Pathscale 2.5 Baseline -Ofast -march=opteron -Ofast -march=opteron -Ofast -march=opteron
  • Optimization Options for Win x86_64/EM64T C Programs C++ Programs F Programs IC 10.0 -O2 -O2 -Qfp_port -O2 -Qfp_port -Qcxx_features -O2 -Qfp_port IC 10.0 Baseline -fast -Qprof_gen/use -fast -Qprof_gen/use -Qcxx_features -fast -Qprof_gen/use IC 9.1 -O2 -O2 -Qfp_port -O2 -Qfp_port -Qcxx_features -O2 -Qfp_port IC9.1 Baseline -fast -Qprof_gen/use -fast -Qprof_gen/use -Qcxx_features -fast -Qprof_gen/use MSVC -O2 -O2 -Op -O2 -Op -GX -GR N.A. MSVC Baseline /O2 /Op /GL /arch:SSE2 -arch:SSE2 /O2 -GX -GR N.A. Optimization Options for Linux C Programs C++ Programs F Programs IC 10.0 -O2 -O2 -fp_port -O2 -fp_port -O2 IC 10.0 Baseline -ipo -O3 -no-prec-div -xW -ipo -O3 -no-prec-div -xW -ipo -O3 -no-prec-div -xW IC9.1 -O2 -O2 -fp_port -O2 -fp_port -O2 IC9.1 Baseline -ipo -O3 -no-prec-div -xW -ipo -O3 -no-prec-div -xW -ipo -O3 -no-prec-div -xW PGI -O2 -O2 -tp=k8-32 -O2 -tp=k8-32 -O2 -tp=8-32 PGI Baseline -fastsse -Mipa=fast -tp=p7 -fastsse -Mipa=fast -tp=p7 -fastsse -Mipa=fast -tp=p7 GCC -O2 -O2 -O2 -O2 GCC Baseline -O3 -funroll-loops -ffast-math -fno-inline-functions -O3 -funroll-loops -ffast_math -O3 -funroll-loops -ffast-math -fno-inline-functions Pathscale 2.5 -O2 -O2 -O2 -O2 Pathscale 2.5 Baseline -Ofast -march=opteron -Ofast -march=opteron -Ofast -march=opteron
  • Grainsize explanation - When determining how much to subdivide a task in order to get the best performance on multicore, we allowed the developer to introduce the granularity of the subdivision, allowing them to make the tradeoffs that worked best for their application. With 1.1 adds the ability for the developer to let TBB calculate this grainsize value - allowing user choice where needed, and automation where needed ….
  • Features Generates best performing code Built-in parallelization support Cross OS support for Windows*, Linux* and Mac OS* 32-bit and 64-bit support Inclusion of critical libraries for threading, math and multimedia Intel® Math Kernel Library v9.1 Intel® Integrated Performance Primitives v5.2 Intel® Threading Building Blocks (Intel TBB) v1.1 Better Performance and Threading support with our new High-Performance, Parallel Optimizer (HPO) Improved C++ Exception Handling and Runtime Type Checking for better inlining and performance Security Checking and Diagnostics in our Static Verifier OpenMP* correctness checking OS Updates Windows* Vista* and VS.Net* 2005 Support Support for latest Linux distributions Support for 64-bit and 32-bit applications on Mac OS* X
  • Features Generates best performing code Built-in parallelization support Cross OS support for Windows*, Linux* and Mac OS* What’s New No need to buy Microsoft Visual Studio*, Intel Visual Fortran for Windows includes Visual Studio Inclusion of critical libraries for threading, math and multimedia Intel® Math Kernel Library v9.1 IMSL* version available for Windows Stand-alone Visual Fortran on Windows* (Microsoft* Visual Studio* included) Better Performance and Threading through HPO New Fortran 2003 Standard Features – Async I/O, C interoperability, Standards checking, and others Security Checking and Diagnostics OpenMP* correctness checking OS Updates Windows* Vista* and VS.Net* 2005 Support Support for latest Linux distributions Support for 64-bit and 32-bit applications on Mac OS* X
  • Example above detects Buffer overflow Mudflap is a feature available on Linux* and Mac OS* X on IA-32 and Intel® 64 processors to find memory leaks and overwrites. Mudflap library was developed by the GNU compiler developers, additional details are available on the gcc Wiki page, http://gcc.gnu.org/wiki/Mudflap%20Pointer%20Debugging . Mudflaps compliments –fstack-security-check option Linux and Mac
  • In 10.0 at –O3, the compiler does loop interchange, blocking for cache locality and unrolling, and only then goes on to vectorize.
  • In the past, if you wanted to build a mixed-language application with Fortran and C, you had to use vendor-specific extensions to manage language differences such as pass-by-value and mixed-case names. Fortran 2003 defines a set of syntax features that make it possible to specify C-compatible declarations in a standard-conforming format. This makes it possible to write Fortran code that is to be called by C, or Fortran code that calls C, without using vendor extensions. There are some aspects of the Fortran run-time environment which are not specified by the standard. For example, which Fortran unit numbers are used for standard input and standard output, or what the IOSTAT return value for an end of file is. Programs that needed to know these things would either guess or would do some sort of run-time test to determine the answer. The ISO_FORTRAN_ENV module provides a uniform and portable way for applications to get the answers to these questions about the run-time environment. With asynchronous I/O, an application can start a large READ or WRITE and then continue computing while the I/O happens in the background. The program can then wait for the operation to complete when the data is needed. For some applications, this can shorten run-time compared to traditional synchronous I/O which stalls at each READ or WRITE. The compiler has the ability to check program source for conformance to the Fortran 90, Fortran 95 or Fortran 2003 standards. Although Intel Fortran does not yet implement all of Fortran 2003, when you ask for standards checking and do not specify a version, the default is to check against Fortran 2003. In the past, Fortran 95 was the default level. There are many more Fortran 2003 features supported that can make you more productive, such as the IMPORT statement in interface blocks. A complete list of all supported Fortran 2003 features is in the compiler release notes, which can be found at the Intel Fortran compiler web site.
  • Features Generates best performing code Built-in parallelization support Cross OS support for Windows*, Linux* and Mac OS* 32-bit and 64-bit support Inclusion of critical libraries for threading, math and multimedia Intel® Math Kernel Library v9.1 Intel® Integrated Performance Primitives v5.2 Intel® Threading Building Blocks (Intel TBB) v1.1 Better Performance and Threading support with our new High-Performance, Parallel Optimizer (HPO) Improved C++ Exception Handling and Runtime Type Checking for better inlining and performance Security Checking and Diagnostics in our Static Verifier OpenMP* correctness checking OS Updates Windows* Vista* and VS.Net* 2005 Support Support for latest Linux distributions Support for 64-bit and 32-bit applications on Mac OS* X
  • Features Generates best performing code Built-in parallelization support Cross OS support for Windows*, Linux* and Mac OS* What’s New No need to buy Microsoft Visual Studio*, Intel Visual Fortran for Windows includes Visual Studio Inclusion of critical libraries for threading, math and multimedia Intel® Math Kernel Library v9.1 IMSL* version available for Windows Stand-alone Visual Fortran on Windows* (Microsoft* Visual Studio* included) Better Performance and Threading through HPO New Fortran 2003 Standard Features – Async I/O, C interoperability, Standards checking, and others Security Checking and Diagnostics OpenMP* correctness checking OS Updates Windows* Vista* and VS.Net* 2005 Support Support for latest Linux distributions Support for 64-bit and 32-bit applications on Mac OS* X

Transcript

  • 1. Intel® Compilers Professional Editions The Best C++ and Fortran Development Solutions for Today’s Multi-core World
    • Phil De La Zerda , Director
    • Intel Software Development Products
    • Intel Corporation
    June 2007 CONTAINS EMBARGOED INFORMATION No public mention allowed of the new Intel Compiler 10.0 discussed in this disclosure until June 5
  • 2. Multi-Core Processors Change the Rules Single Core Dual Core Quad Core Until recently…. Maximize Multi-Core Performance By Parallelizing Software Faster software came from faster processors Now performance will primarily come from multi-core processors Intel® Xeon® processor Dual-Core Intel® Xeon® processor 5100 series New Quad-Core Intel® Xeon® 5300 for 2006
  • 3. Great Performance Performance Through Parallelism Help mainstream programmers to quickly develop quality code Educate Tomorrow’s Experts
    • Helped 45 universities add parallel programming courses
    • 7500 students took them
    • 2007 Goal: 400+ universities
    Provide Tools Today Research Tomorrow’s Techniques Higher-Ed Outreach Architect, Thread, Debug & Tune Transactional Memory Speculative Multi- threading http:// www.intel.com /software/products ACCELERATE TRANSITION TO PARALLEL PROGRAMMING
  • 4. New Usage Models, parallelism is key Terabytes TIPS Gigabytes MIPS Megabytes GIPS Performance Dataset Size Kilobytes KIPS Tera-scale Multi-core Single-core Multi- Media 3D & Video Text RMS Personal Media Creation and Management Learning & Travel Entertainment Health Source: electronic visualization lab University of Illinois
  • 5. Worldwide Trends & The Opportunity
  • 6. Key messages
    • Multi-core is here to stay
      • Hype to the contrary is noise
    • Multi-core is ubiquitous – for software developers: worth utilizing, dangerous to ignore
    • Intel has the best products to help developers with parallelism
    • Intel has rich innovations ahead to take us far into the future with multi-core, we are committed
  • 7. “ THINK PARALLEL (or PERISH)” “ memo to self” software developers need to Multicore processing is the new normal. Utilizing parallelism is the best way to realize performance in the future
  • 8. “ Everyone” need to harness parallelism … even more: they want “solutions” they want tools to help… Help me express parallelism Help me have confidence it will work Help me tune performance Help me: recommend what I do Help me with cluster programming (write, tune, debug) Pro
  • 9. Thinking parallel 1-2-3
    • Start with libraries, use as much as they apply. Recommend OpenMP. We support libraries and OpenMP very well.
    • Recommend Intel Threading Building Blocks. Avoids leaping from step 1 to 3. New, exciting – and well supported by Intel. A focal point for energizing the industry.
    • Last resort: customer use MPI and/or hand threading. We support MPI and threading by hand, very well. (so, do not panic – many early adopters will hand thread, many cluster programmers will use MPI)
  • 10. Thinking parallel 1-2-3
    • Start with libraries, use as much as they apply. Recommend OpenMP. We support libraries and OpenMP very well.
    • Recommend Intel Threading Building Blocks. Avoids leaping from step 1 to 3. New, exciting – and well supported by Intel. A focal point for energizing the industry.
    • Last resort: customer use MPI and/or hand threading. We support MPI and threading by hand, very well. (so, do not panic – many early adopters will hand thread, many cluster programmers will use MPI)
  • 11. #2 – Intel Threading Building Blocks Which do you choose? Less work or More work?
      • C++ template-based runtime library
        • simplifies writing multi-threaded applications
        • high level programming paradigm
        • emphasizes scalability
        • pre-built and tested data structures & algorithms
  • 12. Less code to achieve parallelism Example: 2D Ray Tracing Application Thread Setup and Initialization CRITICAL_SECTION MyMutex, MyMutex2, MyMutex3; int get_num_cpus (void) { SYSTEM_INFO si; GetSystemInfo(&si); return (int)si.dwNumberOfProcessors;} int nthreads = get_num_cpus (); HANDLE *threads = (HANDLE *) alloca (nthreads * sizeof (HANDLE)); InitializeCriticalSection (&MyMutex); InitializeCriticalSection (&MyMutex2); InitializeCriticalSection (&MyMutex3); for (int i = 0; i < nthreads; i++) { DWORD id; &threads[i] = CreateThread (NULL, 0, parallel_thread, i, 0, &id);} for (int i = 0; i < nthreads; i++) { WaitForSingleObject (&threads[i], INFINITE); } Parallel Task Scheduling and Execution const int MINPATCH = 150; const int DIVFACTOR = 2; typedef struct work_queue_entry_s { patch pch; struct work_queue_entry_s *next; } work_queue_entry_t; work_queue_entry_t *work_queue_head = NULL; work_queue_entry_t *work_queue_tail = NULL; void generate_work (patch* pchin) { int startx, stopx, starty, stopy; int xs,ys; startx=pchin->startx; stopx= pchin->stopx; starty=pchin->starty; stopy= pchin->stopy; if(((stopx-startx) >= MINPATCH) || ((stopy-starty) >= MINPATCH)) { int xpatchsize = (stopx-startx)/DIVFACTOR + 1; int ypatchsize = (stopy-starty)/DIVFACTOR + 1; for (ys=starty; ys<=stopy; ys+=ypatchsize) for (xs=startx; xs<=stopx; xs+=xpatchsize) { patch pch; pch.startx = xs; pch.starty = ys; pch.stopx = MIN(xs+xpatchsize-1,stopx); pch.stopy = MIN(ys+ypatchsize-1,stopy); generate_work (&pch);} } else { /* just trace this patch */ work_queue_entry_t *q = (work_queue_entry_t *) malloc (sizeof (work_queue_entry_t)); q->pch.starty = starty; q->pch.stopy = stopy; q->pch.startx = startx; q->pch.stopx = stopx; q->next = NULL; Thread Setup and Initialization #include &quot;tbb/task_scheduler_init.h&quot; #include &quot;tbb/spin_mutex.h&quot; tbb::task_scheduler_init init; tbb::spin_mutex MyMutex, MyMutex2; Parallel Task Scheduling and Execution #include &quot;tbb/parallel_for.h&quot; #include &quot;tbb/blocked_range2d.h&quot; class parallel_task { public: void operator() (const tbb::blocked_range2d<int> &r) const { for (int y = r.rows().begin(); y != r.rows().end(); ++y) { for (int x = r.cols().begin(); x != r.cols().end(); x++) { render_one_pixel (x, y); } } if (scene.displaymode == RT_DISPLAY_ENABLED) { tbb::spin_mutex::scoped_lock lock (MyMutex2); for (int y = r.rows().begin(); y != r.rows().end(); ++y) { GraphicsDrawRow(startx-1, y-1, totalx, (unsigned char *) &global_buffer[(y-starty)*totalx*3]); } } } parallel_task () {} }; parallel_for (tbb::blocked_range2d<int> (starty, stopy + 1, grain_size, startx, stopx + 1, grain_size), parallel_task ()); Windows Threads Intel® Threading Building Blocks Intel® Threading Building Blocks offers platform portability on Windows*, Linux*, and Mac OS* through its cross-platform API. This code comparison shows the additional code needed to make a 2D ray tracing program, Tacheon, correctly threaded. This allows the application to take advantage of current and future multi-core hardware. if (work_queue_head == NULL) { work_queue_head = q; } else { work_queue_tail->next = q; } work_queue_tail = q; } } void generate_worklist (void) { patch pch; pch.startx = startx; pch.stopx = stopx; pch.starty = starty; pch.stopy = stopy; generate_work (&pch); } bool schedule_thread_work (patch &pch) { EnterCriticalSection (&MyMutex3); work_queue_entry_t *q = work_queue_head; if (q != NULL) { pch = q->pch; work_queue_head = work_queue_head->next; } LeaveCriticalSection (&MyMutex3); return (q != NULL); } generate_worklist (); void parallel_thread (void *arg) { patch pch; while (schedule_thread_work (pch)) { for (int y = pch.starty; y <= pch.stopy; y++) { for (int x=pch.startx; x<=pch.stopx; x++) { render_one_pixel (x, y);}} if (scene.displaymode == RT_DISPLAY_ENABLED) { EnterCriticalSection (&MyMutex3); for (int y = pch.starty; y <= pch.stopy; y++) { GraphicsDrawRow(pch.startx-1, y-1, pch.stopx-pch.startx+1, (unsigned char *) &global_buffer[((y-starty)*totalx+(pch.startx-startx))*3]); } LeaveCriticalSection (&MyMutex3); } } } This example includes software developed by John E. Stone. Focus on work to do, not “how” (thread control) to manage threads
  • 13. Less code to achieve parallelism Example: 2D Ray Tracing Application Thread Setup and Initialization CRITICAL_SECTION MyMutex, MyMutex2, MyMutex3; int get_num_cpus (void) { SYSTEM_INFO si; GetSystemInfo(&si); return (int)si.dwNumberOfProcessors;} int nthreads = get_num_cpus (); HANDLE *threads = (HANDLE *) alloca (nthreads * sizeof (HANDLE)); InitializeCriticalSection (&MyMutex); InitializeCriticalSection (&MyMutex2); InitializeCriticalSection (&MyMutex3); for (int i = 0; i < nthreads; i++) { DWORD id; &threads[i] = CreateThread (NULL, 0, parallel_thread, i, 0, &id);} for (int i = 0; i < nthreads; i++) { WaitForSingleObject (&threads[i], INFINITE); } Parallel Task Scheduling and Execution const int MINPATCH = 150; const int DIVFACTOR = 2; typedef struct work_queue_entry_s { patch pch; struct work_queue_entry_s *next; } work_queue_entry_t; work_queue_entry_t *work_queue_head = NULL; work_queue_entry_t *work_queue_tail = NULL; void generate_work (patch* pchin) { int startx, stopx, starty, stopy; int xs,ys; startx=pchin->startx; stopx= pchin->stopx; starty=pchin->starty; stopy= pchin->stopy; if(((stopx-startx) >= MINPATCH) || ((stopy-starty) >= MINPATCH)) { int xpatchsize = (stopx-startx)/DIVFACTOR + 1; int ypatchsize = (stopy-starty)/DIVFACTOR + 1; for (ys=starty; ys<=stopy; ys+=ypatchsize) for (xs=startx; xs<=stopx; xs+=xpatchsize) { patch pch; pch.startx = xs; pch.starty = ys; pch.stopx = MIN(xs+xpatchsize-1,stopx); pch.stopy = MIN(ys+ypatchsize-1,stopy); generate_work (&pch);} } else { /* just trace this patch */ work_queue_entry_t *q = (work_queue_entry_t *) malloc (sizeof (work_queue_entry_t)); q->pch.starty = starty; q->pch.stopy = stopy; q->pch.startx = startx; q->pch.stopx = stopx; q->next = NULL; Thread Setup and Initialization #include &quot;tbb/task_scheduler_init.h&quot; #include &quot;tbb/spin_mutex.h&quot; tbb::task_scheduler_init init; tbb::spin_mutex MyMutex, MyMutex2; Parallel Task Scheduling and Execution #include &quot;tbb/parallel_for.h&quot; #include &quot;tbb/blocked_range2d.h&quot; class parallel_task { public: void operator() (const tbb::blocked_range2d<int> &r) const { for (int y = r.rows().begin(); y != r.rows().end(); ++y) { for (int x = r.cols().begin(); x != r.cols().end(); x++) { render_one_pixel (x, y); } } if (scene.displaymode == RT_DISPLAY_ENABLED) { tbb::spin_mutex::scoped_lock lock (MyMutex2); for (int y = r.rows().begin(); y != r.rows().end(); ++y) { GraphicsDrawRow(startx-1, y-1, totalx, (unsigned char *) &global_buffer[(y-starty)*totalx*3]); } } } parallel_task () {} }; parallel_for (tbb::blocked_range2d<int> (starty, stopy + 1, grain_size, startx, stopx + 1, grain_size), parallel_task ()); Windows Threads Intel® Threading Building Blocks Intel® Threading Building Blocks offers platform portability on Windows*, Linux*, and Mac OS* through its cross-platform API. This code comparison shows the additional code needed to make a 2D ray tracing program, Tacheon, correctly threaded. This allows the application to take advantage of current and future multi-core hardware. if (work_queue_head == NULL) { work_queue_head = q; } else { work_queue_tail->next = q; } work_queue_tail = q; } } void generate_worklist (void) { patch pch; pch.startx = startx; pch.stopx = stopx; pch.starty = starty; pch.stopy = stopy; generate_work (&pch); } bool schedule_thread_work (patch &pch) { EnterCriticalSection (&MyMutex3); work_queue_entry_t *q = work_queue_head; if (q != NULL) { pch = q->pch; work_queue_head = work_queue_head->next; } LeaveCriticalSection (&MyMutex3); return (q != NULL); } generate_worklist (); void parallel_thread (void *arg) { patch pch; while (schedule_thread_work (pch)) { for (int y = pch.starty; y <= pch.stopy; y++) { for (int x=pch.startx; x<=pch.stopx; x++) { render_one_pixel (x, y);}} if (scene.displaymode == RT_DISPLAY_ENABLED) { EnterCriticalSection (&MyMutex3); for (int y = pch.starty; y <= pch.stopy; y++) { GraphicsDrawRow(pch.startx-1, y-1, pch.stopx-pch.startx+1, (unsigned char *) &global_buffer[((y-starty)*totalx+(pch.startx-startx))*3]); } LeaveCriticalSection (&MyMutex3); } } } This example includes software developed by John E. Stone. Focus on work to do, not “how” (thread control) to manage threads
  • 14. Less code to achieve parallelism Example: 2D Ray Tracing Application Thread Setup and Initialization CRITICAL_SECTION MyMutex, MyMutex2, MyMutex3; int get_num_cpus (void) { SYSTEM_INFO si; GetSystemInfo(&si); return (int)si.dwNumberOfProcessors;} int nthreads = get_num_cpus (); HANDLE *threads = (HANDLE *) alloca (nthreads * sizeof (HANDLE)); InitializeCriticalSection (&MyMutex); InitializeCriticalSection (&MyMutex2); InitializeCriticalSection (&MyMutex3); for (int i = 0; i < nthreads; i++) { DWORD id; &threads[i] = CreateThread (NULL, 0, parallel_thread, i, 0, &id);} for (int i = 0; i < nthreads; i++) { WaitForSingleObject (&threads[i], INFINITE); } Parallel Task Scheduling and Execution const int MINPATCH = 150; const int DIVFACTOR = 2; typedef struct work_queue_entry_s { patch pch; struct work_queue_entry_s *next; } work_queue_entry_t; work_queue_entry_t *work_queue_head = NULL; work_queue_entry_t *work_queue_tail = NULL; void generate_work (patch* pchin) { int startx, stopx, starty, stopy; int xs,ys; startx=pchin->startx; stopx= pchin->stopx; starty=pchin->starty; stopy= pchin->stopy; if(((stopx-startx) >= MINPATCH) || ((stopy-starty) >= MINPATCH)) { int xpatchsize = (stopx-startx)/DIVFACTOR + 1; int ypatchsize = (stopy-starty)/DIVFACTOR + 1; for (ys=starty; ys<=stopy; ys+=ypatchsize) for (xs=startx; xs<=stopx; xs+=xpatchsize) { patch pch; pch.startx = xs; pch.starty = ys; pch.stopx = MIN(xs+xpatchsize-1,stopx); pch.stopy = MIN(ys+ypatchsize-1,stopy); generate_work (&pch);} } else { /* just trace this patch */ work_queue_entry_t *q = (work_queue_entry_t *) malloc (sizeof (work_queue_entry_t)); q->pch.starty = starty; q->pch.stopy = stopy; q->pch.startx = startx; q->pch.stopx = stopx; q->next = NULL; Thread Setup and Initialization #include &quot;tbb/task_scheduler_init.h&quot; #include &quot;tbb/spin_mutex.h&quot; tbb::task_scheduler_init init; tbb::spin_mutex MyMutex, MyMutex2; Parallel Task Scheduling and Execution #include &quot;tbb/parallel_for.h&quot; #include &quot;tbb/blocked_range2d.h&quot; class parallel_task { public: void operator() (const tbb::blocked_range2d<int> &r) const { for (int y = r.rows().begin(); y != r.rows().end(); ++y) { for (int x = r.cols().begin(); x != r.cols().end(); x++) { render_one_pixel (x, y); } } if (scene.displaymode == RT_DISPLAY_ENABLED) { tbb::spin_mutex::scoped_lock lock (MyMutex2); for (int y = r.rows().begin(); y != r.rows().end(); ++y) { GraphicsDrawRow(startx-1, y-1, totalx, (unsigned char *) &global_buffer[(y-starty)*totalx*3]); } } } parallel_task () {} }; parallel_for (tbb::blocked_range2d<int> (starty, stopy + 1, grain_size, startx, stopx + 1, grain_size), parallel_task ()); Windows Threads Intel® Threading Building Blocks Intel® Threading Building Blocks offers platform portability on Windows*, Linux*, and Mac OS* through its cross-platform API. This code comparison shows the additional code needed to make a 2D ray tracing program, Tacheon, correctly threaded. This allows the application to take advantage of current and future multi-core hardware. if (work_queue_head == NULL) { work_queue_head = q; } else { work_queue_tail->next = q; } work_queue_tail = q; } } void generate_worklist (void) { patch pch; pch.startx = startx; pch.stopx = stopx; pch.starty = starty; pch.stopy = stopy; generate_work (&pch); } bool schedule_thread_work (patch &pch) { EnterCriticalSection (&MyMutex3); work_queue_entry_t *q = work_queue_head; if (q != NULL) { pch = q->pch; work_queue_head = work_queue_head->next; } LeaveCriticalSection (&MyMutex3); return (q != NULL); } generate_worklist (); void parallel_thread (void *arg) { patch pch; while (schedule_thread_work (pch)) { for (int y = pch.starty; y <= pch.stopy; y++) { for (int x=pch.startx; x<=pch.stopx; x++) { render_one_pixel (x, y);}} if (scene.displaymode == RT_DISPLAY_ENABLED) { EnterCriticalSection (&MyMutex3); for (int y = pch.starty; y <= pch.stopy; y++) { GraphicsDrawRow(pch.startx-1, y-1, pch.stopx-pch.startx+1, (unsigned char *) &global_buffer[((y-starty)*totalx+(pch.startx-startx))*3]); } LeaveCriticalSection (&MyMutex3); } } } This example includes software developed by John E. Stone. Focus on work to do, not “how” (thread control) to manage threads
  • 15. Less code to achieve parallelism Example: 2D Ray Tracing Application Thread Setup and Initialization CRITICAL_SECTION MyMutex, MyMutex2, MyMutex3; int get_num_cpus (void) { SYSTEM_INFO si; GetSystemInfo(&si); return (int)si.dwNumberOfProcessors;} int nthreads = get_num_cpus (); HANDLE *threads = (HANDLE *) alloca (nthreads * sizeof (HANDLE)); InitializeCriticalSection (&MyMutex); InitializeCriticalSection (&MyMutex2); InitializeCriticalSection (&MyMutex3); for (int i = 0; i < nthreads; i++) { DWORD id; &threads[i] = CreateThread (NULL, 0, parallel_thread, i, 0, &id);} for (int i = 0; i < nthreads; i++) { WaitForSingleObject (&threads[i], INFINITE); } Parallel Task Scheduling and Execution const int MINPATCH = 150; const int DIVFACTOR = 2; typedef struct work_queue_entry_s { patch pch; struct work_queue_entry_s *next; } work_queue_entry_t; work_queue_entry_t *work_queue_head = NULL; work_queue_entry_t *work_queue_tail = NULL; void generate_work (patch* pchin) { int startx, stopx, starty, stopy; int xs,ys; startx=pchin->startx; stopx= pchin->stopx; starty=pchin->starty; stopy= pchin->stopy; if(((stopx-startx) >= MINPATCH) || ((stopy-starty) >= MINPATCH)) { int xpatchsize = (stopx-startx)/DIVFACTOR + 1; int ypatchsize = (stopy-starty)/DIVFACTOR + 1; for (ys=starty; ys<=stopy; ys+=ypatchsize) for (xs=startx; xs<=stopx; xs+=xpatchsize) { patch pch; pch.startx = xs; pch.starty = ys; pch.stopx = MIN(xs+xpatchsize-1,stopx); pch.stopy = MIN(ys+ypatchsize-1,stopy); generate_work (&pch);} } else { /* just trace this patch */ work_queue_entry_t *q = (work_queue_entry_t *) malloc (sizeof (work_queue_entry_t)); q->pch.starty = starty; q->pch.stopy = stopy; q->pch.startx = startx; q->pch.stopx = stopx; q->next = NULL; Thread Setup and Initialization #include &quot;tbb/task_scheduler_init.h&quot; #include &quot;tbb/spin_mutex.h&quot; tbb::task_scheduler_init init; tbb::spin_mutex MyMutex, MyMutex2; Parallel Task Scheduling and Execution #include &quot;tbb/parallel_for.h&quot; #include &quot;tbb/blocked_range2d.h&quot; class parallel_task { public: void operator() (const tbb::blocked_range2d<int> &r) const { for (int y = r.rows().begin(); y != r.rows().end(); ++y) { for (int x = r.cols().begin(); x != r.cols().end(); x++) { render_one_pixel (x, y); } } if (scene.displaymode == RT_DISPLAY_ENABLED) { tbb::spin_mutex::scoped_lock lock (MyMutex2); for (int y = r.rows().begin(); y != r.rows().end(); ++y) { GraphicsDrawRow(startx-1, y-1, totalx, (unsigned char *) &global_buffer[(y-starty)*totalx*3]); } } } parallel_task () {} }; parallel_for (tbb::blocked_range2d<int> (starty, stopy + 1, grain_size, startx, stopx + 1, grain_size), parallel_task ()); Windows Threads Intel® Threading Building Blocks offers platform portability on Windows*, Linux*, and Mac OS* through its cross-platform API. This code comparison shows the additional code needed to make a 2D ray tracing program, Tacheon, correctly threaded. This allows the application to take advantage of current and future multi-core hardware. if (work_queue_head == NULL) { work_queue_head = q; } else { work_queue_tail->next = q; } work_queue_tail = q; } } void generate_worklist (void) { patch pch; pch.startx = startx; pch.stopx = stopx; pch.starty = starty; pch.stopy = stopy; generate_work (&pch); } bool schedule_thread_work (patch &pch) { EnterCriticalSection (&MyMutex3); work_queue_entry_t *q = work_queue_head; if (q != NULL) { pch = q->pch; work_queue_head = work_queue_head->next; } LeaveCriticalSection (&MyMutex3); return (q != NULL); } generate_worklist (); void parallel_thread (void *arg) { patch pch; while (schedule_thread_work (pch)) { for (int y = pch.starty; y <= pch.stopy; y++) { for (int x=pch.startx; x<=pch.stopx; x++) { render_one_pixel (x, y);}} if (scene.displaymode == RT_DISPLAY_ENABLED) { EnterCriticalSection (&MyMutex3); for (int y = pch.starty; y <= pch.stopy; y++) { GraphicsDrawRow(pch.startx-1, y-1, pch.stopx-pch.startx+1, (unsigned char *) &global_buffer[((y-starty)*totalx+(pch.startx-startx))*3]); } LeaveCriticalSection (&MyMutex3); } } } This example includes software developed by John E. Stone . Focus on work to do, not “how” (thread control) to manage threads Just say ‘NO’ to explicit thread management Not because you can’t do it – because it isn’t a good use of time developing and maintaining Intel® Threading Building Blocks
  • 16. Thinking parallel 1-2-3
    • Start with libraries, use as much as they apply. Recommend OpenMP. We support libraries and OpenMP very well.
    • Recommend Intel Threading Building Blocks. Avoids leaping from step 1 to 3. New, exciting – and well supported by Intel. A focal point for energizing the industry.
    • Last resort: customer use MPI and/or hand threading. We support MPI and threading by hand, very well. (so, do not panic – many early adopters will hand thread, many cluster programmers will use MPI)
  • 17. Scalability – competitive advantage for software excellent ideal good fair poor Performance “ Scaling” 1 2 4 8 DESIGN FOR SCALING For excellent scaling – use Intel tools 1X 2X 4X 8X Scalability Correctness Maintainability
  • 18. Introducing Scalability: Libraries, OpenMP, MPI, and Intel Threading Building Blocks Scalable & High Performance By focusing on the work to do, not “how” (thread control) to manage threads, we rely on the Intel Threading Building Blocks to manage thread better and we get better scalability and performance. Linux* Windows*
  • 19. Intel ® Threading Building Blocks (TBB) 2.0 Open Source
    • TBB 2.0 announcement extends TBB by:
      • Buildable for more OSes, more processors Now more than Linux, Mac OS X, and Windows. Now more than IA32, IA64 and Intel 64 processors.
      • Creation of open source project http://threadingbuildingblocks. org
      • Source code available (GPL v2 with runtime exception)
      • Commercial Product continues http://threadingbuildingblocks.com
        • Separately – this month…
      • O’Reilly Nutshell book
  • 20. Introducing: Intel Compilers 10.0 Professional Editions Compilers and Libraries tuned for Multi-core Processors Featuring Intel’s Revolutionary New Optimizer
    • Intel® C++ Compiler 10.0 Professional Edition
    • Intel® C++ Compiler
    • Intel® Math Kernel Library
    • Intel® Integrated Performance Primitives
    • Intel® Threading Building Blocks
    • Intel® Fortran Compiler 10.0 Professional Edition
    • Intel® Fortran Compilers
    • Intel® Math Kernel Library
    • and on Windows only: Microsoft* Visual Studio is included
    Versions for: Windows* Linux* Mac OS* X
  • 21. Breakthrough Optimizer Design Uniquely Blends Parallelism Optimizations for SSE and MC to Achieve New Levels of Optimization
    • Single cohesive system to combine essential functions required to optimized multi-core performance.
    • Vectorization – ensures performance optimization of software with complex media demands such as 3D and video.
    • Parallelization – maximizes multi-core processors by automatically generating multi-threaded code.
    • Loop Transformations – enable vectors and threaded code to automatically be transformed within a single framework.
    Parallelism can be exploited in two ways: vectors (SSE) and Multi-core this optimizes our automatic exploitation of them together.
  • 22. Combining vectorization, parallelization and loop transformation into a single transform delivers results
    • Our Goal:
    • Compiler Professional Edition was the result of three years of development with the ultimate goal to make it easy for developers to advance parallelism without changing code or throwing fancy compiler switches.
    • Customer experience :
    • “ In applications involving calculations on huge amounts of continuous memory, Intel C++ 10.0 compiler s significantly faster than 9.1. In certain stand-alone tests, such as linear algebra matrix multiplication, 10.0 is up to 4 times faster than 9.1, due to improved automatic parallelization and automatic vectorization with “unroll and jam” that fits hand in glove with Intel Core 2 micro-architecture.”
    • Gunnar Staff & Lars Petter Endreen SPT Group
  • 23. Helping developers scale, correct and maintain applications for multicore performance
    • Multi-core Highlights
    • Intel ® Threading Building Blocks
        • Parallelism for C++
        • A standard part of Intel C++
      • OpenMP*: C and Fortran
        • Now with API checking
        • Auto-parallelization
      • Intel ® Math Kernel Library (MKL):
        • New threaded scientific routines
      • Intel ® Integrated Performance Primitives (IPP):
        • New threaded multimedia acceleration and 64 bit support – all compilers and libraries
    Multi-threading for multicore has never been easier. Intel C++ Compiler Professional Edition
  • 24. Intel® Compiler Vulnerability Diagnostics STRONG COMMITMENT TO HELP DEVELOPERS FIND VULNERABILITIES Intel Compilers have a powerful and flexible diagnostic reporting facility that allows such things as changing individual, groups of diagnostics, or all diagnostics from warnings to errors, and from remarks to warnings. Controlling Compiler Diagnostics Detects various defects and doubtful or inconsistent uses of language features in user code at compile/link time for an entire application. The Static Verifier understands C/C++ and Fortran, including OpenMP*, and can analyze mixed C/C++/Fortran applications. Static Verifier Detects certain types of memory corruption on Linux. Mudflap Support (Linux* only) Diagnostics based on C++ usage developed by Scott Meyer in his “Effective C++” books. Improving C++ Source Code Using &quot;Effective C++&quot; Diagnostics Various ANSI/ISO Compliance options can help identify non-portable language usage. Porting To/From Other Compilers Generates diagnostics warning of potential problems when porting legacy applications to 64-bits. Porting from 32 to 64-bits Intel® C++ Compilers have unique diagnostic capabilities that can enhance the quality of threaded code and help when introducing threads to legacy applications. Assisting Developing and Debugging Threaded Applications Identify incorrect function declarations that can corrupt the x87 floating point stack corruption and generate incorrect numerical results. Detecting x87 Floating Point Stack Corruption Run-time detection of stack corruption, helps detect commonly used security vulnerabilities. Enabling Stack Checking Run-time detection of buffer overflows, helps prevent commonly used security vulnerabilities. Detecting Buffer Overflow BENEFIT FEATURE
  • 25. B E N C H M A R K S
  • 26. Better Performance on Windows*, Linux* and Mac OS* X Competition is not standing still, neither are we. We continue to deliver on our goal to BE THE BEST. Persistent leadership.
  • 27. Competitive Performance on AMD Processors We continue to deliver on our goal to BE THE BEST. With effort (many switches) and without (-O2). Intel libraries and compilers continue to support a wide range of processors with optimized software performance for multiple processors in the same binary. Intel’s patented CPU-dispatch logic, available since 2000, is used by Intel compilers and libraries to offer sophisticated support for multiple processors in a single binary.
  • 28. Better Performance on Multi-Core Processors Optimized code is the first step to best usage of multi-core. Multiple optimized programs run better than multiple unoptimized programs.
  • 29. Better OpenMP Performance on Multi-Core Processors OpenMP and libraries are the easiest, and best way, to tap into multi-core performance (when they apply, of course). Customers have converted hand threaded programs to OpenMP –they gained performance and scalability and portability and made their program simpler all at once .
  • 30. Standalone Intel Visual Fortran for Windows*
    • Intel® Visual Fortran for Windows now includes Microsoft* Visual Studio*
    • Great news for Compaq Visual Fortran users, and others, who do not have the Microsoft Visual Studio development environment
    • … .and we brought back the COM Server Wizard too!
  • 31. T H E L I B R A R I E S
  • 32. Intel® Math Kernel Library 9.1 Flagship math processing library
    • Improved Functionality
      • Multi-core ready
        • Extensively threaded math functions with excellent scaling
      • Automatic runtime processor detection ensures great performance on whatever processor your application is running on.
      • Support for C and Fortran
    • What’s New
      • Optimizations for latest Intel processors including Quad-core Clovertown (Xeon® 5300 series) and Dual-core Woodcrest (Xeon® 5100 series)
      • LAPACK 3.1 Support
      • New Threading in Vector Math Functions
      • Support for 64-bit and 32-bit applications on Mac OS* X
  • 33. Intel® Integrated Performance Primitives 5.2 Highly optimized multimedia functions
    • Improved Functionality
      • Rapid Application Development
      • Cross-platform Compatibility & Code Re-Use
      • Outstanding Performance
    • What’s New
      • New data compression functions and code samples for full compatibility with zlib and libbzip2
      • Improved data compression performance
      • Support for new VC-1 and H.264 High profile video codecs   
      • Support for 64-bit and 32-bit applications on Mac OS* X
      • Additional optimizations for Quad-Core processors, and 64-bit applications
  • 34. Intel® Threading Building Blocks 1.1 Extends C++ for Parallelism
    • Improved Functionality
      • A C++ runtime library that uses familiar task patterns, not threads
      • A high level abstraction requiring less code for threading without sacrificing performance
      • Appropriately scales to the number of cores available
      • The thread library API is portable across Linux, Windows, or Mac OS platforms
      • Works with all C++ compilers (i.e. Microsoft, GNU and Intel)
    • What’s New (TBB 1.1 was launched in April 2007)
      • Auto_partitioner for better parallel algorithms
      • Microsoft Vista* support
      • Full, native 64 bit support for Mac OS X*
  • 35. Intel® C++ Compiler Professional Edition Most Mac OS* X developers preferred the Intel C++ Compiler with our libraries Profession edition NOW available for Windows* and Linux* too
  • 36. Intel® Fortran Compiler Professional Edition Most Mac OS* X developers preferred the Intel Fortran compiler with our MKL library Profession edition NOW available for Windows* and Linux* too The IMSL library is available in a “Professional with IMSL” version (Windows only)
  • 37. Thank you!
  • 38. Backup Foils
  • 39. P R I C I N G
  • 40. Professional Edition Pricing and Availability
      • Available June 5 th , 2007
      • Licensing:
        • Single User Licenses: CD-ROM and ESD
        • Floating Licenses: ESD only
        • No Node Locked Licenses
    $699 Intel® Fortran Compiler Professional Edition for Mac OS X $899 Intel® Fortran Compiler Professional Edition for Linux $1599 Intel® Fortran Compiler Professional Edition + ISML* for Windows $699 Intel® Fortran Compiler Professional Edition for Windows $599 Intel® C++ Compiler Professional Edition for Windows*, Linux*, Mac OS* X MSRP PRODUCTS (Single User)
  • 41. Standard Edition Pricing and Availability
      • Available June 5 th , 2007
      • Licensing:
        • Single User Licenses: CD-ROM and ESD
        • Floating Licenses: ESD only
        • No Node Locked Licenses
    $449 Intel® C++ Compiler Standard Edition for Windows*, Linux*, Mac OS* X $299 Intel® Threading Building Blocks for Windows, Linux, Mac OS X $199 Intel® Integrated Performance for Windows, Linux, Mac OS X $399 Intel® Math Kernel Library for Windows, Linux, Mac OS X $549 Intel® Fortran Compiler Standard Edition for Mac OS X $749 Intel® Fortran Compiler Standard Edition for Linux $599 Intel® Fortran Compiler Standard Edition for Windows MSRP PRODUCTS (Single User)
  • 42. New Student Package Supporting new programmers
      • Opportunity to LEARN using the latest versions of Intel Software Development Products.
      • Full functional versions with our standard full year of updates and support.
        • $49 (MSRP) for Mac OS* X:
          • Intel® C++ Professional Edition for Mac OS* X
          • Intel® Fortran Professional Edition for Mac OS* X
          • (2 compilers, 2 performance libraries and Intel Threading Building Blocks)
        • $129 (MSRP) for Windows*:
          • Intel® C++ Professional Edition for Windows
          • Intel® Fortran Professional Edition for Windows
          • Intel® VTune™ Analyzer for Windows
          • Intel® Thread Profiler for Windows
          • Intel® Thread Checker for Windows
          • (2 compilers, 2 performance libraries and Intel Threading Building Blocks
          • and 3 great analysis tools)
        • $129 (MSRP) for Linux*:
          • Intel® C++ Professional Edition for Linux
          • Intel® Fortran Professional Edition for Linux
          • Intel® VTune™ Analyzer for Linux
          • Intel® Thread Checker for Linux
          • (2 compilers, 2 performance libraries and Intel Threading Building Blocks
          • and 2 great analysis tools)
    • Available only through resellers worldwide. Downloads only (no CD).
    06/21/10
  • 43. resellers also offer academic versions & student packages to qualified buyers evaluations available www.intel.com/software/products Intel C++ Compiler 10.0 Professional Edition: C++ & MKL & IPP & TBB Intel Fortran Compiler 10.0 Professional Edition: Fortran & MKL $699 Intel® Fortran Compiler Professional Edition for Mac OS X $899 Intel® Fortran Compiler Professional Edition for Linux $1,599 Intel® Fortran Compiler Professional Edition + ISML * for Windows $699 Intel® Fortran Compiler Professional Edition for Windows $599 Intel® C++ Compiler Professional Edition for Windows*, Linux*, Mac OS* X MSRP PRODUCTS (Single User)
  • 44. D E T A I L S (warning: deep dive) On some of the really cool features…
  • 45. Static Verifier (SV) in 10.0 Compilers
    • Static verifier detects many defects and inconsistent use of language features in user code at compile/link time for the whole application
      • Helps C++ and Fortran users develop and debug for memory access, buffer overflow, and OpenMP* usage errors
      • Finds OpenMP* API and data dependency errors
    $ cat -n main.cpp 1 #include <iostream> 2 int foo(const char *); 3 int main() { 4 char * y=NULL; 5 std::cout << foo(y); 6 return(0); 7 } $ cat -n test.cpp 1 #include <string> 2 int foo(const char * widget) { 3 return(std::strlen(widget)); 4 } $ icc -diag-enable sv3 main.cpp test.cpp main.cpp(5): error #12143: [SV] &quot;y&quot; is uninitialized test.cpp(3): warning #12086: [SV] header-file containing the declaration of intrinsic &quot;strlen&quot; should be included or forward declaration is wrong
    • Benefits less experienced users:
        • Detects 200+ kinds of potential coding errors;
        • Can help to “teach” users to correctly use certain features of C/C++ and Fortran languages (for example, OpenMP directives)
    • Benefits experienced users & QA:
        • It detects rather hard to find errors (for example typos or uninitialized variables)
        • Detects inconsistent objects declarations in different program units (for example different types of dummy and actual arguments)
  • 46. Mudflap support in 10.0 Compilers
    • Risky pointer operations are instrumented by the compiler to prevent buffer overflows and invalid heap use.
    $ icc –o t-mudflap of-calc.c of-main.c -fmudflap –lmudflap $ ./t-mudflap ******* mudflap violation 1 (check/write): time=1177545175.621017 ptr=0x7fff5bc461d0 size=8 pc=0x2aaaaaae74a1 location=`of-calc.c:4 (calcSqrt)' /usr/lib64/libmudflap.so.0(__mf_check+0x41) [0x2aaaaaae74a1] ./t-mudflap(calcSqrt+0x8a) [0x40187c] ./t-mudflap(main+0x3b5) [0x401c5d] Nearby object 1: checked region begins 1B after and ends 8B after mudflap object 0x136ad320: name=`of-main.c:4 (main) a‘ bounds=[0x7fff5bc46180,0x7fff5bc461cf] size=80 area=stack check=0r/10w liveness=10 alloc time=1177545175.621003 pc=0x2aaaaaae6fc1 $ cat -n of-calc.c 1 #include <math.h> 2 void calcSqrt(double *a, int N){ 3 for (int i=1;i<N;i++) 4 a[i]=sqrt( (double)i ); 5 return; 6 } $ cat -n of-main.c 1 #include <stdio.h> 2 void calcSqrt(double *a, int N); 3 int main() { 4 double a[10]; 5 for (int i=0;i<10;i++) { 6 a[i] = (double) i; 7 } 8 calcSqrt(a,11); 9 return 0; 10 }
  • 47. HPO – High Performance Parallel Optimizer
    • New parallelizer and vectorizer structure for IA32/Intel®64/ IA64
      • Enabled by
        • -O3 –Q[a]x[P,T,N..]
        • -parallel –O3
      • Revolutionary new design – Vectorizer now after HLO ( loop optimizations like “loop distribution”)
        • Enable loop transformations then vectorization on IA32 and Intel 64 or Load Pair generation on IPF
      • Better interaction among optimizations
      • Parallelization on top of optimized serial code
      • More efficient multi-threaded code
      • Enhanced loop transformations
      • Unified cost model for vectorization and parallelization
    HPO improves performance, particularly for loops for serial and parallel execution
  • 48. HPO Example
    • subroutine matmul(a,b,c,n)
    • real(8), dimension(n,n) :: a,b,c
    • c=0.d0 ! 4
    • !$omp parallel do ! 5
    • do i=1,n ! 6
    • do j=1,n ! 7
    • do k=1,n ! 8
    • c(j,i)=c(j,i)+a(k,i)*b(j,k)
    • enddo
    • enddo
    • enddo
    • end
    • matmul.f90(5): (col. 7) : OpenMP DEFINED LOOP WAS PARALLELIZED
    • High Level Optimizer Report (matmul_)
    • matmul.f90(4): (col. 1) : LOOP WAS VECTORIZED.
    • matmul.f90(7): (col. 3) : PERMUTED LOOP WAS VECTORIZED.
    • LOOP INTERCHANGE in loops at line: 7 8
    • Loopnest permutation ( 1 2 3 ) --> ( 1 3 2 )
    • Block, Unroll, Jam Report:
    • (loop line numbers, unroll factors and type of transformation)
    • Loop at line 7 blocked by 111
    • Loop at line 8 blocked by 111
    • Loop at line 6 blocked by 111
    • Loop at line 6 unrolled and jammed by 4
    • Loop at line 8 unrolled and jammed by 4
    Optimization report (abridged)
  • 49. New C++ Exception Handling Revolutionary new design
    • Simplified internal presentation provides more opportunities for optimization:
      • Improved EH representation
      • Improved in-lining in the presence of EH
      • Improved EH control flow model
      • Improved optimization opportunities where opts were previously disabled, such as const propagation
      • Improved code size/data size vs gcc* for C++
      • More opportunity for parallelization
      • New options: -f[no-]exceptions
        • Accepted as a ‘dummy’ switch
  • 50. Fortran – 2003 new features
      • C Interoperability (Intrinsic module ISO_C_BINDING and BIND(C) attribute) - Enables building portable, mixed-language apps
    • ISO_FORTRAN_ENV intrinsic module
        • Helps write portable code
    • Asynchronous I/O
        • Improves run-time performance for reads, writes of unformatted data
    • std03 default option
        • Diagnose Fortran 2003 standards exceptions
    • Additional features for enhancing programmer productivity
  • 51. Intel ® C++ and Fortran Compilers optimization reports
    • Gives developers the information they need to diagnose quickly and accurately:
    • Explains when the compiler was or wasn’t able to performance a selected advance optimization.
    • Identifies areas to improves the application run-time performance
    • Highlight messages regarding hot spots in the code with Intel® VTune™ Analyzer for Linux*
  • 52. VTune™ Analyzer for Linux* Shows what the Compiler Knows
    • Compile with the Intel compiler
    • Find your hotspot using Intel® VTune™ analyzer
    • Select the hot lines of code in the source view
    • Click the circled icon to see the compiler's opt report
    • The report tells you that the compiler didn't parallelize your critical loop because of an assumed dependency
    • You know there is no dependency and insert an OpenMP statement
    • You have faster parallel software
    • Check the logic with Intel® Thread Checker
  • 53. Intel Compilers 10.0
    • Windows*, Linux* and Mac OS* X
    • Compilers and libraries together – each tuned for multi-core parallelism and vector (SSE) parallelism
    • Revolutionary optimizer
      • boosts compilers and libraries
      • libraries and compilers continue leadership
    • C++ Parallelism: Intel Threading Building Blocks now a standard part of our C++ offering
    • Fortran: Significant new features
    • Numerous vulnerability detection features
    • available from Intel and resellers worldwide now
    • try before you buy – www.intel.com/software/products
  • 54. Intel® C++ Compiler Professional Edition Most Mac OS* X developers preferred the Intel C++ Compiler with our libraries Profession Edition NOW available for Windows* and Linux* too
  • 55. Intel® Fortran Compiler Professional Edition Most Mac OS* X developers preferred the Intel Fortran compiler with our MKL library Profession edition NOW available for Windows* and Linux* too The IMSL library is available in the Intel® Fortran Compiler Professional Edition with IMSL (Windows only)