Analysis of mpich

Introduction
mpich is a “high performance” and “widely portable” implementation of MPI(message Passing interface). It is a standard for message passing for distributed memory applications used in parallel computing. I will be analyzing this program to find assembly, in-line assembly, and other platform specific code. This is done for the purpose of porting to aarch64.

Analysis
This program contains a few inline assembly calls.
The following files show all the inline assembly calls this program makes:

./src/mpid/ch3/channels/nemesis/include/mpid_nem_memdefs.h
./src/mpid/ch3/channels/nemesis/utils/monitor/rdtsc.h
./src/openpa/src/primitives/opa_gcc_arm.h
./src/openpa/src/primitives/opa_gcc_ia64.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_barrier.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_p3barrier.h
./src/openpa/src/primitives/opa_gcc_intrinsics.h
./src/openpa/src/primitives/opa_gcc_ppc.h
./src/openpa/src/primitives/opa_gcc_sicortex.h
./src/openpa/src/primitives/opa_sun_atomic_ops.h
./src/pm/hydra/tools/debugger/debugger.h
./src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h
./src/pm/util/dbgiface.c
./test/mpid/atomic.c
./test/mpid/atomic_fai.c

attempt build the rpm on x86_64: successful
attempt build the rpm on aarch64:

-missing dep valgrind
-mpich requires valgrind in order to successfully build on aarch64, however valgrind is not available

valgrind seems to be built for arm already. After looking through the rpm, it seems valgrind might just need to be made aware of what aarch64, however I will try a build first. Further investigation into valgrind shows, because of it’s low level nature, it requires quite a lot of work in order to work on aarch64. Multiple warnings on the difficulties and task size for porting show up on the valgrind site. They also have a page that shows their porting plans, and guidelines.

Porting valgrind does not seem like an option here.
Building mpich for aarch64 seems a little early here, since we might not be able to test with mpich.

I think I will try and make a aarch64 header file that may be required for mpich, however, I won’t be able to test it with mpich. These header files that contain inline assembly seem like they just need specific functions written for each platform. But the same functions.

./src/openpa/src/primitives/opa_gcc_arm.h
./src/openpa/src/primitives/opa_gcc_ia64.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_barrier.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_p3barrier.h
./src/openpa/src/primitives/opa_gcc_intrinsics.h
./src/openpa/src/primitives/opa_gcc_ppc.h
./src/openpa/src/primitives/opa_gcc_sicortex.h
./src/openpa/src/primitives/opa_sun_atomic_ops.h

After look through the list of files, I found another file which performs all these functions in gcc intrinsics, which means that it may be able to compile on any platform and just use gcc intrinsics instead of in-line assembly. Meaning this code might not actually need a aarch64 port? Maybe it still will for higher performance?

file: ./src/mpid/ch3/channels/nemesis/include/mpid_nem_memdefs.h
this file has pentium, x86_64 asm inside it(also requires gcc)

file: ./src/mpid/ch3/channels/nemesis/utils/monitor/rdtsc.h

#ifndef __RDTSC_H
#define __RDTSC_H
#include 
#include 
/*#include "asm/msr.h" */
#define rdtsc(x) __asm__ __volatile__("rdtsc" : "=A" (x))

#define TIME_INIT do {__cpuMHz = SetMHz();} while(0)
#define TIME_PRE(cycles) rdtsc(cycles)
#define TIME_POST(cycles) do { unsigned long long __tmp;                \
                               rdtsc(__tmp);                            \
                                                              (cycles) = __tmp - (cycles); } while (0)

This shows that there is a asm call for a single instruction, read time stamp counter?
Seems like this file is just used to monitor time inbetween different parts of the program? For performance monitoring reasons maybe?

file: ./src/pm/hydra/tools/debugger/debugger.h
seems like a false positive. found asm in a comment.

file: ./src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h
This file is used for x86 cpuid. Care about this for aarch64? Don’t think so.

file: ./src/pm/util/dbgiface.c
Another false positive in my search. Asm is in a comment

file: ./test/mpid/atomic.c

/* FIXME: MPICH_SINGLE_THREADED is obsolete and no longer defined */
#if defined(MPICH_SINGLE_THREADED) || !defined(USE_ATOMIC_UPDATES)
#define MPID_Atomic_incr( count_ptr ) \
   __asm__ __volatile__ ( "lock; incl %0" \
                            : "=m" (*count_ptr) :: "memory", "cc" )

#define MPID_Atomic_decr_flag( count_ptr, nzflag )                                      \
   __asm__ __volatile__ ( "xor %%ax,%%ax; lock; decl %0 ; setnz %%al"                   \
                            : "=m" (*count_ptr) , "=a" (nzflag) :: "memory", "cc" )
#endif
 __asm__ __volatile__  ( "rdtsc ; movl %%edx,%0 ; movl %%eax,%1"
                             : "=r" (high), "=r" (low) ::
                                                         "eax", "edx" );

Both of these are found in the file, this file does not specify any ifdefs for a specific arch, not sure when this is being run. This file seems to only be used for testing the performance of atomic operations.

file: ./test/mpid/atomic_fai.c

 */
 /*
  * Test performance of atomic access operations.
   * This is a *very* simple test.
    */

#include "stdio.h"
#include "unistd.h"

#define MPID_Atomic_fetch_and_incr(count_ptr_, count_old_) do {         \
        (count_old_) = 1;                                               \
                __asm__ __volatile__ ("lock ; xaddl %0,%1"                      \
                                              : "=r" (count_old_), "=m" (*count_ptr_)   \
                                                                            :  "0" (count_old_),  "m" (*count_ptr_)); \
                                                                                } while (0)

Similar to the previous file, the comment mentions this is just for performance tests of atomic operations.

Is this ready for aarch64?
After the analysis of this package I don’t think a aarch64 port is actually required. Is it? The only location I noted that one could be used is with the following files:

./src/openpa/src/primitives/opa_gcc_arm.h
./src/openpa/src/primitives/opa_gcc_ia64.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_barrier.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_p3barrier.h
./src/openpa/src/primitives/opa_gcc_intrinsics.h
./src/openpa/src/primitives/opa_gcc_ppc.h
./src/openpa/src/primitives/opa_gcc_sicortex.h
./src/openpa/src/primitives/opa_sun_atomic_ops.h

However that being said, there is a file that is used for gcc intrinsics, which looks like it will run on all platforms. I do not think that this package actually requires any porting, and I think it should build on aarch64. However I cannot test this due to build deps at this time.

Compiling for aarch64
attempting to compile on aarch64 even with missing deps
had to install:

autoconf automake libtool perl-Digest-MD5

cd  source dir
./autogen.sh
./configure
./configure --disable-f77 --disable-fc
use configure options to disable fortran, takes a while, forgot to time, maybe 20-30 minutes on aarch64
make 2>&1 | tee m.txt

this completed successfully I think. I will post the log online and link to it. This took 1 hour and 30 minutes on aarch64.
Compiled successfully on aarch64.

After running the tests that come with this package on x86_64, they all complete successfully. Aarch64 tests fail.

Is it worth trying to build the rpm? Might just run into issues about build deps and such things. Might be best to get some input on whether anything should actually be done to this package.

Next step:
I will be be compiling this program on x86_64 and testing specific options, such as building it with assembly, and building it with gcc intrinsics. Once compiling with the different options, I will perform a benchmark to see what the performance improvement is to having the assembly. If it is a significant improvement in performance, a aarch64 assembly port should be made.

Advertisements

About oatleywillisa

Computer Networking Student
This entry was posted in SBR600 and tagged , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s