SPO600 – Conclusion

Introduction
Unfortunately I was unable to finish the packages I chose too look into for this course(mpich, and blender). It was much too early to start trying to port blender, and mpich continued to have problems through the course.

Final Stage of Mpich:
For mpich, I was able to both configure, and compile it on x86_64, and aarch64. For x86_64 all the tests ran successfully. However on aarch64(qemu) all the tests continually failed, this was because qemu could not emulate the network interfaces.

I moved to the foundation model, and started a configure and compile, 20 hours later… it was complete. When running the tests I ran into multiple issues, first, all the tests would time out, even if the timeout was increased. Second, all the perl scripts that were being run did not have “shebang” lines for perl, and failed. After adding in the lines, they continued to have different errors, and exited with errors.

Compiling on qemu did not take so long, but running them on the foundation model took almost a whole day. The commands used for configuring, compiling, and testing:

# This command disables the fortran compiler from being used, as it is currently not in aarch64
time ./configure --disable-f77 --disable-fc --with-atomic-primitive
real    22m37.044s
user    20m22.424s
sys     3m8.911s

time make
real    74m34.936s
user    72m9.635s
sys     2m1.169s

# Testing did not work on aarch64, no matter how long you wait, it always timed out(on foundation model). Qemu failed immediately with errors.
time make testing

Finally I moved on, and stopped trying to test the program, instead started finding all the functions that needed to be ported to aarc64. Found files that had lists of atomic functions that need to be ported for each architecture:

   static _opa_inline int   OPA_load_int(_opa_const OPA_int_t *ptr);
   static _opa_inline void  OPA_store_int(OPA_int_t *ptr, int val);
   static _opa_inline void *OPA_load_ptr(_opa_const OPA_ptr_t *ptr);
   static _opa_inline void  OPA_store_ptr(OPA_ptr_t *ptr, void *val);

   static _opa_inline void OPA_add_int(OPA_int_t *ptr, int val);
   static _opa_inline void OPA_incr_int(OPA_int_t *ptr);
   static _opa_inline void OPA_decr_int(OPA_int_t *ptr);

   static _opa_inline int OPA_decr_and_test_int(OPA_int_t *ptr);
   static _opa_inline int OPA_fetch_and_add_int(OPA_int_t *ptr, int val);
   static _opa_inline int OPA_fetch_and_incr_int(OPA_int_t *ptr);
   static _opa_inline int OPA_fetch_and_decr_int(OPA_int_t *ptr);

   static _opa_inline void *OPA_cas_ptr(OPA_ptr_t *ptr, void *oldv, void *newv);
   static _opa_inline int   OPA_cas_int(OPA_int_t *ptr, int oldv, int newv);

   static _opa_inline void *OPA_swap_ptr(OPA_ptr_t *ptr, void *val);
   static _opa_inline int   OPA_swap_int(OPA_int_t *ptr, int val);

   (the memory barriers may be macros instead of inline functions)
   static _opa_inline void OPA_write_barrier();
   static _opa_inline void OPA_read_barrier();
   static _opa_inline void OPA_read_write_barrier();

   Loads and stores with memory ordering guarantees (also may be macros):
   static _opa_inline int   OPA_load_acquire_int(_opa_const OPA_int_t *ptr);
   static _opa_inline void  OPA_store_release_int(OPA_int_t *ptr, int val);
   static _opa_inline void *OPA_load_acquire_ptr(_opa_const OPA_ptr_t *ptr);
   static _opa_inline void  OPA_store_release_ptr(OPA_ptr_t *ptr, void *val);

   static _opa_inline void OPA_compiler_barrier();

Aarch64 should be able to use memory barriers similar to other arm architectures since dmb and dsb are both in aarch64 instruction set:

#define OPA_arm_dmb_() __asm__ __volatile__  ( "dmb" ::: "memory" )
#define OPA_arm_dsb_() __asm__ __volatile__  ( "dsb" ::: "memory" )

#define OPA_write_barrier()      OPA_arm_dsb_()
#define OPA_read_barrier()       OPA_arm_dmb_()
#define OPA_read_write_barrier() OPA_arm_dsb_()
#define OPA_compiler_barrier()   __asm__ __volatile__  ( "" ::: "memory" )

Many of the atomic functions could be ported the same way for each of the architectures:

/* Aligned loads and stores are atomic. */
static _opa_inline int OPA_load_int(_opa_const OPA_int_t *ptr)
{
        return ptr->v;
}

/* Aligned loads and stores are atomic. */
static _opa_inline void OPA_store_int(OPA_int_t *ptr, int val)
{
        ptr->v = val;
}

/* Aligned loads and stores are atomic. */
static _opa_inline void *OPA_load_ptr(_opa_const OPA_ptr_t *ptr)
{
        return ptr->v;
}

/* Aligned loads and stores are atomic. */
static _opa_inline void OPA_store_ptr(OPA_ptr_t *ptr, void *val)
{
        ptr->v = val;
}

Many other functions that are suppose to be ported, that have already been ported to armv7 seem like they will work on aarch64 as well:

static _opa_inline int   OPA_load_acquire_int(_opa_const OPA_int_t *ptr)
{
    int tmp;
    tmp = ptr->v;
    OPA_arm_dsb_();
    return tmp;
}
static _opa_inline void *OPA_load_acquire_ptr(_opa_const OPA_ptr_t *ptr)
{   
        void *tmp;
        tmp = ptr->v;
        OPA_arm_dsb_();
        return tmp;
}
static _opa_inline void  OPA_store_release_ptr(OPA_ptr_t *ptr, void *val)
{   
        OPA_arm_dsb_();
            ptr->v = val;
}

Most of the remaining functions on the list actually require some assembly inside them for load-link/store-conditional primitives. I did not get a chance to look further into these functions, they will need to be ported over to aarch64. Here are a couple examples of the armv7 LL/SC primitive functions:
static _opa_inline int OPA_LL_int(OPA_int_t *ptr)
{   
    int val;
    __asm__ __volatile__ ("ldrex %0,[%1]"
                          : "=&r" (val)
                          : "r" (&ptr->v)
                          : "cc");

    return val;
}

/* Returns non-zero if the store was successful, zero otherwise. */
static _opa_inline int OPA_SC_int(OPA_int_t *ptr, int val)
{
    int ret; /* will be overwritten by strex */
    /*
      strex returns 0 on success
     */
    __asm__ __volatile__ ("strex %0, %1, [%2]\n"
                          : "=&r" (ret)
                          : "r" (val), "r"(&ptr->v)
                          : "cc", "memory");

    return !ret;
}
Advertisements
Posted in SBR600 | Tagged , , , , , , , , | Leave a comment

Analysis of mpich

Introduction
mpich is a “high performance” and “widely portable” implementation of MPI(message Passing interface). It is a standard for message passing for distributed memory applications used in parallel computing. I will be analyzing this program to find assembly, in-line assembly, and other platform specific code. This is done for the purpose of porting to aarch64.

Analysis
This program contains a few inline assembly calls.
The following files show all the inline assembly calls this program makes:

./src/mpid/ch3/channels/nemesis/include/mpid_nem_memdefs.h
./src/mpid/ch3/channels/nemesis/utils/monitor/rdtsc.h
./src/openpa/src/primitives/opa_gcc_arm.h
./src/openpa/src/primitives/opa_gcc_ia64.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_barrier.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_p3barrier.h
./src/openpa/src/primitives/opa_gcc_intrinsics.h
./src/openpa/src/primitives/opa_gcc_ppc.h
./src/openpa/src/primitives/opa_gcc_sicortex.h
./src/openpa/src/primitives/opa_sun_atomic_ops.h
./src/pm/hydra/tools/debugger/debugger.h
./src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h
./src/pm/util/dbgiface.c
./test/mpid/atomic.c
./test/mpid/atomic_fai.c

attempt build the rpm on x86_64: successful
attempt build the rpm on aarch64:

-missing dep valgrind
-mpich requires valgrind in order to successfully build on aarch64, however valgrind is not available

valgrind seems to be built for arm already. After looking through the rpm, it seems valgrind might just need to be made aware of what aarch64, however I will try a build first. Further investigation into valgrind shows, because of it’s low level nature, it requires quite a lot of work in order to work on aarch64. Multiple warnings on the difficulties and task size for porting show up on the valgrind site. They also have a page that shows their porting plans, and guidelines.

Porting valgrind does not seem like an option here.
Building mpich for aarch64 seems a little early here, since we might not be able to test with mpich.

I think I will try and make a aarch64 header file that may be required for mpich, however, I won’t be able to test it with mpich. These header files that contain inline assembly seem like they just need specific functions written for each platform. But the same functions.

./src/openpa/src/primitives/opa_gcc_arm.h
./src/openpa/src/primitives/opa_gcc_ia64.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_barrier.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_p3barrier.h
./src/openpa/src/primitives/opa_gcc_intrinsics.h
./src/openpa/src/primitives/opa_gcc_ppc.h
./src/openpa/src/primitives/opa_gcc_sicortex.h
./src/openpa/src/primitives/opa_sun_atomic_ops.h

After look through the list of files, I found another file which performs all these functions in gcc intrinsics, which means that it may be able to compile on any platform and just use gcc intrinsics instead of in-line assembly. Meaning this code might not actually need a aarch64 port? Maybe it still will for higher performance?

file: ./src/mpid/ch3/channels/nemesis/include/mpid_nem_memdefs.h
this file has pentium, x86_64 asm inside it(also requires gcc)

file: ./src/mpid/ch3/channels/nemesis/utils/monitor/rdtsc.h

#ifndef __RDTSC_H
#define __RDTSC_H
#include 
#include 
/*#include "asm/msr.h" */
#define rdtsc(x) __asm__ __volatile__("rdtsc" : "=A" (x))

#define TIME_INIT do {__cpuMHz = SetMHz();} while(0)
#define TIME_PRE(cycles) rdtsc(cycles)
#define TIME_POST(cycles) do { unsigned long long __tmp;                \
                               rdtsc(__tmp);                            \
                                                              (cycles) = __tmp - (cycles); } while (0)

This shows that there is a asm call for a single instruction, read time stamp counter?
Seems like this file is just used to monitor time inbetween different parts of the program? For performance monitoring reasons maybe?

file: ./src/pm/hydra/tools/debugger/debugger.h
seems like a false positive. found asm in a comment.

file: ./src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h
This file is used for x86 cpuid. Care about this for aarch64? Don’t think so.

file: ./src/pm/util/dbgiface.c
Another false positive in my search. Asm is in a comment

file: ./test/mpid/atomic.c

/* FIXME: MPICH_SINGLE_THREADED is obsolete and no longer defined */
#if defined(MPICH_SINGLE_THREADED) || !defined(USE_ATOMIC_UPDATES)
#define MPID_Atomic_incr( count_ptr ) \
   __asm__ __volatile__ ( "lock; incl %0" \
                            : "=m" (*count_ptr) :: "memory", "cc" )

#define MPID_Atomic_decr_flag( count_ptr, nzflag )                                      \
   __asm__ __volatile__ ( "xor %%ax,%%ax; lock; decl %0 ; setnz %%al"                   \
                            : "=m" (*count_ptr) , "=a" (nzflag) :: "memory", "cc" )
#endif
 __asm__ __volatile__  ( "rdtsc ; movl %%edx,%0 ; movl %%eax,%1"
                             : "=r" (high), "=r" (low) ::
                                                         "eax", "edx" );

Both of these are found in the file, this file does not specify any ifdefs for a specific arch, not sure when this is being run. This file seems to only be used for testing the performance of atomic operations.

file: ./test/mpid/atomic_fai.c

 */
 /*
  * Test performance of atomic access operations.
   * This is a *very* simple test.
    */

#include "stdio.h"
#include "unistd.h"

#define MPID_Atomic_fetch_and_incr(count_ptr_, count_old_) do {         \
        (count_old_) = 1;                                               \
                __asm__ __volatile__ ("lock ; xaddl %0,%1"                      \
                                              : "=r" (count_old_), "=m" (*count_ptr_)   \
                                                                            :  "0" (count_old_),  "m" (*count_ptr_)); \
                                                                                } while (0)

Similar to the previous file, the comment mentions this is just for performance tests of atomic operations.

Is this ready for aarch64?
After the analysis of this package I don’t think a aarch64 port is actually required. Is it? The only location I noted that one could be used is with the following files:

./src/openpa/src/primitives/opa_gcc_arm.h
./src/openpa/src/primitives/opa_gcc_ia64.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_barrier.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h
./src/openpa/src/primitives/opa_gcc_intel_32_64_p3barrier.h
./src/openpa/src/primitives/opa_gcc_intrinsics.h
./src/openpa/src/primitives/opa_gcc_ppc.h
./src/openpa/src/primitives/opa_gcc_sicortex.h
./src/openpa/src/primitives/opa_sun_atomic_ops.h

However that being said, there is a file that is used for gcc intrinsics, which looks like it will run on all platforms. I do not think that this package actually requires any porting, and I think it should build on aarch64. However I cannot test this due to build deps at this time.

Compiling for aarch64
attempting to compile on aarch64 even with missing deps
had to install:

autoconf automake libtool perl-Digest-MD5

cd  source dir
./autogen.sh
./configure
./configure --disable-f77 --disable-fc
use configure options to disable fortran, takes a while, forgot to time, maybe 20-30 minutes on aarch64
make 2>&1 | tee m.txt

this completed successfully I think. I will post the log online and link to it. This took 1 hour and 30 minutes on aarch64.
Compiled successfully on aarch64.

After running the tests that come with this package on x86_64, they all complete successfully. Aarch64 tests fail.

Is it worth trying to build the rpm? Might just run into issues about build deps and such things. Might be best to get some input on whether anything should actually be done to this package.

Next step:
I will be be compiling this program on x86_64 and testing specific options, such as building it with assembly, and building it with gcc intrinsics. Once compiling with the different options, I will perform a benchmark to see what the performance improvement is to having the assembly. If it is a significant improvement in performance, a aarch64 assembly port should be made.

Posted in SBR600 | Tagged , , , , , , , , | Leave a comment

Analysis of Blender

Introduction
Blender is a very powerful, open source, 3D animation suite. It allows for modeling, rigging, animating, game development, and many more amazing features. Blender is cross platform, it runs on windows, mac, and linux. This is a attempt to analyze the blender package to see how much work is required to port it to aarch64, arm64bit

Unfortunately a lot of this analysis is somewhat inaccurate. This is due to the amount of SIMD vector programming that is within a lot of these files. Also the search was not tuned to find SSE, NEON, or any vector programming, though it probably should have been considering the program.

Get Package:

fedpkg clone -a blender
cd blender
fedpkg prep
cd blender-version

Create a list of files to search:

find ./ | egrep -i "\.s$|\.asm$|\.c$|\.cpp$|\.h$|\.cc$" >> ~/spo600-package1/files.txt

Search for source assembly:

egrep -i "\.s$|\.asm$" ~/spo600-package1/files.txt
(no source assembly files found)

Search for in-line assembly:

egrep -i "asm\(|__asm" $(cat ~/spo600-package1/files.txt) | awk 'BEGIN{FS=":"}{print $1}' | sort -u
./extern/bullet2/src/BulletCollision/BroadphaseCollision/btDbvt.h
./extern/bullet2/src/LinearMath/btConvexHullComputer.cpp
./extern/bullet2/src/LinearMath/btQuickprof.cpp
./extern/bullet2/src/LinearMath/btVector3.cpp
./extern/Eigen3/Eigen/src/Core/arch/SSE/PacketMath.h
./extern/Eigen3/Eigen/src/Core/util/Macros.h
./extern/Eigen3/Eigen/src/Core/util/Memory.h
./extern/libmv/third_party/glog/src/stacktrace_powerpc-inl.h
./extern/libmv/third_party/glog/src/stacktrace_x86-inl.h
./extern/libmv/third_party/glog/src/utilities.h
./intern/cycles/util/util_system.cpp
./intern/moto/include/MT_assert.h
./source/blender/blenlib/intern/cpu.c

Raw Notes on Files
File:/extern/bullet2/src/BulletCollision/BroadphaseCollision/btDbvt.h
Notes:
-Option to use intrinsics instead of assembly

// Use only intrinsics instead of inline asm
#define DBVT_USE_INTRINSIC_SSE  1

-Intrinsics may be “11% slower”
-Intrinsics vs assembly for this file can be found here.
Decision:
This file has an option to use SSE intrinsics or assembly. However both of these are platform specific to x86_64 architecture.

File:
./extern/bullet2/src/LinearMath/btConvexHullComputer.cpp
Notes:
-Asm is disabled in this file, shown here
-All assembly is within many ifdefs

#ifdef USE_X86_64_ASM

-Because of the commented out line none of these should run
Decision:
Does not need to be modified, all assembly #ifdef have been disabled due to a bug.

File:
./extern/bullet2/src/LinearMath/btQuickprof.cpp
Notes:
-Assembly was being used for time
-All assembly has been commented out in file
Decision:
Does not need to be modified, all assembly is commented out.

File:
./extern/bullet2/src/LinearMath/btVector3.cpp
Notes:
-File is full of intrinsics
-Lots of vector stuff
-Assembly is used for Apple here
-Assembly used for vector(SIMD) not quite sure what is going on here(closer look shows this is under arm NEON intrinsics)
-Some conditional use of SSE(Streaming SIMD Extensions)
Some conditional use of arm specific intrinsics
Decision:
Some optional assembly inside. There is a lot of platform specific intrinsics in this file. The fact that Arm NEON intrinsics are in there though, is good, since it may help in the porting to aarch64, since they may use something similar to NEON.

File:
./extern/Eigen3/Eigen/src/Core/arch/SSE/PacketMath.h
Notes:
-All assembly is commented out
-“_mm” intrinsics are used everywhere, intel intrinsics?
Decision:
More platform specific intrinsics for the intel platform.

File:
./extern/Eigen3/Eigen/src/Core/util/Macros.h
Notes:
-EIGEN libraries for C/C++
-Linear algebra, matrix, and vector operations
-I think this is portable
Decision:
More intrinsics. Might be an issue.

File:
./extern/Eigen3/Eigen/src/Core/util/Memory.h

File:
./extern/bullet2/src/LinearMath/btConvexHullComputer.cpp

New Issue: Arch Specific Intrinsics
After finding so many file containing intrinsics that are platform specific I decided to stop and do another search. There are a lot of files found that may contain x86 intrinsics. I have not yet found out if all of these files are required, or if they are optionally called. However, considering that some package dependencies try to install only if the platform is x86, I am thinking that this program was made only for x86, however the fact that arm NEON intrinsics are found, make it that much more confusing.

Search for some intrinsics:

grep "_mm" $(cat ~/spo600-package1/files.txt ) | awk 'BEGIN{FS=":"}{print $1}' | sort -u
./extern/bullet2/src/BulletCollision/BroadphaseCollision/btDbvt.h
./extern/bullet2/src/BulletCollision/CollisionShapes/btConvexShape.cpp
./extern/bullet2/src/BulletCollision/Gimpact/gim_memory.h
./extern/bullet2/src/BulletDynamics/ConstraintSolver/btSequentialImpulseConstraintSolver.cpp
./extern/bullet2/src/BulletDynamics/ConstraintSolver/btSolverBody.h
./extern/bullet2/src/LinearMath/btMatrix3x3.h
./extern/bullet2/src/LinearMath/btQuadWord.h
./extern/bullet2/src/LinearMath/btQuaternion.h
./extern/bullet2/src/LinearMath/btScalar.h
./extern/bullet2/src/LinearMath/btVector3.cpp
./extern/bullet2/src/LinearMath/btVector3.h
./extern/Eigen3/Eigen/src/Core/arch/SSE/Complex.h
./extern/Eigen3/Eigen/src/Core/arch/SSE/MathFunctions.h
./extern/Eigen3/Eigen/src/Core/arch/SSE/PacketMath.h
./extern/Eigen3/Eigen/src/Core/util/Memory.h
./extern/Eigen3/Eigen/src/Geometry/arch/Geometry_SSE.h
./extern/Eigen3/Eigen/src/LU/arch/Inverse_SSE.h
./extern/libmv/libmv/simple_pipeline/detect.cc
./extern/libmv/libmv/tracking/brute_region_tracker.cc
./extern/libopenjpeg/dwt.c
./extern/libopenjpeg/mct.c
./extern/libopenjpeg/opj_malloc.h
./intern/audaspace/intern/AUD_JOSResampleReader.cpp
./intern/cycles/kernel/kernel_bvh_subsurface.h
./intern/cycles/kernel/kernel_bvh_traversal.h
./intern/cycles/util/util_math.h
./intern/cycles/util/util_types.h
./intern/guardedalloc/intern/mallocn.c
./intern/guardedalloc/intern/mmap_win.c
./source/blender/blenkernel/BKE_subsurf.h
./source/blender/blenkernel/intern/multires.c
./source/blender/blenkernel/intern/subsurf_ccg.c
./source/blender/blenlib/intern/math_geom.c
./source/blender/editors/object/object_bake.c
./source/blender/freestyle/intern/geometry/matrix_util.cpp
./source/blender/imbuf/intern/cineon/cineonlib.h
./source/blender/makesrna/intern/rna_tracking.c
./source/blender/makesrna/intern/rna_userdef.c
./source/blender/render/intern/raytrace/bvh.h
./source/blender/render/intern/raytrace/svbvh.h

Installing Blender on aarch64 Failed
So I decided to try installing Blender on aarch64 to see what happens, however the first thing I notice is some of it’s build dependencies have not been built for aarch64. I looked on arm koji and could not find a built package.

yum-builddep blender.spec
Error: No Package found for OpenImageIO-devel

I decided I would try and build it(if it was easy), but it failed and seems to have a script inside it that tries to find out the architecture and fails with an error when it decides it is not x86.

# Start with unknown platform
     18 platform ?= unknown
     19 
     20 # Use 'uname -m' to determine the hardware architecture.  This should
     21 # return "x86" or "x86_64"
     22 hw := ${shell uname -m}
     23 #$(info hardware = ${hw})
     24 ifneq (${hw},x86)
     25   ifneq (${hw},x86_64)
     26     ifneq (${hw},i386)
     27       ifneq (${hw},i686)
     28         $(error "ERROR: Unknown hardware architecture")
     29       endif
     30     endif
     31   endif
     32 endif

Added a couple lines to skip this platform check:

ifneq (${hw},aarch64)
endif

This allowed the package to start building, however it needed many build deps. Further investigation into this package shows that it has multiple build dependencies that are have also not been built for aarch64.

I will stop here with trying to get Blender installed on aarch64 as I am getting sidetracked and will move back to analyzing the actual assembly code in Blender.

yum-builddep OpenImageIO.spec
Error: No Package found for Field3D-devel
Error: No Package found for hdf5-devel

Conclusion:
For now I have stopped working on blender, it may be too early in the development of aarch64 to port it. However if I were to continue I would look through options for compiling blender with many platform specific features turned off. This would be useful in determining which files must be ported to aarch64, and which are optional.

Posted in SBR600 | Tagged , , , , , , , , , , , , | Leave a comment

SPO600 – Lab 4 – Codebase Analysis

Introduction
In this lab we are downloading a rpm package and searching it for assembly language. We will be looking for .s and .S files, along with searching inside .c files for in-line assembly. It’s also good to search for .cpp, .asm, and .h files, as these are other files that may contain platform specific code. Once all the assembly is believed to be found, we will analyse it, to determine what its purpose is: performance, memory barriers, atomics, and low-level interactions.

Sometimes assembly is not the only thing that is the culprit of platform specific code. If a assembly search comes up bare, there are intrinsic code, which can lock the program down to a specific compiler, and even a platform. In another program analysis I have run into SIMD for vectors, which was written in C++, but only compiled on specific architectures.

The package I will be search is “syslinux”, which consists of many lightweight bootloaders for many different file systems. It also includes a tool for booting legacy operating systems on different media. After looking through the spec file it is clear that this package is only for X86 processors, but it looks like it will run on both 32bit and 64bit.

ExclusiveArch: %{ix86} x86_64

Getting the Package
To start my search I downloaded the package and started searching through the source(This was performed on fedora 19).

Download Package and Source:

fedpkg clone -a syslinux
cd syslinux
fedpkg prep
cd packagename-version

The Search
Objective: Find all files that have assembly within

Gather a list of files to search:

find ./ | egrep -i "\.s$|\.asm$|\.c$|\.cpp$|\.h$|\.cc$" >> ~/spo600-package1/files.txt

Now we have a file “files.txt” which contains the name of all files that might have something to do with assembly. All the files with the “.s”, “.S”, and “.asm” all are probably assembly language source files. However, assembly can also be called through in-line assembly, so we will search through all “.c”, “.cc”, “.cpp”, and “.h” files for the use of in-line assembly.

Search for in-line assembly:

grep "asm(" $(cat ~/files.txt)
grep "__asm" $(cat ~/files.txt)

Now we have a list of all files that contain in-line assembly, as well as a list of all the source assembly files found earlier. Now it is time to look and see what they are doing.

Analysis
First we will start with the assembly source files “.S” extensions. These files can be found here. Files that seem to come up often are:

memset.S
memcpy.S
memmove.S

There are multiple versions of these files in different directory structures. The directories that these files are in seem to represent different file systems and programs.

The ./dos/memcpy.S program:

#
# memcpy.S
#
# Simple 16-bit memcpy() implementation
#

 .text
 .code16gcc
 .globl memcpy
 .type memcpy, @function
memcpy:
 cld
 pushw %di
 pushw %si
 movw %ax,%di
 movw %dx,%si
 # The third argument is already in cx
 rep ; movsb
 popw %si
 popw %di
 ret

 .size memcpy,.-memcpy

The memcpy program seems like it runs as a function from another program. This program is used to copy memory from source to destination of a specified size. First it clears the direction flag “cld”, which is used in the “rep” instruction. Next it saves some registers on the stack so it can put them back when it’s done using them. It places the source and destination arguments in the registers, then runs the following instructions “rep ; movsb”. The “rep” instruction will repeat the next instruction multiple times, equal to the value in %cx, which happens to be the third argument. The “movsb” instruction will move a single byte from the source and destination registers, then increment(or decrement) the location in memory based on the direction set(“cld”). After copying the memory, byte by byte, to the destination, it returns the values it saved on the stack and ends.

Posted in SBR600 | Tagged , , , , | Leave a comment

SPO600 – Lab 3 – Assembly Language

Writing Assembly Programs
This was a very fun and interesting lab for the SPO600 course. I apologize now for the length of this blog post, but I wanted to document everything I learned, even some of the less important information. Writing in assembler is very different from other programming languages, mostly because of how simple each instruction is. The complexity in assembly seems to only come from length of code, where in other languages, you can create very complex data structures and objects with single line commands. It is almost relaxing to do things one step at a time in assembly, however, debugging it can be a nightmare sometimes, since it’s more difficult to read. I would expect large programs to be impossible to manage if they were written in assembly, with so many lines, and so many tiny things that could go wrong. It was however very fun to learn and to write in this language. The goal was to write an x86_64 assembly program(using gas, GNU Assembler) with a loop that increments a integer and prints out the string with the incremented number. There are multiple stages of adding features and ways to change and improve as you go along. You can find the lab here. Another fun part about this lab, is at each step you complete, you are to port the assembly code from x86_64 over to aarch64. Here I will be going over many of the issues I had and solutions I found. Below is what output we needed for our program:

Loop:  0
Loop:  1
Loop:  2
Loop:  3
Loop:  4
Loop:  5
Loop:  6
Loop:  7
Loop:  8
Loop:  9
Loop: 10
Loop: 11
... and so on

Help and Tips
When running into trouble and errors with assembly code, there were three things that helped a lot. First was to write a C program that performed a similar function and then analyse the binary and instructions within, this helped for both seeing where different syntax was used and how it was used. Second, reading the some of the instruction set and some of the information from the ARMv8_ISA_Overview. Finally, reading the quick start guides on the zenit wiki over multiple time for both AARCH64 and X86_64.

X86_64 Assembly
Initially I will post the final assembly program and then slowly take them apart and explain what is going on and in some cases why it’s different between x86_64 and aarch64.

.text
.globl  _start

start = 0
max = 31

_start: 
        mov     $start,%r15     /* starting value in register */
        and     $0,%r12         /* value of 0 */
        add     $0x30,%r12      /* convert to ascii 0 */


loop:   
        // division for 2-digit number
        and     $0,%rdx         /* clear remainder */
        mov     %r15,%rax       /* set dividend */
        mov     $10,%r10        /* set divisor */
        div     %r10            /* divide */
        mov     %rax,%r14       /* store first digit */
        mov     %rdx,%r13       /* store second digit */

        // modify msg
        add     $0x30,%r14      /* convert increment to ascii */
        add     $0x30,%r13      /* convert increment to ascii */
        mov     %r13b,msg+7     /* modify single byte in msg */

        // skip if first digit is 0
        cmp     %r12,%r14
        je      continue
        mov     %r14b,msg+6     /* modify single byte in msg */

continue:
        // write
        mov     $len,%rdx       /* length of string */
        mov     $msg,%rsi       /* string */
        mov     $1,%rdi         /* file descriptor 1 = stdout */
        mov     $1,%rax         /* syscall 1 = write */
        syscall

        // loop
        inc     %r15            /* increment register */
        cmp     $max,%r15       /* compare 10 to increment value */
        jne     loop            /* jump to loop if not equal */

        // exit
        mov     $0,%rdi         /* exit status */
        mov     $60,%rax        /* syscall 60 = exit */
        syscall

.data
msg:    .ascii  "Loop:   \n"
len = . - msg

Assembly initialization
In the following snippet we are simply preparing for the later parts of the program. We add the starting point to a register so we can use it later, we are adding the value 0 to another register and we are converting that value 0 to ascii, this could have been done in one instruction if I looked up the ascii value of 0. We will need the ascii value of zero at a later point in the program.

.text
.globl  _start

start = 0
max = 31

_start: 
        mov     $start,%r15     /* starting value in register */
        and     $0,%r12         /* value of 0 */
        add     $0x30,%r12      /* convert to ascii 0 */

Assembly Division
In this next part we are dividing the incrementing value of the loop by 10, and turning it into multiple digits. This is necessary, because we need to convert it to a ascii digit which only goes from 0-9, there is no ascii value for “10” because that is two bytes, the #1 byte “1” and the #0 byte “0”. First we clear the remainder, sometimes without clearing it the remainder gives incorrect results. Next, move the dividend into register rax, this is the number we want to divide. Now we set the divisor, which can be placed in any valid register, this value should be equal to 10. Then we run the divide instruction, which will divide the value stored in register rax by the value in the register we give it, which is r10. The values that are returned are the quotient(gets stored in rax), which is the first digit, and the remainder(gets stored in rdx), which is the second digit. Finally, I move the registers to safe registers to save for later(Some safe registers are: rsp,rbp, rbx, r12, r13, r14, and r15).

loop:   
        // division for 2-digit number
        and     $0,%rdx         /* clear remainder */
        mov     %r15,%rax       /* set dividend */
        mov     $10,%r10        /* set divisor */
        div     %r10            /* divide */
        mov     %rax,%r14       /* store first digit */
        mov     %rdx,%r13       /* store second digit */

Assembly Ascii Conversion
The first two add instructions just convert the 2 digits from their number values to their ascii counterparts. This is one of the tricky spots I found in both x86_64 and aarch64, we now need to modify a specific byte in a string we create in our data section(this string is at the end of the file). In order to do this we use special syntax within the mov instruction, register r13 holds our second digit ascii value, but the register is a 64bit register, so you need to use %r13b. The added “b” means byte, it will move on 1 byte over to the memory location specified. We are putting the byte into the memory address of msg, but we speicify msg+7, since our string is “Loop: \n”, we have 10 bytes in our string, the msg+7 puts the byte into the 7th byte of the msg string, which just happens to be a space.

        // modify msg
        add     $0x30,%r14      /* convert increment to ascii */
        add     $0x30,%r13      /* convert increment to ascii */
        mov     %r13b,msg+7     /* modify single byte in msg */

Assembly Jump/Branch
This next part is fairly similar to the last step of placing a ascii value into the string. This ascii byte is the first digit of the string, however we do not want to show this digit if it is a “0”. So we need to make a comparison between the value of this ascii value and the ascii “0” value we created at the start of the program. Next we if it is equal to a “0”, we jump to the label continue, we skips the single instruction that would be used to place the first digit in the string.

        // skip if first digit is 0
        cmp     %r12,%r14
        je      continue
        mov     %r14b,msg+6     /* modify single byte in msg */

continue:

Assembly System Call
This next part prints out the string we created plus the modified numbers we added. To use the sys_write call we need: a file descriptor, a string, and the length of the string. We put the length into the 3rd arg register, the msg into the second arg register, and the file descriptor “1”(1 = stdout) into the first register. We then put “1” into the rax register, on x86_64 while using the syscall instruction, “1” means sys_write. The syscall instruction then invokes the system call.

        // write
        mov     $len,%rdx       /* length of string */
        mov     $msg,%rsi       /* string */
        mov     $1,%rdi         /* file descriptor 1 = stdout */
        mov     $1,%rax         /* syscall 1 = write */
        syscall

Assembly Loop
This is how the loop functions for this assembly program. We have a register containing a value that starts at 0 and increments each time it runs the inc instruction. Then the cmp instruction compares the max amount of times we want to run to the incrementing value. Finally, if the values are not equal, it jumps back to the label “loop”, which is at the beginning of the code.

        // loop
        inc     %r15            /* increment register */
        cmp     $max,%r15       /* compare 10 to increment value */
        jne     loop            /* jump to loop if not equal */

This final bit of code is a syscall used to exit the program. Below it is the .data directive/section, which was used to create 2 labels/”variables”. One of them holds the string that we print and the other holds the length of the string.

        // exit
        mov     $0,%rdi         /* exit status */
        mov     $60,%rax        /* syscall 60 = exit */
        syscall

.data
msg:    .ascii  "Loop:   \n"
len = . - msg

AARCH64 Assembly
The aarch64 version of the above code, does the same thing as the X86_64 code, except they do things a little different. One of the main differences with aarch64 is that it yells at you every time you try and use a label or value as the first argument of a instruction. Along with the reversed direction of arguments makes things a little confusing sometimes. For example:

// move value from register 1 to register 0
mov     x0,x1    /* AARCH64 */

// move value from register 14 to register 13
mov     %r14,%r13    /* X86_64 */

Another thing to note is that X86_64 gas assembly uses %’s to mark registers and $’s to mark values. The are a few other different things that aarch64 does, such as different ways of modifying memory, different instructions, and new syntax. I found that aarch64 instructions seem much simpler, and more powerful. Aarch64 has very few ways to write each instruction, and if it’s not the right way, it will complain and tell you it’s wrong. On x86_64 it will just not work properly, because each instruction can be written in so many ways, with many different functions. Here is the same program above in aarch64:

.text
.globl  _start

start = 0
max = 31

_start:
        mov     x28,start       /* start value */
        mov     w20,0           /* get value 0 */
        add     w26,w20,0x30    /* convert to ascii 0 */

loop:

        // div
        mov     x20, 10         /* use value 10 */
        udiv    x21,x28,x20     /* divide by 10 */
        msub    x22,x20,x21,x28 /* get remainder */

        // modify msg
        add     w23,w21,0x30    /* convert increment to ascii */
        add     w24,w22,0x30    /* convert increment to ascii */
        adr     x25,msg         /* save address of msg in register */
        strb    w24,[x25,7]     /* store byte in msg, offset 6 */
        cmp     w23,w26         /* compare if it is ascii 0 */
        beq     continue        /* skip next instruction if above is ascii 0 */
        strb    w23,[x25,6]     /* store byte in msg, offset 6 */

continue:
        // write
        mov     x2,len          /* length of string */
        adr     x1,msg          /* save address of msg */
        mov     x0,1            /* file descriptor 1 = stdout */
        mov     x8,64           /* syscall 64 = write */
        svc     0

        // loop
        add     x28,x28,1       /* increment register */
        cmp     x28,max         /* check max size */
        bne     loop            /* branch to loop if not equal */

        // exit
        mov     x0,0            /* exit status */
        mov     x8,93           /* syscall 93 = exit */
        svc     0

.data
msg:    .ascii  "Loop:   \n"
len= . - msg

Aarch64 Assembly Instruction Differences
I really liked a few of the instructons for aarch64, such as the add instruction. It functions almost like a add and mov instruction combined together. It takes the second and third arguments, adds them together and puts them inside the first argument.

add     w26,w20,0x30

Aaarch64 Assembly Division
Another interesting difference if with the division. The udiv instruction divides the second argument by the third and places the quotient in the first argument. However this means that there is no remainder obtained from the udiv instruction. In order to get it you must use a msub instruction with the following formula:

remainder = divisor - (divident * quotient)
        mov     x20, 10         /* use value 10 */
        udiv    x21,x28,x20     /* divide by 10 */
        msub    x22,x20,x21,x28 /* get remainder */

Aarch64 Assembly Memory Addresses
One of the main problems I had on aarch64 was trying to change a single byte inside a string. It is not the same as x86_64 because it requires you to use the adr instruction instead of the mov. First you use the adr instruction with the label msg, which saves the address into the register. You have to do this because it does not allow you to put msg directly inside the strb instruction(for some reason?). Next you use the instruction strb(store byte), this instruction requires that you use a “w” register for the first argument. The second argument contains the address, which we saved to the register, and the final number in there is the offset of that address, which byte in that string you’d like to use.

        adr     x25,msg         /* save address of msg in register */
        strb    w24,[x25,7]     /* store byte in msg, offset 6 */
Posted in SBR600 | Tagged , , , , , , , , , , , , , , , , | Leave a comment

SPO600 – Lab 2

Analyzing Binary Files x86_64
In this lab for spo600 we are compiling a c hello world program, with a few different compiler options, and analyzing the binaries with “objdump”. Using the program objdump, and a couple of it’s options we will disassemble and view information on these files, and learn to read what they are doing in a assembly style format. The lab material can be found here.

The first step I took in performing this lab was to make the necessary c files and a Makefile for performing all the compilations.

hello.c

#include 

int main() {
    printf("Hello World!\n");
}

hello-args.c

#include 

int main() {
    printf("Hello World!\n", 1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
}

hello-function.c

#include 

int main() {
        dothis();    
}

dothis() {
        printf("Hello World!\n");
}

Makefile

BINARIES=hello hello-1 hello-2 hello-3 hello-4 hello-5 hello-6

all:    ${BINARIES}

hello:          hello.c
        gcc     -g      -O0     -fno-builtin    -o hello        hello.c

hello-1:        hello.c
        gcc     -static -g      -O0     -fno-builtin    -o hello-1      hello.c

hello-2:        hello.c
        gcc     -g      -O0     -o hello-2      hello.c

hello-3:        hello.c
        gcc     -O0     -fno-builtin    -o hello-3      hello.c

hello-4:        hello-args.c
        gcc     -g      -O0     -fno-builtin    -o hello-4      hello-args.c

hello-5:        hello-function.c
        gcc     -g      -O0     -fno-builtin    -o hello-5      hello-function.c

hello-6:        hello.c
        gcc     -g      -O3     -fno-builtin    -o hello-6      hello.c

clean:
        rm      -f ${BINARIES}  *.o

Using the Makefile I simply run the “make” command and it will compile all parts for the lab. Using Makefiles really helps to keep everything organized and really clean. They save tons of time not having to rewrite all the compiling options and it confirms that each binary is the most recent version. It becomes even more useful when assembling files because you then have to link the file afterwards, which is just one more command to write. Now to begin analyzing the binary files! I will be using the following command to disassemble and view each of the binary file’s information:

objdump --source ./binary_file_here | less

Change (1)

Add the compiler option -static. Note and explain the change in size, section headers, and the function call.

gcc     -static -g      -O0     -fno-builtin    -o hello-1      hello.c

After adding -static to the gcc options, the size of the file jumped in size by almost 100 times. With this new option, dynamic linking with shared libraries is prevented and static linking is enabled. This means the binary will require the libraries it’s using directly inside the binary, hence the massive increase in size.

Change (2)

Remove the compiler option -fno-builtin. Note and explain the change in the function call.

gcc     -g      -O0     -o hello-2      hello.c

After removing -fno-builtin from the gcc options, the function in the main changed. It became smaller and started using the function “puts” instead of “printf”. After a little research, it seems that running a “puts” is much faster then running “printf” while “printf” has no arguments. This is possibly because of the way printf must scan and format the string, checking each character. This is a possible optimization that was added during compilation.

Change (3)

Remove the compiler option -g. Note and explain the change in size, section headers, and disassembly output.

gcc     -O0     -fno-builtin    -o hello-3      hello.c

After removing -g from the gcc options, the file became about 10% smaller. It also no longer has any of the source code inside it when I use “objdump –source”. The “-g” option in gcc adds extra debugging information, such as the source code and additional information when used with the “gdb” debugger.

Change (4)

Add additional arguments to the printf() function in your program. Note which register each argument is placed in.

gcc     -g      -O0     -fno-builtin    -o hello-4      hello-args.c

Using additional arguments in the printf shows each argument in the disassembled code in movl and mov opcode. It looks like in the mov opcode, the arguments are being saved into registers, while the movl commands, the arguments are being saved into memory. This makes me wonder why only the first 5 arguments are being used in registers, maybe that is the max amount of registers that printf can use for arguments so it uses all of them first for performance reasons, before turning to memory. The remaining arguments show up in memory incrementing the memory address.

Change (5)

Move the printf() call to a separate function, and call that function from main(). Explain the changes in the object code.

gcc     -g      -O0     -fno-builtin    -o hello-5      hello-function.c

Using a separate function that calls printf in the c code changes the way the object code looks when viewed with objdump. Originally there was a printf call being run in the main, now however, there is the custom function that was created. There is a address beside the new function in main, if you follow this address to it’s location, you find the custom functions actual code. This code shows the printf call that would normally be inside main.

Change (6)

Remove -O0 and add -O3 to the gcc options. Note and explain the difference in the compiled code.

gcc     -g      -O3     -fno-builtin    -o hello-6      hello.c

At first glance it looks like the file grew a little larger in size(about 10% again), this is due to a trade off of file size and optimization of the program compiled. By using the option “-O3” with gcc you are getting faster performance at the cost of larger file size and longer compilation times. One of the optimizations that is noticed when using “objdump”, is in the main function the push and pop from the stack is no longer there, since it doesn’t need to be, making unnoticeable performance gains for a hello world program. Some notes about “-O3”: “-O3” may mess up some of the debugging data, “-O2” is very similar to it except it tries not to increase file size, and lacks a few optimizations, “-O0” on the other hand reduces compiling time and allows for the debugging to show properly.

Posted in SBR600 | Tagged , , , , , , , , , , , , | Leave a comment

SPO600 – Lab 1

The first lab for the course SPO600 is to review two different open source software packages and their review processes. I will be researching the process for the software packages: Ansible and Python.

Ansible – GPLv3
Ansible is a configuration management system for large numbers of computers. Using Ansible scripts and modules you can perform almost any task(shutdown, update, copying files, configurations, etc), across any number of computers, with the single run of a command or even automatically.

Ansible uses Github for their content control system. This shows the source for each release they make, all the commits, the contributors, and the bug tracker they use. Ansible has almost 500 contributors, ranging from over 165,000 lines added(author of ansible) to about 9 lines added. Other communities that are available for Ansible are its mailing list, which is just a google groups forum where uses can ask questions, and the irc channel #ansible on Freenode.

Ansible has a bug tracker on Github, this is used for all bugs that appear, and many seems to be fixed within a day or two. The issue I followed was this one. The poster said what the issue was and why it was happening. One of the contributors then made a patch and closed the bug, with a followup of a test run with the patch which showed the issue resolved. While this particular bug appeared to only have to people involved, there may have been more people behind the scenes since there were possible duplicate issues submitted.

Python – Python Software Foundation License
Python is an interpreted, interactive, object oriented programming language. It is a very popular language which is being used across entire linux systems and even used a lot by users on windows. The python library is very large and there are not too many things that you can’t do in python anymore.

The community page for python shows shows mailing lists, irc, bug tracker, and even some dates or conferences and workshops. The bug tracker uses the Roundup Issue Tracker to control all the issues and tickets. The specific issue I followed was a review of a function for the python subprocess module. This issue had 3 people comment on it and left a link to the actual review of the code. There were a lot of back to back communication about whether this should be put into the subprocces module, along with recommendations and requests to fix bugs. The reviewer was tentative to let this function be added and in the end rejected the patch until it became a little more mature.

Comparison
The review processes seem fairly similar in function, talk back and forth on a “ticket” or “review request” between usually 2-4 people. Python review process looks to be a lot more difficult than the Ansible process. In the examples I found, Python was rejecting function based on it’s age and whether it had been “very” well tested. Ansible on the other hand seemed to encourage all types of functionality, but seemed picky on how it is envoked in the Ansible scripts, making sure it is universally similar to the rest of Ansible. It really looks like both review processes help the contributor a lot with assistance and information, while trying to keep them on track with different goals/guidelines to get the review approved.

Posted in SBR600 | Tagged , , | Leave a comment