next up previous contents
Next: BP_ANG Up: ATSP2K manual Previous: NONH   Contents

Subsections


MCHF

Introduction

The state of many-electron system is described by a wave function $\Psi$, that is the solution of the Schrödinger equation:
\begin{displaymath}
({\cal H}-E)\psi=0
\end{displaymath} (3)

The approximate wave function $\Psi$ of the state labeled $\gamma L S$ is:

\begin{displaymath}
\Psi ( \gamma L S) = \sum_j c_j \Phi (\gamma _j L S ),
\end{displaymath} (4)

where $\gamma$ represents the dominant configuration, and any additional quantum numbers required for uniquely specifying the state being considered. The MCHF wave function $\Psi$ is expanded in terms of configuration state functions (CSF) $\{ \Phi \}$ having the same $LS$ symmetry but arising from different electronic configurations ($ \gamma _j $). The mchf procedure consists of optimizing to self-consistency both the sets of radial functions $\{ P_{n_j l_j} (r) \} $ and mixing coefficients $\{ c_j \}$. The CSF's are built from a basis of one-electron spin-orbital functions and determine the radial functions:
\begin{displaymath}
\phi _{nlm_lm_s} = \frac{1}{r} P_{nl} (r) Y_{lm_l}
(\theta, \varphi) \chi_{m_s}.
\end{displaymath} (5)

With the wave function expansion is associated an energy functional for one LS term and eigenvalue.

The traditional mchf program has been extended to accomplish a simultaneous optimization of energy expressions derived from several different terms or even several eigenvalues of the same term. Additionally, the energy energy functional is represented as a weighted average of energy functionals for expansions of wave functions for different LS terms or parity. This approach facilitates the Breit-Pauli calculations for complex atomic systems, while previously somewhat arbitrary methods have been applied (cross-wise optimization,  [*]).

mchf was modified for systematic, large-scale methods using dynamic memory allocation and sparse matrix methods. All orbitals in a wave function expansion are assumed to be orthonormal. Configuration states are restricted to at most eight (8) subshells in addition to the closed shells common to all configuration states. The maximum size is limited by the available memory and disk space. The wave function expansions are obtained from orbital sets of increasing size, allowing for the monitoring of convergence. The Davidson algorithm [#!dvdson!#] is applied for finding the needed eigenvalues and eigenvectors. In this version of the code, non-orthogonality is not supported. In the present atsp2K_MCHF package, it is not foreseen that optimization would be over different parities, only over different terms of the same parity, and we refer to this as "simultaneous optimization". Suppose ${\cal E}(T_i)$ represents and energy functional for term $T$ and eigenvalue $i$, assuming orbitals and also wave functions are normalized. Then optimization was performed on the functional

\begin{displaymath}{\cal E} = \sum_{T_i} w_{T_i} {\cal E}(T_i) /\sum_{T_i}w_{T_i}\end{displaymath}

where $w_{T_i}$ is the weight for ${T_i}$.


Program Structure

The mchf program, Figure  5.10 performs data initialization (using inita() and initr()), data(), and applies the SCF procedure. Output data, including the wave function approximation are written by (output(), summry(), wfn(), eig_out()).

Figure 5.10: main() of the mchf program: data initialization, scf() iterations, data output
\begin{figure}\centerline{ \psfig{file=tex/fig/mchf_main.epsi}}\end{figure}

Data initialization

mchf requires a number of input items for proper calculations, The first part of the program data() shown on Figure  5.11, is designed to handle both, input processing and memory management.

Estimates of the radial wave functions are taken from the file wfn.inp, if provided. Otherwise, hydrogenic estimates are used. Upon completion (either a maximum of 200 iterations or a change in the weighted average energy of less that $10^{-8}$ a.u., in default mode) an updated wfn.out file is produced. During the course of the calculation an intermediate orbital results are written to the wfn.out file, for restart purposes in case of process termination for some reason.

Figure 5.11: Subroutine data(). After reading parameters from cfg.h, memory for cfgs and coefficients is allocated in alloc_mem. wavefn() reads the input wave function form wfn.inp. spintgrl() reads input files with the integral data and calls alcsts() for pointer initialization and memory management.
\begin{figure}\begin{center}
\centerline{\epsfig{figure=tex/fig/mchf_data.epsi,height=8cm}}\end{center}\end{figure}

Figure 5.12 shows the data structure of the interaction matrix (hmx) and the associated pointers (ico and ih). For each nonzero matrix element ico accounts for the number of total angular coefficients as the matrix is traversed in the direction of the arrows. The row index of each non-zero element is stored in ih.

Figure 5.12: Data Structure of the interaction matrix.
\begin{figure}\begin{center}
\centerline{\psfig{figure=tex/fig/mchf_data_mat.epsi}}\end{center}\end{figure}

Each coefficient data structure consists of two quantities:
coeff numerical coefficient (double precision)
inptr index pointer to an integral (integer)

Thus the position of each integral needs to be known in advance and this is achieved by generating all possible integrals from the orbital set once initially, before the start of the SCF iterations.

During the initial phase of data initialization, data() allocates memory for the problem, the details are covered in section  5.3.

SCF procedure

The scf() algorithm is the most CPU intensive part of the mchf program. It is an iterative process with two sequential phases in each iteration: solving the differential equation, then, finding the eigenvalue problem  5.10.

Figure 5.13: Two phases of an scf() iteration: The eigenvalue phase constructs the matrix, finds eigenvalues and updates the coefficients. The differential equation phase updates radial functions on each node.
\begin{figure}\centerline{ \psfig{file=tex/fig/mchf_mpi.eps}}\end{figure}

The first phase solves the differential equation for each radial function and finds the sets of radial functions $\{ P_{n_j l_j} (r) \} $, Figure  5.14.

Figure 5.14: Steps in finding the radial functions.
\begin{figure}\centerline{ \psfig{file=tex/fig/scf_de.eps}}\end{figure}

The routines applied to this phase are shown on Figure  5.15

Figure 5.15: de() procedure. On each iteration diag(), de(), orthog() and grange() are called.
\begin{figure}\begin{center}
\centerline{\psfig{figure=tex/fig/mchf_de.epsi}}\end{center}\end{figure}

Diagonalization

In the next phase the eigenvalue is solved, and the mixing coefficients $\{ c_j \}$ are updated.

Figure 5.16: Steps in solving the eigenvalue problem.
\begin{figure}\centerline{ \psfig{file=tex/fig/scf_diag.eps}}\end{figure}

During this phase diag() performs the process referred to as a "simultaneous optimization" in three distinctive steps. During the first step the matrix elements are assembled and the matrix diagonalized. Diagonalization occurs in the one of the routines diag_memory_all(),diag_disk_clst(), diag_disk_ico(),diag_disk_hmx(). The differ in the storage methods which are used as function of available memory. The main loop is over all coefficients, tt coeff, which are contained in file c.lst. It is a complicated loop with a number of logical statements and it includes single and double precision arithmetics, this reduces the optimization of the Mflop performance to less than 30% of theoretical possible for floating point calculations. The fastest routine is diag_memory_all(), it does not include any disk IO. The main loop is shown below:

      do ii = 1, n_cf;
          n_count_tmp = ncoef+ii
          if (ii.gt.ico(nijcurr+nz)) nijcurr = nijcurr + 1
          hmx(nijcurr) = hmx(nijcurr) +
     :            coeff(n_count_tmp)*value(inptr(n_count_tmp))
          if (nijcurr.gt.jptr(jjh)) then
            jjh = jjh + 1;
            max_col = max_col+1
          end if 
        end do

This fragment combines the matrix elements based on the information about the column index ico, the column pointer jptr and the coefficient coeff. The rest of the diag_*() routines perform the same task, however apply disk IO to a various degree, with the worst case being diag_disk_hmx(), which reads each of the files c.lst, ih.nn.lst, ico.lst.

The next step computes the selected eigenvalues using the Davidson algorithm. The eigenvectors for each requested eigenvalue are saved and applied for updating the coefficients. This steps proceeds consequently for each block, after the last step the weighted total energy is computed an displayed during each iteration.

Figure 5.17: Steps in solving the eigenvalue problem.
\begin{figure}\centerline{\psfig{file=tex/fig/diag_steps.eps}}\end{figure}

The optimization is performed on the functional

\begin{displaymath}{\cal E} = \sum_{T_i} w_{T_i} {\cal E}(T_i) /\sum_{T_i}w_{T_i}\end{displaymath}

where $w_{T_i}$ is the weight for ${T_i}$, where ${\cal E}(T_i)$ represents an energy functional for term $T$ and eigenvalue $i$ Updating the coefficients proceeds similar to the calculation of the matrix elements. Depending on storage requirements and availability one of the routines, UPDATC_memory_all(), UPDATC_disk_ico(), UPDATC_disk_clst(), UPDATC_disk_ih(), is applied to update the coefficients of all integrals, which contribute to the energy. This procedure is applied to all blocks, figure  5.17. For all coefficients, the weighted contribution of the mixin coefficient is in a complicated loop containing multiple logical constructs, floating and integral calculations. Upon convergence, during the last iteration, the routine prprty() is called to compute a number of wave function properties, saved in file summry.

The listing below shows the complexity of the coefficient updating process.

 
      if (iblock == 1) ncoef = 0;
        do i = 1, cf_tot(iblock);
            n_count_tmp = ncoef+i
            if (i.gt.ico(nijcurr)) then
                nijcurr = nijcurr + 1
*               .. have we also changed column?
                if (nijcurr.gt.jptr(max_col)) then
                   jjh = jjh + 1;
                   max_col = max_col+1
                end if
             end if
             iih = ihh(nijcurr)
             im1 = 0;
             do j = 1,maxev
               ioffw = (j-1)*ncfg
               if (leigen(j,iblock)) then
                 wcoef = eigst_weight(j,iblock)
                 W = wcoef*wt(ioffw+iih)*wt(ioffw+jjh)
                 T = W*coeff(n_count_tmp)
                 IF (IIH .NE. JJH) T = T+T
                 coef(inptr(n_count_tmp)) = coef(inptr(n_count_tmp)) + T
                 if (last) then
                   W0 = wt(ioffw+iih)*wt(ioffw+jjh);
                   T0 = W0*coeff(n_count_tmp);
                   IF (IIH .NE. JJH) T0 = T0 + T0;
                   itmp_s = (inptr(n_count_tmp))+idim*im1;
                   tmp_coef(itmp_s) = tmp_coef(itmp_s) + T0;
                   im1 = im1 + 1;
                 end if
               end if
             end do
      end do

Figure  5.18 shows the implementation and the call sequence in diag().

Figure 5.18: diag() computes the elements of the H matrix and solves the eigenvalue problem. At the begining, memory is allocated for the diagonalization procedure. Then, one of the versions of diag_*() is called and the matrix elements are computed and diagonalized. Further, the dvdson() routines return the eigenvector(s). The memory used by dvdson() is deallocated. Then, the coef are updated in one of the versions of updatc_*(). NOTE: diag_*) and diag_*() differ only with regards to memory and disk use.
\begin{figure}\begin{center}
\centerline{\psfig{figure=tex/fig/mchf_diag.epsi}}\end{center}\end{figure}

There are special storage requirements for each step: Arrays (hmx_diag), inptr, coeff and ico are accessed only once. In contrast, the iterative solution of the eigenvalue problem (dvdson) requires multiple read access operations on the interaction matrix (only hmx and ih are used). Therefore, higher priority for storing in memory was given to hmx and ih. After dvdson the memory used in diag_hmx and dvdson is deallo cated and used in updatc. Finally, updating the coefficients requires a single access to ih, and ico suggesting higher priority for ico over coeff. Table  5.3 shows the multilevel storage scheduling derived from the frequency of data access. The storage scheduling will depend on the size of the problem and the system capacity and mchf is designed to select the best level with respect to computational efficiency.

Figure 5.19: Storage access requirements in diag().
\begin{figure}\centerline{\psfig{file=tex/fig/diag_storage.epsi}}\end{figure}


Dynamic memory management

Memory management is an important factor for increasing the limits of the mchf calculation, and it has been accomplished mainly in two directions. First, between the two phases of an scf() iteration, arrays can share memory, and secondly the memory is allocated based on size and frequency of use in CPU intensive processes.

The SCF process has two phases which are totally disjoint, the array of contributions to the energy functional, and the array of integral values may share the same memory.

  1. In the orbital update phase, the appropriate view of the energy functional is given by Eq. (3). To compute the potential and/or exchange function for a specified orbital, say $a$, it is necessary to search the list for integrals involving orbital $a$. When an integral is found, its contribution needs to be determined and multiplied by $w_{ab}$ or $v_{abcd;k}$. Thus in this phase the values of the integrals themselves are not needed, only the coefficient defining their contribution to the energy. Since these coefficients depend on eigenvectors, they need to be re-computed before the orbital update phase. In the present implementation, it occurs after the diagonalization process.
  2. In the matrix diagonalization phase, the matrix elements as defined by Eq (2) need to be assembled, in order, as described earlier. Now the value of the integrals are needed but not their total contribution to the energy functional. Thus, before the diagonalization phase, all integrals need to be re-evaluated.

To maximize the memory utilization, the storage of all arrays is allocated in a stepwise fashion. Upon memory allocation failure, the remaining arrays are stored on disk. Under this scheme the order in which arrays are allocated becomes important and the arrays used most frequently at each SCF iteration are allocated first. The angular coefficient data and the interaction matrix are the two major data structures which define the memory use for each scf iteration. As the number of configurations grows the memory capacity may be exceeded thus requiring some or most of the data to be stored on disk. Disk read/write operations are then performed on each SCF and DVDSON iteration. Disk read/write access is considerably slower compared to memory access. In order to accomplish high computational efficiency it is essential to avoid disk I/O and keep all data in memory. Therefore, optimizing mchf for large scale calculations requires management of the disk/memory data storage. In general, the angular coefficients, the interaction matrix elements and the associated pointers are stored in one dimensional arrays. The arrays of angular coefficients are considerably larger than the interaction matrix.

The memory allocation process (alcsts), starts with Level 4 and upon success on each level may proceed up to Level 1. At Level 4, if no memory is available for performing Dvdson iterations on disk (vector-matrix multiplication on disk) the program exits. (This will represent a very large case beyond practical limits for serial computing: in the order of millions of cfgs). If the program cannot proceed with allocating memory for the interaction matrix of the largest block, a similar, local scheme (in diag) is used to allocate memory for each block. Blocks can vary significantly in size and in order to improve the performance mchf is designed to keep smaller blocks in memory if possible.


Table 5.2: Steps in allocating memory.
\begin{table}\begin{tabular*}{0.8\textwidth}
{@{\extracolsep{\fill}} l l l}
\...
...{\tt hmx, ih, ico} for a single column \\
\hline
\\
\end{tabular*}\end{table}


As shown in table  5.3, levels 1, 2, and 3 ico has higher priority compared to coeff and inptr because ico is used in both diag and updatc routines. However, it is important to note that updatc proceeds almost always in memory since all of the memory used in diag_hmx and dvdson is deallocated and made available.

The memory allocation procedure relies on malloc() and free(), which are C routines normally used for dynamic memory management. malloc() allocates size bytes and returns a pointer to the allocated memory. If the request fails malloc() returns a NULL pointer. The atsp2K package includes the routines alloc() and dalloc() which use a similar approach, however upon failure alloc() aborts by calling EXIT(). mchf has several levels of memory allocation requiring the returned pointer to be monitored in order to adjust the subsequent memory requests. For this purpose two routines specific only to mchf have been introduced: diag_allocate() and diag_deallocate(). If the memory is not sufficient for loading the array under consideration in the memory, then, memory for the largest column of that array is allocated and the array is stored on disk.

I/O files

The radial wave functions are stored in binary format in wfn.out. For each term, say LS, a file LS.l is produced with the same format as a <name>.j, but with an extra line that contains the S parameter for the specific mass shift [#!book!#], Ssms.

The summry file also contains some additional information including

  1. The mean radius, the expectation of $\sum_i r_i$.
  2. The mean square radius, the expectation of $\sum_i r_i^2$.
  3. The dipole-dipole operator, the expectation of $(\sum_i r_i)^2$,
  4. The Isotope shift parameter, $S= - \sum_{i<j} \nabla_i\cdot \nabla_j$.
The mean radius gives an indication of the size of the atomic system, whereas the dipole-dipole operator (denoted as r.r in the summry file) is relevant to long-range interactions [#!babb!#].

  1. wfn.inp Contains initial estimates for the wave function
  2. wfn.out The output of the mchf calculation
  3. cfg.inp Configuration list
  4. cfg.h File containing information for the memory allocation.
  5. summry Summary of wave function properties
  6. LSn.l Eigenvectors for a specific term, text file.
  7. yint.lst information about all expansions in cfg.inp
  8. c.lst coefficient and integrals for deriving the energy
  9. ih.nn.lst row indices of the matrix elements

MPI version

Introduction

The MPI mchf implementation is based on the program structure of the serial code with the most CPU intensive operations modified for parallel execution. The program initialization is similar to the serial version: Node 0 in mpi_mchf_sun() processes the parameters provided by the user in interactive mode, and broadcast them to the other processors. mpi_data() calls wavefn(), which reads the input wave function estimate from wfn.inp if it is present in the working directory. Or, wavefn() creates hydrogenic estimates. All initial parameters are broadcast from node 0 to the rest of the nodes. Then, mpi_data() proceeds with calling mpi_spintgrl(), which allocates memory for each node (calling mpi_spalcsts()), and reads the the angular data files supplied from nonh_mpi. mpi_spalcsts() has sufficient information for the size of the arrays, and attempts to allocate heap memory for all arrays. If the calculation is too large, then the coefficient data from c.lst.nnn are read from disk on each scf() iteration. In this case, considerably smaller arrays are allocated and they are used to buffer the input data. The parameter LSDIM=30000 controls the size of the buffer, and its size can be adjusted for efficient I/O processing. This is important only in the case when coefficient data is on disk, since on each scf() iteration the entire list coefficients is processed. However, the user should avoid computing on disk and by increasing the number of processors, all coefficient and pointer data may be stored in memory.

The scf() iterations has exactly the same structure as described for the serial mchf. The first phase solves the differential equation for each radial function with the following sequence, Figure  5.14. Then, during the second phase, diag() solves the eigenvalue problem and updates the integral coefficients of the radial functions, Figure  5.16.

mchf performs complicated computational tasks, including parallel and serial I/O, complex arithmetic loops, and matrix algebra. The efficiency of mchf is a function of the number of processors, it significantly drops below 0.6 when more than 16-24 processors are used. The time consuming operations are coefficient updates and matrix diagonalization, and in the exchange procedure. The parallel version, mchf_mpi, is structurally similar to the serial mchf program. However, it has only two level of memory allocation: Level1, all arrays are in memory, and Level 2, coeff, inptr are on disk and hmx, ih, ico are stored in memory. It is assumed that the number of processor can be increased as needed so that all data is stored in memory. The speed of iteration may show a considerable decrease when all of the data is on disk, as opposed to have all data in memory.

MPI I/O

The mchf program implements parallel IO for the largest input data files, Figure  5.20.

Figure 5.20: The angular data are stored in files, which are processed in parallel. Files with negligible IO requirements are read by node 0, which broadcasts the data to other nodes.
\begin{figure}\centerline{ \psfig{file=tex/fig/mpi_io.eps}}\end{figure}

The IO files used by mpi_mchf can be divided into two categories based on their use:

  1. Small input data files (cfg.inp, cfg.h, wfn.inp. There is a single copy of each file. Each node process the file and broadcasts the data to the other nodes. Those files are small and incurs only a negligible communication overhead. In addition, the output files wfn.out, summry, LSn.l have a single copy, which is written by processor 0.

    1. wfn.inp initial estimates for the wave function
    2. wfn.out computed radial functions.
    3. cfg.inp configuration list
    4. cfg.h file containing information for the memory allocation.
    5. summry summary of wave function properties
    6. LSn.l eigenvectors for each LS term.

  2. Node dependent large data files, with a copy per node They contain the integral coefficient and pointer data. Depending on how extensive is the computational model, c.lst.nnn may exceed 1 GB, the pointer data files ico.lst.nnn, ih.lst.nnn, yint.lst.nnn may reach hundreds of MB, and they need to be processed in parallel. Each file has a name comprised by a basename which is the same as the serial version. However, the filename appends the node ID. Those files are needed only for the mchf calculation. They are Z independent and can be reused when the configuration lists have not changed.
    1. c.lst.000 pointer and coefficient data for each element of the Hamiltonian
    2. ico.lst.000 column indexes per element
    3. ih.lst.000 row index
    4. yint.lst.000 integral data

    Details about the format and each entry are given in chapter  13.16.


next up previous contents
Next: BP_ANG Up: ATSP2K manual Previous: NONH   Contents
2001-10-11