It seems possible to cause race condition because there's no any lock with __malloc_initialized

It is impossible1 for a program to create a second running thread without having called an allocation routine (and therefore ptmalloc_init) while it was still single-threaded.

Because of that, ptmalloc_init can assume that it runs while there is only a single thread.


1Why is it impossible? Because creating a thread itself calls calloc.

For example, in this program:

#include <pthread.h>

void *fn(void *p) { return p; }
int main()
{
  pthread_t tid;
  pthread_create(&tid, NULL, fn, NULL);
  pthread_join(tid, NULL);
  return 0;
}

ptmalloc_init is called here (only a single thread exists at that point):

Breakpoint 2, ptmalloc_init () at /usr/src/debug/glibc-2.34-42.fc35.x86_64/malloc/arena.c:283
283       if (__malloc_initialized)
(gdb) bt
#0  ptmalloc_init () at /usr/src/debug/glibc-2.34-42.fc35.x86_64/malloc/arena.c:283
#1  __libc_calloc (n=17, elem_size=16) at malloc.c:3526
#2  0x00007ffff7fdd6c3 in calloc (b=16, a=17) at ../include/rtld-malloc.h:44
#3  allocate_dtv (result=result@entry=0x7ffff7dae640) at ../elf/dl-tls.c:375
#4  0x00007ffff7fde0e2 in __GI__dl_allocate_tls (mem=mem@entry=0x7ffff7dae640) at ../elf/dl-tls.c:634
#5  0x00007ffff7e514e5 in allocate_stack (stacksize=<synthetic pointer>, stack=<synthetic pointer>,
    pdp=<synthetic pointer>, attr=0x7fffffffde30)
    at /usr/src/debug/glibc-2.34-42.fc35.x86_64/nptl/allocatestack.c:429
#6  __pthread_create_2_1 (newthread=0x7fffffffdf58, attr=0x0, start_routine=0x401136 <fn>, arg=0x0)
    at pthread_create.c:648
#7  0x0000000000401167 in main () at p.c:7
Answer from Employed Russian on Stack Overflow
🌐
Bootlin
elixir.bootlin.com › glibc › glibc-2.31 › C › ident › _int_malloc
_int_malloc identifier - Glibc source code glibc-2.31 - Bootlin Elixir Cross Referencer
Elixir Cross Referencer - _int_malloc identifier references search for Glibc glibc-2.31. Defined as a prototype in malloc/malloc.c. Defined as a function in malloc/malloc.c. Referenced in 2 files: malloc/hooks.c...
🌐
Debian
sources.debian.org › src › glibc › 2.28-8 › malloc › malloc.c
File: malloc.c | Debian Sources
sloc: ansic: 1,008,637; asm: 259,607; makefile: 11,271; sh: 10,477; python: 6,910; cpp: 4,992; perl: 2,258; awk: 2,005; yacc: 290; pascal: 182; sed: 73
Top answer
1 of 2
3

It seems possible to cause race condition because there's no any lock with __malloc_initialized

It is impossible1 for a program to create a second running thread without having called an allocation routine (and therefore ptmalloc_init) while it was still single-threaded.

Because of that, ptmalloc_init can assume that it runs while there is only a single thread.


1Why is it impossible? Because creating a thread itself calls calloc.

For example, in this program:

#include <pthread.h>

void *fn(void *p) { return p; }
int main()
{
  pthread_t tid;
  pthread_create(&tid, NULL, fn, NULL);
  pthread_join(tid, NULL);
  return 0;
}

ptmalloc_init is called here (only a single thread exists at that point):

Breakpoint 2, ptmalloc_init () at /usr/src/debug/glibc-2.34-42.fc35.x86_64/malloc/arena.c:283
283       if (__malloc_initialized)
(gdb) bt
#0  ptmalloc_init () at /usr/src/debug/glibc-2.34-42.fc35.x86_64/malloc/arena.c:283
#1  __libc_calloc (n=17, elem_size=16) at malloc.c:3526
#2  0x00007ffff7fdd6c3 in calloc (b=16, a=17) at ../include/rtld-malloc.h:44
#3  allocate_dtv (result=result@entry=0x7ffff7dae640) at ../elf/dl-tls.c:375
#4  0x00007ffff7fde0e2 in __GI__dl_allocate_tls (mem=mem@entry=0x7ffff7dae640) at ../elf/dl-tls.c:634
#5  0x00007ffff7e514e5 in allocate_stack (stacksize=<synthetic pointer>, stack=<synthetic pointer>,
    pdp=<synthetic pointer>, attr=0x7fffffffde30)
    at /usr/src/debug/glibc-2.34-42.fc35.x86_64/nptl/allocatestack.c:429
#6  __pthread_create_2_1 (newthread=0x7fffffffdf58, attr=0x0, start_routine=0x401136 <fn>, arg=0x0)
    at pthread_create.c:648
#7  0x0000000000401167 in main () at p.c:7
2 of 2
0

GLIBC's dynamic memory allocator is designed to deliver performances in both mono-threaded and multi-threaded programs. Several mutexes are used instead of having a centralized unique one which would at the end serialize every concurrent accesses to the dynamic memory allocator. The concept of arenas protected by one mutex has been introduced to have a kind of reserved memory area for each thread. Hence, the threads can access the memory allocator data structures in parallel as long as they use different arenas.

The main goal is to avoid as much as possible the contention on the mutexes.

The initialization step is critical because the main arena must be set up once. The __malloc_initialized global variable is a flag to prevent multiple initializations. Of course, in a multi-threaded environment, the latter should be protected by a mutex because checking the value of a variable is not multi-thread safe. But doing this would break the main design principle consisting to avoid a centralized mutex which would somehow serialize the execution of the concurrent threads during the process life time.

So, the unprotected __malloc_initialized is a trade-off that works as long as the first access to the memory allocator is done in mono-threaded mode.

Under Linux, a process starts mono-threaded (the main thread). With dynamically and statically linked programs, the GLIBC library has an initialization entry point (CSU = C Start Up) called __libc_start_main()_ defined in csu/libc-start.c in the library's source tree. It performs many initializations before calling the main() function. This is where a first call to the dynamic allocator occurs to initialize the main arena.

Let's look at the following program which does not explicitly call any service from the dynamic memory allocator and does not create any thread:

#include <unistd.h>

int main(void)
{
  pause();
  return 0;
}

Let's compile it and run it with gdb and a breakpoint on malloc():

 gdb ./mm
[...]
(gdb) br malloc
Function "malloc" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (malloc) pending.
(gdb) run
Starting program: /.../mm 

Breakpoint 1, malloc (n=1441) at dl-minimal.c:49
49  dl-minimal.c: No such file or directory.
(gdb) where
#0  malloc (n=1441) at dl-minimal.c:49
#1  0x00007ffff7fec5e5 in calloc (nmemb=<optimized out>, size=size@entry=1) at dl-minimal.c:103
#2  0x00007ffff7fdc284 in _dl_new_object (realname=realname@entry=0x7ffff7ff4342 "", libname=libname@entry=0x7ffff7ff4342 "", type=type@entry=0, loader=loader@entry=0x0, 
    mode=mode@entry=536870912, nsid=nsid@entry=0) at dl-object.c:89
#3  0x00007ffff7fd1d2f in dl_main (phdr=0x555555554040, phnum=<optimized out>, user_entry=<optimized out>, auxv=<optimized out>) at rtld.c:1330
#4  0x00007ffff7febc4b in _dl_sysdep_start (start_argptr=start_argptr@entry=0x7fffffffdf70, dl_main=dl_main@entry=0x7ffff7fd15e0 <dl_main>) at ../elf/dl-sysdep.c:252
#5  0x00007ffff7fd104c in _dl_start_final (arg=0x7fffffffdf70) at rtld.c:449
#6  _dl_start (arg=0x7fffffffdf70) at rtld.c:539
#7  0x00007ffff7fd0108 in _start () from /lib64/ld-linux-x86-64.so.2
#8  0x0000000000000001 in ?? ()
#9  0x00007fffffffe2e2 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb) 

The above display shows that even if malloc() is not called explicitly in the main program, the GLIBC's internals call at least once the memory allocator triggering the initialization of the main arena.

We may consequently wonder why we need to check the __malloc_initialized variable during the process life time after the internal initialization step. The GLIBC initialization sets up various internal modules (main stack, pthreads...) and some of them may call the dynamic memory allocator. Hence __malloc_initialized is here to allow calling the allocator at any time during the initialization step. And, if the allocator is not needed because of some specific esoteric configuration, then it will not be initialized at all.

🌐
openEuler
openeuler.org › en › blog › wangshuo › glibc_Bugs_Fault_Analysis_of_malloc_Call_Stack.html
glibc Bugs - Fault Analysis of malloc Call Stack | openEuler
The probable cause is dltlsdesc_dynamic instead of the malloc function. We then ran this specific scenario as a demo and reproduced the problem, and confirmed that the problem is caused by _dl_tlsdesc_dynamic under sysdeps/aarch64/dl-tlsdesc.S. Specifically, the push-to-stack fails after _dl_tlsdesc_dynamic is invoked. But there are two exceptions. The first exception is as follows: Thread 2 "xxxxxxx" hit Breakpoint 1, _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:149 149 stp x1, x2, [sp, #-32]! Missing separate debuginfos, use: dnf debuginfo-install libgcc-7.3.0-20190804.h24.aarch
Find elsewhere
🌐
Red Hat
developers.redhat.com › articles › 2021 › 08 › 25 › securing-malloc-glibc-why-malloc-hooks-had-go
Securing malloc in glibc: Why malloc hooks had to go | Red Hat Developer
October 8, 2024 - Read how memory allocation, or malloc hooks, were unsafe in multi-threaded environments and why they were removed from the GNU C Library, or glibc.
🌐
openEuler
openeuler.org › en › blog › wangshuo › Glibc_Malloc_Source_Code_Analysis_(1).html
Glibc Malloc Source Code Analysis (1) | openEuler
A chunk is the minimum unit for glibc memory management. The definition of a chunk is as follows: /* This struct declaration is misleading (but accurate and necessary). It declares a "view" into memory allowing access to necessary fields at known offsets from a given base. See explanation below. */ struct malloc_chunk { INTERNAL_SIZE_T mchunk_prev_size; /* Size of previous chunk (if free).
🌐
Low-level adventures
0x434b.dev › overview-of-glibc-heap-exploitation-techniques
Overview of GLIBC heap exploitation techniques
February 13, 2022 - Since GLIBC >= 2.30 each count is the size of a word, before that, it was a char*. The tcachebins behave similarly to fastbins, with each acting as the head of a singly linked, non-circular list of chunks of a specific size. By default, each tcachebin can hold 7 free chunks (which can be tweaked with the tcache_count variable in the malloc_par struct).
🌐
Fossies
fossies.org › linux › glibc › malloc › malloc.c
GNU C Library: malloc/malloc.c | Fossies
*/ 333 #define PROTECT_PTR(pos, ptr) \ 334 ((__typeof (ptr)) ((((size_t) pos) >> 12) ^ ((size_t) ptr))) 335 #define REVEAL_PTR(ptr) PROTECT_PTR (&ptr, ptr) 336 337 /* 338 The REALLOC_ZERO_BYTES_FREES macro controls the behavior of realloc (p, 0) 339 when p is nonnull. If the macro is nonzero, the realloc call returns NULL; 340 otherwise, the call returns what malloc (0) would. In either case, 341 p is freed. Glibc uses a nonzero REALLOC_ZERO_BYTES_FREES, which 342 implements common historical practice.
🌐
Red Hat
bugzilla.redhat.com › show_bug.cgi
1255506 – major performance problem in glibc malloc _int_free
article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords".
Top answer
1 of 3
62

After trying some things, I finally managed to figure out how to do this.

First of all, in glibc, malloc is defined as a weak symbol, which means that it can be overwritten by the application or a shared library. Hence, LD_PRELOAD is not necessarily needed. Instead, I implemented the following function in a shared library:

void*
malloc (size_t size)
{
  [ ... ]
}

Which gets called by the application instead of glibcs malloc.

Now, to be equivalent to the __malloc_hooks functionality, a couple of things are still missing.

1.) the caller address

In addition to the original parameters to malloc, glibcs __malloc_hooks also provide the address of the calling function, which is actually the return address of where malloc would return to. To achieve the same thing, we can use the __builtin_return_address function that is available in gcc. I have not looked into other compilers, because I am limited to gcc anyway, but if you happen to know how to do such a thing portably, please drop me a comment :)

Our malloc function now looks like this:

void*
malloc (size_t size)
{
  void *caller = __builtin_return_address(0);
  [ ... ]
}

2.) accessing glibcs malloc from within your hook

As I am limited to glibc in my application, I chose to use __libc_malloc to access the original malloc implementation. Alternatively, dlsym(RTLD_NEXT, "malloc") can be used, but at the possible pitfall that this function uses calloc on its first call, possibly resulting in an infinite loop leading to a segfault.

complete malloc hook

My complete hooking function now looks like this:

extern void *__libc_malloc(size_t size);

int malloc_hook_active = 0;

void*
malloc (size_t size)
{
  void *caller = __builtin_return_address(0);
  if (malloc_hook_active)
    return my_malloc_hook(size, caller);
  return __libc_malloc(size);
}

where my_malloc_hook looks like this:

void*
my_malloc_hook (size_t size, void *caller)
{
  void *result;

  // deactivate hooks for logging
  malloc_hook_active = 0;

  result = malloc(size);

  // do logging
  [ ... ]

  // reactivate hooks
  malloc_hook_active = 1;

  return result;
}

Of course, the hooks for calloc, realloc and free work similarly.

dynamic and static linking

With these functions, dynamic linking works out of the box. Linking the .so file containing the malloc hook implementation will result of all calls to malloc from the application and also all library calls to be routed through my hook. Static linking is problematic though. I have not yet wrapped my head around it completely, but in static linking malloc is not a weak symbol, resulting in a multiple definition error at link time.

If you need static linking for whatever reason, for example translating function addresses in 3rd party libraries to code lines via debug symbols, then you can link these 3rd party libs statically while still linking the malloc hooks dynamically, avoiding the multiple definition problem. I have not yet found a better workaround for this, if you know one,feel free to leave me a comment.

Here is a short example:

gcc -o test test.c -lmalloc_hook_library -Wl,-Bstatic -l3rdparty -Wl,-Bdynamic

3rdparty will be linked statically, while malloc_hook_library will be linked dynamically, resulting in the expected behaviour, and addresses of functions in 3rdparty to be translatable via debug symbols in test. Pretty neat, huh?

Conlusion

the techniques above describe a non-deprecated, pretty much equivalent approach to __malloc_hooks, but with a couple of mean limitations:

__builtin_caller_address only works with gcc

__libc_malloc only works with glibc

dlsym(RTLD_NEXT, [...]) is a GNU extension in glibc

the linker flags -Wl,-Bstatic and -Wl,-Bdynamic are specific to the GNU binutils.

In other words, this solution is utterly non-portable and alternative solutions would have to be added if the hooks library were to be ported to a non-GNU operating system.

2 of 3
2

You can use LD_PRELOAD & dlsym See "Tips for malloc and free" at http://www.slideshare.net/tetsu.koba/presentations

Top answer
1 of 2
16

For understanding how dynamic memory allocation (the malloc, free, calloc, realloc library functions) really works there is no substitute for reading the source code of malloc(). It is well commented:

comments on chunks:

/*
1056    malloc_chunk details:
1057 
1058     (The following includes lightly edited explanations by Colin Plumb.)
1059 
1060     Chunks of memory are maintained using a `boundary tag' method as
1061     described in e.g., Knuth or Standish.  (See the paper by Paul
1062     Wilson ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps for a
1063     survey of such techniques.)  Sizes of free chunks are stored both
1064     in the front of each chunk and at the end.  This makes
1065     consolidating fragmented chunks into bigger chunks very fast.  The
1066     size fields also hold bits representing whether chunks are free or
1067     in use.
1068 
1069     An allocated chunk looks like this:
1070 
1071 
1072     chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1073             |             Size of previous chunk, if unallocated (P clear)  |
1074             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1075             |             Size of chunk, in bytes                     |A|M|P|
1076       mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1077             |             User data starts here...                          .
1078             .                                                               .
1079             .             (malloc_usable_size() bytes)                      .
1080             .                                                               |
1081 nextchunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1082             |             (size of chunk, but used for application data)    |
1083             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1084             |             Size of next chunk, in bytes                |A|0|1|
1085             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1086 
1087     Where "chunk" is the front of the chunk for the purpose of most of
1088     the malloc code, but "mem" is the pointer that is returned to the
1089     user.  "Nextchunk" is the beginning of the next contiguous chunk.
1090 
1091     Chunks always begin on even word boundaries, so the mem portion
1092     (which is returned to the user) is also on an even word boundary, and
1093     thus at least double-word aligned.
1094 
1095     Free chunks are stored in circular doubly-linked lists, and look like this:
1096 
1097     chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1098             |             Size of previous chunk, if unallocated (P clear)  |
1099             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1100     `head:' |             Size of chunk, in bytes                     |A|0|P|
1101       mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1102             |             Forward pointer to next chunk in list             |
1103             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1104             |             Back pointer to previous chunk in list            |
1105             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1106             |             Unused space (may be 0 bytes long)                .
1107             .                                                               .
1108             .                                                               |
1109 nextchunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1110     `foot:' |             Size of chunk, in bytes                           |
1111             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1112             |             Size of next chunk, in bytes                |A|0|0|
1113             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1114 
1115     The P (PREV_INUSE) bit, stored in the unused low-order bit of the
1116     chunk size (which is always a multiple of two words), is an in-use
1117     bit for the *previous* chunk.  If that bit is *clear*, then the
1118     word before the current chunk size contains the previous chunk
1119     size, and can be used to find the front of the previous chunk.
1120     The very first chunk allocated always has this bit set,
1121     preventing access to non-existent (or non-owned) memory. If
1122     prev_inuse is set for any given chunk, then you CANNOT determine
1123     the size of the previous chunk, and might even get a memory
1124     addressing fault when trying to do so.
1125 
1126     The A (NON_MAIN_ARENA) bit is cleared for chunks on the initial,
1127     main arena, described by the main_arena variable.  When additional
1128     threads are spawned, each thread receives its own arena (up to a
1129     configurable limit, after which arenas are reused for multiple
1130     threads), and the chunks in these arenas have the A bit set.  To
1131     find the arena for a chunk on such a non-main arena, heap_for_ptr
1132     performs a bit mask operation and indirection through the ar_ptr
1133     member of the per-heap header heap_info (see arena.c).
1134 
1135     Note that the `foot' of the current chunk is actually represented
1136     as the prev_size of the NEXT chunk. This makes it easier to
1137     deal with alignments etc but can be very confusing when trying
1138     to extend or adapt this code.
1139 
1140     The three exceptions to all this are:
1141 
1142      1. The special chunk `top' doesn't bother using the
1143         trailing size field since there is no next contiguous chunk
1144         that would have to index off it. After initialization, `top'
1145         is forced to always exist.  If it would become less than
1146         MINSIZE bytes long, it is replenished.
1147 
1148      2. Chunks allocated via mmap, which have the second-lowest-order
1149         bit M (IS_MMAPPED) set in their size fields.  Because they are
1150         allocated one-by-one, each must contain its own trailing size
1151         field.  If the M bit is set, the other bits are ignored
1152         (because mmapped chunks are neither in an arena, nor adjacent
1153         to a freed chunk).  The M bit is also used for chunks which
1154         originally came from a dumped heap via malloc_set_state in
1155         hooks.c.
1156 
1157      3. Chunks in fastbins are treated as allocated chunks from the
1158         point of view of the chunk allocator.  They are consolidated
1159         with their neighbors only in bulk, in malloc_consolidate.
1160 */

comments on internal data structures:

/*
1313    -------------------- Internal data structures --------------------
1314 
1315    All internal state is held in an instance of malloc_state defined
1316    below. There are no other static variables, except in two optional
1317    cases:
1318  * If USE_MALLOC_LOCK is defined, the mALLOC_MUTEx declared above.
1319  * If mmap doesn't support MAP_ANONYMOUS, a dummy file descriptor
1320      for mmap.
1321 
1322    Beware of lots of tricks that minimize the total bookkeeping space
1323    requirements. The result is a little over 1K bytes (for 4byte
1324    pointers and size_t.)
1325  */
1326 
1327 /*
1328    Bins
1329 
1330     An array of bin headers for free chunks. Each bin is doubly
1331     linked.  The bins are approximately proportionally (log) spaced.
1332     There are a lot of these bins (128). This may look excessive, but
1333     works very well in practice.  Most bins hold sizes that are
1334     unusual as malloc request sizes, but are more usual for fragments
1335     and consolidated sets of chunks, which is what these bins hold, so
1336     they can be found quickly.  All procedures maintain the invariant
1337     that no consolidated chunk physically borders another one, so each
1338     chunk in a list is known to be preceeded and followed by either
1339     inuse chunks or the ends of memory.
1340 
1341     Chunks in bins are kept in size order, with ties going to the
1342     approximately least recently used chunk. Ordering isn't needed
1343     for the small bins, which all contain the same-sized chunks, but
1344     facilitates best-fit allocation for larger chunks. These lists
1345     are just sequential. Keeping them in order almost never requires
1346     enough traversal to warrant using fancier ordered data
1347     structures.
1348 
1349     Chunks of the same size are linked with the most
1350     recently freed at the front, and allocations are taken from the
1351     back.  This results in LRU (FIFO) allocation order, which tends
1352     to give each chunk an equal opportunity to be consolidated with
1353     adjacent freed chunks, resulting in larger free chunks and less
1354     fragmentation.
1355 
1356     To simplify use in double-linked lists, each bin header acts
1357     as a malloc_chunk. This avoids special-casing for headers.
1358     But to conserve space and improve locality, we allocate
1359     only the fd/bk pointers of bins, and then use repositioning tricks
1360     to treat these as the fields of a malloc_chunk*.
1361  */

One of the authors of malloc(), Doug Lea, has written an article called "A Memory Allocator" which describes how malloc works (note that the article is from 2000, so there will be some out of date info).

From the article:

Chunks:

Bins:

An additional resource is chapter 7: "Memory Allocation" of "The Linux Programming Interface" by Michael Kerrisk. TLPI is the best reference of any type I have ever encountered and cannot recommend it highly enough.

Here is a diagram of the implementation of malloc() and free() from TLPI:

On a final note, malloc() is a wrapper around the brk() and sbrk() system calls, which resize the heap by changing the location of the program break.

From comments in the source:

 901   In the new situation, brk() and mmap space is shared and there are no
 902   artificial limits on brk size imposed by the kernel. What is more,
 903   applications have started using transient allocations larger than the
 904   128Kb as was imagined in 2001.
 905 
 906   The price for mmap is also high now; each time glibc mmaps from the
 907   kernel, the kernel is forced to zero out the memory it gives to the
 908   application. Zeroing memory is expensive and eats a lot of cache and
 909   memory bandwidth. This has nothing to do with the efficiency of the
 910   virtual memory system, by doing mmap the kernel just has no choice but
 911   to zero.
 912 
 913   In 2001, the kernel had a maximum size for brk() which was about 800
 914   megabytes on 32 bit x86, at that point brk() would hit the first
 915   mmaped shared libaries and couldn't expand anymore. With current 2.6
 916   kernels, the VA space layout is different and brk() and mmap
 917   both can span the entire heap at will.
 918 
 919   Rather than using a static threshold for the brk/mmap tradeoff,
 920   we are now using a simple dynamic one. The goal is still to avoid
 921   fragmentation. The old goals we kept are
 922   1) try to get the long lived large allocations to use mmap()
 923   2) really large allocations should always use mmap()
 924   and we're adding now:
 925   3) transient allocations should use brk() to avoid forcing the kernel
 926      having to zero memory over and over again
 927 
 928   The implementation works with a sliding threshold, which is by default
 929   limited to go between 128Kb and 32Mb (64Mb for 64 bitmachines) and starts
 930   out at 128Kb as per the 2001 default.
 931 
 932   This allows us to satisfy requirement 1) under the assumption that long
 933   lived allocations are made early in the process' lifespan, before it has
 934   started doing dynamic allocations of the same size (which will
 935   increase the threshold).
 936 
 937   The upperbound on the threshold satisfies requirement 2)
 938 
 939   The threshold goes up in value when the application frees memory that was
 940   allocated with the mmap allocator. The idea is that once the application
 941   starts freeing memory of a certain size, it's highly probable that this is
 942   a size the application uses for transient allocations. This estimator
 943   is there to satisfy the new third requirement.
 944 
 945 */
2 of 2
10

There are several quite good references about the exploitation of the heap in software security, one of my favorite is probably the 'binary hacking course' from LiveOverflow.

You can look at the following lectures for a simplified approach of the heap management (using the Protostar exercise set from Exploit-Exercises):

  • 0x14 - The Heap: what does malloc() do?
  • 0x15 - The Heap: How to exploit a Heap Overflow
  • 0x16 - The Heap: How do use-after-free exploits work?
  • 0x17 - The Heap: Once upon a free()
  • 0x18 - The Heap: dlmalloc unlink() exploit

And, also:

  • 0x1F - [Live] Remote oldschool dlmalloc Heap exploit

Then, you can try to read the write-ups on all the Heap exercises on Protostar.

But, the blog posts from sploitfun are one of the most accurate articles I ever seen on the web about this specific topic. I would advise to you to get back to the articles of sploitfun once you get enough understanding of the basic principles.