QNX From The Board Up #17 - mmap() PRIVATE vs. SHARED

Dive into more internals of mmap() to understand files in memory, the semantics of MAP_PRIVATE vs. MAP_SHARED, and how these flags impact memory sharing and efficiency in QNX.

QNX From The Board Up #17 - mmap() PRIVATE vs. SHARED

Welcome to the blog series "From The Board Up" by Michael Brown. In this series we create a QNX image from scratch and build upon it stepwise, all the while looking deep under the hood to understand what the system is doing at each step.

With QNX for a decade now, Michael works on the QNX kernel and has in-depth knowledge and experience with embedded systems and system architecture.


So far we've explored mmap() through two use cases:

  • Give me Plain Old RAM à la malloc(), and
  • Give me direct access to a specific range of the physical address space.

In both of these, it's pretty clear that mmap() is creating a mapping from a process's virtual address space to the system's physical address space by manipulating the process's MMU configuration. Easy peasy.

So far, I've avoided getting deeply into MAP_PRIVATE vs. MAP_SHARED. The reason for that being that these use cases were not the original use cases that begat mmap() in the first place. Therefore, it's not necessarily obvious that MAP_PRIVATE and MAP_SHARED were actually shoehorned into these "newer" use cases.

To really understand MAP_PRIVATE vs. MAP_SHARED we have to take mmap() back to its roots and the most common use case: mapping a file into memory.

Once we see how they differ in their use cases here, it will become obvious why they exist. Then, we'll go back to the "newer" use cases and discuss how we can apply MAP_PRIVATE vs. MAP_SHARED.

Files. Files. Files.

So, you want access to the contents of a file. How can you do that? Well, I suppose you could:

  1. open() a file to get a file descriptor (fd),
  2. fstat() the fd to get the file's size,
  3. malloc() or mmap() to allocate a buffer of the size of the file
  4. read() the contents of the file into the allocated buffer,
  5. Profit!

i.e. copy the data from the original source (the file) into memory.

And if you want to write the changes back: 6. write() the buffer back to the file.

i.e. write the modified copy of the file back to the source (the file).

UNIX/QUNIXQNX/POSIX is all about files. Files are an abstraction layer around data. So, this is going to happen a lot.

Then someone said, "I'm tired of writing this code over and over and over. Why don't you, dear operating system, just do this for me? I'll give you a file descriptor, a size, and you do the rest." Then the operating system people said, "Hm. You're right. I will provide you with a function called mmap(). And, as an operating system that knows what's going on in the system with all these calls to mmap(), I might be able to do this efficiently as well!"

So, that's what happened. When? AFAICT, starting around 1983 in the UNIX world, with 4.2BSD, but, apparently mmap() was still vapourware then. Not sure. Anyway, most people would agree 1983 was a long time ago, and that means there's been lots of time for people to say over the years between, "Hey, do you know what else we could add to mmap()?"

POSIX standardized a few of them, but, a lot of the rest are "Only this OS" and "Only this version of this OS." And QNX is no different.

I'll refer to this ur-use-case as "file-backed mappings".

MAP_PRIVATE vs. MAP_SHARED

When it comes to file-backed mappings and MAP_PRIVATE vs. MAP_SHARED, the question is:

  • do you want your own private copy that you can modify, i.e. your changes are private to your copy, or
  • do you want your changes in memory to be reflected in the file? i.e. share your changes with everyone else.

In other words PRIVATE means:

  • a copy of the original data source is made in memory, and
  • any changes made by you in memory are only reflected in your private copy.

... and SHARED means:

  • you "effectively" have access to the original data, and
  • any changes made by you in memory are visible to everyone.

This is a simple but powerful construct.

Aside

In early references to mmap(), the flags argument to mmap() was then called share:

"The parameter share specifies whether modifications made to this mapped copy of the page are to be kept private, or are to be shared with other references."

MAP_SHARED (1) or MAP_PRIVATE (2). That's it.

Oh, and there was a footnote in this section saying,

"This section represents the interface planned for later releases of the system. Of the calls described in this section. only sbrk() and getpagesize() are included in 4.2BSD"

So, looks like they were still working on it and throwing it out there for feedback.

Simple Use Case 1 - mmap() a file MAP_PRIVATE

Example: open() the file /dev/zero (Which is not an actual file on a hard drive. It's a resource manager that, when asked to read() N bytes of data, says "Ok. Here are N instances of the byte 0x00".) and mmap() it MAP_PRIVATE.

Boom, you got yourself a bunch of zeros in memory. If you make changes to those zeros by writing anywhere in the mapping, it doesn't matter: it's your private copy, so the changes are private to you in the memory used in the mapping.

This use case is so handy it was decided to create MAP_ANON as a shortcut.

(Ever wonder why there's a calloc(), when you could always just malloc() and then memset() zeros?)

Simple Use Case 2 - mmap() a file MAP_SHARED

Example: open() the file my_read_only_data.bin as O_RDONLY, and then mmap() it PROT_READ and MAP_SHARED. If someone else has done the same, the OS can be smart and say "You know what? I've already loaded the contents of this file into memory R/O for someone else. I'll just create a mapping to that memory in your process too."

Boom, you have 2 mappings to the same memory. Say that file is 32 MB. You just saved yourself 32 MB. If 3 processes created the mapping, you just saved yourself 64 MB. Imagine if all processes needed this file how much memory this would save?

And in fact, this is exactly what happens with libc.so and ldqnx-64.so, and indeed all shared libraries and executables. When an executable or shared library is loaded into a process:

  • the code is loaded PROT_READ|PROT_EXEC and MAP_SHARED, and therefore only really loaded into memory once, and
  • the data is loaded PROT_READ|PROT_WRITE and MAP_PRIVATE so that it's initialized with the data in the exe/shared library, but, any changes that are made are private to the process.

So, while the data is – necessarily – copied for each process, the code is shared amongst all processes that need it.

Abstraction

We can see that there's a lot more here than manipulating the MMU configuration; there's a lot of work being done by the operating system. However, it's saving you and your code a lot of work, and finding efficiencies where possible.

We must therefore move from the MMU and its configuration up to the more abstract idea of a "memory object", i.e. something that acts as a data source for a mapping. Then, we have either:

  • a private view of a memory object, i.e. copy its content, and then any changes are local changes to the process only, or
  • a shared view of a memory object, i.e. directly accessing the data source.

And at this level of abstraction, we're now getting into the terminology used by the POSIX specification, which talks about the following memory objects:

  • regular files (the original use case for mmap())
  • anonymous memory objects (Oh, you mean /dev/zero MAP_PRIVATE? Gotcha.)
  • shared memory objects (Plain Old RAM that 2+ processes explicitly agree to share for IPC), and
  • typed memory objects (More later).

The reason I structured these articles to start with MAP_ANON and MAP_PHYS is I have found explaining mmap() beginning from this level of abstraction too abstract. If you're unfamiliar with mmap() and then drop in cold into the POSIX spec and read "MAP_SHARED and MAP_PRIVATE describe the disposition of write references to the memory object." there aren't many epiphanies happening.

IMHO, it's easier if you first understand what the MMU is capable of, and then, given its capabilities, you can start moving up the software stack to more abstract functionality which is, really, just more value added by the OS.

With this context of:

  • PRIVATE = create a copy of data source, and
  • SHARED = access the data source directly (or effectively directly)

... let's take a look at MAP_PRIVATE vs. MAP_SHARED and how they're interpreted in the two big use cases for mmap() we've already looked at:

  • MAP_ANON for Plain Old RAM à la malloc(), and
  • MAP_PHYS when we want access to a specific range of physical addresses

MAP_ANON

For malloc(), we used MAP_ANON with MAP_PRIVATE because we wanted some Plain Old RAM only for ourselves. The question is, when would we ever want Plain Old RAM not just for ourselves? If it's not private to us, then we're potentially sharing it. But, with whom, and under what conditions? There are really only 2 possibilities:

  • we will take whatever the OS gives us and take the risk that some other process may or may not be using it, or
  • the OS will share it with specific processes, and only under controlled conditions.

Fortunately, the latter is what POSIX and QNX choose.

There are several ways to share memory between processes, but, the easiest is what we're talking about now: if you mmap() with MAP_ANON and MAP_SHARED, then that memory will be shared with children created by fork(). We saw fork() in an early version of our nkiss program when we created a child process using fork()-and-exec().

We'll create a little helper function to:

  • print the virtual address,
  • print the physical address, and
  • print the first byte of the mapping as a character
static void
print_vaddr_paddr(char const * const desc, char* const vaddr)
{
    off_t paddr;
    int rv = mem_offset(vaddr, NOFD, 1, &paddr, NULL);
    assert(0 == rv);

    printf("%12s : "
           "vaddr = 0x%16.16" PRIX64 ", paddr = 0x%16.16" PRIX64 ", "
           "vaddr[0] = '%c'\n",
           desc, (uintptr_t)vaddr, paddr, vaddr[0] );
}

Then, in main(), we'll use getopt() to allow us to create an ANON mapping that is either private (default) or shared:

int main(int argc, char* argv[])
{
    bool opt_private = true;

    while (true) {
        const int opt = getopt(argc, argv, "s");
        if (-1 == opt) break;
        switch(opt) {
            case 's': opt_private = false; break;
            default : return EXIT_FAILURE;
        }
    }

And then create the mapping based on the preference:

    const size_t len   = 32;
    const int    prot  = PROT_READ | PROT_WRITE;
    const int    flags = MAP_ANON  | (opt_private ? MAP_PRIVATE : MAP_SHARED);

    printf("Creating mapping with MAP_%s\n", opt_private ? "PRIVATE" : "SHARED");

    char * const vaddr = mmap(NULL, len, prot, flags, NOFD, 0);
    assert(MAP_FAILED != vaddr);

Now, so we can be extra sure, we'll make some changes to the data in the mapping in the parent and check if the child can see them. AND, we'll have the child do the same to see if the parent can see them.

    vaddr[0] = 'P'; // First byte modified by Parent

    print_vaddr_paddr("Parent", vaddr);


    // Now create a child process by fork()ing
    const pid_t pid = fork();
    assert(-1 != pid);
    if (0 == pid) {
        // Child process
        print_vaddr_paddr("Child", vaddr);
        vaddr[0] = 'C'; // First byte modified by child
        return EXIT_SUCCESS;
    }

    // Parent waits for child to terminate
    int child_status;
    const pid_t wait_rv = wait(&child_status);
    assert(wait_rv == pid);
    assert(WIFEXITED(child_status));
    assert(WEXITSTATUS(child_status) == EXIT_SUCCESS);

    // Let's see what we have in our memory now
    print_vaddr_paddr("Parent", vaddr);

    return EXIT_SUCCESS;
}

MAP_ANON | MAP_PRIVATE

Let's try ANON and PRIVATE and see what happens:

 / # private_vs_shared
Creating mapping with MAP_PRIVATE
      Parent : vaddr = 0x0000001D48E4E000, paddr = 0x0000000234168000, vaddr[0] = 'P'
       Child : vaddr = 0x0000001D48E4E000, paddr = 0x000000023416C000, vaddr[0] = 'P'
      Parent : vaddr = 0x0000001D48E4E000, paddr = 0x0000000234168000, vaddr[0] = 'P'

We can see that the child's mapping:

  • is at the same virtual address as it is in the parent
    • That makes sense; the child process is supposed to be a copy of the parent process,
  • but, the mapping is to a different physical address, and
  • the child's mapping has the same contents as the parent's.

i.e. a copy was created for the child at a different physical address in System RAM, but the child has the mapping at the same virtual address.

So, clearly, part of the work that QNX had to do for the fork() was to

  • grab some more System RAM for the mapping in the child, and
  • copy the contents from the parent's PRIVATE mapping into the child's PRIVATE mapping.

MAP_ANON | MAP_SHARED

Let's try ANON and SHARED and see what happens:

 / # private_vs_shared -s
Creating mapping with MAP_SHARED
      Parent : vaddr = 0x0000002882D98000, paddr = 0x0000000234168000, vaddr[0] = 'P'
       Child : vaddr = 0x0000002882D98000, paddr = 0x0000000234168000, vaddr[0] = 'P'
      Parent : vaddr = 0x0000002882D98000, paddr = 0x0000000234168000, vaddr[0] = 'C'

Ah! Looks like when you mmap() SHARED, when you fork(), the mapping in the child is at the same vaddr and paddr. i.e. these two processes, parent and child, are sharing the same RAM. Any change made by one is visible to the other.

Given that, it makes sense that the change made by the child ('C') is visible to the parent, as seen on the last line.

Notice that this wasn't true with MAP_PRIVATE. i.e. the change made by the child to its PRIVATE copy was discarded when the child exited.

MAP_PHYS

Ok, in a previous article I stated without explanation that MAP_PHYS has to be done with MAP_SHARED.

The reason for this is that with MAP_PHYS, saying you want MAP_PRIVATE doesn't make sense. Is QNX supposed to give you your own private copy of the contents of a peripheral's registers, or your own private copy of the contents of, e.g, a video buffer? Neither of those make sense. Asking for a specific physical address range is (typically) either because:

  • there's a peripheral there, or
  • there's something there that is also accessed by a peripheral.

Therefore older versions of QNX will quietly let you "get away with it", i.e.

  • quietly convert MAP_PHYS | MAP_PRIVATE to MAP_PHYS | MAP_SHARED, and
  • not report an error.

In more recent versions we have started to move away from this quiet correction. Rather, this is now part of the "backward compatibility" setting for the VMM.

Remember when we created our own cat so we could look at the contents of /proc/config to see ASLR enabled or disabled? There was another field in there that we ignored then, but can talk about now.

/ # cat /proc/config
...
mmflags:0x205 (BACKWARDS_COMPAT,RANDOMIZE,V4)
...

That BACKWARDS_COMPAT means we'll let you still pass in MAP_PHYS | MAP_PRIVATE, and quietly convert it to MAP_PHYS | MAP_SHARED and not report an error.

Aside: V4

That just leaves the mmflags field V4 to discuss, which just means this QNX is using Version 4 of the Virtual Memory Manager (VMM). QNX 7 uses V3, but we never really called it V3 until the one we created for QNX 8 was named V4. The /proc/config in QNX 7 doesn't even mention which version of the VMM it uses.

Let's see what happens if we call mmap() with MAP_PHYS | MAP_PRIVATE to the video buffer at 0x000B8000:

    const size_t len   = 4000;
    const int    prot  = PROT_READ | PROT_WRITE;
    const int    flags = MAP_PHYS  | MAP_PRIVATE;
    const off_t offset = UINT64_C(0x000B8000);
    char * const vaddr = mmap(NULL, len, prot, flags, NOFD, offset);
    if (MAP_FAILED == vaddr) {
        printf("mmap() failed. errno = %d (%s)\n", errno, strerror(errno));
    }
    assert(MAP_FAILED != vaddr);
    print_vaddr_paddr("Video", vaddr);

and run that

Video  : vaddr = 0x0000002982DA3000, paddr = 0x00000000000B8000, vaddr[0] = 'S'

A few things to note:

  • No error.
  • paddr is the actual requested physical address. No referring to some copy in System RAM.
  • We can see the S character in the first byte.

The S character is there for two reasons:

  1. The initial state of the video buffer started with the message "SeaBIOS (version ". We saw this last time:
  1. The format of each character in the video buffer is:
  • First byte: ASCII character, then
  • Second byte: Character attributes (foreground and background colours)

Therefore, by reading the first byte of the buffer, we're getting the ASCII of the first character displayed, i.e. S.

Turning Backwards Compat Off

Let's turn BACKWARDS_COMPAT off and see what happens if we try to mmap() with MAP_PHYS|MAP_PRIVATE.

First we modify our build file to disable backward compatibility:

[virtual=x86_64,multiboot] boot = {
    startup-x86 -D 8250
    procnto-smp-instr -m~b
}

and confirm by looking at /proc/config

/ # cat /proc/config
...
mmflags:0x204 (RANDOMIZE,V4)
...

and now run our program to see:

mmap() failed. errno = 22 (Invalid argument)

I don't know how long we'll leave this backward compatibility on by default. Just be aware it's there and could disappear with any update. I'm sure there'll be mention in the release notes.

Summary

Just to review, we looked at a couple of the original use case for mmap() with files, and saw how you can use PRIVATE and SHARED mappings, especially with executables and libraries. We didn't get into ALL the use cases with files, but, that was enough to help explain the essence of PRIVATE and SHARED, and how they apply to the ANON and PHYS use cases we've already dug into.

Flags Description
MAP_ANON | MAP_PRIVATE Gimme RAM.
MAP_ANON | MAP_SHARED Gimme RAM I can share with child processes when I fork().
MAP_PHYS | MAP_PRIVATE Don't do it.
MAP_PHYS | MAP_SHARED Give me access to a specific physical address range.

Coming Up...

We're not quite done with memory just yet!