TECHNOLOGY

Honey, I gotten smaller {fmt}: bringing binary size to 14k and ditching the C++ runtime

The {fmt} formatting library is famous for its tiny binary footprint,
usually producing code that’s several times smaller per characteristic call when put next
to choices savor IOStreams, Boost Layout, or, considerably ironically,
tinyformat. Here is mostly done through cautious utility of form erasure
on various ranges, which effectively minimizes template bloat.

Formatting arguments are passed by capability of form-erased format_args:

auto vformat(string_view fmt, format_args args) -> std:: string;

template <typename... T>
auto format(format_string<T...> fmt, T&&... args) -> std:: string {
  return vformat(fmt, fmt:: make_format_args(args...));
}

As you are going to be ready to stumble on, format delegates all its work to vformat, which is no longer a
template.

Output iterators and other output forms are moreover form-erased through a specially
designed buffer API.

This form confines template usage to a minimal top-level layer, leading to
each and each a smaller binary size and faster procure times.

Shall we utter, the next code:

// take a look at.cc
#consist of 

int fundamental() {
  fmt:: print("The acknowledge is {}.", 42);
}

compiles to accurate

.LC0:
        .string "The acknowledge is {}."
fundamental:
        sub     rsp, 24
        mov     eax, 1
        mov     edi, OFFSET FLAT:.LC0
        mov     esi, 17
        mov     rcx, rsp
        mov     rdx, rax
        mov     DWORD PTR [rsp], 42
        call    fmt::v11::vprint(fmt::v11::basic_string_view, fmt::v11::basic_format_args)
        xor     eax, eax
        add     rsp, 24
        ret

godbolt

It’s fundamental smaller than the same IOStreams code and same to that
of printf:

.LC0:
        .string "The acknowledge is %d."
fundamental:
        sub     rsp, 8
        mov     esi, 42
        mov     edi, OFFSET FLAT:.LC0
        xor     eax, eax
        call    printf
        xor     eax, eax
        add     rsp, 8
        ret

godbolt

Unlike printf, {fmt} provides fat runtime form security. Errors in format strings
might perhaps well moreover be caught at assemble time, and even when the format string is determined at
runtime, errors are managed through exceptions, struggling with undefined habits,
memory corruption, and capacity crashes. Furthermore, {fmt} calls are
generally more efficient, significantly when the exercise of positional arguments, which C
varargs are no longer smartly-fitted to.

Encourage in 2020, I devoted some time to optimizing the library size,
successfully cutting back it to below 100kB (accurate ~57kB with -Os -flto).
A lot has modified since then. Most notably, {fmt} now uses the unparalleled
Dragonbox algorithm for floating-point formatting, kindly
contributed by its author, Junekey Jeon. Let’s explore how these adjustments accept
impacted the binary size and stumble on if further reductions are imaginable.

But why, some utter, the binary size? Why pick this as our procedure?

There has been noteworthy passion in the exercise of {fmt} on memory-constrained
units, stumble on e.g. #758 and #1226 for accurate two examples from
the a long way away past. An awfully provocative exercise case is retro computing, with
other folk the exercise of {fmt} on methods savor Amiga (#4054).

We’ll apply the same methodology as in previous work, examining the
executable size of a program that uses {fmt}, as this is most connected to full
users. All tests shall be conducted on an aarch64 Ubuntu 22.04 system with GCC
11.4.0.

First, let’s place the baseline: what’s the binary size for the most modern
version of {fmt} (11.0.2)?

$ git checkout 11.0.2
$ g++ -Os -flto -DNDEBUG -I consist of take a look at.cc src/format.cc
$ strip a.out && ls -l. a.out
-rwxrwxr-x 1 vagrant vagrant 75Okay Aug 30 19: 24 a.out

The following binary size is 75kB (stripped). The terrifying takeaway is that
no topic a mountainous amount of traits over the last four years, the dimensions has no longer
greatly regressed.

Now, let’s explore capacity optimizations. One in every of the fundamental adjustments you
might perhaps well grab into story is disabling locale toughen. The entire formatting in {fmt} is
locale-honest by default (which breaks with the C++’s custom of getting
corrupt defaults), on the opposite hand it’s a long way mild available as an opt in by capability of the L format
specifier. It must moreover be disabled in a considerably vague manner by capability of the
FMT_STATIC_THOUSANDS_SEPARATOR macro:

$ g++ -Os -flto -DNDEBUG "-DFMT_STATIC_THOUSANDS_SEPARATOR=','" 
      -I consist of take a look at.cc src/format.cc
$ strip a.out && ls -l. a.out
-rwxrwxr-x 1 vagrant vagrant 71Okay Aug 30 19: 25 a.out

Disabling locale toughen reduces the binary size to 71kB.

Subsequent, let’s gaze the outcomes the exercise of our staunch instrument, Bloaty:

$ bloaty -d symbols a.out

    FILE SIZE        VM SIZE
 --------------  --------------
  43.8%  41.1Ki  43.6%  29.0Ki    [121 Others]
   6.4%  6.04Ki   8.1%  5.42Ki    fmt::v11::part::do_write_float<>()
   5.9%  5.50Ki   7.5%  4.98Ki    fmt::v11::part::write_int_noinline<>()
   5.7%  5.32Ki   5.8%  3.88Ki    fmt::v11::part::write<>()
   5.4%  5.02Ki   7.2%  4.81Ki    fmt::v11::part::parse_replacement_field<>()
   3.9%  3.69Ki   3.7%  2.49Ki    fmt::v11::part::format_uint<>()
   3.2%  3.00Ki   0.0%       0    [section .symtab]
   2.7%  2.50Ki   0.0%       0    [section .strtab]
   2.3%  2.12Ki   2.9%  1.93Ki    fmt::v11::part::dragonbox::to_decimal<>()
   2.0%  1.89Ki   2.4%  1.61Ki    fmt::v11::part::write_int<>()
   2.0%  1.88Ki   0.0%       0    [ELF Section Headers]
   1.9%  1.79Ki   2.5%  1.66Ki    fmt::v11::part::write_float<>()
   1.9%  1.78Ki   2.7%  1.78Ki    [section .dynstr]
   1.8%  1.72Ki   2.4%  1.62Ki    fmt::v11::part::format_dragon()
   1.8%  1.68Ki   1.5%    1016    fmt::v11::part::format_decimal<>()
   1.6%  1.52Ki   2.1%  1.41Ki    fmt::v11::part::format_float<>()
   1.6%  1.49Ki   0.0%       0    [Unmapped]
   1.5%  1.45Ki   2.2%  1.45Ki    [section .dynsym]
   1.5%  1.45Ki   2.0%  1.31Ki    fmt::v11::part::write_loc()
   1.5%  1.44Ki   2.2%  1.44Ki    [section .rodata]
   1.5%  1.40Ki   1.1%     764    fmt::v11::part::do_write_float<>()::{lambda()#2}::operator()()
 100.0%  93.8Ki 100.0%  66.6Ki    TOTAL

Unsurprisingly, an even portion of the binary size is devoted to numeric
formatting, significantly floating-point numbers. FP formatting moreover depends on
huge tables, which aren’t confirmed here. But what if floating-point toughen
isn’t required? {fmt} provides a vogue to disable it, though the manner is
considerably advert hoc and doesn’t lengthen to different kinds.

The core field is that formatting functions prefer to be responsive to all formattable
forms. Or produce they? Here is appropriate for printf as outlined by the C long-established, but
no longer necessarily for {fmt}. {fmt} supports an extension API that lets in
formatting arbitrary forms without lustrous your entire home of forms in come.
While constructed-in and string forms are dealt with specially for efficiency causes,
specializing in binary size might perhaps well warrant a special manner. By doing away with this
special handling and routing all forms during the extension API, you are going to be ready to lead clear of
paying for forms you don’t exercise.

I did an experimental implementation of this realizing. With the
FMT_BUILTIN_TYPES macro home to 0, handiest int is dealt with specially, and all
different kinds battle during the smartly-liked extension API. We mild prefer to know about
int for dynamic width and precision, as an illustration

fmt:: print("{:{}}n", "good day", 10); // prints "good day     "

This provides you the “don’t pay for what you don’t exercise” mannequin, though it comes
with a microscopic expand in per-call binary size. While you happen to provide format floating-point
numbers or different kinds, the connected code will mild be integrated in the procure.
While it’s imaginable to carry out the FP implementation smaller, we received’t delve into
that here.

With FMT_BUILTIN_TYPES=0, the binary size in our instance diminished to 31kB,
representing a gargantuan development:

$ git checkout 377cf20
$ g++ -Os -flto -DNDEBUG 
      "-DFMT_STATIC_THOUSANDS_SEPARATOR=','" -DFMT_BUILTIN_TYPES=0 
      -I consist of take a look at.cc src/format.cc
$ strip a.out && ls -l. a.out
-rwxrwxr-x 1 vagrant vagrant 31Okay Aug 30 19: 37 a.out

On the opposite hand, the updated Bloaty results display conceal some lingering locale artifacts,
similar to digit_grouping:

$ bloaty -d fullsymbols a.out

    FILE SIZE        VM SIZE
 --------------  --------------
  41.8%  18.0Ki  39.7%  11.0Ki    [84 Others]
   6.4%  2.77Ki   0.0%       0    [section .symtab]
   5.3%  2.28Ki   0.0%       0    [section .strtab]
   4.6%  1.99Ki   6.9%  1.90Ki    fmt::v11::part::format_handler::on_format_specs(int, char const*, char const*)
   4.4%  1.88Ki   0.0%       0    [ELF Section Headers]
   4.1%  1.78Ki   5.8%  1.61Ki    fmt::v11::basic_appender fmt::v11::part::write_int_noinline, unsigned int>(fmt::v11::basic_appender, fmt::v11::part::write_int_arg, fmt::v11::format_specs const&, fmt::v11::part::locale_ref) (.constprop.0)
   3.7%  1.60Ki   5.8%  1.60Ki    [section .dynstr]
   3.5%  1.50Ki   4.8%  1.34Ki    void fmt::v11::part::vformat_to(fmt::v11::part::buffer&, fmt::v11::basic_string_view, fmt::v11::part::vformat_args::form, fmt::v11::part::locale_ref) (.constprop.0)
   3.5%  1.49Ki   4.9%  1.35Ki    fmt::v11::basic_appender fmt::v11::part::write_int, unsigned __int128, char>(fmt::v11::basic_appender, unsigned __int128, unsigned int, fmt::v11::format_specs const&, fmt::v11::part::digit_grouping const&)
   3.1%  1.31Ki   4.7%  1.31Ki    [section .dynsym]
   3.0%  1.29Ki   4.2%  1.15Ki    fmt::v11::basic_appender fmt::v11::part::write_int, unsigned lengthy, char>(fmt::v11::basic_appender, unsigned lengthy, unsigned int, fmt::v11::format_specs const&, fmt::v11::part::digit_grouping const&)

After disabling these artifacts in commits e582d37 and
b3ccc2d, and introducing a more particular person-pleasant choice to opt out by capability of
the FMT_USE_LOCALE macro, the binary size drops to 27kB:

$ git checkout b3ccc2d
$ g++ -Os -flto -DNDEBUG -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 
      -I consist of take a look at.cc src/format.cc
$ strip a.out && ls -l. a.out
-rwxrwxr-x 1 vagrant vagrant 27Okay Aug 30 19: 38 a.out

The library contains several areas the do size is traded off for bolt.
Shall we utter, grab into story this characteristic mature to compute the amount of decimal
digits:

auto do_count_digits(uint32_t n) -> int {
// An optimization by Kendall Willets from https://bit.ly/3uOIQrB.
// This increments the simpler 32 bits (log10(T) - 1) when >= T is added.
#  elaborate FMT_INC(T) (((sizeof(#T) - 1ull) << 32) - T)
  static constexpr uint64_t table[] = {
      FMT_INC(0),          FMT_INC(0),          FMT_INC(0),           // 8
      FMT_INC(10),         FMT_INC(10),         FMT_INC(10),          // 64
      FMT_INC(100),        FMT_INC(100),        FMT_INC(100),         // 512
      FMT_INC(1000),       FMT_INC(1000),       FMT_INC(1000),        // 4096
      FMT_INC(10000),      FMT_INC(10000),      FMT_INC(10000),       // 32ample
      FMT_INC(100000),     FMT_INC(100000),     FMT_INC(100000),      // 256ample
      FMT_INC(1000000),    FMT_INC(1000000),    FMT_INC(1000000),     // 2048ample
      FMT_INC(10000000),   FMT_INC(10000000),   FMT_INC(10000000),    // 16M
      FMT_INC(100000000),  FMT_INC(100000000),  FMT_INC(100000000),   // 128M
      FMT_INC(1000000000), FMT_INC(1000000000), FMT_INC(1000000000),  // 1024M
      FMT_INC(1000000000), FMT_INC(1000000000)                        // 4B
  };
  auto inc = table[__builtin_clz(n | 1) ^ 31];
  return static_cast<int>((n + inc) >> 32);
}

The table mature here is 256 bytes. There isn’t a one-size-suits-all resolution,
and changing it unconditionally might perhaps well negatively affect other exercise cases.
Thankfully, we’ve got a fallback implementation of this characteristic for cases
the do __builtin_clz is unavailable, similar to with constexpr:

template <typename T> constexpr auto count_digits_fallback(T n) -> int {
  int count = 1;
  for (;;) {
    // Integer division is slow so produce it for a community of four digits in its do
    // of for every digit. The premise comes from the controversy by Alexandrescu
    // "Three Optimization Pointers for C++". Watch bolt-take a look at for a comparison.
    if (n < 10) return count;
    if (n < 100) return count + 1;
    if (n < 1000) return count + 2;
    if (n < 10000) return count + 3;
    n /= 10000u;
    count += 4;
  }
}

All that remains is to provide users with control over when to use the fallback
implementation via (you guessed it) another configuration macro,
FMT_OPTIMIZE_SIZE:

auto count_digits(uint32_t n) -> int {
#ifdef FMT_BUILTIN_CLZ
  if (!is_constant_evaluated() && !FMT_OPTIMIZE_SIZE) return do_count_digits(n);
#endif
  return count_digits_fallback(n);
}

With this and a number of same adjustments, we diminished the binary size to 23kB:

$ git checkout 8e3da9d
$ g++ -Os -flto -DNDEBUG -I consist of 
      -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 
      take a look at.cc src/format.cc
$ strip a.out && ls -l. a.out
-rwxrwxr-x 1 vagrant vagrant 23Okay Aug 30 19: 41 a.out

We might perhaps well seemingly lower the binary size even further with extra tweaks,
but let’s take care of the elephant in the room which is, needless to claim, the C++ long-established
library. What’s the point of optimizing the dimensions at the same time as you finish up getting
a megabyte or two of the C++ runtime?

While {fmt} depends minimally on the long-established library, is it imaginable to
grab away it entirely as a dependency? One obvious declare is exceptions and
these might perhaps well moreover be disabled by capability of FMT_THROW, e.g. by defining it to abort.
In smartly-liked it’s now not urged on the opposite hand it shall be OK for some exercise cases
significantly brooding about that most errors are caught at assemble time.

Let’s are trying it out and assemble with -nodefaultlibs and exceptions disabled:

$ g++ -Os -flto -DNDEBUG -I consist of 
      -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 
      '-DFMT_THROW(s)=abort()' -fno-exceptions take a look at.cc src/format.cc 
      -nodefaultlibs -lc

/usr/bin/ld: /tmp/cc04DFeK.ltrans0.ltrans.o: in characteristic `fmt::v11::basic_memory_buffer >::develop(fmt::v11::part::buffer&, unsigned lengthy)':
:(.text+0xaa8): undefined reference to `std::__throw_bad_alloc()'
/usr/bin/ld: :(.text+0xab8): undefined reference to `operator new(unsigned lengthy)'
/usr/bin/ld: :(.text+0xaf8): undefined reference to `operator delete(void*, unsigned lengthy)'
/usr/bin/ld: /tmp/cc04DFeK.ltrans0.ltrans.o: in characteristic `fmt::v11::vprint_buffered(_IO_FILE*, fmt::v11::basic_string_view, fmt::v11::basic_format_args) [clone .constprop.0]':
:(.text+0x18c4): undefined reference to `operator delete(void*, unsigned lengthy)'
collect2: error: ld returned 1 exit utter

Amazingly, this form mostly works. The handiest remaining dependency on the C++
runtime comes from fmt::basic_memory_buffer, which is a tiny stack-allocated
buffer that can develop into dynamic memory if foremost.

fmt::print can write straight into the FILE buffer and generally
doesn’t require dynamic allocation. So we might perhaps well grab away the dependency on
fmt::basic_memory_buffer from fmt::print. On the opposite hand, since it shall be mature
in other locations, a bigger resolution is to interchange the default allocator with one which
uses malloc and free in preference to new and delete.

template <typename T> struct allocator {
  the exercise of value_type = T;

  T* allocate(size_t n) {
    FMT_ASSERT(n <= max_value<size_t>() / sizeof(T), "");
    T* p = static_cast<T*>(malloc(n * sizeof(T)));
    if (!p) FMT_THROW(std:: bad_alloc());
    return p;
  }

  void deallocate(T* p, size_t) { free(p); }
};

This reduces binary size to accurate 14kB:

$ git checkout c0fab5e
$ g++ -Os -flto -DNDEBUG -I consist of 
      -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 
      '-DFMT_THROW(s)=abort()' -fno-exceptions take a look at.cc src/format.cc 
      -nodefaultlibs -lc
$ strip a.out && ls -l. a.out
-rwxrwxr-x 1 vagrant vagrant 14Okay Aug 30 19: 06 a.out

Pondering that a C program with an empty fundamental characteristic is 6kB on this
system, {fmt} now adds lower than 10kB to the binary.

We are succesful of moreover effortlessly verify that it no longer depends on the C++ runtime:

$ ldd a.out
        linux-vdso.so.1 (0x0000ffffb0738000)
        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffb0530000)
        /lib/ld-linux-aarch64.so.1 (0x0000ffffb06ff000)

Hope you stumbled on this attention-grabbing and pleased embedded formatting!


Final modified on 2024-08-30

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button