Honey, I gotten smaller {fmt}: bringing binary size to 14k and ditching the C++ runtime
The {fmt} formatting library is famous for its tiny binary footprint,
usually producing code that’s several times smaller per characteristic call when put next
to choices savor IOStreams, Boost Layout, or, considerably ironically,
tinyformat. Here is mostly done through cautious utility of form erasure
on various ranges, which effectively minimizes template bloat.
Formatting arguments are passed by capability of form-erased format_args
:
auto vformat(string_view fmt, format_args args) -> std:: string;
template <typename... T>
auto format(format_string<T...> fmt, T&&... args) -> std:: string {
return vformat(fmt, fmt:: make_format_args(args...));
}
As you are going to be ready to stumble on, format
delegates all its work to vformat
, which is no longer a
template.
Output iterators and other output forms are moreover form-erased through a specially
designed buffer API.
This form confines template usage to a minimal top-level layer, leading to
each and each a smaller binary size and faster procure times.
Shall we utter, the next code:
// take a look at.cc
#consist of
int fundamental() {
fmt:: print("The acknowledge is {}.", 42);
}
compiles to accurate
.LC0:
.string "The acknowledge is {}."
fundamental:
sub rsp, 24
mov eax, 1
mov edi, OFFSET FLAT:.LC0
mov esi, 17
mov rcx, rsp
mov rdx, rax
mov DWORD PTR [rsp], 42
call fmt::v11::vprint(fmt::v11::basic_string_view, fmt::v11::basic_format_args)
xor eax, eax
add rsp, 24
ret
It’s fundamental smaller than the same IOStreams code and same to that
of printf
:
.LC0:
.string "The acknowledge is %d."
fundamental:
sub rsp, 8
mov esi, 42
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 8
ret
Unlike printf
, {fmt} provides fat runtime form security. Errors in format strings
might perhaps well moreover be caught at assemble time, and even when the format string is determined at
runtime, errors are managed through exceptions, struggling with undefined habits,
memory corruption, and capacity crashes. Furthermore, {fmt} calls are
generally more efficient, significantly when the exercise of positional arguments, which C
varargs are no longer smartly-fitted to.
Encourage in 2020, I devoted some time to optimizing the library size,
successfully cutting back it to below 100kB (accurate ~57kB with -Os -flto
).
A lot has modified since then. Most notably, {fmt} now uses the unparalleled
Dragonbox algorithm for floating-point formatting, kindly
contributed by its author, Junekey Jeon. Let’s explore how these adjustments accept
impacted the binary size and stumble on if further reductions are imaginable.
But why, some utter, the binary size? Why pick this as our procedure?
There has been noteworthy passion in the exercise of {fmt} on memory-constrained
units, stumble on e.g. #758 and #1226 for accurate two examples from
the a long way away past. An awfully provocative exercise case is retro computing, with
other folk the exercise of {fmt} on methods savor Amiga (#4054).
We’ll apply the same methodology as in previous work, examining the
executable size of a program that uses {fmt}, as this is most connected to full
users. All tests shall be conducted on an aarch64 Ubuntu 22.04 system with GCC
11.4.0.
First, let’s place the baseline: what’s the binary size for the most modern
version of {fmt} (11.0.2)?
$ git checkout 11.0.2
$ g++ -Os -flto -DNDEBUG -I consist of take a look at.cc src/format.cc
$ strip a.out && ls -l. a.out
-rwxrwxr-x 1 vagrant vagrant 75Okay Aug 30 19: 24 a.out
The following binary size is 75kB (stripped). The terrifying takeaway is that
no topic a mountainous amount of traits over the last four years, the dimensions has no longer
greatly regressed.
Now, let’s explore capacity optimizations. One in every of the fundamental adjustments you
might perhaps well grab into story is disabling locale toughen. The entire formatting in {fmt} is
locale-honest by default (which breaks with the C++’s custom of getting
corrupt defaults), on the opposite hand it’s a long way mild available as an opt in by capability of the L
format
specifier. It must moreover be disabled in a considerably vague manner by capability of the
FMT_STATIC_THOUSANDS_SEPARATOR
macro:
$ g++ -Os -flto -DNDEBUG "-DFMT_STATIC_THOUSANDS_SEPARATOR=','"
-I consist of take a look at.cc src/format.cc
$ strip a.out && ls -l. a.out
-rwxrwxr-x 1 vagrant vagrant 71Okay Aug 30 19: 25 a.out
Disabling locale toughen reduces the binary size to 71kB.
Subsequent, let’s gaze the outcomes the exercise of our staunch instrument, Bloaty:
$ bloaty -d symbols a.out
FILE SIZE VM SIZE
-------------- --------------
43.8% 41.1Ki 43.6% 29.0Ki [121 Others]
6.4% 6.04Ki 8.1% 5.42Ki fmt::v11::part::do_write_float<>()
5.9% 5.50Ki 7.5% 4.98Ki fmt::v11::part::write_int_noinline<>()
5.7% 5.32Ki 5.8% 3.88Ki fmt::v11::part::write<>()
5.4% 5.02Ki 7.2% 4.81Ki fmt::v11::part::parse_replacement_field<>()
3.9% 3.69Ki 3.7% 2.49Ki fmt::v11::part::format_uint<>()
3.2% 3.00Ki 0.0% 0 [section .symtab]
2.7% 2.50Ki 0.0% 0 [section .strtab]
2.3% 2.12Ki 2.9% 1.93Ki fmt::v11::part::dragonbox::to_decimal<>()
2.0% 1.89Ki 2.4% 1.61Ki fmt::v11::part::write_int<>()
2.0% 1.88Ki 0.0% 0 [ELF Section Headers]
1.9% 1.79Ki 2.5% 1.66Ki fmt::v11::part::write_float<>()
1.9% 1.78Ki 2.7% 1.78Ki [section .dynstr]
1.8% 1.72Ki 2.4% 1.62Ki fmt::v11::part::format_dragon()
1.8% 1.68Ki 1.5% 1016 fmt::v11::part::format_decimal<>()
1.6% 1.52Ki 2.1% 1.41Ki fmt::v11::part::format_float<>()
1.6% 1.49Ki 0.0% 0 [Unmapped]
1.5% 1.45Ki 2.2% 1.45Ki [section .dynsym]
1.5% 1.45Ki 2.0% 1.31Ki fmt::v11::part::write_loc()
1.5% 1.44Ki 2.2% 1.44Ki [section .rodata]
1.5% 1.40Ki 1.1% 764 fmt::v11::part::do_write_float<>()::{lambda()#2}::operator()()
100.0% 93.8Ki 100.0% 66.6Ki TOTAL
Unsurprisingly, an even portion of the binary size is devoted to numeric
formatting, significantly floating-point numbers. FP formatting moreover depends on
huge tables, which aren’t confirmed here. But what if floating-point toughen
isn’t required? {fmt} provides a vogue to disable it, though the manner is
considerably advert hoc and doesn’t lengthen to different kinds.
The core field is that formatting functions prefer to be responsive to all formattable
forms. Or produce they? Here is appropriate for printf
as outlined by the C long-established, but
no longer necessarily for {fmt}. {fmt} supports an extension API that lets in
formatting arbitrary forms without lustrous your entire home of forms in come.
While constructed-in and string forms are dealt with specially for efficiency causes,
specializing in binary size might perhaps well warrant a special manner. By doing away with this
special handling and routing all forms during the extension API, you are going to be ready to lead clear of
paying for forms you don’t exercise.
I did an experimental implementation of this realizing. With the
FMT_BUILTIN_TYPES
macro home to 0, handiest int
is dealt with specially, and all
different kinds battle during the smartly-liked extension API. We mild prefer to know about
int
for dynamic width and precision, as an illustration
fmt:: print("{:{}}n", "good day", 10); // prints "good day "
This provides you the “don’t pay for what you don’t exercise” mannequin, though it comes
with a microscopic expand in per-call binary size. While you happen to provide format floating-point
numbers or different kinds, the connected code will mild be integrated in the procure.
While it’s imaginable to carry out the FP implementation smaller, we received’t delve into
that here.
With FMT_BUILTIN_TYPES=0
, the binary size in our instance diminished to 31kB,
representing a gargantuan development:
$ git checkout 377cf20
$ g++ -Os -flto -DNDEBUG
"-DFMT_STATIC_THOUSANDS_SEPARATOR=','" -DFMT_BUILTIN_TYPES=0
-I consist of take a look at.cc src/format.cc
$ strip a.out && ls -l. a.out
-rwxrwxr-x 1 vagrant vagrant 31Okay Aug 30 19: 37 a.out
On the opposite hand, the updated Bloaty results display conceal some lingering locale artifacts,
similar to digit_grouping
:
$ bloaty -d fullsymbols a.out
FILE SIZE VM SIZE
-------------- --------------
41.8% 18.0Ki 39.7% 11.0Ki [84 Others]
6.4% 2.77Ki 0.0% 0 [section .symtab]
5.3% 2.28Ki 0.0% 0 [section .strtab]
4.6% 1.99Ki 6.9% 1.90Ki fmt::v11::part::format_handler::on_format_specs(int, char const*, char const*)
4.4% 1.88Ki 0.0% 0 [ELF Section Headers]
4.1% 1.78Ki 5.8% 1.61Ki fmt::v11::basic_appender fmt::v11::part::write_int_noinline, unsigned int>(fmt::v11::basic_appender, fmt::v11::part::write_int_arg, fmt::v11::format_specs const&, fmt::v11::part::locale_ref) (.constprop.0)
3.7% 1.60Ki 5.8% 1.60Ki [section .dynstr]
3.5% 1.50Ki 4.8% 1.34Ki void fmt::v11::part::vformat_to(fmt::v11::part::buffer&, fmt::v11::basic_string_view, fmt::v11::part::vformat_args::form, fmt::v11::part::locale_ref) (.constprop.0)
3.5% 1.49Ki 4.9% 1.35Ki fmt::v11::basic_appender fmt::v11::part::write_int, unsigned __int128, char>(fmt::v11::basic_appender, unsigned __int128, unsigned int, fmt::v11::format_specs const&, fmt::v11::part::digit_grouping const&)
3.1% 1.31Ki 4.7% 1.31Ki [section .dynsym]
3.0% 1.29Ki 4.2% 1.15Ki fmt::v11::basic_appender fmt::v11::part::write_int, unsigned lengthy, char>(fmt::v11::basic_appender, unsigned lengthy, unsigned int, fmt::v11::format_specs const&, fmt::v11::part::digit_grouping const&)
After disabling these artifacts in commits e582d37 and
b3ccc2d, and introducing a more particular person-pleasant choice to opt out by capability of
the FMT_USE_LOCALE
macro, the binary size drops to 27kB:
$ git checkout b3ccc2d
$ g++ -Os -flto -DNDEBUG -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0
-I consist of take a look at.cc src/format.cc
$ strip a.out && ls -l. a.out
-rwxrwxr-x 1 vagrant vagrant 27Okay Aug 30 19: 38 a.out
The library contains several areas the do size is traded off for bolt.
Shall we utter, grab into story this characteristic mature to compute the amount of decimal
digits:
auto do_count_digits(uint32_t n) -> int {
// An optimization by Kendall Willets from https://bit.ly/3uOIQrB.
// This increments the simpler 32 bits (log10(T) - 1) when >= T is added.
# elaborate FMT_INC(T) (((sizeof(#T) - 1ull) << 32) - T)
static constexpr uint64_t table[] = {
FMT_INC(0), FMT_INC(0), FMT_INC(0), // 8
FMT_INC(10), FMT_INC(10), FMT_INC(10), // 64
FMT_INC(100), FMT_INC(100), FMT_INC(100), // 512
FMT_INC(1000), FMT_INC(1000), FMT_INC(1000), // 4096
FMT_INC(10000), FMT_INC(10000), FMT_INC(10000), // 32ample
FMT_INC(100000), FMT_INC(100000), FMT_INC(100000), // 256ample
FMT_INC(1000000), FMT_INC(1000000), FMT_INC(1000000), // 2048ample
FMT_INC(10000000), FMT_INC(10000000), FMT_INC(10000000), // 16M
FMT_INC(100000000), FMT_INC(100000000), FMT_INC(100000000), // 128M
FMT_INC(1000000000), FMT_INC(1000000000), FMT_INC(1000000000), // 1024M
FMT_INC(1000000000), FMT_INC(1000000000) // 4B
};
auto inc = table[__builtin_clz(n | 1) ^ 31];
return static_cast<int>((n + inc) >> 32);
}
The table mature here is 256 bytes. There isn’t a one-size-suits-all resolution,
and changing it unconditionally might perhaps well negatively affect other exercise cases.
Thankfully, we’ve got a fallback implementation of this characteristic for cases
the do __builtin_clz
is unavailable, similar to with constexpr
:
template <typename T> constexpr auto count_digits_fallback(T n) -> int {
int count = 1;
for (;;) {
// Integer division is slow so produce it for a community of four digits in its do
// of for every digit. The premise comes from the controversy by Alexandrescu
// "Three Optimization Pointers for C++". Watch bolt-take a look at for a comparison.
if (n < 10) return count;
if (n < 100) return count + 1;
if (n < 1000) return count + 2;
if (n < 10000) return count + 3;
n /= 10000u;
count += 4;
}
}
All that remains is to provide users with control over when to use the fallback
implementation via (you guessed it) another configuration macro,
FMT_OPTIMIZE_SIZE
:
auto count_digits(uint32_t n) -> int {
#ifdef FMT_BUILTIN_CLZ
if (!is_constant_evaluated() && !FMT_OPTIMIZE_SIZE) return do_count_digits(n);
#endif
return count_digits_fallback(n);
}
With this and a number of same adjustments, we diminished the binary size to 23kB:
$ git checkout 8e3da9d
$ g++ -Os -flto -DNDEBUG -I consist of
-DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1
take a look at.cc src/format.cc
$ strip a.out && ls -l. a.out
-rwxrwxr-x 1 vagrant vagrant 23Okay Aug 30 19: 41 a.out
We might perhaps well seemingly lower the binary size even further with extra tweaks,
but let’s take care of the elephant in the room which is, needless to claim, the C++ long-established
library. What’s the point of optimizing the dimensions at the same time as you finish up getting
a megabyte or two of the C++ runtime?
While {fmt} depends minimally on the long-established library, is it imaginable to
grab away it entirely as a dependency? One obvious declare is exceptions and
these might perhaps well moreover be disabled by capability of FMT_THROW
, e.g. by defining it to abort
.
In smartly-liked it’s now not urged on the opposite hand it shall be OK for some exercise cases
significantly brooding about that most errors are caught at assemble time.
Let’s are trying it out and assemble with -nodefaultlibs
and exceptions disabled:
$ g++ -Os -flto -DNDEBUG -I consist of
-DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1
'-DFMT_THROW(s)=abort()' -fno-exceptions take a look at.cc src/format.cc
-nodefaultlibs -lc
/usr/bin/ld: /tmp/cc04DFeK.ltrans0.ltrans.o: in characteristic `fmt::v11::basic_memory_buffer >::develop(fmt::v11::part::buffer&, unsigned lengthy)':
:(.text+0xaa8): undefined reference to `std::__throw_bad_alloc()'
/usr/bin/ld: :(.text+0xab8): undefined reference to `operator new(unsigned lengthy)'
/usr/bin/ld: :(.text+0xaf8): undefined reference to `operator delete(void*, unsigned lengthy)'
/usr/bin/ld: /tmp/cc04DFeK.ltrans0.ltrans.o: in characteristic `fmt::v11::vprint_buffered(_IO_FILE*, fmt::v11::basic_string_view, fmt::v11::basic_format_args) [clone .constprop.0]':
:(.text+0x18c4): undefined reference to `operator delete(void*, unsigned lengthy)'
collect2: error: ld returned 1 exit utter
Amazingly, this form mostly works. The handiest remaining dependency on the C++
runtime comes from fmt::basic_memory_buffer
, which is a tiny stack-allocated
buffer that can develop into dynamic memory if foremost.
fmt::print
can write straight into the FILE
buffer and generally
doesn’t require dynamic allocation. So we might perhaps well grab away the dependency on
fmt::basic_memory_buffer
from fmt::print
. On the opposite hand, since it shall be mature
in other locations, a bigger resolution is to interchange the default allocator with one which
uses malloc
and free
in preference to new
and delete
.
template <typename T> struct allocator {
the exercise of value_type = T;
T* allocate(size_t n) {
FMT_ASSERT(n <= max_value<size_t>() / sizeof(T), "");
T* p = static_cast<T*>(malloc(n * sizeof(T)));
if (!p) FMT_THROW(std:: bad_alloc());
return p;
}
void deallocate(T* p, size_t) { free(p); }
};
This reduces binary size to accurate 14kB:
$ git checkout c0fab5e
$ g++ -Os -flto -DNDEBUG -I consist of
-DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1
'-DFMT_THROW(s)=abort()' -fno-exceptions take a look at.cc src/format.cc
-nodefaultlibs -lc
$ strip a.out && ls -l. a.out
-rwxrwxr-x 1 vagrant vagrant 14Okay Aug 30 19: 06 a.out
Pondering that a C program with an empty fundamental
characteristic is 6kB on this
system, {fmt} now adds lower than 10kB to the binary.
We are succesful of moreover effortlessly verify that it no longer depends on the C++ runtime:
$ ldd a.out
linux-vdso.so.1 (0x0000ffffb0738000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffb0530000)
/lib/ld-linux-aarch64.so.1 (0x0000ffffb06ff000)
Hope you stumbled on this attention-grabbing and pleased embedded formatting!
Final modified on 2024-08-30