What every compiler writer should know about programmers (2015) [pdf]

(tuwien.ac.at)

65 points | by tosh 4 days ago

10 comments

Joker_vD 9 hours ago
In the end it all boils to a very simple argument. The C programmers want the C compilers to behave one way, the C implementers want the C compilers to behave the other way. Since the power structure is what it is — the C implementers are the ones who write the C standard and are the ones who actually get to implement the C compilers — the C compilers do, and will, behave the way the C implementers want them to.
In this situation the C programmers can either a) accept that they're programming in a language that exists as it exists, not as they'd like it to exist; b) angrily deny a); or c) switch to some other system-level language with defined semantics.
[-]
- userbinator 7 hours ago
  Given what most C compilers are written in, are C programmers also C implementers?
  I suspect it also depends on who exactly the compiler writers are; the GCC and LLVM guys seem to have more theoretics/academics and thus think of the language more abstractly, leading to UB being truly inexplicable and free of thought, while MSVC and ICC are more on the practical side and their interpretation of it is, as the standard says, "in a documented manner characteristic of the environment". IMHO the "spirit of C" and the more commonsense approach is definitely the latter, and K&R themselves have always leaned in that direction. This is very much a "letter of the law vs. spirit of the law" argument. The fact that these two different sides have produced compilers with nearly the same performance characteristics shows IMHO that the argument of needing to exploit UB is mandatory for performance is a debunked myth.
  [-]
  - zbentley 30 minutes ago
    I doubt it, but that's just a hunch. Is there data out there regarding compiler/language maintainer/standards committee members' contributions to other projects (beyond "so and so person works on $compiler and $application, both written in C"-type anecdotes)?
    If not, then, like ... sure, C compiler maintainers people who program in C, but they're not "C programmers" as it was intended (people who develop non-compiler software in C).
    My hunch is that that statement is overwhelmingly true if measured by influence of a given C compiler/implementation stack (because GCC/LLVM/MSVC take up a huge slice of the market, and their maintainers are in many cases paid specialists who don't do significant work on other projects), but untrue if measured by count of people who have worked on C compilers (because there are a huge number of small-market-share/niche compilers out there, often maintained by groups who develop those compilers for a specific, often closed-source, platform/SoC/whatever).
- HexDecOctBin 8 hours ago
  Another alternative is that the programmer write their own C compiler and be free of this politics. Maybe I am biased since I am working on exactly such a project, but I have been seeing more and more in-progress compiler implementations for C or C-like languages for the past couple years.
  [-]
  - Joker_vD 6 hours ago
    The proposals for Boring C or "Friendly Dialect of C" or whatever has been around for a while. None went beyond the early design stages because, it turns out, no two experienced C programmers could agree on what parts of C are reasonable/unreasonable (and should be kept/left out), see [0] for the first-hand recount.
    [0] https://blog.regehr.org/archives/1287
    > In contrast, we want old code to just keep working, with latent bugs remaining latent.
    Well, just keep compiling it with the old compilers. "But we'd like to use new compilers for some 'free' gains!" Well, sucks, you can't. "But we have to use new compilers because the old ones just plain don't work on the newer systems!" Well, that sucks, and this here is why "technical debt" is called "debt" and you've managed to hold paying it off until now the repo team is here and knocking at your door.
    [-]
    - zbentley 17 minutes ago
      I can't upvote this enough.
      I mostly work in compiled languages now, but started in interpreted/runtime languages.
      When I made that switch, it was baffling to me that the compiled-language folks don't do compatibility-breaking changes more often during big language/compiler revision updates.
      Compiled code isn't like runtime code--you can build it (in many cases bit-deterministically!) on any compiler version and it stays built! There's no risk of a toolchain upgrade preventing your software from running, just compiling.
      After having gone through the browser compatibility trenches and the Python 2->3 wars, I have no idea why your proposal isn't implemented more often: old compiler/language versions get critical/bugfix updates where practical, new versions get new features and aggressively deprecate old ones.
      Don't get me wrong, backwards compatibility is golden...when it comes to making software run. But I think it's a mistake that back compat is taken even further when it comes to compilers, rather than the reverse. I get that there are immense volumes of C/C++ out there, but I don't get why new features/semantics/optimizations aren't rolled out more aggressively (well, I do--maintainers of some of those immense volumes are on language steering committees and don't want to spin up projects to modernize their codebases--but I'm mad about it).
      "Just use an old compiler" seems like such a gimme--especially in the modern era of containers etc. where making old toolchains available is easier than ever. I get that it feels bad and accumulates paper cuts, but it is so much easier to deploy compiled code written on an old revision on a new system than it is to deploy interpreted/managed code.
      (There are a few cases where compilers need to be careful there--thinking about e.g. ELF format extensions and how to compile code with consideration for more aggressive linker optimizations that might be developed in the future--but they're the minority.)
  - uecker 7 hours ago
    And please provide feedback to WG14. Also please give feedback and file bugs for GCC / clang. There are users of C in the committee and we need your support. Also keeping C implementable for small teams is something that is at risk.
- WalterBright 8 hours ago
  > behave the way the C implementers want them to
  If you don't please your users, you won't have any users.
  [-]
  - Jweb_Guru 7 hours ago
    It's ironic that I have to tell you of all people this, but many users of C (or at least, backends of compilers targeted by C) do actually want the compiler to aggressively optimize around UB.
  - groestl 7 hours ago
    If you're self hosting your compiler on C, you are your own user.
  - Gibbon1 6 hours ago
    Consider that most programmers have long since fled for other languages.
  - godelski 7 hours ago
    Which users?
  - AlotOfReading 8 hours ago
    And yet, C++.
    [-]
    - locknitpicker 7 hours ago
      > And yet, C++.
      By any metric, C++ is one of the most successful programming languages devised by mankind, if not the most successful.
      What point were you trying to make?
      [-]
      - zbentley 10 minutes ago
        True! But C++ is popular almost entirely because of when (in history/what alternatives existed at the time) and where (on what platforms) it first became available, and how much adoption momentum was created during that era.
        I think claiming that C++ is successful because of the unintuitive-behavior-causing compiler behaviors/parts of the spec is an extraordinary claim--if that's what you mean, then I disagree. TFA discusses that many of the most pernicious UB-causing optimizations yield paltry performance gains.
      - Joker_vD 6 hours ago
        That it doesn't pleases lots of its users I imagine. I, personally, certainly never enjoyed it but sometimes you don't have a realistic alternative and have to use C++ (or C). In which case your pleasure or displeasure doesn't really matter, you just use that one tool with very sharp edges in the most unexpected (and ridiculously exposed) places with as much care as you could, then bandage your wounds and move on.
      - zem 6 hours ago
        that it has millions of users while pleasing approximately none of them
- seg_lol 9 hours ago
  How about we agree on the ABI and everyone can have their own C compiler. Everyone C's the world through their own lenses.
  [-]
  - dotancohen 8 hours ago
    We're not too far away from that. At the very least, Claude can provide feedback and help decide which compiler options to use, as per developer preference.
nananana9 7 hours ago
Compiler developers hijacked and twisted the term "Undefined Behavior". Everyone understood what UB was in K&R C - if you write code that the standard doesn't define a meaning to, and the compiler outputs what it outputs. If you dereference a null pointer, the compiler outputs a null pointer dereference, and when you hit it at runtime you get the undefined behavior (page fault on modern systems).
Nowadays, UB means something completely different - if at any point in time, the compiler reasons out that a piece of code is only reachable via UB, it will assume that this can never happen, and will quietly delete everything downstream:
https://godbolt.org/z/EYxWqcfjx
[-]
- kace91 7 hours ago
  Sorry if I’m missing something as this isn’t my field, but shouldn’t the two meanings be roughly equivalent to the user?
  As in, everything down from UB is only working by an accident of implementation that does not need to hold, and you should explicitly not rely on that. Whether the compiler happens to explicitly make it not ever work or just leaves it to fate should not be relevant.
  [-]
  - Doxin 6 hours ago
    No, because the former definition is still something you can rely on given a specific compiler and a specific machine. Hell a bunch of UB was pretty much universal anyway. Compilers would usually still emit sensible code for UB.
    UB just ment "the spec doesn't define what happens". It didn't use to mean "the compiler can just decide to do any wild thing if your program touches UB anywhere at anytime". Hell, with the modern definition UB can aparantly time travel. you don't even need to execute UB code for it to start doing weird shit in some cases.
    UB went from "whatever happens when your compiler/hardware runs this is what happens" to "Once a program contains UB the compiler doesn't need to conform to the rest of the spec anymore."
    [-]
    - kace91 1 hour ago
      >the former definition is still something you can rely on given a specific compiler and a specific machine.
      >UB just ment "the spec doesn't define what happens"
      What comes to mind is that then the written code is operating on a subspec, one that is probably undocumented and maybe even unintended by the specifics of that version and platform.
      It sounds like it could create a ton of issues, from code that can’t be ported to difficulty in other person grokking the undocumented behavior that is being used.
      In this regard, as someone that could potentially inherit this code I’d actually want the compiler to stop this potential behavior. Am I missing something? Is the spec not functional enough on its own to rely just on that?
  - anilakar 6 hours ago
    My main gripe with UB is that if a compiler is able to detect undefined behavior invocation, it is still allowed to compile (or rather omit) said code instead of crashing.
- atiedebee 3 hours ago
  ISO C99 actually defines multiple types of deviating behaviour. What you're describing is closer to implementation-defined behaviour than anything else.
  The three behaviours relevant in this discussion, from section 3.4:
```
  3.4.1 implementation-defined behavior
  unspecified behavior where each implementation documents how the choice is made
  EXAMPLE An example of implementation-defined behavior is the propagation of the high-order bit when a signed integer is shifted right.

  3.4.3 undefined behavior
  behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements
  Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).
  An example of undefined behavior is the behavior on integer overflow.

  3.4.4 unspecified behavior
  behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance
  An example of unspecified behavior is the order in which the arguments to a function are evaluated.
```
  K&R seems to also mention "undefined" and "implementation-defined" behaviour on several occasions. It doesn't specify what is meant by undefined behaviour, but it does indeed seem to be "whatever happens, happens" instead of "you can do whatever you want." But ISO C99 seems to be a lot looser with their definition.
  Using integer overflow, as in your example, for optimization has been shown to be beneficial by Charles Carruth in a talk he did at CppCon in 2016.[1] I think it would be best to have something similar to Zig's wrapping and saturating addition operators instead, but for that I think it is better to just use Zig (which I personally am very willing to do once they reach 1.0 and other compiler implementations are available).[2]
  [1] https://youtu.be/yG1OZ69H_-o?si=x-9ALB8JGn5Qdjx_&t=2357 [2] https://ziglang.org/documentation/0.15.2/#Operators
  [-]
  - nananana9 2 hours ago
    [1] is probably the best counterpoint I've seen, but there are other ways to enable this optimization - the most obvious being to use a register-sized index, which is what's passed to the function anyways. I'd be fine with an intrinsic for this as well (I don't think you'll use it often enough to justify the +%= syntax)
    It's also worth noting that even with the current very liberal handling of UB, the actual code sample in [1] was still lacking this optimization; so it's not like the liberal UB handling automatically lead to faster code, understanding of the compiler was still needed.
    The question is one of risk - if the compiler is conservative, you're risking is a slightly more unoptimized code. If the compiler is very liberal and assumes UB never happens, you're risking that it will wipe your overflow check like in my godbolt (I've seen an actual CVEs due to that, although I don't remember the project)
ameliaquining 11 hours ago
Previously:
https://news.ycombinator.com/item?id=11219874 (2016)
https://news.ycombinator.com/item?id=19659555 (2019)
[-]
- dang 7 hours ago
  Thanks! Macroexpanded:
  What every compiler writer should know about programmers (2015) [pdf] - https://news.ycombinator.com/item?id=19659555 - April 2019 (62 comments)
  What every compiler writer should know about programmers [pdf] - https://news.ycombinator.com/item?id=11219874 - March 2016 (106 comments)
muldvarp 5 hours ago
A C compiler is a relatively simple program (especially if you don't want any optimizations based on undefined behavior). If a large part of the userbase is unhappy with the way most modern C compilers work, they could easily write a "friendly"/"boring" C compiler.
[-]
- zbentley 2 minutes ago
  Some of those already exist, e.g. https://bellard.org/tcc/
  However, they're not in widespread use. I would be curious to learn if there's any data/non-anecdotal information as to why. Is it momentum/inertia of GCC/LLVM/MSVC? Are alternative compilers incomplete and can't actually compile a lot of practical programs (belying the "relatively simple program") claim? Or is the performance differential due to optimizations really so significant that ordinary programs like e.g. vim or libjpeg or VLC or whatnot have significant degradations when built on an alternative compiler?
Animats 9 hours ago
Note that all the examples come from lack of bounds checking.
rurban 8 hours ago
This was 2015, and we still have no -Wdeadcode, warning of removal of "dead code", ie what compilers think of dead code. If a program writer writes code, it is never dead. It is written. It had purpose. If the compiler thinks this is wrong, it needs to warn about it.
The only dead code is generated code by macros.
[-]
- BeeOnRope 8 hours ago
  Dead code is extremely common in C or C++ after inlining, other optimizations.
  [-]
  - dotancohen 8 hours ago
    Or stubs. I'll often flesh out a class before implementing the methods.
  - catlifeonmars 8 hours ago
    OP means that the code has a dual purpose: one purpose is to be compiled, the other is to communicate structure or intent to programmers.
    [-]
    - deathanatos 8 hours ago
      Do we know that? I've written "dead" code. It's point was to communicate structure or intent, but it was also still dead. This pattern, in one form or another, crops up a lot IME (in multiple languages, even, with varying abilities to optimize it):
      if condition that is "always" false: abort with message detailing the circumstances
      That `if` is "dead", in the sense that the condition is always false. But "dead" sometimes is just a proof — or if I'm not rigourous enough, an assumption — in my head. If the compiler can prove the same proof I have in my head, then the dead code is eliminated. If can't, well, presumably it is left in the binary, either to never be executed, or to be executed in the case that the proof in my head is wrong.
      [-]
      - guenthert 3 hours ago
        What about assertions that are meant to detect bad hardware? I'd think that's not too uncommon, particularly in shops building their own hardware. Noise on the bus, improper termination, ESD, dirty clock signal, etc. -- there are a million reasons why a bit might flip. I wouldn't want the compiler to optimize "obviously wrong" code out anymore then empty loops.
      - pmontra 7 hours ago
        Some conditions depend strictly on inputs and the compiler can't reason much about them, and the developers can't be sure about what their users will do. So that pattern is common. It's a sibling of assertions.
        There are even languages with mandatory else branch.
  - somat 8 hours ago
    That's the problem
    [-]
    - BeeOnRope 8 hours ago
      Why is that a problem? Inlining and optimization aren't minor aspects of compiling to native code, they are responsible for order-of-magnitude speedups.
      My point is that it is easy to say "don't remove my code" while looking at a simple single-function example, but in actual compilation huge portions of a function are "dead" after inlining, constant propagation and other optimizations: not talking anything about C-specific UB or other shenanigans. You don't want to throw that out.
      [-]
      - somat 6 hours ago
        Apologies for the flippant one liner, You made a good point and deserve more than that.
        On the one hand, having the optimizer save you from your own bad code is a huge draw, this is my desperate hope with SQL, I can write garbage queries and the optimizer will save me from myself.
        But... Someone put that code there, spent time and effort to get that machinery into place with the expectation that it is doing something. and when the optimizer takes that away with no hint. That does not feel right either. Especially when the program now behaves differently when "optimized" vs unoptimized.
  - duped 6 hours ago
    If dead code (1) is common in your codebase then your code base is missing heaps of refactors
    (1) "dead" meaning unused types, unreachable branches
    [-]
    - muldvarp 4 hours ago
      Not really, no. If you use a regex library it is very likely that 80% of that code is effectively dead code.
- muldvarp 4 hours ago
  I'd love for you to write a C compiler that does this and then realize how much dead code there is in your C projects.
- stinkbeetle 7 hours ago
  That's not true for all code bases. Two common examples:
  It's very common for inline functions in headers to be written for inlining and constant propagation from arguments result in dead code and better generated code. There is even __builtin_constant_p() to help with such things (e.g., you can use it to have a fast folded inline variant if an argument is constant, or call big out of line library code if variable).
  There are also configuration systems that end up with config options in headers that code tests with if (CONFIG_BLAH) {...} that can evaluate to zero in valid builds.
moktonar 6 hours ago
UB is the definition of Free Will that’s why you can’t control it, and for a programmer something that cannot be controlled is felt as dangerous..
Panzerschrek 8 hours ago
Making C compilers better and more predictable is impossible with so many UB cases listed in the standard. A better language should be used instead, where UB and implementation-defined behavior cases are minimized.
[-]
- uecker 7 hours ago
  Where there is UB in the standard it means that a C compiler is free to define the behavior. So of course, somebody could write a C implementation which does this. See also Fil-C for a perfectly memory safe version of C. So the first sentence makes no sense.
  But also note that there is an ongoing effort to remove UB from the standard. We have eliminated already about 30% of UB in the core language for the upcoming version C2Y.
- dotancohen 8 hours ago
  Have you seen Rust? I'm loving it.
  [-]
  - uecker 7 hours ago
    Rust is not super appealing to me as C user: too complex, slow compilation, etc.
    [-]
    - mfru 6 hours ago
      Maybe Zig, Hare or C3 then?
      [-]
      - uecker 6 hours ago
        Also what I like about C is that is has mature tooling, very portable with multiple implementations, and that is is very stable. I would not use a language for any serious project hat does not offer all this.
        Honestly, I do not think that the problem is C is o big that one needs to jump ship. There are real issues, yes, but there are also plenty of good tools and strategies to deal with UB, it is not really an issue for me.
mgaunard 6 hours ago
C programs with undefined behaviour were never conforming or well-working.
I stopped reading at the abstract; garbage rant full of contradictions.
zephen 9 hours ago
Here's a cogent argument that any decision by compiler writers that they can do whatever they wish whenever they encounter an "undefined behavior" construct is rubbish:
https://www.yodaiken.com/2021/05/19/undefined-behavior-in-c-...
And here's a cautionary tale of how a compiler writer doing whatever they wish once they encounter undefined behavior makes debugging intractable:
https://www.quora.com/What-is-the-most-subtle-bug-you-have-h...
[-]
- deathanatos 8 hours ago
  > undefined behavior makes debugging intractable:
  By their own admission, the compiler warns about the UB. "-Wanal"¹, as some call it, makes it an error. Under UBSan the program aborts with:
```
  code.cpp:4:6: runtime error: execution reached the end of a value-returning function without returning a value
```
  … "intractable"?
  ¹a humorous name for -Wextra -Wall -Werror
  [-]
  - zephen 6 hours ago
    Not everybody has full control over their environment.
    The -Werror flag is not even religiously used for building, e.g. the linux kernel, and -Wextra can introduce a lot of extraneous garbage.
    This will often make it easier (though still difficult) to winnow the program down to a smaller example, as that person did, rather than to enable everything and spend weeks debugging stuff that isn't the actual problem.
    [-]
    - uecker 4 hours ago
      Yes, this is the funny thing. People do not want to spend time using a stricter language as already supported by C compilers using compiler flags because "it is waste of time" while others argue that we need to switch to much stricter languages. Both positions can not be true at the same time.
- Joker_vD 9 hours ago
  Wow, that's a very torturous reading of a specific line in a standard. And it doesn't really matter what Yodaiken thinks this line means because standard is written by C implementers for (mostly) C implementers. So if C compile writers think this line means they can use UB for optimizing purposes, then that's what it means.
  Yeah, I know it breaks the common illusion among the C programmers that they're "close to the bare metal", but illusions should be dispersed, not indulged. The C programmers program for the abstract C machine which is then mediated by the C compilers into machine code the way the implementers of C compilers publicly documented.
  [-]
  - chc4 8 hours ago
    Yeah, this is basically Sovereign Citizen-tier argumentation: through some magic of definitions and historical readings and arguing about commas, I prove that actually everyone is incorrect. That's not how programming languages work! If everyone for 10+ years has been developing compilers with some definition of undefined behavior, and all modern compilers use undefined behavior in order to drive optimization passes which depend on those invariants, there is no possible way to argue that they're wrong and you know the One True C Programming Language interpretation instead.
    Moreover, compiler authors don't just go out maliciously trying to ruin programs through finding more and more torturous undefined behavior for fun: the vast majority of undefined behavior in C are things that if a compiler wasn't able to assume were upheld by the programmer would inhibit trivial optimizations that the programmer also expects the compiler to be able to do.
    [-]
    - somat 7 hours ago
      I find where the argument gets lost is when undefined behavior is assumed to be exactly that, an invariant.
      That is to say, I find "could not happen" the most bizarre reading to make when optimizing around undefined behavior "whatever the machine does" makes sense, as does "we don't know". But "could not happen???" if it could not happen the spec would have said "could not happen" instead the spec does not know what will happen and so punts on the outcome, knowing full well that it will happen all the time.
      The problem is that there is no optimization to make around "whatever the hardware does" or "we have no clue" so the incentive is to choose the worst possible reading "undefined behavior is incorrect code and therefore a correct program will never have it".
      [-]
      - fluoridation 7 hours ago
        Some behaviors are left unspecified instead of undefined, which allows each implementation to choose whatever behavior is convenient, such as, as you put it, whatever the hardware does. IIRC this is the case in C for modulo with both negative operands.
        I would imagine that the standard writers choose one or the other depending on whether the behavior is useful for optimizations. There's also the matter that if a behavior is currently undefined, it's easy to later on make it unspecified or specified, while if a behavior is unspecified it's more difficult to make it undefined, because you don't know how much code is depending on that behavior.
        [-]
        zephen 5 hours ago
        But even integer overflow is undefined.
        It's practically impossible to find a program without UB.
        [-]
        uecker 4 hours ago
        I think this is not really true. Or rather, it depends on the UB you are talking about. There is UB which is simply UB because it is out-of-scope for the C standard, and there is UB such as signed integer overflow that can cause issues. It is realistic to deal with the later, e.g. by converting them to traps with a compiler flags.
        [-]
        zephen 3 hours ago
        > I think this is not really true. Or rather, it depends on the UB you are talking about.
        I mean, if you're going to argue that a compiler can do anything with any UB, then by all means make that argument.
        Otherwise, then no, I don't think it's reasonable for a compiler to cause an infinite loop inside a function simply because that function itself doesn't return a value.
        [-]
        fluoridation 3 hours ago
        When you say "cause", do you mean insert on purpose, or do you mean cause by accident? I could see the latter happening, for example because the compiler doesn't generate a ret if the non-void function doesn't return anything, so control flow falls through to whatever code happens to be next in memory. I'm not aware of any compiler that does that, but it's something I could see happening, and the developers would have no reason to "fix" it, because it's perfectly up to spec.
        uecker 1 hour ago
        I am not sure what statement you are responding to. I am certainly not arguing that. I disagree with your claim that "it is practically impossible find a program without UB".
    - twoodfin 8 hours ago
      Aliasing being the classic example. If code generation for every pointer dereference has to assume that it’s potentially aliasing any other value in scope, things get slow in a hurry.
  - AlotOfReading 8 hours ago
    Compiler writers are free to make whatever intentional choices they want and document them. UB is especially nasty compared to other kinds of bugs because implementors can't/refuse to commit to any specific behavior, not because they've chosen the wrong behaviors.
    [-]
    - zephen 4 hours ago
      > Compiler writers are free to make whatever intentional choices they want and document them.
      Sure, but it's unlikely it's an intentional choice to cause an infinite loop simply because your boolean function didn't return a boolean.
  - zephen 5 hours ago
    > Wow, that's a very torturous reading of a specific line in a standard.
    It's actually a much more torturous reading to say "if any line in the program contains undefined behavior (such as the example given in the standard, integer overflow), then it's OK for the compiler to treat the entire program as garbage and create any behavior whatsoever in the executable."
    Which is exactly what had been claimed, that he was addressing.