2016/11/21

C vs C++, performance on AVR

The aim of this post is to fight the generalized belief of C++ being too slow of a language for embedded environments. This belief goes around, saying that microcontrollers should still be programmed in C, or even in assembler. Probably you don't agree with me right now. The idea of C being much more efficient than C++ is so extended that it almost seems like sacrilege to debate it. That's why I'm about to make a series of comparisons between both languages, throwing in some real and objective numbers (code size, execution time, etc). After we prove that not only can C++ compete with good old C, we'll see it's actually a better alternative. For that, besides performance metric, I will compare things like safety, code readability or portability.

The platform I'm using for the benchmarks will be an AVR microcontroller. Specifically, an Atmega328p, because its use is so common, and because it's been the base platform for Arduino, which many use as an example to argue that C++ is slo (more on this later).
In order to be as fair as possible, I'm also going to take an extremely different reference: Atmel Software Framework, in C. Given that Atmel is the manufacturer of AVRs, and that they claim their libraries to be optimized by experts both for code size and performance (here), they should make a good benchmark.
As said above, on the C++ we'll use arduino for reference. Arduino doesn't provide the best performance, but it makes a great example of usability, and thus will be perfect to illustrate a different point.
In the middle of both, I'll develop a small toy framwork to show that's perfectly possible, by using modern C++, to build libraries as easy to use as arduino, without giving up C-level performance (even Assembler performance).

Step 0: Environment and configuration.

I'll run all tests using Atmel Studio, compiling with GCC, activating C++11 standard and optimizing for space (unless noted otherwise).
Before we start measuring our own code, let's take a look at the code generated by the compiler to initialize it's own environment (execution stack, heap, etc) inside the microcontroller. So, how big is a minimal C or C++ program?

int main(void)
{
 while (1)
 {
 }
}

Surprisingly, the answer is 166 bytes in C, 134 bytes in C++. Things don't start good for C defenders. Let's not rush, however. We'll just take not of this numbers, so we can later make better judgement on the size of our own code.

Step 1: Light an LED

The microcontroller equivalent to writing a "hello, world" is to turn an LED on. This is a very simple task, consisting of configuring a port, and turning on a specific pin. Barely touching a couple registers. In this example, we are lighting the pin that connect's to arduino uno's built in LED. The C version, right out of Atmel sample code would be:

int main(void)
{
 DDRB |= (1 << DDB5);
 PORTB |= (1 << PORTB5);
 
 while (true);
}

The generated assembly only occupies 4 more bytes than an empty program (170 bytes total), and corresponds to the following assembly code.

sbi 0x04, 5; // DDRB |= (1 << DDB5);
sbi 0x05, 5; // PORTB |= (1 << PORTB5);

Pefect. One instruction per task. Unbeatable performance. It's worth noting this code can be compiled in C++ too, and the result will be exactly the same. But the goal here is not to prove C++ can do the same, but that it can do better. For starters, the above code isn't really very readable. Dooming the programmer to deal directly with registers is kind of uncomfortable, error prone, non-portable and definitely isn't helping to maintenance. Ideally, we would do something like this:

int main(void)
{
 pinMode(LED_BUILTIN, OUTPUT);
 digitalWrite(LED_BUILTIN, HIGH);
 
 while (true);
}

That's Arduino code. Clean and clear. The problem? even extracting only the parts of the arduino library that directly intervene here, the resulting binary is 368 bytes in size, vs 4 bytes of the C version. In order to understand why, we need to take a look at both libraries. If we do, we will discover that this part of the Arduino library is actually written in plain C!
Let's start analyzing the first sample, the efficient ASF, C code. When we include the relevant part of atmel headers, we see the whole code come to this:

#include 
#if __AVR_ARCH__ >= 100
#  define __SFR_OFFSET 0x00
#else
#  define __SFR_OFFSET 0x20
#endif

#define _MMIO_BYTE(mem_addr) (*(volatile uint8_t *)(mem_addr))
#define _SFR_IO8(io_addr) _MMIO_BYTE((io_addr) + __SFR_OFFSET)
 
#define DDRB _SFR_IO8(0x04)
#define DDB5 5
#define PORTB _SFR_IO8(0x05)
#define PORTB5 5
 
int main(void)
{
 DDRB |= (1 << DDB5);
 PORTB |= (1 << PORTB5);
 
 while (1);
}

Defines. Defines and macros everywhere. The only way to keep performance so tight is with macros and defines so that the compiler makes all the work and avoids any intermediate computation. Defines are unsafe, don't keep type information, get lost in the preprocessor (so they can't be seen while debugging), and macros are known for being able to hide pretty obscure bugs. Besides, both creep all your code, and can't be contained in namespaces or anything. So if you have a macro that conflicts with something else in your code, you are screwed. For a more elaborate discussion about the disadvantages of macros, see Scott Meyers's Effective C++.

Now let's see the Arduino version, which with all relevant code included, looks like this:

/*
* CppTest.cpp
*
* Created: 2016-09-02 14:14:39
* Author : Technik
*/
 
#include 
#include 
#include 
 
#define HIGH 0x1
#define LOW  0x0

#define INPUT 0x0
#define OUTPUT 0x1
#define INPUT_PULLUP 0x2
 
#define digitalPinToPort(P) ( pgm_read_byte( digital_pin_to_port_PGM + (P) ) )
#define digitalPinToBitMask(P) ( pgm_read_byte( digital_pin_to_bit_mask_PGM + (P) ) )
#define digitalPinToTimer(P) ( pgm_read_byte( digital_pin_to_timer_PGM + (P) ) )
#define portOutputRegister(P) ( (volatile uint8_t *)( pgm_read_word( port_to_output_PGM + (P))) )
#define portModeRegister(P) ( (volatile uint8_t *)( pgm_read_word( port_to_mode_PGM + (P))) )
 
#define LED_BUILTIN 13

#define NOT_A_PIN 0
#define NOT_A_PORT 0

#define NOT_AN_INTERRUPT -1


#define NOT_ON_TIMER 0
#define TIMER0A 1
#define TIMER0B 2
#define TIMER1A 3
#define TIMER1B 4
#define TIMER1C 5
#define TIMER2  6
#define TIMER2A 7
#define TIMER2B 8

#define PB 2
#define PC 3
#define PD 4

#ifndef cbi
#define cbi(sfr, bit) (_SFR_BYTE(sfr) &= ~_BV(bit))
#endif
 
// these arrays map port names (e.g. port B) to the
// appropriate addresses for various functions (e.g. reading
// and writing)
const uint16_t PROGMEM port_to_mode_PGM[] = {
 NOT_A_PORT,
 NOT_A_PORT,
 (uint16_t)&DDRB,
 (uint16_t)&DDRC,
 (uint16_t)&DDRD,
};

const uint16_t PROGMEM port_to_output_PGM[] = {
 NOT_A_PORT,
 NOT_A_PORT,
 (uint16_t)&PORTB,
 (uint16_t)&PORTC,
 (uint16_t)&PORTD,
};

const uint16_t PROGMEM port_to_input_PGM[] = {
 NOT_A_PORT,
 NOT_A_PORT,
 (uint16_t)&PINB,
 (uint16_t)&PINC,
 (uint16_t)&PIND,
};

const uint8_t PROGMEM digital_pin_to_port_PGM[] = {
 PD, /* 0 */
 PD,
 PD,
 PD,
 PD,
 PD,
 PD,
 PD,
 PB, /* 8 */
 PB,
 PB,
 PB,
 PB,
 PB,
 PC, /* 14 */
 PC,
 PC,
 PC,
 PC,
 PC,
};

const uint8_t PROGMEM digital_pin_to_bit_mask_PGM[] = {
 _BV(0), /* 0, port D */
 _BV(1),
 _BV(2),
 _BV(3),
 _BV(4),
 _BV(5),
 _BV(6),
 _BV(7),
 _BV(0), /* 8, port B */
 _BV(1),
 _BV(2),
 _BV(3),
 _BV(4),
 _BV(5),
 _BV(0), /* 14, port C */
 _BV(1),
 _BV(2),
 _BV(3),
 _BV(4),
 _BV(5),
};
 
const uint8_t PROGMEM digital_pin_to_timer_PGM[] = {
 NOT_ON_TIMER, /* 0 - port D */
 NOT_ON_TIMER,
 NOT_ON_TIMER,
 // on the ATmega168, digital pin 3 has hardware pwm
 TIMER2B,
 NOT_ON_TIMER,
 // on the ATmega168, digital pins 5 and 6 have hardware pwm
 TIMER0B,
 TIMER0A,
 NOT_ON_TIMER,
 NOT_ON_TIMER, /* 8 - port B */
 TIMER1A,
 TIMER1B,
 TIMER2A,
 NOT_ON_TIMER,
 NOT_ON_TIMER,
 NOT_ON_TIMER,
 NOT_ON_TIMER, /* 14 - port C */
 NOT_ON_TIMER,
 NOT_ON_TIMER,
 NOT_ON_TIMER,
 NOT_ON_TIMER,
};

static void turnOffPWM(uint8_t timer)
{
 switch (timer)
 {
#if defined(TCCR1A) && defined(COM1A1)
 case TIMER1A:   cbi(TCCR1A, COM1A1);    break;
#endif
#if defined(TCCR1A) && defined(COM1B1)
 case TIMER1B:   cbi(TCCR1A, COM1B1);    break;
#endif
#if defined(TCCR1A) && defined(COM1C1)
 case TIMER1C:   cbi(TCCR1A, COM1C1);    break;
#endif

#if defined(TCCR2) && defined(COM21)
 case  TIMER2:   cbi(TCCR2, COM21);      break;
#endif

#if defined(TCCR0A) && defined(COM0A1)
 case  TIMER0A:  cbi(TCCR0A, COM0A1);    break;
#endif

#if defined(TCCR0A) && defined(COM0B1)
 case  TIMER0B:  cbi(TCCR0A, COM0B1);    break;
#endif
#if defined(TCCR2A) && defined(COM2A1)
 case  TIMER2A:  cbi(TCCR2A, COM2A1);    break;
#endif
#if defined(TCCR2A) && defined(COM2B1)
 case  TIMER2B:  cbi(TCCR2A, COM2B1);    break;
#endif
 }
}
 
void pinMode(uint8_t pin, uint8_t mode)
{
 uint8_t bit = digitalPinToBitMask(pin);
 uint8_t port = digitalPinToPort(pin);
 volatile uint8_t *reg, *out;

 if (port == NOT_A_PIN) return;

 // JWS: can I let the optimizer do this?
 reg = portModeRegister(port);
 out = portOutputRegister(port);

 if (mode == INPUT) {
  uint8_t oldSREG = SREG;
  cli();
  *reg &= ~bit;
  *out &= ~bit;
  SREG = oldSREG;
 }
 else if (mode == INPUT_PULLUP) {
  uint8_t oldSREG = SREG;
  cli();
  *reg &= ~bit;
  *out |= bit;
  SREG = oldSREG;
 }
 else {
  uint8_t oldSREG = SREG;
  cli();
  *reg |= bit;
  SREG = oldSREG;
 }
}
 
void digitalWrite(uint8_t pin, uint8_t val)
{
 uint8_t timer = digitalPinToTimer(pin);
 uint8_t bit = digitalPinToBitMask(pin);
 uint8_t port = digitalPinToPort(pin);
 volatile uint8_t *out;

 if (port == NOT_A_PIN) return;

 // If the pin that support PWM output, we need to turn it off
 // before doing a digital write.
 if (timer != NOT_ON_TIMER) turnOffPWM(timer);

 out = portOutputRegister(port);

 uint8_t oldSREG = SREG;
 cli();

 if (val == LOW) {
  *out &= ~bit;
 }
 else {
  *out |= bit;
 }

 SREG = oldSREG;
}
 
int main(void)
{
 pinMode(LED_BUILTIN, OUTPUT);
 digitalWrite(LED_BUILTIN, HIGH);
 
 while (true);
}

Wow. More than 200 lines of code. We start to see why this code might not be as fast as Atmel's. Everytime we change the state of a pin, we run several reads in program space and a few if-elses. This code even needs to deactivate interruptions. However, the code does many more things, like checking the port actually exists in our mcu. However, if the user makes a mistake and tries to activate a pin in a port that doesn't exist, the error will be silent, and no one will notice.

Seen both examples of how to use C both for performance and usability, and declared some problems of both extremes, we can now answer the question: How does C++ allow us to sort this mess?

Our main ally will be templates. By defining a few template structures in our library, we will get the compiler to reduce the generated code, while keeping readability and safety. Incidentally, we will also make most errors to show early during compilation instead of remaining hidden until (best case), the first execution.

Starting from the bottom, first thing is accessing the registers. There are two types of registers in the AVR: 8-bit and 16-bit registers. For all practical matters, you use them the same way, and all you need to have them defined is their size and position in memory. Both things are known at compile time, so they will be template arguments. As most functionality is common, it will be shared in a shared, template base class. And since all the state information is stored in the register itself, the class will not have members. We could even make it static, but we would loose the asignment operators, which help a lot with readability in this case.

First, we get rid of the DDRB regiser define

#define DDRB _SFR_IO8(0x04)

and change it by a struct. This way, we skip one macro, so adding type safety, limit access by namespaces and keep all the flexibility. Everything at once.

struct DDBRegister {
 void operator=(uint8_t _r)
 {
  *reinterpret_cast<volatile uint8_t>(0x24) = _r;
 }
 operator uint8_t() const
 {
  return *reinterpret_cast<volatile uint8_t> (0x24);
 }
 operator volatile uint8_t&()
 {
  return *reinterpret_cast<volatile uint8_t> (0x24);
 }
} DDRB;

Until now, it's all advantages, and the code generated is still exactly the same two assembly lines. But we can generalize it and extend it to other registers.

template<uint16_t address_>
struct Register {
 void operator=(uint8_t _r)
 {
  *reinterpret_cast<volatile uint8_t> (address_) = _r;
 }
 operator uint8_t() const
 {
  return *reinterpret_cast<volatile uint8_t> (address_);
 }
 operator volatile uint8_t&()
 {
  return *reinterpret_cast<volatile uint8_t> (address_);
 }
};
 
Register<0x24> DDRB;
Register<0x25> PORTB;

To improve readability, we can add methods for bit setting and clearing.

template<uint16_t address_>
struct Register {
 void operator=(uint8_t _r)
 {
  *reinterpret_cast<volatile uint8_t> (address_) = _r;
 }
 operator uint8_t() const
 {
  return *reinterpret_cast<volatile uint8_t> (address_);
 }
 operator volatile uint8_t&()
 {
  return *reinterpret_cast<volatile uint8_t> (address_);
 }
 
 template<uint8_t bit_>
 void setBit() { *reinterpret_cast<volatile RegT_> (address) |= (1 << bit_); }
 template<uint8_t bit_>
 void clearBit() { *reinterpret_cast<volatile RegT_> (address) &= ~(1 << bit_); }
};
 
Register<0x24> DDRB;
Register<0x25> PORTB;

Since the bit is a template parameter, the compiler still resolves these calls to a single instruction each, and user's code transforms to:

int main(void)
{
 DDRB.setBit<DDB5>();
 PORTB.setBit<PORTB5>();
 
 while (1);
}

That's exactly as fast as the original code, a bit more readable, and way safer. We can still get rid of the last defines of DDB5 and PORTB5 by converting them in static constexpr, but there are better ways to face that. We are way far from Arduino's ease of use, which is our goal, and this code still permits many pitfalls, but this post is quite long already, so we will address all that in a second part.

2 comments:

  1. What's important is not the execution time but development time. The C++ is too complicated, too sophisticated and requires programmers that are paid twice than c programmers. I worked with a customer that used c++ RT generator and they had a C++ guru that set quarter of hour meetings a month ahead just to answer questions. At the end, they dumped him and C++ and went into production. Total lifetime cost of c++ project is around ten times more than c. I still remember the graphics drivers of ATI cards written in C++, having updates every week for years and never really worked. Yep, 100MB just for drivers!

    ReplyDelete
    Replies
    1. I understand that concern. In this article I'm trying to debunk the myth that many people share, saying C++ is just too slow. In the second part, however, I will address readability, and will talk about how C++ can actually make your code simpler and less bug-prone (it's bugs that cost you time and money).

      Delete