UART reception CPU loading

vedecoid · September 13, 2018, 9:31pm

We used to do UART RX reception byte by byte using the uart callback function. That proved to be rather CPU intensive and was topping out at about 57600 baud to stay reliable.
We’ve now switched to using a HW pin to indicated frame start and end and using the Pin state trigger in combination with uart.readblob(). This already proves to be much much faster…
Question, on the sending side, we can easily go beyond several Mbaud (DMA controlled with PCB layout that has been optimised with short traces. Reception proves to be reliable, but I wanted to get a feel for the CPU load created by these high baud rates. Are you underneath also using DMA transfers, offloading the CPU, or are we clogging the CPU with these high rates ?

vedecoid · September 13, 2018, 10:12pm

For those interested, the code below gives a reliable back-and-forth UART data exchange at almost 2Mbaud. Amazing ! Host uP is using DMA for TX and RX. Haven’t tried any faster as the IMP004 tops out at 3Mbaud. Using the RS485 transmit pin function to generate the frame sync signal back to the host uP (thanks guys, this settxactive function is what many of us have been waiting for for years !)
I’m still interested to understand better the impact on the CPU loading when using these high rates.

msg <- blob(256);

function SendImpToHostMsg()
{
    PcUart.write(msg);
    server.log(format("Send blob with %d bytes",msg.len()));
}

function HostToImpMsgHandler()
{
	if (pcHostTxSync.read() == 1) // start of the frame
	{
		PcUart.flush();
	}
	else	// end of the frame
	{
		msg = PcUart.readblob();
		server.log(format("Received blob with %d bytes",msg.len()));
		imp.wakeup(0.2,SendImpToHostMsg);
	}
}

function Init_Hardware()
{
	if (imp.info().type == "imp004m") 
  {
    server.log("[Init_Hardware] Configuring HW for IMP004m");
	PcUart <- hardware.uartBCAW;
	PcUart.setrxfifosize(284);
	PcUart.settxfifosize(284);	
	PcUart.configure(1843200, 8, PARITY_NONE, 1, NO_CTSRTS); 

	pcHostTxSync <- hardware.pinD;
	pcHostRxSync <- hardware.pinW;
	pcHostRxSync.configure(DIGITAL_OUT);
	PcUart.settxactive(pcHostRxSync, 1, 2000, 2000);
    // RX frame sync
	pcHostTxSync.configure(DIGITAL_IN,HostToImpMsgHandler);
  }   
  else if (imp.info().type == "imp005") 
  {
	server.log("[Init_Hardware] Configuring HW for IMP005");
  }     
}

Init_Hardware();

peter · September 14, 2018, 9:44am

I’m actually a little amazed myself. That is much faster than we’ve ever tested imp UARTs at. The imp does not use DMA for UARTs, and on imp004 and earlier there isn’t even a hardware FIFO. (The imp005 and impC001 UARTs have hardware FIFOs.)

The imp004 CPU, at 96MHz, is actually a little slower than the ones in imp001-003. (It’s much lower-power, though.) At 1,842,300 baud it’ll be servicing an interrupt every 520 machine cycles. Good idea to use readblob() to cut down the Squirrel overhead – there’s definitely no way it can run bits of Squirrel every 520 cycles.

So yes, you probably are “clogging” the CPU, at least during packet reception. But that’s fine so long as the CPU doesn’t have anything better to do at that time. A 284-byte packet only takes 1.5ms at that baud rate, so even if packet reception takes 100% CPU (which it must be close to), everything else in the system should easily tolerate that 1.5ms of extra latency.

You probably already know this, but: the imp hardware generates UART baud rates by dividing down the master clock. So not all baud rates are achievable; the uart.configure() call rounds the requested baud-rate to the nearest available one. The faster the rate is, the further apart those available options are. The return value from uart.configure() is the actual baud rate that the imp will be using, so, if you haven’t already, you should quickly check that this is close to what the UART peer is set to. (But if it’s actually working, it must be fairly close.)

We’ve had a story in our backlog for a while about exposing device CPU usage statistics to customers, but it’s not so obvious how to do so in a way that’s actually useful. Here, for instance, the interesting statistic is “how much CPU is being used during those 1.5ms bursts?”, so it’d be no help reporting that, say, the CPU was 30% used over the past second.

Peter

vedecoid · September 14, 2018, 10:43am

Thanks Peter.
There’s actually no need for us to use such a high speed, I was only curious on what it would do to the system. We’ll probably be running at 230400 or so, but it’s good to know there’s plenty of headroom.

The host uP has fractional baudrate possibility, so we’ll probably tune both sides to have an as close as possible match between them.

One more statistic: the above runs stable with a message exchange back and forth every 50ms (we let it run overnight and have processed > 1000000 messages without a single byte error. Obviously there’s nothing else running in the squirrel so this is not a real world capability, but still impressive.

Would it be an idea to consider DMA for uarts in the Imp OS ? On our CPU (LPC4078 @ 120MHz), this scenario takes around 0,1% CPU load 1 interrupt for every packet…

peter · September 14, 2018, 11:29am

Well, the hardware does support UART DMA (except on imp005), so it’s always been a possibility – but we never really had a customer user-story that required implementing it. And with respect, seeing as you did just mention how well the existing implementation already works for you, we kinda still don’t

Peter

vedecoid · September 14, 2018, 11:39am

No need for us to change it, we only switched from interrupt driven to dma (which by the way is a lot simpler then the interrupt way on the LPC) to make the whole system more deterministic and less sensitive to the actual use of the serial connection.

peter · September 14, 2018, 12:45pm

I just noticed that your code uses NO_CTSRTS. That basically means that, not only is there an interrupt every 520 cycles, the interrupt latency can’t rise above 520 cycles without losing bytes. It’s quite impressive that we manage that at all, and I really don’t think that we can guarantee it in all circumstances, nor can we promise not to regress it in future releases. Surely UARTs going that fast ought to have RTS/CTS wired up?

Peter

vedecoid · September 14, 2018, 1:07pm

The particular UART we use for this channel on the CortexM4 doesn’t have RTS/CTS pins (only one of the 4 available UARTS has it). We try to ‘clear the field’ on both sides with the HW pin handshaking that prepare for reception on both sides, but you’re right there’s a danger at these high speeds. As said, we’ll probably reduce it to 1/10 of this speed, so probably no problem. By enlarging the window defined by these sync signals, we can allow for even larger latency if that would be required.
The problem we had in the past was that, as this is just binary data and there’s no such thing as a ‘special character’ to define start and end of the frame, , we had to use a crude gap detection in reception on the Imp side that often missed to trigger with all the consequences that come with it. This setup is going to be much more reliable.

hugo · September 14, 2018, 9:29pm

Out of interest, how were you doing the reads “byte by byte”? The recommendation is always to empty the UART buffer in the callback, which is easily done with a readblob (which is indeed much higher performance than a loop reading a byte at a time).

This way, you don’t fall behind with all the handling involved in going into and out of a callback.

vedecoid · September 15, 2018, 6:40am

as far as I remember the history, we initially used the 'while (c = uart.read()){…} ’ kind of callback, emptying the fifo one by one. That caused a certain loading as callback execution wasfast enough to catch almost every new incoming byte so the ‘while’ didn’t loop very often. That made it top out at 57600 above which we started to see crc errors in the frame. Replacing uart.read by readblob didn’t make much of a difference. Just guessing but I think the very frequent execution of the callback is what caused the problem.

What I realise now, after thinking it through is that we should have disabled the callback for let’s say 80msec after every read/readblob (disable it after read and enable it again within an imp.wakeup statement). Then the fifo would do its work on the background grabbing the frame and the callback would have been called less frequently reducing the load. I just realise that now so haven’t tested it. I’m for instance not sure if the callback gets called immediately when you enable it while there’s bytes in the fifo, even when no new bytes are coming in on the serial port. However you might not need that as you can also read out the fifo in the imp.wakeup lambda function next te re-enabling the uart callback.
Just never thought about such dynamic activation of the uart callback…

I am going to try it out though as our older HW doesn’t have the HW sync pins we’re using now. As the frames are coming in on a regular 100msec timeslot it would be pretty easy to time correctly.

vedecoid · September 15, 2018, 8:56am

Just tried it with code below… doesn’t work. You can’t call uart.configure without screwing up all that is already in the fifo, so you can’t disable the callback dynamically after the first trigger at frame start
The code below results in a blob that is not only smaller then expected, it also contains garbage… Too bad.

Retried the normal way with readblob() iso read(). At 57600 that results in blobs of 1 and sometimes 2 to be added, at 115200 , it adds 2 or 3 etc. It basically means that the callback is called for virtually every byte up until 57600baud and then with increasing chunks sizes at the same frequency when increasing speed further resulting in serious load on the imp.at higher speeds. Is there another way of disabling the callback after first trigger as that would solve the load problem IMHO ?

function nextUartRead()
{
    local pos = msg.tell();
    msg.writeblob(PcUart.readblob());    // read what's been captured in the fifo and append it to the blob
    if (pos < msg.tell())   
    {
        server.log(format("Received chunk with %d bytes",msg.tell()-pos));
        imp.wakeup(0.02,nextUartRead); // something was added out of the fifo => try grabbing some more in 20 msec
    }
    else    // means that for the last 20 msec nothing new came in => assume frame end
    {
        // do what needs to be done with the frame
        server.log(format("Received blob with  %d bytes (%d)",msg.tell(),cnt++));
        // re-activate the uart callback for the next frame
    	PcUart.configure(115200, 8, PARITY_NONE, 1, NO_CTSRTS,uartCallback); 
    }
}

function uartCallback()
{
    // disable the callback immediately
    PcUart.configure(115200, 8, PARITY_NONE, 1, NO_CTSRTS);
    // move the framebuffer to start
    msg.seek(0);
    // read what's already there
    msg.writeblob(PcUart.readblob()); 
    server.log(format("Received chunk with %d bytes ",msg.tell()));
    // schedule the next fifo readout, in the meantime incoming bytes are caught in the fifo
    imp.wakeup(0.02,nextUartRead);
}

// optimised without using the HW handshaking
function Init_Hardware2()
{
  if (imp.info().type == "imp004m") 
  {
    server.log("[Init_Hardware] Configuring HW for IMP004m");
	PcUart <- hardware.uartBCAW;
	PcUart.setrxfifosize(284);
	PcUart.settxfifosize(284);	
	PcUart.configure(115200, 8, PARITY_NONE, 1, NO_CTSRTS,uartCallback); // 230400 baud worked well, no parity, 1 stop bit, 8 data bits
  }   
  else if (imp.info().type == "imp005") 
  {
	server.log("[Init_Hardware] Configuring HW for IMP005");
  }     
}

Init_Hardware2();

coverdriven · September 15, 2018, 10:00am

Counterintuitively, string manipulation is generally faster than using blobs. You may get better throughput with uart.readstring() and the + operator.
Also, imp.wakeup(0.02) takes 30ms to trigger, not 20ms. Try using 0.01

hugo · September 15, 2018, 4:33pm

I’d also note that you shouldn’t be calling server.log in any time critical code. This causes a lot of work (pushing the logging out to the network).

Plus, you are queueing multiple callbacks. You should cancel the previous callback if one was set up before setting up a new one - no need to reconfigure the UART callback, just send the completed blob to a handler with an imp.wakeup(0,…).

I’ll try and write some code to show this this weekend.

coverdriven · September 16, 2018, 4:09am

Hugo’s right about server.log(). In Squirrel, Handlers for uart.configure() are the closest we get to interrupt handlers. They need to be as lean as possible.

@Hugo + @Peter, In my pre-Electric Imp designs I often designated a pin dedicated to showing how loaded the CPU was. Each time the CPU is idle, I’d set the pin high, dropping it to low whenever it left the idle state. Using an oscilloscope or external sampler enabled me to see how different operations affected loading. For development devices, could you consider a mode where the bi-colour LED is used to mimic CPU loading? The mode would only be engaged when enabled in squirrel and wouldn’t persist after a restart. For something smarter, the red LED could be used for CPU and the green for something else (eg TCP receive buffer emptiness). These are things that we can’t satisfactorily achieve through squirrel alone. The code required for this would be small.

peter · September 16, 2018, 9:46am

Oh that’s not a bad idea (the pin part at least – less convinced about the LED part). It’d mean you couldn’t do it remotely, of course, but it pushes all the awkward decisions about sampling rates, time-averaging, etc., to the customer where they belong, instead of us trying to be all things to all people.

Peter

vedecoid · September 16, 2018, 3:54pm

@hugo, I obviously know about not using server.log in time critical code; These were just there to test the correct operation of the ‘algorithm’ of executing the uart callback only once and then read the incoming stream in chunks. Without these log, the result is the same.
The base problem is and remains that until about 57600 baud, on a not too loaded system,whether you use uart.read() or uart.readblob() the squirrel callback is triggered for almost every incoming byte creating a lot of unnecessary load on the CPU. It would be much better if it could trigger on the first byte received and then become disabled so that, let’s say 50-100msec later the whole frame can be retrieved with one readblob. That’s only 2 squirrel calls. That’s not possible though as from the moment you use uart.configure during reception (in an attempt to disable the callback), what ends up in the msg blob is garbage. Not sure why, but this is what the squirrel code above shows. Probably something to do with re-initialisation of pointers to the reception buffer in the ImpOS code underneith when making th configure API call. An extra API call 'uart.disablecallback() that doesn’t change the configuration but only disables the callback would be handy and probably not that difficult to implement The way I usualy do that in bare metal C code systems is to point the function pointer representing the callback to NULL and test against NULL before executing it. in the rest of the code base. Just a few lines of code
@peter, absolutely agreed. I think that from the moment a developer’s code becomes so complex that you need this kind of ‘assistance’, you probably also have the necessary equipment such as a logic analyser at your disposal. For instance, the reason why I know the code above doesn’t work as you would expect is that I can verify the ingoing byte stream with such an analyser and I know it to be correct. I test every time critical piece in embedded code with either a LA or a scope using a spare pin, so if that would be possible on the imp, that would be a good step forward in profiling execution.