Pushing Object Detection to Home-Assistant with Coral EdgeTPU

Forget the Tensorflow Component

Lots of awesome developers have added image processing components to Home-Assistant’s integration list. In fact, there are now 9 different image processors (a few that do more than just object detection) built right into Home-Assistant. I wrote the first version of the OpenCV image processing component, which was not the best given my lack of experience in Python at the time, but it obviously triggered some ideas in others – which is one of the amazing parts of open source software! However, most computers (and servers) are just not built to perform inference analysis, so the components that are self-hosted, like the OpenCV integration, just aren’t very efficient.

Enter Coral the EdgeTPU from Google

A lot of providers provide cloud-based inference engines – Google has released some solutions for performing inference at the “edge” (i.e. locally): The Google Coral Dev Board and the USB Accelerator (see all products here). The Coral Dev Board is similar to a Raspberry Pi with a TPU and the USB Accelerator is a USB 3 external TPU; A TPU is a Tensor Processing Unit – specifically designed for processing Tensors or n-D matrices. Tensors are basically mathematical representations of real-world patterns; they are, in basic terms, used for pattern matching. There are some alternatives to the EdgeTPU by Google, like the Intel Neural Compute Stick, but none have seemed to have the community backing that the EdgeTPU does.

Pushing State to Home-Assistant

My house was built around 1970, and it’s quite obvious any “repairs” (the term is used loosely) by the previous owners were done by those without the know-how. The doorbell button looks like it’s from the ’70s, I experimented with a Z-Wave Doorbell but it just wasn’t loud enough – my partner works from home and his office is in the basement. We already had an Unifi Camera mounted above the door, why not let the house tell us when someone was at the door? So, I implemented the OpenCV integration with Home-Assistant, and then tried the TensorFlow integration; I had to throw an extra 4 cores at the VM to get even semi-reliable results: if it worked, it took a few seconds before it would even trigger a notification which caused our Doordash drivers to get frustrated, and had no idea which delivery company had dropped packages as they had left the frame of view.

Why was the reliability of the integrations such an issue? Well, for one, it was done on Xenon processors (not exactly top of the line) and, two, because those integrations were polling – only updating when the loop requested their state; I lived with it but hated it.

When I discovered the EdgeTPU, I ordered both a Coral Dev Board and a USB Accelerator, I had plenty of Raspberry Pis laying around I was sure I could put it to use. Of course, the idea got back-logged to all of my other projects, like implementing my Distributed, Modular State Machine. I finally got around to it this last weekend.

The 1st Pass

I wanted the Raspberry Pi to push the state to Home-Assistant in order to get more immediate results. So the application was designed to consume an RTSP video stream and perform object detection on the frames. Watching the logs, I couldn’t believe how fast it was; each loop (retrieve frame, process, and push the state to Home-Assistant) appeared to take around a second.

The logs, however, were very misleading. While I watched the logs and Home-Assistant, I stepped in front of the camera. Only, it took a couple seconds for it to detect a person was in the frame; when I left the view of the camera, it reported that a person was in the frame for close to five to six seconds. It was way better than using the Home-Assistant integrations, but it definitely puzzled me.

OpenCV VideoCapture Implementation

The code was written to run 1:1, 1 thread to 1 camera. The camera stream was fed to OpenCV’s VideoCapture class, and continually looped while the connection was open. I found on some forums the answer to why I was experiencing such delay: when you call the VideoCapture::read() function, it provides the next frame in the buffer, not the most recent frame. This wouldn’t be much a problem if your processing could keep up with the frame rate of the video stream; if you can’t keep up with the frame rate you experience lag, as I was.

Attempting to work around this limitation, I found you could retrieve the number of frames in the buffer and set the current frame index. Unfortunately, this let to around 4-5 seconds per frame, still better than the Home-Assistant integrations, but completely unacceptable for replacing a doorbell! The answer I found somewhere deep in Stack Overflow (I’ll link if I can find it I found it).

Fun with Thread Synchronization

Have two threads for each video stream. The first thread continually pop’s the oldest frame off of the buffer and the second processes the, hopefully, current frame. Since there’s a shared resource involved, you can’t have both threads popping the VideoCapture’s buffer queue; no, you need to synchronize the shared resource, otherwise, you run into concurrency issues. Concurrency issues, depending on the context and implementation, could crash your application, cause a thread to grab expired data, or even grab data that mutate later!

So we have two streams per video stream: the “Grabber” thread and the “Processor” thread. Grabbing a frame from the buffer and discarding it takes, essentially, no time at all, while Processing a frame from the buffer could take a bit (the term “bit” is used loosely here). So which thread should be the one to tell the other “Hey dude, it’s my turn!”?

Whenever a thread wants to read from the buffer, it must tell the other “Hol’ up, yo!” to prevent some of the concurrency issues mentioned above. While the thread is chatting away with the buffer, the other thread is waiting… patiently, or impatiently – kinda depends on how late they are to their next appointment. Imagine the UI thread is waiting: all of a sudden the user sees a frozen screen (and most like bitches loudly to their cube mates). For this reason, threads should quit the chit-chat and let the next thread do what it needs to do!

To accomplish this behavior, we use a shared Lock: a local, domain-specific object that identifies who has the right to access sharing resources across separate threads. A Lock, while similar, is different than a Mutex, which usually relates to system processes – though some people (and languages) use them interchangeably (they probably mean Semaphore). When a thread wants to access a shared resource, it attempts to acquire the Lock, waiting – sometimes impatiently – until it acquires the lock; precisely the reason a lock should be released as soon as possible.

Back to the topic at hand: when the Processor thread has received its frame from the buffer, it immediately relinquishes the lock so it can process it – while the Grabber thread happily gifts the buffer’s oldest frames to the garbage collector – until the Processor needs its fix from the frame.

What the hell did I just read?

Exactly how to handle a FIFO buffer between discrete processes…

The Grabber thread:

while self._video_stream.isOpened():
    self.lock.acquire() # Blocking action, wait for lock to be free
    self._video_stream.grab()
    self.lock.release() # Put the lock up for grab

The Processor thread:

while self.video_stream.isOpened():
    self.lock.acquire() # Blocking action, wait for lock to be free
    frame = self._retrieve_frame()
    self.lock.release() # Put the lock up for grab
    if frame is None:
       time.sleep(FRAME_FAILURE_SLEEP)
       continue # Stop at next light

   detection_entity = self._process_frame(frame)

   self._set_state(detection_entity)