Last year, Microsoft announced Teams Intelligent Cameras, a feature for Microsoft Teams Rooms that purported to use AI (artificial intelligence) to improve the meeting room video experience for remote attendees.
Intelligent Cameras originally promised to deliver “an equitable meeting experience” using unique technologies: AI-powered speaker tracking, with multiple video streams to place in-room participants into their own video frame along with people (face) recognition.
In the announcement, Microsoft promised to use facial movements and gestures to understand who is speaking, and use facial recognition to group people’s names in the meeting roster – all using Microsoft’s technology to deliver.
Unfortunately, whilst Microsoft promised these features – Teams Intelligent Camera has launched to GA (General Availability) and under-delivers on the promise and execution. This is bad news for everyone buying Microsoft Teams Rooms.
What has Microsoft announced?
The big change is that now Intelligent Cameras “harness OEM designed AI capabilities” from Logitech, Neat, Jabra, and Poly. That’s the same set of OEM-designed features Microsoft announced in July 2021 would be coming to Teams by the end of 2021, separately from the Intelligent Cameras feature.
Let that sink in for a minute; Microsoft has gone from aiming to “deliver an equitable experience” using “Microsoft AI facial recognition technology” that they will make “available to OEMs”… to grouping people framing capabilities from several vendors under a single banner.
And – whilst Microsoft announced it as GA at Inspire – the headline core feature to show multiple video feeds for each in-room attendee isn’t expected to arrive for “some months”; with no word of when person recognition will arrive.
It clearly isn’t GA, as the documentation doesn’t appear to be available (at the time of writing) and might not even be advanced enough to show in beta, as the demo video is a mock-up; given away by the mis-spelling of a name in the meeting roster:
Reading between the lines – what Microsoft is saying is if you have bought a kit such as Neat Symmetry, then you can use its feature “Auto Framing” with Teams, just like you can with Zoom, under the banner of “Intelligent Cameras”.
Microsoft should have kept the feature inside Teams
Despite what meeting room vendors would like you to believe, auto-framing and grouping of people isn’t a ground-breaking innovation.
If you’ve used a Snapchat filter or one of many apps on a phone that detect a face and its landmarks (the key points on a human face), and then overlay it, then you’ve used AI at work in real-time. Chipset vendors like Qualcomm have been including AI-accelerators on their system-on-a-chip for years, allowing developers to use trained models in their applications and deliver extremely fast and capable facial recognition.
And if you’ve used Teams with Together Mode – or even background blur on a PC, then you’ve also seen similar technology at work; Teams need the AVX (or AVC2) extensions to detect the person, their face, and their facial outline so that the background can be replaced. Face detection (what’s needed to determine where people are) can be determined in ~3-5ms even with an elderly CPU.
Simply put, Microsoft could have delivered the feature in-house, and it would have provided all MTR devices with the capability in a consistent way.
OEM vendors’ solutions aren’t very intelligent
Most OEM vendors of Microsoft Teams Rooms are providing last-generation solutions (i.e. not as good as an app on your phone or a modern laptop) for face detection to support Intelligent Cameras.
Take, for example, a Microsoft Teams Room built to the “Enhanced Microsoft Teams Room” standard from Microsoft; in the example below we see that the vendor camera can’t match auto-framing using open-source libraries or basic examples using AI acceleration.
I’ve used one of the cameras on Microsoft’s list, which was also a vendor listed in the September 2021 announcement.
You’ll see the auto-framing and speaker tracking features on the MTR vendor device are extremely slow, in comparison to the AI-predictive tracking using the Nvidia AR SDK and basic face detection and tracking using the open source Dlib library. Even the CPU-based Dlib managed nearly as well as the GPU model, even though the CPU (of a similar model to an MTR) was processing the three 4K video streams.
Although my example is slow, it doesn’t show the worst of how the vendor solutions work.
Using a similar kit last week, I ran a hybrid event for executives of an international company. Everything went mostly to plan… apart from the slow speaker tracking – and at one point, the vendor face detection functionality misrecognized the projector as a face – it had two mics on top and looked a little like the robot from Short Circuit – much to the hilarity for the remote attendees.
How Microsoft should have delivered Intelligent Cameras
The lesson from that is don’t leave it to the vendors to deliver the feature. The $1000+ cameras provide worse tracking than a $100 Brio using software-based face detection to crop video feeds. The Nvidia GPU – similar to what is available in many laptops – provides group-based framing – i.e. the functionality necessary to deliver Intelligent Cameras separate feeds – as part of the SDK.
What Microsoft should have done is allowed the vendors to offer their own innovations, but for Microsoft Teams, provide a consistent experience.
Rather than slice video feeds up at the camera, vendors should have provided the metadata so that this could be performed either by the Microsoft Teams service layer – using a similar method to Together mode to take each person from a video feed and compose them as individual frames. Even if Microsoft considered it too costly to do so, they could have performed the cropping client-side against the full video feed, delivered at 4K if needed.
Ultimately the feature is taking a single camera feed and then splitting it into multiple streams. Taking the source video feed, along with the metadata highlighting the location of each face allows the decoded video to be cropped and arranged easily. In the example below, you’ll see the metadata shown as boxes against the real-time feeds. As you can see, apart from the predictive data (i.e. the arrow near my face on the left, showing where it thinks I’m moving to), it’s effectively the co-ordinates of each person:
Then, Microsoft could have either loaded a pre-computed registered faces model based on the meeting roster into the source MTR to process, or used the Azure Face API. The recognition function is the piece of the puzzle that determines who the detected face is.
Naturally, to make it simple, only the attendees of the meeting would be included in the model so that two twins who work for the same company but in different departments wouldn’t be shown as false positives.
Because the recognition doesn’t need to be against each video frame, it doesn’t need acceleration or offloading to a cloud service like the Face API. Using the AVX2 extensions on the MTR, you can perform face recognition at the same time as encoding/decoding video streams, real-time face recognition, and running a UI in around 1-2 seconds; these can then be tagged in the metadata against each detected face, allowing a meeting room with around 10 people to have a fully tagged roster in around ~20 seconds. If vendors used the capabilities in the modern ARM system on chip devices they should be putting into cameras and MTR on Android devices – it could be around 50 times faster, even on the cheapest CPUs.
Is there a good reason why Microsoft offloaded responsibility for Intelligent Cameras?
Microsoft hasn’t shared why they’ve changed direction – and naturally, it is their decision to do so. But – it would be interesting to understand why. Was it a fear of a misrecognition causing fall-out (similar to Tay Bot), or perhaps the OEM vendors pressuring Microsoft, so that you need to buy new equipment to use the functionality?
We’ll possibly never know, but what we do know is that Intelligent Cameras has a lot of potential – but only if Microsoft delivers the feature as a first-party feature included in all Teams Rooms. Only then can they provide an equitable meeting experience to customers who’ve already invested heavily in Microsoft’s Teams Room systems.