News from Industry

WebCodecs, WebTransport, and the Future of WebRTC

webrtchacks - Tue, 07/18/2023 - 14:30

Explore the future of Real-Time Communications with WebrtcHacks as we delve into the use of WebCodecs and WebTransport as alternatives to WebRTC's RTCPeerConnection. This comprehensive blog post features interviews with industry experts, a review of potential WebCodecs+WebTransport architecture, and a discussion on real-time media processing challenges. We also examine performance measurements, hardware encoder issues, and the practicality of these new technologies.

The post WebCodecs, WebTransport, and the Future of WebRTC appeared first on webrtcHacks.

New: Higher-Level WebRTC Protocols course

bloggeek - Mon, 07/17/2023 - 12:30

A new Higher-level WebRTC protocols course and discounts, available for a limited period of time.

Over a year ago, Philipp Hancke came to me with the idea of creating a new set of courses. Ones that will dig deeper into the heart of the protocols used in WebRTC. This being a huge undertaking, we decided to split it into several courses, and focus on the first one – Low-level WebRTC protocols.

We received positive feedback about it, so we ended up working on our second course in this series – Higher-level WebRTC protocols.

Why the need for additional WebRTC courses?

There is always something more to learn.

The initial courses at WebRTC Course were focused on giving an understanding of the different components of WebRTC itself and on getting developers to be able to design and then implement their application.

What was missing in all that was a closer look at the protocols themselves. Of looking at what goes on in the network, and being able to understand what goes over the wire. Which is why we started off with the protocols courses.

Where the Low-level WebRTC protocols looks at directly what goes to the network with WebRTC, our newer Higher-level WebRTC protocols is taking it up one level:

This time, we’re looking at the protocols that make use of RTP and RTCP to make the job of real time communications manageable.

If you don’t know exactly what header extensions are, and how they work (and why), or the types of bandwidth estimation algorithms that WebRTC uses – and again – how and why – then this course is for you.

If you know RTP and RTCP really well, because you’ve worked in the video conferencing industry, or have done SIP for years – then this course is definitely for you.

Just understanding the types of RTP header extensions that WebRTC ends up using, many of them proprietary, is going to be quite a surprise for you.

Our WebRTC Protocols courses

Got a use case where you need to render remote machines using WebRTC? These require sitting at the cutting edge of WebRTC, or more accurately and a slightly skewed angle versus what the general population does with WebRTC (including Google).

Taking upon yourself such a use case means you’ll need to rely more heavily on your own expertise and understanding of WebRTC.

There are now 2 available protocols courses for you:

  1. Low-level WebRTC protocols
  2. Higher-level WebRTC protocols (half-complete. Call it a work in progress)

And there are 2 different ways to purchase them:

  1. Each one separately – low and high
  2. As part of the bigger ALL INCLUDED WebRTC Developer bundle (the Higher-level course was just added to it)

You should probably hurry though…

  • There’s a 40% discount on the Higher-level WebRTC protocols course. This early-bird discount will be available until the end of this month ($180 instead of $300)
  • There’s also a 20% discount on all courses and ebooks. Call it a summer sale – this one is available using discount code SUMMER

Check out my WebRTC courses

The post New: Higher-Level WebRTC Protocols course appeared first on BlogGeek.me.

Cloud gaming, virtual desktops and WebRTC

bloggeek - Mon, 07/03/2023 - 13:30

WebRTC is an important technology for cloud gaming and virtual desktop type use cases. Here are the reasons and the challenges associated with it.

Google launched and shut down Stadia. A cloud gaming platform. It used WebRTC (yay), but it didn’t quite fit into Google’s future it seems.

That said, it does shed a light on a use case that I’ve been “neglecting” in my writing here, though it was and is definitely top of mind in discussions with vendors and developers.

What I want to put in writing this time is cloud gaming as a concept, and then alongside it, all virtual desktops and cloud rendering use cases.

Let’s dig in

Table of contents The rise and (predictable?) fall of Google Stadia

Google Stadia started life as Project Stream inside Google.

Technically, it made perfect sense. But at least in hindsight, the business plan wasn’t really there. Google is far remote from gaming, game developers and gamers.

On the technical side, the intent was to run high end games on cloud machines that would render the game and then have someone play the game “remotely”. The user gets a live video rendering of the game and sends back console signals. This meant games could be as complex as they need be and get their compute power from cloud servers, while keeping the user’s device at the same spec no matter the game.

Source: Google

I’ve added the WebRTC text on the diagram from Google – WebRTC was called upon so that the player could use a modern browser to play the game. No installation needed. This can work nicely even on iOS devices, where Apple is adamant about their part of the revenue sharing on anything that goes through the app store.

Stadia wanted to solve quite a few technological challenges:

  • Running high end console games on cloud machines
  • Remotely serving these games in real time
  • Playing the game inside a browser (or an equivalent)

And likely quite a few other challenges as well (scaling this whole thing and figuring out how to obtain and keep so many GPUs for example).

Technically, Stadia was a success. Businesswise… well… it shut down a little over 3 years after its launch – so not so much.

What Stadia did though, was show that this is most definitely possible.

WebRTC, Cloud gaming and the challenges of real time

To get cloud gaming right, Google had to do a few things with WebRTC. Things they haven’t really needed too much when the main thing for WebRTC at Google was Google Meet. These were lowering the latency, dealing with a larger color space and aiming for 4K resolution at 60 fps. What they got virtually for “free” with WebRTC was its data channel – the means to send game controller signals quickly from the player to the gaming machine in the cloud.

Lets see what it meant to add the other three things:

4K resolution at 60 fps

Google aimed for high end games, which meant higher resolutions and frame rates.

WebRTC is/was great for video conferencing resolutions. VGA, 720p and even 1080p. 4K was another jump up that scale. It requires more CPU and more bandwidth.

Luckily, for cloud gaming, the browser only needs to decode the video and not encode it. Which meant the real issue, besides making sure the browser can actually decode 4K resolutions efficiently, was to conduct efficient bandwidth estimation.

As an algorithm, bandwidth estimation is finely tuned and optimized for given scenarios. 4K and cloud gaming being a new scenario, meant that bitrates that were needed weren’t 2mbps or even 4mbps but rather more in the range of 10-35mbps.

The built-in bandwidth estimator in WebRTC can’t handle this… but the one Google built for the Stadia servers can. On the technical side, this was made possible by Google relying on sender-side bandwidth estimation techniques using transport-cc.

Lower latency: playout delay

Remember this diagram?

It can be found in my article titled With media delivery, you can optimize for quality or latency. Not both.

WebRTC is designed and built for lower latency, but in the sub-second latency, how would you sort the latency requirements of these 3 activities?

  1. Nailing a SpaceX rocket landing
  2. Playing a first shooter game (as old as I am, that means Doom or Quake for me)
  3. Having an online meeting with a potential customer

WebRTC’s main focus over the years has been online meetings. This means having 100 milliseconds or 200 milliseconds delay would be just fine.

With an online game? 100 milliseconds is the difference between winning and losing.

So Google tried to reduce latency even further with WebRTC by adding a concept of Playout Delay. The intent here is to let WebRTC know that the application and use case prefers playing out the media earlier and sacrificing even further in quality, versus waiting a bit for the benefit of maybe getting better quality.

Larger color space

Video conferencing and talking heads doesn’t need much. If you recall, with video compression what we’re after is to lose as much as we can out of the original video signal and then compress. The idea here is that whatever the eye won’t notice – we can make do without.

Apparently, for talking heads we can lose more of the “color” and still be happy versus doing something similar for an online game.

To make a point, if you’ve watched Game of Thrones at home, then you may remember the botch they had with the last season with some of the episodes that ended up being too dark for television. That was due to compression done by service providers…

So far this is my favorite screenshot from #BattleForWinterfell #GameofThrones pic.twitter.com/6uI45SjPG7

— Lady Emily (@GreatCheshire) April 29, 2019

While different from the color space issue here, it goes to show that how you treat color in video encoding matters. And it differs from one scenario to another.

When it comes to games, a different treatment of color space was needed. Specifically, moving from SDR to HDR, adding an RTP header extension in the process to express that additional information.

Oh, and if you want to learn more about these changes (especially resolution and color space), then make sure to watch this Kranky Geek session by YouTube about the changes they had to make to support Stadia:

What’s in cloud gaming anyway?

Here’s the thing. Google Stadia is one end of the spectrum in gaming and in cloud gaming.

Throughout the years, I’ve seen quite a few other reasons and market targets for cloud gaming.

Types of cloud games

Here are the ones that come out of the top of my head:

  • High end gaming. That’s the Google Stadia use case. Play a high end game anywhere you want on any kind of device. This reduces the reliance and need to upgrade your gaming hardware all the time
    • You’ll find NVIDIA, Amazon Luna and Microsoft xCloud focused in this domain
    • How popular/profitable this is is still questionable
  • Console gaming. PlayStation, Xbox, Switch. Whatever. Picking a game and playing without waiting to download and install is great. It also allows reducing/removing the hard drive from these devices (or shrinking them in size)
  • Mobile games. You can now sample mobile apps and games before downloading them, running them in the cloud. Other things here? You could play games of other users using their account and the levels they reached instead of slaving your way there
  • Retro/emulated games. There’s a large and growing body of games that can’t be played on today’s machines because the hardware for them is too old. These can be emulated, and now also played remotely as cloud games. How about playing a PlayStation 2 game? Or an old and classing SEGA arcade game? Me thinking Golden Axe
Improved gameplay

Why not even play these games with others remotely?

My son recently had a sit down with 4 other friends, all playing on Xbox together a TMNT game. It was great having them all over, but you could do it remotely as well. If the game doesn’t offer remote players, by pushing it to the cloud you can get that feature simply because all users immediately become remote players.

At this stage, you can even add a voice conference or a video call to the game between the players. Just to give them the level of collaboration they can get out of playing the likes of Fortnite. Granted, this requires more than just game rendering in the cloud, but it is possible and I do see it happen with some of the vendors in this space.

Beyond cloud gaming – virtual desktop, remote desktop and cloud rendering

Lower latencies. Bigger color space. Higher resolutions. Rendering in the cloud and consuming remotely.

All these aren’t specific to cloud gaming. They can easily be extended to virtual desktop and remote desktop scenarios.

You have a machine in the cloud – big or small or even a cluster. That “machine” handles computations and ends up rendering the result to a virtual display. You then grab that display and send it to a remote user.

One use case can just be a remote desktop a-la VNC. Here we’re actually trying to get connected from one machine to another, usually in a private and secure peer-to-peer fashion, which is different from what I am aiming for here.

Another, less talked about is doing things like Photoshop operations in the cloud. For the poor sad people like me who don’t have the latest Mac Pro with the shiny M2 Ultra chip, I might just want to “rent” the compute power online for my image or video editing jobs.

I might want to open a rendered 3D view of a sports car I’d like to buy, directly from the browser, having the ability to move my view around the car.

Or it might just be a simple VDI scenario, where the company (usually a large financial institute, but not only) would like the employees to work on Chromebook machines but have nothing installed or stored in them – all consumed by accessing the actual machine and data in their own corporate data center or secure cloud environment.

A good friend of mine asked me what PC to buy for himself. He needed it for work. He is a lawyer. My answer was the lowest end machine you can find would do the job. That saved him quite a lot of money I am guessing, and he wouldn’t even notice the difference for what he needs it for.

But what if he needs a bit more juice and power every once in a while? Can renting that in the cloud make a difference?

What about the need to use specialized software that is hard to install and configure? Or that requires a lot of collaboration on large amounts of data that need to be shared across the collaborators?

Taking the notion and capabilities of cloud gaming and applying them to non-gaming use cases can help us with multiple other requirements:

  1. CPU and memory requirements that can’t be met with a local machine easily
  2. The need to maintain privacy and corporate data in work from home environments
  3. Zero install environment, lowering maintenance costs

Do these have to happen with WebRTC? No

Can they happen with WebRTC? Yes

Would changing from proprietary VDI environments to open standard WebRTC in browsers improve things? Probably

Why use WebRTC in cloud gaming

Why even use WebRTC for cloud gaming or more general cloud rendering then?

With cloud gaming, we’re fine doing it from inside a dedicated app. So WebRTC isn’t really necessary. Or is it?

In one of our recent WebRTC Insights issues we’ve highlighted that Amazon Luna is dropping the dedicated apps in favor of the web (=WebRTC). From that article:

“We saw customers were spending significantly more time playing games on Luna using their web browsers than on native PC and Mac apps. When we see customers love something, we double down. We optimized the web browser experience with the full features and capabilities offered in Luna’s native desktop apps so customers now have the same exact Luna experience when using Luna on their web browsers.”

Browsers are still a popular enough alternative for many users. Are these your users too?

If you need or want web browser access for a cloud gaming / cloud rendering application, then WebRTC is the way to go. It is a slightly different opinion than the one I had with the future of live streaming, where I stated the opposite:

“The reason WebRTC is used at the moment is because it was the only game in town. Soon that will change with the adoption of solutions based on WebTransport+WebCodecs+WebAssembly where an alternative to WebRTC for live streaming in browsers will introduce itself.”

Why the difference? It is all about the latency we are willing to accommodate:

Your mileage may vary when it comes to the specific latency you’re aiming for, but in general – live streaming can live with slightly higher latency than our online meetings. So something other than WebRTC can cater for that better – we can fine tune and tweak it more.

Cloud gaming needs even lower latency than WebRTC. And WebRTC can accommodate for that. Using something else that is unproven yet (and suffers from performance and latency issues a bit at the moment) is the wrong approach. At least today.

Enter our WebRTC Protocols courses

Got a use case where you need to render remote machines using WebRTC? These require sitting at the cutting edge of WebRTC, or more accurately and a slightly skewed angle versus what the general population does with WebRTC (including Google).

Taking upon yourself such a use case means you’ll need to rely more heavily on your own expertise and understanding of WebRTC.

Over a year ago I launched with Philipp Hancke the Low-level WebRTC Protocols course. We’re now recording our next course – Higher-level WebRTC Protocols. 

If you are interested in learning more about this, be sure to join our waiting list for once we launch the course

Join the course waiting list

Oh, and I’d like to thank Midjourney for releasing version 5.2 – awesome images

The post Cloud gaming, virtual desktops and WebRTC appeared first on BlogGeek.me.

Apple Vision, VR/AR, the metaverse and what it means to the web and WebRTC

bloggeek - Mon, 06/19/2023 - 13:30

The Apple Vision pro is a new VR/AR headset. Here are my thoughts on if and how it will affect the metaverse and WebRTC.

There were quite a few interesting announcements and advances made in recent months that got me thinking about this whole area of the metaverse, augmented reality and virtual reality. All of which culminated with Apple’s unveiling last week of the Apple Vision Pro. For me, the prism from which I analyze things is the one of communication technologies, and predominantly WebRTC.

A quick disclaimer: I have no clue about what the future holds here or how it affects WebRTC. The whole purpose of this article is for me to try and sort my own thoughts by putting them “down on paper”.

Let’s get started then

Table of contents The Apple Vision Pro

Apple just announced its Vision Pro VR/AR headset. If you’re reading this blog, then you know about this already, so there isn’t much to say about it.

For me? This is the first time that I had this nagging feeling for a few seconds that I just might want to go and purchase an Apple product.

Most articles I’ve read were raving about this – especially the ones who got a few minutes to play with it at Apple’s headquarters.

AR/VR headsets thus far have been taking one of the two approaches:

  1. AR headsets were more akin to “glasses” that had an overhead display on them which is where the augmentation took place with additional information being displayed on top of reality. Think Google Glass
  2. VR headsets which you wear a whole new world on top of your head, looking at a video screen that replaces the real world altogether

Apple took the middle ground – their headset is a VR headset since it replaces what you see with two high resolution displays – one for each eye. But it acts as an AR headset – because it uses external cameras on the headset to project the world on these displays.t

The end result? Expensive, but probably with better utility than any other alternative, especially once you couple it with Apple’s software.

Video calling, FaceTime, televisions and AR

Almost at the sidelines of all the talks and discussions around Apple Vision Pro and the new Mac machines, there have been a few announcements around things that interest me the most – video calling.

FaceTime and Apple TV

One of the challenges of video calling has been to put it on the television. This used to be called a lean back experience for video calling, in a world predominantly focused on lean forward when it comes to video calling. I remember working on such proof of concepts and product demos with customers ~15 years ago or more.

These never caught on.

The main reason was somewhere between the cost of the hardware, maintaining privacy with a livingroom camera and microphone positioning/noise.

By tethering the iPhone to the television, the cost of hardware along with maintaining privacy gets solved. The microphones are now a lot better than they used to – mostly due to better software.

Apple, being Apple, can offer a unique experience because they own and control the hardware – both of the phone and the set-top box. Something that is hard for other vendors to pull off.

There’s a nice concept video on the Apple press release for this, which reminded me of this Facebook (now Meta) Portal presentation from Kranky Geek:

Can Android devices pull the same thing, connected to Chromecast enabled devices maybe? Or is that too much to ask?

Do television and/or set-top box vendors put an effort into a similar solution? Should they be worried in any way?

Where could/should WebRTC play a role in such solutions, if at all?

FaceTime and Apple Vision Pro

How do you manage video calls with a clunky AR/VR headset plastered on your face?

First off, there’s no external camera “watching you”, unless you add one. And then there’s the nagging thing of… well… the headset:

Apple has this “figured out” by way of generating a realistic avatar of you in a meeting. What is interesting to note here, is that in the Apple Vision Pro announcement video itself, Apple made a three important omissions:

  1. They don’t show how the other people in the meeting see the person with the Vision headset on
  2. There’s only a single person with a Vision headset on, and we have his worldview, so again, we can’t see how others with a Vision headset look like in such a call
  3. How do you maintain eye contact, or even know where the user is looking at? (a problem today’s video calling solutions have as well)

What do the people at the meeting see of her? Do they see her looking at them, or the side of her head? Do they see the context of her real-life surroundings or a virtual background?

I couldn’t find any person who played with the Apple Vision Pro headset and reported using FaceTime, so I am assuming this one is still a work in progress. It will be really interesting to see what they come up with once this is released to market, and how real life use looks and feels like.

Lifelike video meetings: Just like being there

Then there’s telepresence. This amorphous thing which for me translates into: “expensive video conferencing meeting rooms no one can purchase unless they are too rich”.

Or if I am a wee bit less sarcastic – it is where we strive to with video conferencing – what would be the ultimate experience of “just like being there” done remotely if we had the best technology money can buy today.

Google Project Starline is the current poster child of this telepresence technology.

The current iteration of telepresence strives to provide 3D lifelike experience (with eye contact obviously). To do so while maintaining hardware costs down and fitting more environments and hardware devices, it will rely on AI – like everything else these days.

The result as I understand it?

  • Background removal/replacement
  • Understanding depth, to be able to generate a 3D representation of a speaker on demand and fit it to what the viewer needs, as opposed to what the cameras directly capture

Now look at what FaceTime on an Apple Vision Pro really means:

Generate a hyper realistic avatar representation of the person – this sounds really similar to removing the background and using cameras to generate a 3D representation of the speaker (just with a bit more work and a bit less accuracy).

Both Vision Pro and Starline strive for lifelike experiences between remote people. Starline goes for a meeting room experience, capturing the essence of the real world. Vision Pro goes after a mix between augmented and virtual reality here – can’t really say this is augmented, but can’t say this is virtual either.

A telepresence system may end up selling a million units a year (a gross exaggeration on my part as to the size of the market, if you take the most optimistic outcome), whereas a headset will end up selling in the tens of millions or more once it is successful (and this is probably a realistic estimate).

What both of these ends of the same continuum of a video meeting experience do is they add the notion of 3D, which in video is referred to as volumetric video (we need to use big fancy words to show off our smarts).

And yes, that does lead me to the next topic I’d like to cover – volumetric video encoding.

Volumetric video coding

We have the metaverse now. Virtual reality. Augmented reality. The works.

How do we communicate on top of it? What does a video look like now?

The obvious answer today would be “it’s a 3D video”. And now we need to be able to compress it and send it over the network – just like any other 2D video.

The Alliance for Open Media, who has been behind the publication and promotion of the AV1 video codec, just published a call for proposals related to volumetric video compression. From the proposal, I want to focus on the following tidbits:

  • The Alliance’s Volumetric Visual Media (VVM) Working Group formed in February 2022 this is rather new
  • It is led by Co-Chairs Khaled Mammou, Principal Engineer at Apple, and Shan Liu, Distinguished Scientist and General Manager at Tencent Apple… me thinking Vision Pro
  • The purpose is the “development of new tools for the compression of volumetric visual media” better compression tools for 3D video

This being promoted now, on the same week Apple Vision Pro comes out might be a coincidence. Or it might not.

The founding members include all the relevant vendors interested in AR/VR that you’d assume:

  • Apple – obviously
  • Cisco – WebEx and telepresence
  • Google – think Project Starline
  • Intel & NVIDIA – selling CPUs and GPUs to all of us
  • Meta – and their metaverse
  • Microsoft – with Teams, Hololens and metaverse aspirations

The rest also have vested interest in the metaverse, so this all boils down to this:

AR/VR requires new video coding techniques to enable better and more efficient communications in 3D (among other things)

Apple Vision Pro isn’t alone in this, but likely the one taking the first bold steps

The big question for me is this – will Apple go off with its own volumetric video codecs here, touting how open they are (think FaceTime open) or will they embrace the Alliance of Open Media work that they themselves are co-chairing?

And if they do go for the open standard here, will they also make it available for other developers to use? Me thinking… WebRTC

Is the metaverse web based?

Before tackling the notion of WebRTC into the metaverse, there’s one more prerequisite – that’s the web itself.

Would we be accessing the metaverse via a web browser, or a similar construct?

For an open metaverse, this would be something we’d like to have – the ability to have our own identity(ies) in the metaverse go with us wherever we go – between Facebook, to Roblox, through Fortnite or whatever other “domain” we go to.

Last week also got us this sideline announcement from Matrix: Introducing Third Room TP2: The Creator Update

Matrix, an open source and open standard for decentralized communications, have been working on Third Room, which for me is a kind of a metaverse infrastructure for the web. Like everything related to the metaverse, this is mostly a work in progress.

I’d love the metaverse itself to be web based and open, but it seems most vendors would rather have it limited to their own closed gardens (Apple and Meta certainly would love it that way. So would many others). I definitely see how open standards might end up being used in the metaverse (like the work the Alliance of Open Media is doing), but the vendors who will adopt these open standards will end up deciding how open to make their implementations – and will the web be the place to do it all or not.

Where would one fit WebRTC in the metaverse, AR and VR?

Maybe. Maybe not.

The unbundling of WebRTC makes it both an option while taking us farther away from having WebRTC as part of the future metaverse.

Not having the web means no real reliance on WebRTC.

Having the tooling in WebRTC to assist developers in the metaverse means there’s incentive to use and adopt it even without the web browser angle of it.

WebRTC will need at some point to deal with some new technical requirements to properly support metaverse use cases:

  • Volumetric video coding
  • Improve its spatial audio capability
  • The number of audio streams that can be mixed (3 are the max today)

We’re still far away from that target, and there will be a lot of other technologies that will need to be crammed in alongside WebRTC itself to make this whole thing happen.

Apple’s new Vision Pro might accelerate that trajectory of WebRTC – or it might just do the opposite – solidify the world of the metaverse inside native apps.

I want to finish this off with this short piece by Jason Fried: The visions of the future

It looks at AR/VR and generative AI, and how they are two exact opposites in many ways.

Recently I also covered ChatGPT and WebRTC – you might want to take a look at that while at it.

The post Apple Vision, VR/AR, the metaverse and what it means to the web and WebRTC appeared first on BlogGeek.me.

Livestream this Friday: WebCodecs, WebTransport, and the Future of WebRTC

webrtchacks - Tue, 06/06/2023 - 14:10

Here at webrtcHacks we are always exploring what’s next in the world of Real Time Communications. One area we have touched on a few times is the use of WebCodecs with WebTransport as an alternative to WebRTC’s RTCPeerConnection. There have been several recent experiments by Bernard Aboba – WebRTC & WebTransport Co-Chair and webrtcHacks regular, […]

The post Livestream this Friday: WebCodecs, WebTransport, and the Future of WebRTC appeared first on webrtcHacks.

Is WebRTC really free? The costs of running a WebRTC application

bloggeek - Mon, 06/05/2023 - 12:30

Is WebRTC really free? It is open source and widely used due to it. But it isn’t free when it comes to running and hosting your own WebRTC applications.

If you are new to WebRTC, then start here – What is WebRTC?

Time to answer this nagging question:

Is WebRTC really free?

One of the reasons that WebRTC is the most widely used developer technology for real time communications in the world is that it is open source. It helps a lot that it comes embedded and available in all modern browsers. That means that anyone can use WebRTC for any purpose they see fit, without paying any upfront licensing fee or later on royalties. This has enabled thousands of companies to develop and launch their own applications.

But does that mean every web application built with WebRTC is free? No. WebRTC may well be free, but whatever is bolted on top of it might not be. And then there are still costs involved with getting a web application online and dealing with traffic costs.

For that reason, in this article, I’ll be touching on why WebRTC really is free, and what you have to factor in for it if you want to get your own WebRTC application.

Table of contents Yes. WebRTC itself is completely free

Since I am sure you didn’t really go read that other article – I’ll suggest it here again: What is WebRTC?

The TL;DR version of it?

The WebRTC software library is open sourced under a permissible open source license. That means its source code is available to everyone AND that individuals and companies can modify and use it anywhere they wish without needing to contribute back their changes. It makes it easier for commercial software to be developed with it (even when no changes or improvements are made to the base WebRTC library – just because of how corporate lawyers are).

You see? WebRTC really is free.

Google “owns” and maintains the main WebRTC library implementation. Everyone benefits from this. That siad, they aren’t doing this only from the goodness of their heart – they have their own uses for WebRTC they focus on.

However, there are costs involved with running a WebRTC application

While you don’t have to pay anything for WebRTC itself, there’s the application you develop, publish and then maintain. There are costs that come into play here – and considerable ones. These costs can vary depending on your requirements. 

I’d like to split the costs here into 3 components:

  1. The cost of developing a WebRTC application
  2. How expensive it is to optimize a WebRTC implementation
  3. Hosting and maintenance costs of a WebRTC application
1. The cost of developing a WebRTC application

The first thing you can put as a cost is to build the WebRTC application itself.

Here, as in all other areas, there’s more demand than supply when it comes to skilled WebRTC engineers. So much so that I had to write an article about hiring WebRTC developers – and I still send this link multiple times a month when asked about this.

Here too, you should split the cost into two parts:

  1. How much does it cost to develop your application?
  2. The WebRTC part of the application – how much investment do you need to put on it?

Since everything done in WebRTC requires skilled engineers (that are scarce when it comes to WebRTC expertise), you can safely assume it is going to be a wee bit more expensive than you estimate it to be.

2. How expensive it is to optimize a WebRTC implementation

I know what you’re going to say. Your WebRTC application is going to be awesome. Glorious. Superb. It is going to be so good that it will wipe the floor with the existing solutions such as Zoom, Google Meet and Microsoft Teams.

That kind of a mentality is healthy in an entrepreneur, but a dose of reality is necessary here:

  • You can’t out-do Google in quality with WebRTC
    • At least not if you’re going to butt head to head
    • Remember that they’re the ones who maintain WebRTC and implement it inside Chrome
    • And if you have the skillset to actually deliver on this one, it means you don’t need to read this article…
  • These vendors have large teams
    • Larger than what you are going to put out there
    • Almost definitely larger than what you’re going to budget for in the next 3 years
    • In man-years they are going to out-class you on pure media quality
    • Especially when the focus is on improving it in our industry at the moment
    • These vendors also need to deal with how Google runs WebRTC in the browser

This brings me to the need to optimize what you’re doing on an ongoing basis.

Ever since the pandemic, we’ve seen a growing effort in the leading vendors in this space to improve and optimize quality. This manifests itself in the research they publish as well as features they bring to the market. Here are a few examples:

  • Larger meeting sizes
  • Lower CPU use
  • Newer audio and video codecs
  • Introduction of AI algorithms to the media pipeline

You should plan for ongoing optimization of your own as well. Your customers are going to expect you to keep up with the industry. The notion of “good enough” works well here, but the bar of what is “good enough” is rising all the time.

Such optimizations are also needed not only to improve quality, but also to reduce costs.

Factor these costs in…

3. Hosting and maintenance costs of a WebRTC application

I had a meeting the other day. A founder of a startup who had to use WebRTC because customers needed something live and interactive. That component wasn’t at the core of his application, but not having it meant lost deals and revenue. It was a mandatory capability needed for a specific feature.

He complained about WebRTC being expensive to operate. Mainly because of bandwidth costs.

We can split WebRTC maintenance costs here into two categories: cloud costs, keeping the lights on costs.

Cloud costs

That startup founder was focused on cloud costs.

When we look at the infrastructure costs of web applications, there’s the usual CPU, memory, storage and network. We might be paying these directly, or indirectly via other managed and serverless services.

With WebRTC, the network component is the biggest hurt. Especially for video applications. You can reduce these costs by going to 2nd tier IaaS vendors or by hosting in “no-name” local data centers, but if you are like most vendors, you’re likely to end up on Amazon, Microsoft or Google cloud. And there, bandwidth costs for outgoing traffic are high.

WebRTC is peer to peer, but:

  • Not all sessions can go peer to peer. Some must be relayed via TURN servers
  • Large group calls in most cases will mean going through the cloud with your bandwidth to WebRTC media servers
  • All commercial WebRTC services I know have server components that gobble up bandwidth

And the more successful you become – the more bandwidth you’ll consume – and the higher your cloud costs are going to be.

You will need to factor this in when developing your application, especially deciding when to start optimizing for costs and bandwidth use.

Keeping the lights costs

Then there’s the “keeping the lights” costs.

WebRTC changes all the time. Things get deprecated and removed. Features change behavior over time. New features are added. You continually need to test that your application does not break in the upcoming Chrome release. Who is going to take care of all that in your WebRTC application?

You will also need to understand the way your WebRTC application is used. Are users happy? Are there areas you need to invest in with further optimization? Observability (=monitoring) is key here.

Keeping the lights on has its own set of costs associated with it.

Build vs buy a WebRTC infrastructure

Buying your WebRTC infrastructure by using managed services like CPaaS vendors is expensive. But then again, building your own (along with optimizing and maintaining it) is also expensive.

Roughly speaking, this is the kind of a decision table you’ll see in front of you:

BuildBuyPros Customized to your specific need
Ownership of the solution and ability to modify with changing requirements
Better control over costs Short time to market
Low initial cost
Less of a need for a highly skilled team of WebRTC expertsCons Time consuming. Longer time to market
High initial development costs
Ongoing maintenance costs
Finding/sourcing skilled WebRTC developers Cost at scale can be an issue
Harder to differentiate on the media layer

There’s also a middleground, where you can source/buy certain pieces and build others. Here are a few examples/suggestions:

  • Consider paying for a managed TURN service while building your own WebRTC application
  • Signaling can be outsourced using the likes of PubNub, Pusher and Ably
  • You can get your testing and monitoring needs from testRTC (a company I co-founded)

You can also start with a CPaaS vendor and once you scale and grow, invest the time and money needed to build your own infrastructure – once you’ve proven your application and got to product-market-fit.

So, how free is WebRTC, really?

Part of WebRTC’s claim to fame is its nature as an open source and thus free software for building interactive web applications. While the technology itself is indeed free of charge and offers numerous freedoms, there are still costs associated with running a WebRTC application.

When we had to launch our own video conferencing service some 25 years ago, we had to put an investment of several millions of dollars along with an engineering team for a period of a couple of years. Only to get to the implementation of a media engine.

WebRTC gives that to you for “free”. And it is also kind enough to be pre-integrated in all modern browsers.

What Google did with WebRTC was to reduce the barrier of entry to real time communication drastically.

Creating a WebRTC application isn’t free – not really. But it does come with a lot of alternatives that bring with them freedom and flexibility.

The post Is WebRTC really free? The costs of running a WebRTC application appeared first on BlogGeek.me.

WebRTC media resilience: the role FEC, RED, PLC, RTX and other acronyms play

bloggeek - Mon, 05/22/2023 - 13:00

How WebRTC media resilience works – what FEC, RED, PLC, RTX are and why they are needed to improve media quality in real-time communications.

Networks are finicky in nature, and media codecs even more so.

With networks, not everything sent is received on the other end, which means we have one more thing to deal with and care about when it comes to handling WebRTC media. Luckily for us, there are quite a few built-in tools that are available to us. But which one should we use at each point and what benefits do they bring?

This is what I’ll be focusing on in this article.

Table of contents Networks are lossy

Communication networks are lossy in nature. This means that if you send a packet through a network – there’s no guarantee of that packet reaching the other side. There’s also no guarantee that packets are reached in the order you’ve sent them or in a timely fashion, but that’s for another article.

This is why almost everything you do over the internet has this nice retransmission mechanism tucked away somewhere deep inside as an assumption. That retransmission mechanism is part of how TCP works – and for that matter, almost every other transport protocol implemented inside browsers.

The assumption here is that if something is lost, you simply send it again and you’re done. It may take a wee bit longer for the receiver to receive it, but it will get there. And if it doesn’t, we can simply announce that connection as severed and closed.

We call and measure that “something is lost” aspect of networks as packet loss.

Stripping away that automatic assumption that networks are reliable and everything you send over them is received on the other side is the first important step in understanding WebRTC but also in understanding real-time transport protocols and their underlying concepts.

Media codecs are lossy (and sensitive)

Media codecs are also lossy but in a different way. When an audio codec or a video codec needs to encode (=compress) the raw input from a microphone or a camera, what they do is strip the data out of things they deem unnecessary. These things are levels of perceived quality of the original media.

I remember many years ago, sitting at the dorms in the university and talking about albums and CDs. One of the roommates there was an audiophile. He always explained how vinyl albums have better audio quality than CDs and how MP3 just ruins audio quality. Me? I never heard the difference.

Perceived quality might be different between different people. The better the codec implementation, the more people will not notice degraded quality.

Back to codecs.

Most media codecs are lossy in nature. There are a few lossless ones, but these are rarely used for real time communications and not used in WebRTC at all. The reason we use lossy codecs is to have better compression rates:

Taking 1080p (Full HD) video at 30 frames per second will result in roughly 1.5Gbps of data. Without compressing it – it just won’t work. We’re trying to squeeze a lot of raw data over networks, and as always, we need to balance our needs with the resources available to us.

To compress more, we need:

  • To reduce what we care about to the its bare minimum (the lossy aspect of the codecs we use)
  • More CPU and memory to perform the compression
  • Make every bit we end up with matter

That last one is where media codecs become really sensitive.

If every bit matters, then losing a bit matters. And if losing a bit matters, then losing a whole packet matters even more.

Since networks are bound to lose packets, we’re going to need to deal with media packets missing and our system (in the decoder or elsewhere) needing to fill that gap somehow

More on lossy codecs

More on the future of audio codecs (lossy and lossless ones)

Types of WebRTC media correction

Media packets are lost. Our media decoders – or WebRTC system as a whole – needs to deal with this fact. This is done using different media correction mechanisms. Here’s a quick illustration of the available choices in front of us:

Each such media correction technique has its advantages and challenges. Let’s review them so we can understand them better.

PLC: Packet Loss Concealment

Every WebRTC implementation needs a packet loss concealment strategy. Why? Because at some point, in some cases, you won’t have the packets you need to play NOW. And since WebRTC is all about real-time, there’s no waiting with NOW for too long.

What does packet loss concealment mean? It means that if we lost one or more packets, we need to somehow overcome that problem and continue to run to the best of our ability.

Before we dive a bit deeper, it is important to state: not losing packets is always better than needing to conceal lost packets. More on that – later.

This is done differently between audio and video:

Audio PLC

For the most part, audio packets are decoded frame-by-frame and usually also packet-by-packet. If one is lost, we can try various ways to solve that. There are the most common approaches:

Illustration taken from Philipp Hancke’s presentation at Kranky Geek. Video PLC

Packet loss on video streams has its own headaches and challenges.

In video, most of the frames are dependant on previous ones, creating chains of dependencies:

I-frames or keyframes (whatever they are called depending on the video codec used) break these dependency chains, and then one can use techniques like temporal scalability to reduce the dependencies for some of the frames that follow.

When you lose a packet, the question isn’t only what to do with the current video frame and how to display it, but rather what is going to happen to future frames depending on the frame with the lost packet.

In the past, the focus was on displaying every bit that got decoded, which ended up with video played back with smears as well as greens and pinks.

Check it for yourself, with our most recent WebRTC fiddle around frame loss.

Today, we mostly not display frames until we have a clean enough bitstream, opting to freeze the video a bit or skip video frames than show something that isn’t accurate enough. With the advances in machine learning, they may change in the future.

PLC is great, but there’s a lot to be done to get back the lost packets as opposed to trying to make do with what we have. Next, we will see the additional techniques available to us.

RTX: Retransmissions

Here’s a simple mechanism (used everywhere) to deal with packet loss – retransmission.

In whatever protocol you use, make sure to either acknowledge receiving what is sent to you or NACKing (sending a negative acknowledgement) when not receiving what you should have received. This way, the sender can retransmit whatever was lost and you will have it readily available.

This works well if there’s enough time for another round trip of data until you must play it back. Or when the data can help you out in future decoding (think the dependency across frames in video codecs). It is why retransmissions don’t always work that well in WebRTC media correction – we’re dealing with real time and low latency.

Another variation of this in video streams is asking for a new I-frame. This way, the receiver can signal the sender to “reset” the video stream and start encoding it from scratch, which essentially means a request to break the dependency between the old frames and the new ones that should be sent after the packet loss.

RED: REDundancy Encoding

Retransmission means we overcome packet losses after the fact. But what if we could solve things without retransmissions? We can do that by sending the same packet more than once and be done with it.

Double or triple the bitstream by flooding it with the same information to add more robustness to the whole thing.

RED is exactly that. It concatenates older audio frames into fresh packets that are being sent, effectively doubling or tripling the packet size.

If a packet gets lost, the new frame it was meant to deliver will be found in one of the following packets that should be received.

Yes. it eats up our bandwidth budget, but in a video call where we send 1Mbps of video data or more, tripling the audio size from 40kbps to 90kbps might be a sacrifice worth making for cleaner audio.

FEC: Forward Error Correction

Redundancy encoding requires an additional 100% or more of bitrate. We can do better using other means, usually referred to as Forward Error Correction.

Mind you, redundancy encoding is just another type of forward error correction mechanism

With FEC, we are going to add more packets that can be used to restore other packets that are lost. The most common approach for FEC is by taking multiple packets, XORing them and sending the XORed result as an additional packet of data.

If one of the packets is lost, we can use the XORed packet to recreate the lost one.

There are other means of correction algorithms that are a wee bit more complex mathematically (google about Reed-Solomon if you’re interested), but the one used in WebRTC for this purpose is XOR.

FEC is still an expensive thing since it increases the bitrate considerably. Which is why it is used only sparingly:

  • When you know there’s going to be packet losses on the network
  • To protect only important video frames that many other frames are going to be dependent on
Making sense of WebRTC media correction

PLC, RTX, FEC, RED, …

How is each one signaled over the network? When would it make sense to use it? How does WebRTC implement it in the browser and what exactly can you expect out of it?

All that is mostly arcane knowledge. Something that is passed from one generation of WebRTC developers to another it seems.

Lucky for you, Philipp Hancke and myself are working on a new course – Higher Level WebRTC Protocols. In it, we are covering these specific topics as well as quite a few others in a level of detail that isn’t found anywhere else out there.

Most of the material is already written down. We just need to prettify it a bit and record it.

If you are interested in learning more about this, be sure to join our waiting list for once we launch the course

Join the course waiting list

The post WebRTC media resilience: the role FEC, RED, PLC, RTX and other acronyms play appeared first on BlogGeek.me.

ChatGPT meets WebRTC: What Generative AI means to Real Time Communications

bloggeek - Mon, 05/08/2023 - 13:00

ChatGPT is changing computing and as an extension how we interact with machines. Here’s how it is going to affect WebRTC.

ChatGPT became the service with the highest growth rate of any internet application, reaching 100 million active users within the first two months of its existence. A few are using it daily. Others are experimenting with it. Many have heard about it. All of us will be affected by it in one way or another.

I’ve been trying to figure out what exactly does a “ChatGPT WebRTC” duo means – or in other words – what does ChatGPT means for those of us working with and on WebRTC.

Here are my thoughts so far.

Table of contents Crash course on ChatGPT

Let’s start with a quick look at what ChatGPT really is (in layman terms, with a lot of hand waving, and probably more than a few mistakes along the way).

BI, AI and Generative AI

I’ll start with a few slides I cobbled up for a presentation I did for a group of friends who wanted to understand this.

ChatGPT is a product/service that makes use of machine learning. Machine learning is something that has been marketed a lot as AI – Artificial Intelligence. If you look at how this field has evolved, it would be something like the below:

We started with simple statistics – take a few numbers, sum them up, divide by their count and you get an average. You complicate that a bit with weighted average. Add a bit more statistics on top of it, collect more data points and cobble up a nice BI (Business Intelligence) system.

At some point, we started looking at deep learning:

Here, we train a model by using a lot of data points, to a point that the model can infer things about new data given to it. Things like “do you see a dog in this picture?” or “what is the text being said in this audio recording?”.

Here, a lot of 3 letter acronyms are used like HMM, ANN, CNN, RNN, GNN…

What deep learning did in the past decade or two was enable machines to describe things – be able to identify objects in images and videos, convert speech to text, etc.

It made it the ultimate classifier, improving the way we search and catalog things.

And then came a new field of solutions in the form of Generative AI. Here, machine learning is used to generate new data, as opposed to classifying existing data:

Here what we’re doing is creating a random input vector, pushing it into a generator model. The generator model creates a sample for us – something that *should* result in the type of thing we want created (say a picture of a dog). That sample that was generated is then passed to the “traditional” inference model that checks if this is indeed what we wanted to generate. If it isn’t, we iteratively try to fine tune it until we get to a result that is “real”.

This is time consuming and resource intensive – but it works rather well for many use cases (like some of the images on this site’s articles that are now generated with the help of Midjourney).

So…

  • We started with averages and statistics
  • Moved to “deep learning”, which is just hard for us to explain how the algorithms got to the results they did (it isn’t based on simple rules any longer)
  • And we then got to a point where AI generates new data
The stellar rise of ChatGPT

The thing is that all this thing I just explained wouldn’t be interesting without ChatGPT – a service that came to our lives only recently, becoming the hottest thing out there:

The Most Important Chart In 100 Years https://t.co/Ypcsqi0AWJ #AI #GPT #ChatGPT #technology @JohnNosta pic.twitter.com/QjMroVZ7cG

— Kyle Hailey (@kylelf_) February 16, 2023

ChatGPT is based on LLMs – Large Language Models – and it is fast becoming the hottest thing around. No other service grew as fast as ChatGPT, which is why every business in the world now is trying to figure out if and how ChatGPT will fit into their world and services.

Why ChatGPT and WebRTC are like oil and water

So it begged the question: what can you do with ChatGPT and WebRTC?

Problem is, ChatGPT and WebRTC are like oil and water – they don’t mix that well.

ChatGPT generates data whereas WebRTC enables people to communicate with each other. The “generation” part in WebRTC is taken care of by the humans that interact mostly with each other on it.

On one hand, this makes ChatGPT kinda useless for WebRTC – or at least not that obvious to use for it.

But on the other hand, if someone succeeds to crack this one up properly – he will have an innovative and unique thing.

What have people done with ChatGPT and WebRTC so far?

It is interesting to see what people and companies have done with ChatGPT and WebRTC in the last couple of months. Here are a few things that I’ve noticed:

In LiveKit’s and Twilio’s examples, the concept is to use the audio source from humans as part of prompts for ChatGPT after converting them using Speech to Text and then converting the ChatGPT response using Text to Speech and pass it back to the humans in the conversation.

Broadening the scope: Generative AI

ChatGPT is one of many generative AI services. Its focus is on text. Other generative AI solutions deal with images or sound or video or practically any other data that needs to be generated.

I have been using MidJourney for the past several months to help me with the creation of many images in this blog.

Today it seems that in any field where new data or information needs to be created, a generative AI algorithm can be a good place to investigate. And in marketing-speak – AI is overused and a new overhyped term was needed to explain what innovation and cutting edge is – so the word “generative” was added to AI for that purpose.

Fitting Generative AI to the world of RTC

How does one go about connecting generative AI technologies with communications then? The answer to this question isn’t an obvious or simple one. From what I’ve seen, there are 3 main areas where you can make use of generative AI with WebRTC (or just RTC):

  1. Conversations and bots
  2. Media compression
  3. Media processing

Here’s what it means

Conversations and bots

In this area, we either have a conversation with a bot or have a bot “eavesdrop” on a conversation.

The LiveKit and Twilio examples earlier are about striking a conversation with a bot – much like how you’d use ChatGPT’s prompts.

A bot eavesdropping to a conversation can offer assistance throughout a meeting or after the meeting –

  • It can try to capture to essence of a session, turning it into a summary
  • Help with note taking and writing down action items
  • Figure out additional resources to share during the conversation – such as knowledge base items that reflect what a customer is complaining about to a call center agent

As I stated above, this has little to do with WebRTC itself – it takes place elsewhere in the pipeline; and to me, this is mostly an application capability.

Media compression

An interesting domain where AI is starting to be investigated and used is media compression. I’ve written about Lyra, Google’s AI enabled speech codec in the past. Lyra makes assumptions on how human speech sounds and behaves in order to send less data over the network (effectively compressing it) and letting the receiving end figure out and fill out the gaps using machine learning. Can this approach be seen as a case of generative AI? Maybe

Would investigating such approaches where the speakers are known to better compress their audio and even video makes sense?

How about the whole super resolution angle? Where you send video at resolutions of WVGA or 720p and then having the decoder scale them up to 1080p or 4K, losing little in the process. We’re generating data out of thin air, though probably not in the “classic” sense of generative AI.

I’d also argue that if you know the initial raw content was generated using generative AI, there might be a better way in which the data can be compressed and sent at lower bitrates. Is that something worth pursuing or investigating? I don’t know.

Media processing

Similar to how we can have AI based codecs such as Lyra, we can also use AI algorithms to improve quality – better packet loss concealment that learns the speech patterns in real time and then mimics them when there’s packet loss. This is what Google is doing with their WaveNetEQ, something I mentioned in my WebRTC unbundling article from 2020.

Here again, the main question is how much of this is generative AI versus simply AI – and does that even matter?

Is the future of WebRTC generative (AI)?

ChatGTP and other generative AI services are growing and evolving rapidly. While WebRTC isn’t directly linked to this trend, it certainly is affected by it:

  • Applications will need to figure out how (and why) to incorporate generative AI with WebRTC as part of what they offer
  • Algorithms and codecs in WebRTC are evolving with the assistance of AI (generative or otherwise)

Like any other person and business out there, you too should see if and how does generative AI affects your own plans.

The post ChatGPT meets WebRTC: What Generative AI means to Real Time Communications appeared first on BlogGeek.me.

RTC@Scale 2023 – an event summary

bloggeek - Mon, 05/01/2023 - 13:00

RTC@Scale is Facebook’s virtual WebRTC event, covering current and future topics. Here’s the summary for RTC@Scale 2023 so you can pick and choose the relevant ones for you.

WebRTC Insights is a subscription service I have been running with Philipp Hancke for the past two years. The purpose of it is to make it easier for developers to get a grip of WebRTC and all of the changes happening in the code and browsers – to keep you up to date so you can focus on what you need to do best – build awesome applications.

We got into a kind of a flow:

  • Once every two weeks we finalize and publish a newsletter issue
  • Once a month we record a video summarizing libwebrtc release notes (older ones can be found on this YouTube playlist)

Oh – and we’re covering important events somewhat separately. Last month, a week after Meta’s RTC@Scale event took place, Philipp sat down and wrote a lengthy summary of the key takeaways from all the sessions, which we distributed to our WebRTC Insights subscribers.

As a community service (and a kind of a promotion for WebRTC Insights), we are now opening it up to everyone in this article

Table of contents Why this issue?

Meta ran their rtc@scale event again. Last year was a blast and we were looking forward to this one. The technical content was pretty good again. As last year, our focus for this summary is what we learned or what it means for folks developing with WebRTC. Once again, the majority of speakers were from Meta. At times they crossed the line of “is this generally useful” to the realm of “Meta specific” but most of the talks still provide value.

Compared to last year there were almost no “work with me” pitches (with one exception).

It is surprising how often Meta says “WebRTC” or “Google” (oh and Amazon as well).

Writing up these notes took a considerable amount of time (again) but we learned a ton and will keep referencing these talks in the future so it was totally worth it (again). You can find the list of speakers and topics on the conference website, the seven hours of raw video here (which includes the speaker introductions) or you just scroll down below for our summary.

SESSION 1 Rish Tandon / Meta – Meta RTC State of the Union

Duration: 13:50

Watch if you

  • watch nothing else and don’t want to dive into specific areas right away. It contains a ton of insights, product features and motivation for their technical decisions

Key insights:

  • Every conference needs a keynote!
  • 300 million daily calls on Messenger alone is huge
    • The Instagram numbers on top of that remain unclear. Huge but not big enough to brag about?
    • Meta seems to have fared well and has kept their usage numbers up after the end of the pandemic, despite the general downward/flat trend we see for WebRTC in the browser
    • 2022 being their largest-ever year in call volume this suggests eating someone else’s market share (Google Duo possibly?)
  • Traditionally RTC at Meta was mobile-first with 1-1 being the dominant use-case. This is changing with Whatsapp supporting 32 users (because FaceTime does? Larger calls are in the making), an improved desktop application experience with a paginated 5×5 grid. Avatars are not dead yet btw
  • Meta is building their unified “MetaRTC” stack on top of WebRTC and openly talks about it. But it is a very small piece in the architecture diagram. Whatsapp remains a separate stack. RSYS is their cross-platform library for all the things on top of the core functionality provided by libWebRTC
  • The paginated 4×4 grid demo is impressive
    • Pagination is a hard problem to solve since you need to change a lot of video stream subscriptions at the same time which, with simulcast, means a lot of large keyframes (thankfully only small resolution ones for this grid size)
    • You can see this as the video becomes visible from left to right at 7:19
    • Getting this right is tough, imagine how annoying it would be if the videos showed up in a random order…
  • End-to-end encryption is a key principle for Meta
    • This rules out an MCU as part of the architecture
    • Meta is clearly betting on the simulcast (with temporal layers), selective forwarding and dominant speaker identification for audio (with “Last-N” as described by Jitsi in 2015)
  • Big reliability improvements by defining a metric “%BAD” and then improving that
    • The components of that metric shown at 9:00 are interesting
    • In particular “last min quality reg” which probably measures if there was a quality issue that caused the user to hang up:
  • For mobile apps a grid layout that scales nicely with the number of participants is key to the experience. One of the interesting points made is that the Web version actually uses WASM instead of the browser’s native video elements
  • The “Metaverse” is only mentioned as part of the outlook. It drives screen sharing experiences which need to work with a tight latency budget of 80ms similar to game streaming
Sriram Srinivasan / Meta – Real-time audio at Meta Scale

Duration: 19:30

Watch if you are

  • An engineer working on audio. Audio reliability remains one of the most challenging problems with very direct impact to the user experience

Key insights:

  • Audio in RTC has evolved over the years:
    • We moved from wired-network audio-only calls to large multi party calls on mobile devices
    • Our quality expectations (when dogfooding) have become much higher in the last two decades
    • The Metaverse introduces new requirements which will keep us busy for the next one
  • Great examples of the key problems in audio reliability starting at 2:30
    • Participants can’t hear audio
    • Participants hear echo
    • Background noise
    • Voice breakup (due to packet loss)
    • Excessive latency (leading to talking over each other)
  • On the overview slide at 4:20 we have been working on the essentials in WebRTC for a decade, with Opus thankfully enabling the high-end quality
    • This is hard because of the diversity in devices and acoustic conditions (as well as lighting for video). This is why we still have vendors shipping their own devices (Meta discontinued their Portal device though)
    • Humans have very little tolerance for audio distortions
  • The basic audio processing pipeline diagram at 5:50 and gets walked through until 11:00
    • Acknowledges that the pipeline is built on libWebRTC and then says it was a good starting point back in the day. The opinion at Google seems to be that the libWebRTC device management is very rudimentary and one should adopt the Chrome implementations. This is something where Google was doing better than messenger with duo. They are not going to give that away for free to their nemesis
    • While there have been advances in AEC recently due to deep neural networks, this is a challenge on mobile devices. The solution is a “progressive enhancement” which enables more powerful features on high-end devices. On the web platform it is hard to decide this upfront as we can’t measure a lot due to fingerprinting concerns. You heard the term “progressive RTC application” or PRA here on WebRTC Insights first (but it is terrible, isn’t it?)
    • For noise suppression it is important to let the users decide. If you want to show your cute baby to a friend then filtering out the cries is not appropriate. Baseline should be filtering stationary noise (fan, air condition)
    • Auto gain control is important since the audio level gets taken into account by SFUs to identify the dominant speaker
    • Low-bitrate encoding is important in the market with the largest growth and terrible networks and low-end devices: India. We have seen this before from Google Duo
  • Audio device management (capture and rendering) starts at 11:00 and is platform-dependent
    • This is hard since it cannot be tested at scale but is device specific. So we need at least the right telemetry to identify which devices have issues and how often
    • End-call feedback which gets more specific for poor calls with a number of buckets. This is likely correlated with telemetry and the “last minute quality regression” metric
    • While all of this is great it is something Meta is keeping to themselves. After all, if Google made them spend the money why would they not make Google spend the money to compete?
    • This goes to show how players other than Google are also to blame for the current state of WebRTC (see Google isn’t your outsourcing vendor)
  • Break-down of “no-audio” into more specific cases at 13:00
    • The approach is to define, measure, fix which drives the error rate down
    • This is where WebRTC in the browser has disadvantages since we rarely get the details of the errors exposed to Javascript hence we need to rely on Google to identify and fix those problems
    • Speaking when muted and showing a notification is a common and effective UX fix
    • Good war stories, including the obligatory Bluetooth issues and interaction between phone calls and microphone access
  • Outlook at 17:40 about the Metaverse
    • Our tolerance for audio issues in a video call is higher because we have gotten used to the problems
    • Techniques like speaker detection don’t work in this setting anymore
Niklas Enbom / Meta – AV1 for Meta RTC

Duration: 18:00

Watch if you are

  • An engineer working on video, the “system integrators” perspective makes this highly valuable and applicable with lots of data and measurements
  • A product owner interested in how much money AV1 could save you

Key insights:

  • Human perception is often the best tool to measure video quality during development
  • AV1 is adopted by the streaming industry (including Meta who wrote a great blog post). Now is the time to work on RTC adaptation which lags behind:
  • AV1 is the next step after H.264 (a 20 year old codec) for most deployments (except Google who went after VP9 with quite some success)
  • Measurements starting at 4:20
    • The “BD-Rate” described the bitrate difference between OpenH264 and libaom implementations, showing a 30-40% lower bitrate for the same quality. Or a considerably higher quality for the same bitrate (but that is harder to express in the diagram as the Y-axis is in decibels)
    • 20% of Meta’s video calls end up with less than 200kbps (globally which includes India). AV1 can deliver a lot more quality in that area
    • The second diagram at 5:20 is about screen sharing which is becoming a more important use-case for Meta. Quality gains are even more important in this area which deals with high-resolution content and the bitrate difference for the same quality is up to 80%. AV1 screen content coding tools help address the special-ness of this use-case too
    • A high resolution screen sharing (4k-ish) diagram is at 6:00 and shows an ever more massive difference followed by a great visual demo. Sadly the libaom source code shown is blurry in both examples as a result of H.264 encoding) but you can see a difference
  • Starting at 6:30 we are getting into the advantages of AV1 for integrators or SFU developers:
    • Reference Picture Resampling removes the need for keyframes when switching resolution. This is important when switching the resolution down due to bandwidth estimates dropping – receiving a large key frame is not desirable at all in that situation. Measuring the amount of key frames due to resolution changes is a good metric to track (in the SFU) – quoted as 1.5 per minute
    • AV1 offers temporal, spatial and quality SVC
    • Meta currently uses Simulcast (with H.264) and requires (another good metric to track) 4 keyframes per minute (presumably that means when switching up)
  • Starting at 8:00 Niklas Enbom talks about the AV1 challenges they encountered:
    • AV1 can also provide significant cost savings (the exact split between cost savings and quality improvements is what you will end up fighting about internally)
    • Meta approached AV1 by doing an “offline evaluation” first, looking at what they could gain theoretically and then proceeding with a limited roll-out on desktop platforms which validated the evaluation results
    • Rolling this out to the diverse user base is a big challenge of its own, even if the results are fantastic
    • libAOM increases the binary size by 1MB which is a problem because users hate large apps (and yet, AV1 would save a lot more even on the first call) which becomes a political fight (we never heard about that from Google including it in libWebRTC and Chrome). It gets dynamically downloaded for that reason which also allows deciding whether it is really needed on this device (on low-end devices you don’t need to bother with AV1)
    • At 11:40 “Talk time” is the key metric for Meta/Messenger and AV1 means at least 3x CPU usage (5x if you go for the best settings). This creates a goal conflict between battery (which lowers the metric) and increased quality (which increases it). More CPU does not mean more power usage however, the slide at 13:00 talks about measuring that and shows results with a single-digit percentage increase in power usage. This can be reduced further with some tweaks and using AV1 for low-bitrate scenarios and using it only when the battery level is high enough. WebRTC is getting support (in the API) for doing this without needing to resort to SDP manipulation, this is a good example of the use-case (which is being debated in the spec pull request)
    • At 15:30 we get into a discussion about bitrate control, i.e. how quickly and well will the encoder produce what you asked for as shown in the slide:
  • Blue is the target bitrate, purple the actual bitrate and it is higher than the the bitrate for quite a while! Getting rate control on par with their custom H.264 one was a lot of work (due to Meta’s H.264 rate controller being quite tuned) and will hopefully result in an upstream contribution to libaom! The “laddering” of resolution and framerate depending on the target bitrate is an area that needs improvements as well, we have seen Google just ship some improvements in Chrome 113. The “field quality metrics” (i.e. results of getStats such as qpSum/framesEncoded) are codec-specific so cannot be used to compare between codecs which is an unsolved problem
  • At 17:00 we get into the description of the current state and the outlook:
    • This is being rolled out currently. Mobile support will come later and probably take the whole next year
    • VR and game streaming are obvious use-cases with more control over devices and encoders
    • VVC (the next version of HEVC) and AV2 are on the horizon, but only for the streaming industry and RTC lags behind by several years usually.
    • H.264 (called “the G.711 of video” during Q&A) is not going to go away anytime soon so one needs to invest in dealing with multiple codecs
Jonathan Christensen / SMPL – Keeping it Simple

Duration: 18:40

Watch if you are

  • A product manager who wants to understand the history of the industry, what products need to be successful and where we might be going
  • Interested in how you can spend two decades in the RTC space without getting bored

Key insights:

  • Great overview of the history of use-cases and how certain innovations were successfully implemented by products that shaped the industry by hitting the sweet spot of “uncommon utility” and “global usability”
  • At 3:00: When ICQ shipped in 1996 it popularized the concept of “presence”, i.e. showing a roster of people who are online in a centralized service.
  • At 4:40: next came MSN Messenger which did what ICQ did but got bundled with Windows which meant massive distribution. It also introduced free voice calling between users on the network in 2002. Without solving the NAT traversal issue which meant 85% of calls failed. Yep, that means not using a STUN server in WebRTC terms (but nowadays you would go 99.9% failure rate)
  • At 7:00: While MSN was arguing who was going to pay the cost (of STUN? Not even TURN yet!) Skype showed up in 2003 and provided the same utility of “free voice calls between users” but they solved NAT traversal using P2P so had a 95% call success rate. They monetized it by charging 0.02$ per minute for phone calls and became a verb by being “Internet Telephony that just works”
  • The advent of the first iPhone in 2007 led to the first mobile VoIP application, Skype for the iPhone which became the cash cow for Skype. The peer-to-peer model did however not work great there as it killed the battery quickly
  • At 9:30: WhatsApp entered the scene in 2009. It provided less utility than Skype (no voice or video calls, just text messaging) and yet introduced the important concept of leveraging the address book on the phone and using the phone number as an identifier which was truly novel back then!
  • When Whatsapp later added voice (not using WebRTC) they took over being “Internet Telephony that just works”
  • At 11:40: Zoom… which became a verb during the 2020 pandemic. The utility it provided was a friction free model
    • We disagree here, downloading the Zoom client has always been something WebRTC avoided, just as “going to a website” had the same frictionless-ness we saw with the earlyWebRTC applications like talky.io and the other we have forgotten about
    • What it really brought was a freemium business model to video calling that was easy to use freely and not just for a trial period
  • At 12:40: These slides ask you to think about what uncommon utility is provided by your product or project (hint: WebRTC commoditized RTC) and whether normal people will understand it (as the pandemic has shown, normal people did not understand video conferencing). What follows is a bit of a sales pitch for SMPL which is still great to listen to, small teams of RTC industry veterans would not work on boring stuff
  • At 15:00: Outlook into what is next followed by predictions. Spatial audio is believed to be one of the things but we heard that a lot over the last decade (or two if you have been around for long enough; Google is shipping  this feature to some Pixel phones, getting the name wrong), as is lossless codecs for screen sharing and Virtual Reality
  • We can easily agree that the prediction that “users will continue to win” (in WebRTC we do this every time a Googler improves libWebRTC) but whether there will be “new stars” in RTC remains to be seen
First Q&A

Duration: 25:00

Watch if:

  • You found the talks this relates to interesting and want more details

Key points:

  • We probably needs something like Simulcast but for audio
  • H.264 is becoming the G.711 of video. Some advice on what metrics to measure for video (freezes, resolution, framerate and qp are available through getStats) 
  • Multi-codec simulcast is an interesting use-case
  • The notion that “RTC is good enough” is indeed not great. WebRTC suffers in particular from it
SESSION 2 Sandhya Rao / Microsoft – Top considerations for integrating RTC with Android appliances

Duration: 21:30

Watch if you are

  • A product manager in the RTC space, even if not interested in Android

Key points:

  • This is mostly a shopping list of some of the sections you’ll have in a requirements document. Make sure to check if there’s anything here you’d like to add to your own set of requirements
  • Devices will be running Android OS more often than not. If we had to plot how we got there: Proprietary RTOS → Vxworks → Embedded Linux → Android
  • Some form factors discussed:
    • Hardware deskphone
    • Companion voice assistant
    • Canvas/large tablet personal device
    • always-on ambient screen
    • shared device for cross collaboration (=room system with touchscreen)
  • Things to think through: user experience, hardware/OS, maintenance+support
  • User experience
    • How does the device give a better experience than a desktop or a mobile device?
    • What are the top workloads for this device? focus only on them (make it top 3-5 at most)
  • Hardware & OS
    • Chipset selection is important. You’ll fall into the quality vs cost problem
    • Decide where you want to cut corners and where not to compromise
    • Understand which features take up which resources (memory, CPU, GPU)
    • What’s the lifecycle/lifetime of this device? (5+ years)
  • Maintenance & support
    • Environment of where the device is placed
    • Can you remotely access the device to troubleshoot?
    • Security & authentication aspects
    • Ongoing monitoring
Yun Zhang & Bin Liu / Meta – Scaling for large group calls

Duration: 19:00

Watch if you are

  • A developer dealing with group calling and SFUs, covers both audio and video. Some of it describes the very specific problems Meta ran into scaling the group size but interesting nonetheless

Key points:

  • The audio part of the talk starts at 2:00 with a retrospective slide on how audio was done at Meta for “small group calls”. For these it is sufficient to rely on audio being relatively little traffic compared to video, DTX reducing the amount of packets greatly as well as lots of people being muted. As conference sizes grow larger this does not scale, even forwarding the DTX silence indicator packets every 400ms could lead to a significant amount of packets. To solve this two ideas are used: “top-N” and “audio capping”
  • The first describes forwarding the “top-N” active audio streams. This is described in detail in the Jitsi paper from 2015. The slideshow uses the same mechanism with audio levels as RTP header extension (the use of that extension was confirmed in the Q&A; the algorithm itself can be tweaked on the server). The dominant speaker decision also affects Bandwidth allocation for video:
  • The second idea is “audio capping” which does not forward audio from anyone but the last couple of dominant speakers. Google Meet does this by rewriting audio to three synchronization sources which avoids some of the PLC/CNG issues described on one of the slides. An interesting point here is at 7:50 where it says “Rewrite the Sequence number in the RTP header, inject custom header to inform dropping start/end”. Google Meet uses the contributing source here and one might use the RTP marker bit to signal the beginning or end of a “talk spurt” as described in RFC 3551
  • The results from applying these techniques are shown at 8:40 – 38% reduction of traffic in a 20 person call, 63% for a 50 person call and less congestion from server to the client
  • The video part of the talk starts at 9:30 establishing some of the terminology used. While “MWS” or “multiway server” is specific to Meta we think the term “BWA” or “bandwidth allocation” to describe how the estimated bandwidth gets distributed among streams sent from server to client is something we should talk more about:
    • Capping the uplink is not part of BWA (IMO) but if nobody wants to receive 720p video from you, then you should not bother encoding or sending it and we need ways for the server to signal this to the client
  • The slide at 10:20 shows where this is coming from, Meta’s transition from “small group calls” to large ones. This is a bit more involved than saying “we support 50 users now”. Given this mention “lowest common denominator” makes us wonder if simulcast was used by the small calls even because it solves this problem
  • Video oscillation, i.e. how and when to switch between layers which needs to be done “intelligently”
  • Similarly, bandwidth allocation needs to do something smarter than splitting the bandwidth budget equally. Also there are bandwidth situations where you can not send video and need to degrade to sending only one and eventually none at all. Servers should avoid congesting the downstream link just as clients do BWE to avoid congesting the upstream
  • The slide at 13:00 shows the solution to this problem. Simulcast with temporal layers and “video pause”:
    • Simulcast with temporal layers provides (number of spatial layers) * (number of temporal layers) video layers with different bitrates that the server can pick from according to the bandwidth allocation
    • “Video pause” is a component of what Jitsi called “Video pausing” in the  “Last-N” paper
  • It is a bit unclear what module the “PE-BWA” replaces but taking into account use-cases like grid-view, pinned-user or thumbnail makes a lot of sense
  • Likewise, “Stream Subscription Manager” and “Video Forward Manager” are only meaningful inside Meta since we cannot use it. Maximizing for a “stable” experience rather than spending the whole budget makes sense. So do the techniques to control the downstream bandwidth used, picking the right spatial layer, dropping temporal layers and finally dropping “uninteresting” streams
  • At 18:10 we get into the results for the video improvements:
    • 51% less video quality oscillation (which suggests the previous strategy was pretty bad) and 20% less freezes
    • 34% overall video quality improvement, 62% improvement for the dominant speaker (in use-cases where it is being used; this may include allocating more bandwidth to the most recent dominant speakers)
  • At 18:30 comes the outlook:
    • Dynamic video layers structure sounds like informing the server about the displayed resolution on the client and letting it make smart decisions based on that
Saish Gersappa & Nitin Khandelwa / Whatsapp – Relay Infrastructure

Duration: 15:50

Watch if you are

  • A developer dealing with group calling and SFUs. Being Whatsapp this is a bit more distant from WebRTC (as well as the rest of the “unified” Meta stack?) but still has a lot of great points

Key points:

  • After an introduction of Whatsapp principles (and a number… billion of hours per week) for the first three minutes the basic “relay server” is described which is a media server that is involved for the whole duration of the call (i.e. there is no peer-to-peer offload)
  • The conversation needs to feel natural and network latency and packet loss create problems in this area. This gets addressed by using a global overlay network and routing via those servers. The relay servers are not run in the “core” data centers but at the “points of presence” (thousands) that are closer to the user. This is a very common strategy we have always recommended but the number and geographic distribution of the Meta PoPs makes this impressive. To reach the PoPs the traffic must cross the “public internet” where packet loss happens
  • At 5:30 this gets discussed. The preventive way to avoid packet loss is to do bandwidth estimation and avoid congesting the network. Caching media packets on the server and resending them from there is a very common method as well, typically called a NACK cache. It does not sound like FEC/RED is being used or at least not mentioned.
  • At 6:30 we go into device resource usage. An SFU with dominant speaker identification is used to reduce the amount of audio and video streams as well as limit the number of packets that need to be encrypted and decrypted. All of this costs CPU which means battery life and you don’t want to drain the battery
  • For determining the dominant speaker the server is using the “audio volume” on the client. Which means the ssrc-audio-level based variant of the original dominant speaker identification paper done by the Jitsi team.
  • Next at 8:40 comes a description of how simulcast (with two streams) is used to avoid reducing the call quality to the lowest common denominator. We wonder if this also uses temporal scalability, Messenger does but Whatsapp still seems to use their own stack
  • Reliability is the topic of the section starting at 10:40 with a particular focus on reliability in cases of maintenance. The Whatsapp SFU seems to be highly clustered with many independent nodes (which limits the blast radius); from the Q&A later it does not sound like it is a cascading SFU. Moving calls between nodes in a seamless way is pretty tricky, for WebRTC one would need to both get and set the SRTP state including rollover count (which is not possible in libSRTP as far as we know). There are two types of state that need to be taken into account:
    • Critical information like “who is in the call”
    • Temporary information like the current bandwidth estimate which constantly changes and is easy to recover
  • At 12:40 we have a description of handling extreme load spikes… like calling all your family and friends and wishing them a happy new year (thankfully this is spread over 24 hours!). Servers can throttle things like the bandwidth estimate in such cases in order to limit the load (this can be done e.g. when reaching certain CPU thresholds). Prioritizing ongoing calls and not accepting new calls is common practice, prioritizing 1:1 calls over multi party calls is acceptable for Whatsapp as a product but would not be acceptable for an enterprise product where meetings are the default mode of operation
  • Describing dominant speaker identification and simulcast as “novel approaches” is… not quite novel
Second Q&A

Duration: 28:00

Watch if:

  • You found the talks this relates to interesting and want more details

Key points:

  • There were a lot more questions and it felt more dynamic than the first Q&A
  • Maximizing video experience for a stable and smooth experience (e.g. less layer switches) often works better than chasing the highest bitrate!
  • Good questions and answers on audio levels, speaker detection and how BWE works and is used by the server
  • It does sound like WhatsApp still refuses to do DTLS-SRTP…
SESSION 3 Vinay Mathew / Dolby – Building a flexible, scalable infrastructure to test dolby.io at scale

Duration: 22:00

Watch if you are

  • A software developer or QA engineer working on RTC products

Key points:

  • Dolby.io today has the following requirements/limits:
    • Today: 50 participants in a group call; 100k viewers
    • Target: 1M viewers; up to 25 concurrent live streams; live performance streaming
  • Scale requires better testing strategies
  • For scale testing, Dolby split the functionality into 6 different areas:
    • Authentication and signaling establishment – how many can be handled per second (rate), geo and across geo
    • Call signaling performance – maximum number of join conference requests that can be handled per second
    • Media distribution performance – how does the backend handle the different media loads, looking at media metrics on client and server side
    • Load distribution validation – how does the backend scale up and down under different load sizes and changes
    • Scenario based mixing performance – focus on recording and streaming to huge audiences (a specific component of their platform)
    • Metrics collection from both server and client side – holistic collection of metrics and use a baseline for performance metrics out of it
  • Each component has its own set of metrics and rates that are measured and optimized separately
  • Use a mix of testRTC and in-house tools/scripts on AWS EC2 (Python based, using aiortc; locust for jobs distribution)
  • Homegrown tools means they usually overprovision EC2 instances for their tests. Something they want to address moving forward
  • Dolby decided not to use testRTC for scale testing. Partly due to cost issues and the need to support native clients
  • The new scale testing architecture for Dolby:
  • Mix of static and on demand EC2 instances, based on size of the test
  • Decided on a YAML based syntax to define the scenario
  • Scenarios are kept simple, and the scripting language used is proprietary and as a domain specific language
  • This looks like the minimal applicable architecture for stress testing WebRTC applications. If you keep your requirements of testing limited, then this approach can work really well
Jay Kuo / USC – blind quality assessment of real-time visual communications

Duration: 18:15

Watch if you

  • Want to learn how to develop ways to measure video quality in an RTC scenario

Key points:

  • We struggled a bit with this one as it is a bit “academic” (with an academic sales pitch even!) and not directly applicable. However, this is a very hard problem that needs this kind of research first
  • Video quality assessment typically requires both the sender side representation of the video and the receiver side. Not requiring the sender side video is called “blind quality assessment” and is what we need for applying it to RTC or conversational video
  • Ideally we want a number from such a method (called BVQA around 4:00) that we can include in the getStats output. The challenge here is doing this with low latency and efficiently in particular for Meta’s requirements to run on mobile phones
  • We do wonder how background blur affects this kind of measurement. Is the video codec simply bad for those areas or is this intentional…
Wurzel Parsons-Keir / Meta – Beware the bot armies! Protecting RTC user experiences using simulations

Duration: 25:00

Watch if you are

  • Interested in a better way to test than asking all your coworkers to join you for a test (we all have done that many times)
  • A developer that wants to test and validate changes that might affect media quality (such as bandwidth estimation)
  • Want to learn how to simulate a ten minute call in just one minute
  • Finally want to hear a good recruiting pitch for the team (the only one this year)
  • Yes, Philipp really liked this one. Wurzel’s trick of making his name more appealing to Germans works so please bear with him.

Key takeaways:

  • This is a long talk but totally worth the time
  • At 3:00 some great arguments for investing in developer experience and simulation, mainly by shifting the cost left from “test in production” and providing faster feedback cycles. It also enables building and evaluating complex features like “Grid View and Pagination” (which we saw during the keynote) much faster
  • After laying out the goals we jump to the problem at 6:00. Experiments in the field take time and pose a great risk. Having a way to test a change in a variety of scenarios, conditions and configuration (but how representative are the ones you have?) shortens the feedback cycle and reduces the risk
  • At 7:30 we get a good overview of what gets tested in the system and how. libWebRTC is just a small block here (but a complex one) followed by the introduction of “Newton” which is the framework Meta developed for deterministic and faster than real-time testing. A lot of events in WebRTC are driven by periodic events, such as a camera delivering frames at 30 frames per second, RTCP being sent at certain intervals, networks having a certain bits-per-second constraints and so on
  • At 9:20 we start with a “normal RTC test”, two clients and a simulated network. You want to introduce random variations for realism but make those reproducible. The common approach for that is to seed the random generator, log the seed and allow feeding it in as a parameter to reproduce
  • The solution to the problem of clocks is sadly not to send a probe into the event horizon of a black hole and have physics deal with making it look faster on the outside. Instead, a simulated clock and task queue is used. Those are again very libWebRTC specific terms, it provides a “fake clock” which is mainly used for unit tests. Newton extends this to end-to-end tests, the secret sauce here is how to tell the simulated network (assuming it is an external component and not one simulated by libWebRTC too) about those clocks as well
  • After that (around 10:30) it is a matter of providing a great developer experience for this by providing scripts to run thousands of calls, aggregate the results and group the logging for these. This allows judging both how this affects averages as well identifying cases (or rather seed values!) where this degraded the experience
  • At 12:00 we get into the second big testing system built which is called “Callagen” (such pun!) which is basically a large scale bot infrastructure that operates in real-time on real networks. The system sounds similar to what Tsahi built with testRTC in many ways as well as what Dolby talked about. Being Meta they need to deal with physical phones in hardware labs. One of the advantages of this is that it captures both sender input video as well as receiver output video, enabling traditional non-blind quality comparisons
  • Developer experience is key here, you want to build a system that developers actually use. A screenshot is shown at 14:40. We wonder what the “event types” are. As suggested by the Dolby talk there is a limited set of “words” in a “domain specific language” (DSL) to describe the actions and events. Agreeing on those would even make cross-service comparisons more realistic (as we have seen in the case of Zoom-vs-Agora this sometimes evolves into a mud fight) and might lead to agreeing on a set of commonly accepted baseline requirements for how a media engine should react to network conditions
  • The section starting at 16:00 is about how this applies to… doing RTC testing @scale at Meta. It extends the approach we have seen in the slides before and again reminds us of the Dolby mention of a DSL. As shown around 17:15 the “interfaces” for that are appium scripts for native apps or python-puppeteer ones for web clients (we are glad web clients are tested by Meta despite being a niche for them!)
  • At 17:40 the challenge of ensuring test configurations are representative. This is a tough problem and requires putting numbers on all your features so you can track changes. And some changes only affect the ratios in ways that… don’t show up until your product gets used by hundreds of millions of users in a production scenario. Newton reduces the risk here by validating with a statistically relevant number of randomized tests at least which increases the organizational confidence. Over time it also creates a feedback loop of how realistic the scenarios you test are. Compared to Google, Meta is in a pretty good position here as they only need to deal with a single organization doing product changes which might affect metrics rather than “everyone” using WebRTC in Chrome
  • Some example use-cases are given at 19:15 that this kind of work enables. Migrating strategies between “small calls” and “large calls” is tricky as some metrics will change. Getting insight into which ones and whether those changes are acceptable (while retaining the metrics for “small ones”) is crucial for migrations
    • Even solving the seemingly “can someone join me on a call” problem provides a ton of value to developers
    • The value of enabling changes to complex issues such as anything related to codecs cannot be underestimated
  • Callagen running a lot of simulations on appium also has the unexpected side-effect of exposing deadlocks earlier which is a clear win in terms of shifting the cost of such a bug “left” and providing a reproduction and validation of fixes
  • Source-code bisect, presented at 21:00, is the native libWebRTC equivalent of Chromium’s bisect-build.py together with bisect-variations.py. Instead of writing a jsfiddle, one writes a “sim plan”. And it works “at scale” and allows observing effects like a 2% decrease in some metric. libWebRTC has similar capabilities of performance monitoring to identify perf regressions that run in Google’s infrastructure but that is not being talked about much by Google sadly
  • A summary is provided at 23:00 and there is indeed a ton to be learned from this talk. Testing is important and crucial for driving changes in complex systems such as WebRTC. Having proof that this kind of testing provides value makes it easier to argue for it and it can even identify corner cases
  • At 24:00 there is a “how to do it yourself” slide which we very much appreciate from a “what can WE learn from this”. While some of it seems like generally applicable to testing any system, thinking about the RTC angle is useful and the talk gave some great examples. That small, take baby steps. They will pay off in the long run (and for “just” a year of effort the progress seems remarkable)
  • There is a special guest joining at the end!
Sid Rao / Amazon – Using Machine learning to enrich and analyze real-time communications

Duration: 17:45

Watch if you are

  • A developer Interested in audio quality
  • A product manager that wants to see a competitor’s demo

Key points:

  • This talk is a bit sales-y for the Chime SDK but totally worth it. As a trigger warning, “SIP” gets mentioned. This covers three (and a half) use cases:
    • Packet loss concealment which improves the opus codec considerably
    • Deriving insights (and value) from sessions, with a focus on 1:1 use-cases such as contact centers or sales calls
    • Identifying multiple speakers from the same microphone (which is not a full-blown use-case but still very interesting)
    • Speech super resolution
  • Packet loss concealments start at 3:40. It is describing how Opus as a codec is tackling the improvements that deep neural networks can offer. Much of is also described in the Amazon blog post and we describe our take on it in WebRTC Insights #63. This is close to home for Philipp obviously:
    • RFC 2198 provides audio redundancy for WebRTC. It was a hell of a fight to get that capability back in WebRTC but it was clear this had some drawbacks. While it can improve quality significantly It cannot address bigger problems such as burst loss effectively
    • Sending redundant data only when there is voice activity is a great idea. However, libWebRTC has a weird connection between VAD and the RTP marker bit and fixing this caused a very nasty regression for Amazon Chime (in contact centers?) which was only noticed once this hit Chrome Stable. This remains unsolved, as well as easy access to the VAD information in APIs such as Insertable Streams that can be used for encoding RED using Javascript
    • It is not clear how sending redundant audio which are part of the same UDP packet is making the WiFi congestion problem worse at 4:50 (audio NACK would resend packets in contrast)
    • The actual presentation of DRED starts at 5:20 and has a great demo. What the demo does not show is that the magic is how little bitrate is used compared to just sending x10 the amount of data. Which is the true magic of DRED. Whether it is worth it remains to be seen. Applying it to the browser may be hard due to the lack of APIs (we still lack an API to control FEC bandwidth or percentage) but if the browser can decode DRED sent by a server (from Amazon) thanks to the magic tricks in the wire format that would be a great win already (for Amazon but maybe for others as well so we are approving this)
  • Deriving insights starts at 9:15 and is great at motivating why 1:1 calls, while considered boring by developers, are still very relevant to users:
    • Call centers are a bit special though since they deal with “frequently asked questions” and provide guidance on those. Leveraging AI to automate some of this is the next step in customer support after “playbooks” with predefined responses
    • Transcribing the incoming audio to identify the topic and the actual question does make the call center agent more productive (or reduces the value of a highly skilled customer support agent) with clear metrics such as average call handle time while improving customer satisfaction which is a win-win situation for both sides (and Amazon Chime enabling this value)
  • Identifying multiple speakers from the same microphone (also known as diarisation) starts at 10:55:
    • The problem that is being solved here is using a single microphone (but why limit to that?) to identify different persons in the same room speaking when transcribing. Mapping that to a particular person’s “profile” (identified from the meeting roster) is a bit creepy though. And yet this is going to be important to solve the problem of transcription after the push to return to the office (in particular for Amazon who doubled down on this). The demo itself is impressive but the looks folks give each other…
    • The diversity of non-native speakers is another subtle but powerful demo. Overlapping speakers are certainly a problem but people are less likely to do this unintentionally while being in the same room
    • We are however unconvinced that using a voice fingerprint is useful in a contact center context (would you like your voice fingerprint being taken here), in particular since the caller’s phone number and a lookup based on that has provided enough context for the last two decades
  • Voice uplift (we prefer “speech super resolution”) starts at 14:35. It takes the principle of “super resolution” commonly applied to video (see this KrankyGeek talk) and applies it to… G.711 calls:
    • With the advent of WebRTC and the high-quality provided by Opus we got used to the level of quality it provides which means that we perceive the worse quality of a G.711 narrowband phone call much more which causes fatigue when listening to those. While this may not be relevant to WebRTC developers this is quite relevant to call center agents (whose ears are on the other hand not accustomed to the level of quality Opus provides)
    • G.711 reduces the audio bandwidth by narrowing the signal frequency range to [300Hz, 3.4 kHz]. This is a physical process and as such not reversible. However, deep neural networks have listened to enough calls to reconstruct the original signal with sufficient fidelity
    • This feature is a differentiator in the contact center space, where most calls still originate from PSTN offering G.711 narrowband call quality. Expanding this to wideband for contact center agents may bring big benefits to the agent’s comfort and by extension to the customer experience
  • The summary starts at 16:00. If you prefer just the summary so far, listen to it anyway:
    • DRED is available for integration “into the WebRTC” platform. We will see whether that is going to happen faster than the re-integration of RED which took more than a year
Third Q&A

Duration: 19:45

Watch if

  • You found the talks this relates to interesting and want more details

Key points:

  • A lot of questions about open sourcing the stuff that gets talked about
  • Great questions about Opus/DRED, video quality assessment, getting representative network data for Newton (and how it relates to the WebRTC FakeClock)
  • The problem with DRED is that you don’t have just a single model but different models depending on the platform. And you can’t ship all of them in the browser binary…
SESSION 4 Ishan Khot & Hani Atassi / Meta – RSYS cross-platform real-time system library

Duration: 18:15

Watch if you are

  • A software architect that has worked with libWebRTC as part of a larger system

Key points:

  • This talk is a bit of an internal talk since we can’t download and use rsys which makes it hard to relate to it which is only possible if you have done your own integration of libWebRTC into a larger system
  • rsys is Meta’s RTC extension of their msys messaging library. It came out of Messenger and the need to abstract the existing codebase and make it more usable for other products. This creates an internal conflict between “we care only about our main use-case” and “we want to support more products” (and we know how Google’s priorities are in WebRTC/Chrome for this…). For example, Messenger made some assumptions about video streams and did not consider screen sharing to be something that is a core feature (as we saw in the keynote that has changed)
  • You can see the overall architecture at 8:00
  • With (lib)WebRTC being just one of many blocks in the diagram (the other two interesting ones are “camera” and “audio” which relate to the device management modules from the  second talk. Loading libWebRTC is done at runtime to reduce the binary size of the app store download
  • The slides that follow are a good description of what you need besides “raw WebRTC” like signaling and call state machines
  • The slides starting at 12:20 focus on how testing is done as well as debuggability and monitoring
  • The four-minute outlook which starts at 14:00 makes an odd point about 50 participants in a call being a challenge
Raman Valia & Shreyas Basarge / Meta – Bringing RTC to the Metaverse

Duration: 22:00

Watch if you are

  • Interested in the Metaverse and what challenges it brings for RTC
  • A product manager that wants to understand how it is different from communication products
  • An engineer that is interested in how RTC concepts like Simulcast are applicable to a more generic “world state” (or game servers as we think of them)

Key points:

  • The Metaverse is not dead yet but we still think it is called Fortnite
  • The distinction between communicating (in a video call) and “being present” is useful as the Metaverse tries to solve the latter and is “always on”
  • Around 5:00 delivering media over process boundaries is actually something where WebRTC can provide a better solution than IPC (but one needs to disable encryption for that use-case)
  • Embodiment is the topic that starts at 7:00. One of the tricky things about the Metaverse is that due to headsets you cannot capture a person’s face or landmarks on it since they are obscured by AR/VR devices
  • The distinction between different “levels” of Avatars, stylized, photorealistic and volumetric at 8:30 is interesting but even getting to the second stage is going to be tough
  • Sharing the world state that is being discussed at 15:00 is an adjacent problem. It does require systems similar to RTC in the sense that we have mediaserver-like servers (you might call them game servers) and then need techniques similar to simulcast. Also we have “data channels” with different priorities. And (later on) even “floor control
  • For the outlook around 20:30 a large concert is mentioned as a use-case. Which has happened in Fortnite since 2019
Fourth Q&A

Duration: 14:10

Watch if

  • You found the talks this relates to interesting and want more details

Key points:

  • rsys design assumptions which led to the current architecture and how its performance gets evaluated. And how they managed to keep the organization aligned on the goals for the migration
  • Never-ending calls in the Metaverse and privacy expectations which are different in a 1:1 call and a virtual concert
Closing remarks

We tried capturing as much as possible, which made this a wee bit long. The purpose though is to make it easier for you to decide in which sessions to focus, and even in which parts of each session.

Oh – and did we mention you should check out (and subscribe) to our WebRTC Insights service?

The post RTC@Scale 2023 – an event summary appeared first on BlogGeek.me.

What exactly is a WebRTC media server?

bloggeek - Mon, 04/24/2023 - 13:00

WebRTC media server is an optional component in a WebRTC application. That said, in most common use cases, you will need one.

There are different types of WebRTC servers. One of them is the WebRTC media server. When will you be needing one and what exactly it does? Read on.

Oh – and if you’re looking to dig deeper into WebRTC media servers, make sure to check the end of this article for an announcement of our latest WebRTC course

Table of contents Servers in WebRTC

There are quite a few moving parts in a WebRTC application. There’s the client device side, where you’ll have the web browsers with WebRTC support and maybe other types of clients like mobile applications that have WebRTC implementations in them.

And then there are the server side components and there are quite a few of them. The illustration above shows the 4 types of WebRTC servers you are likely to need:

  • Application servers where the application logic resides. Unrelated directly to WebRTC, but there nonetheless
  • Signaling servers used to orchestrate and control how users get connected to one another, passing WebRTC signaling across the devices (WebRTC has no signaling protocol of its own)
  • TURN (and STUN) servers that are needed to get media routed through firewalls and NATs. Not all the time, but frequently enough to make them important
  • WebRTC media servers processing and routing WebRTC media packets in your infrastructure when needed

The illustration below shows how all of these WebRTC servers connect to the client devices and what types of data flows through them:

What is interesting, is that the only real piece of WebRTC infrastructure component that can be seen as optional is the WebRTC media server. That said, in most real-world use-cases you will need media servers.

The role of a WebRTC media server

At its conception, WebRTC was meant to be “between” browsers. Only recently, did the good people at the W3C see it fit to change it to something that can work also in browsers. We’ve know that to be the case all along

What does a WebRTC media server do exactly? It processes and routes media packets through the backend infrastructure – either in the cloud or on premise.

Let’s say you are building a group calling service and you want 10 people to be able to join in and talk to each other. For simplicity’s sake, assume we want to get 1Mbps of encoded video from each participant and show the other 9 participants on the screen of each of the users:

How would we go about building such an application without a WebRTC media server?

To do that, we will need to develop a mesh architecture:

We’d have the clients send out 1Mbps of their own media to all the other participants who wish to display them on their screen. This amounts to 9*1Mbps = 9Mbps of upstream data that each participant will be sending out. Each client receives streams from all 9 other participants, getting us to 9Mbps of downstream data.

This might not seem like much, but it is. Especially when sent over UDP in real time, and when we need to encode and encrypt each stream separately for each user, and to determine bandwidth estimation across the network. Even if we reduce the requirement from 1Mbps to a lower bitrate, this is still a hard problem to deal with and solve.

It becomes devilishly hard (impossible?) when we crank up the number to say 50 or a 100 participants. Not to mention the numbers we see today of 1,000 or more participants in sessions (either active participants or passive viewers).

Enter the WebRTC media server

This is where a WebRTC media server comes in. We will add it here to be able to do the following tasks for us:

  • Reduce the stress on the upstream connection of clients
    • Now clients will send out fewer media streams to the server
    • The server will be distributing the media it receives to other clients
  • Handle bandwidth estimation
    • Each client takes care of bandwidth estimation in front of the server
    • The server takes care of the whole “operation”, understanding the available bandwidth and constraints of all clients

Here’s what’s really going on and what we use these media servers for:

WebRTC media servers bridge the gaps in the architecture that we can’t solve with clients alone

How is a WebRTC media server different from TURN servers

Before we continue and dive in to the different types of media servers, there’s something that must be said and discussed:

WebRTC media server != TURN server

I’ve seen people try to use the TURN server to do what media servers do. Usually that would be things like recording the data stream.

This doesn’t work.

TURN servers route media through firewalls and NAT devices. They aren’t privy to the data being sent through them. WebRTC privacy is maintained by having data encrypted end to end when passing via TURN servers – the TURN servers don’t know the encryption key so can’t do anything with the media.

WebRTC media servers are implementations of WebRTC clients in a server component. From an architectural point of view, the “session” terminates in the WebRTC media server:

A WebRTC media server is privy to all data passing through it, and acts as a WebRTC client in front of each of the WebRTC devices it works with. It is also why it isn’t so well defined in WebRTC but at the same time so versatile.

Types of WebRTC media servers

This versatility of WebRTC media servers means that there are different types of such servers. Each one works under different architectural assumptions and concepts. Lets review them quickly here.

Routing media using an SFU

The most common and popular WebRTC media server is the SFU.

An SFU routes media between the devices, doing as little as possible when it comes to the media processing part itself.

The concept of an SFU is that it offloads much of the decision making of layout and display to the clients themselves, giving them more flexibility than any other alternative. At the same time, it takes care of bandwidth management and routing logic to best fit the capabilities of the devices it works with.

To do all that, it uses technologies such as bandwidth estimation, simulcast, SVC and many others (things like DTX, cascading and RED).

At the beginning, SFUs were introduced and used for group calls. Later on, they started to appear as live streaming and broadcast components.

Mixing media with an MCU

Probably the oldest media server solution is the MCU.

The MCU was introduced years before WebRTC, when networks were limited. Telephony systems had/have voice conferencing bridges built around the concept of MCUs. Video conferencing systems required the use of media servers simply because video compression required specialized hardware and later too much CPU from client devices.

In telephony and audio, you’ll see this referred to as mixers or audio bridges and not MCUs. That said, they still are one and the same technically.

What MCUs do is to receive and mix the media streams it receives from the various participants, sending a single stream of media towards the clients. For clients, an MCU looks like a call between 2 participants – it is the only entity the client really interacts with directly. This means there’s a single audio and a single video stream coming into and going out of the client – regardless of the number of participants and how/when they join and leave the session.

MCUs were less used in WebRTC from the get go. Part of it was the simple economies of scale – MCUs are expensive to operate, requiring a lot of CPU power (encoding and decoding media is expensive). It is cheaper to offer the same or similar services using SFUs. There are vendors who still rely on MCUs in WebRTC for group calling, though in most cases, you will find MCUs providing the recording mechanism only – where what they end up doing is taking all inputs and mixing them into a single stream to place in storage.

Bridging across standards using a gateway

Another type of media server that is used in WebRTC is a gateway.

In some cases, content – rendered, live or otherwise – needs to be shared in a WebRTC session – or a WebRTC session needs to be shared on another type of a protocol/medium. To do so, a gateway can be used to bridge between the protocols.

The two main cases where these happen are probably:

  1. Connecting surveillance cameras that don’t inherently support WebRTC to a WebRTC application
  2. Streaming a WebRTC session into a social network (think Twitch, YouTube Live, …)
The hybrid media server

One more example is a kind of a hybrid media server. One that might do routing and processing together. A group calling service that also records the call into a single stream for example. Such solutions are becoming more and more popular and are usually deployed as multiple media servers of different types (unlike the illustration above), each catering for a different part of the service. Splitting them up makes it easier to develop, maintain and scale them based on the workload needed by each media server type.

Cloud rendering

This might not be a WebRTC media server per se, but for me this falls within the same category.

Sometimes, what we want is to render content in the cloud and share it live with a user on a browser. This is true for things like cloud gaming or cloud application delivery (Photoshop in the cloud for hourly consumption). In such a case, this is more like a peer-to-peer WebRTC session taking place between a user on a browser and a cloud server that renders the content.

I see it as a media server because many of the aspects of development and scaling of the cloud rendering components are more akin to how you’d think about WebRTC media servers than they are about browser or native clients.

A quick exercise: What WebRTC media servers are used by Google Meet?

Let’s look at an example service – Google Meet. Why Google Meet? Well, because it is so versatile today and because if you want to trace capabilities in WebRTC, the best approach is to keep close tabs with what Google Meet is doing.

What WebRTC media servers does Google Meet use? Based on the functionality it offers, we can glean out the types that make up this service:

  • Supports large group meetings – this is where SFU servers are used by Google Meet to host and orchestrate the meeting. Each user has different layouts during the same session and can flexibly control what it views
  • Recording meetings – Google Meet recordings shows a single participant/screen share and mixes all audio streams. For the audio this means using an MCU server and for the video this is more akin to a switching SFU server (always picking out a single video stream out of those available and not aiming for a “what you see is what you get” kind of recording)
  • Connect to YouTube live – here, they connect between Google Meet and YouTube Live using an RTMP gateway in real-time instead of storing it in a file like it is done while recording
  • Dialing in from regular telephones – this one requires a hybrid gateway bridging server as well as an MCU to mix the audio into the meeting
  • Cloud based noise suppression – Google decided to implement noise suppression in Google Meet using servers. This requires an SFU/bridging gateway to connect to servers that process the media in such a way
  • Cloud based background removal – For low performing devices, Google Meet also runs background removal in the server, and like noise suppression, this requires an SFU/bridging gateway for this functionality

A classing meeting service in WebRTC may well require more than a single type of a WebRTC media server, likely deployed in hybrid mode across different hardware configurations.

When will you need a WebRTC media server?

As we’ve seen earlier, the answer to this is simple – when doing things with WebRTC clients only isn’t possible and we need something to bridge this gap.

We may lack:

  • Bandwidth on the client side, so we will alleviate that by adding WebRTC servers
  • CPU, memory or processing power, delegating that to the cloud
  • Conduct certain machine learning algorithms, where having them run in cloud services may make more sense (due to CPU, memory, availability of training data, speed, certain AI chips, …)
  • Bridging between WebRTC and other components that don’t use WebRTC, such as connecting to telephony systems, surveillance cameras, social media streaming services, etc
  • When we need the data on servers – so we record the sessions (we can also do this without a WebRTC server, but there will be a media server in the cloud there nonetheless)

What I usually do when analyzing the needs of a WebRTC application is to find these gaps and determine if a WebRTC media server is needed (it usually is). I do so by thinking of the solution as a P2P one, without media servers. And then based on the requirements and the gaps found, I’ll be adding certain WebRTC media server elements into the infrastructure needed for my WebRTC application.

E2EE and WebRTC media servers

We’ve seen a growing interest in recent years in privacy. The internet has shifted to encryption first connections and WebRTC offers encrypted only media. This shift towards privacy started as privacy from other malicious actors on the public internet but has since shifted also towards privacy from the service provider itself.

Running a group meetings service through a service provider that cannot access the meeting’s content himself is becoming more commonplace.

This capability is known as E2EE – End to End Encryption.

When introducing WebRTC media servers into the mix, it means that while they are still a part of the session and are terminating WebRTC peer connections (=terminating encrypted SRTP streams) on their own, they shouldn’t have access to the media itself.

This can be achieved only in the SFU type of WebRTC media servers by the use of insertable streams. With it, the application logic can exchange private encryption keys between the users and have a second encryption layer that passes transparently through the SFU – enabling it to do its job of packet routing without the ability to understand the media content itself.

WebRTC media servers and open source

Another important aspect to understand about WebRTC media servers is that most of those using media servers in WebRTC do so using open source frameworks for media servers.

I’ve written at length about WebRTC open source projects – there are details there about the market state and open source WebRTC media servers there.

What is important to note is that more often than not, projects who don’t use managed services for their WebRTC media servers usually pick open source WebRTC media servers to work with and not develop their own from scratch. This isn’t always the case, but it is quite common.

Video APIs, CPaaS and WebRTC media servers

WebRTC Video API and CPaaS is another area I cover quite extensively.

Vendors who decide to use a CPaaS vendor for their WebRTC application will mainly do it in one of two situations:

  1. They need to bridge audio calls to PSTN to connect them to regular telephony
  2. There’s a need for a WebRTC media server (usually an SFU) in their solution

Both cases require media servers…

This leads to the following important conclusion: there’s no such thing as a CPaaS vendor doing WebRTC that isn’t offering a managed WebRTC media server as part of its solution – and if there is, then I’ll question its usefulness for most potential customers.

Taking a deep dive into WebRTC protocols

Last year, I released the Low-level WebRTC protocols course along with Philipp Hancke.

The Low-level WebRTC protocols course has been a huge success, which is why we’re starting to work on our next course in this series: Higher level WebRTC protocols

Before we go about understanding WebRTC media servers, it is important to understand the inner-workings of the network protocols that WebRTC employs. Our low-level protocols course covers the first part of the underlying protocols. This second course, looks at the higher level protocols – the parts that look and deal a bit more with network realities – challenges brought to us by packet losses as well as other network characteristics.

Things we cover here include retransmissions, forward error correction, codecs packetization and a myriad of media processing algorithms.

Want to be the first to know when we open our early bird enrollment?

Join the waiting list

The post What exactly is a WebRTC media server? appeared first on BlogGeek.me.

WHIP & WHEP: Is WebRTC the future of live streaming?

bloggeek - Mon, 04/10/2023 - 13:00

WHIP and WHEP are specifications to get WebRTC into live streaming. But is this really what is needed moving forward?

WebRTC is great for real time. Anything else – not as much. Recently two new protocols came to being – WHIP and WHEP. They work as signaling to WebRTC to better support live streaming use cases.

In recent months, there has been a growing adoption in the implementation of these protocols (the adoption of actual use isn’t something I am privy to so can’t attest either way). This progress is a positive one, but I can’t ignore the feelings I have that this is only a temporary solution.

Table of contents What are WHIP and WHEP?

WHIP stands for WebRTC-HTTP Ingestion Protocol. WHEP stands for WebRTC-HTTP Egress Protocol. They are both relatively new IETF drafts that define a signaling protocol for WebRTC.

WebRTC explicitly decided NOT to have any signaling protocol so that developers will be able to pick and choose any existing signaling protocol of their choice – be it SIP, XMPP or any other alternative. For the media streaming industry, this wasn’t a good thing – they needed a well known protocol with ready-made implementations. Which led to WHIP and WHEP.

 To understand them how they fit into a solution, we can use the diagram below:

In a live streaming use case, we have one or more broadcasters who “Ingest” their media to a media server. That’s where WHIP comes in. The viewers on the other side, get their media streams on the egress side of the media servers infrastructure.

For a technical overview of WHIP & WHEP, check out this Kranky Geek session by Sergio Garcia Murillo from Dolby:

In video conferencing, WebRTC transformed the market and how it thought of meetings and interoperability by practically killing the notion of interoperability across vendors on the protocol level, shifting it to the application level and letting users install their own apps on devices or just load web pages on demand.

The streaming industry is different – it relies on 3 components, which can easily come from 3 different vendors:

  1. Media servers – the cloud or on premise infrastructure that processes the media and routes it around the globe
  2. Ingress/Ingestion – the media source. In many cases these are internet cameras connected via RTP/RTSP or OBS and GStreamer-based sources
  3. Egress/Viewers – those who receive the media, often doing so on media players

When a broadcaster implements his application, he picks and chooses the media servers and media players. Sometimes he will also pick the ingestion part, but not always. And none of the vendors in each of these 3 categories can really enforce the use of his own components for the others.

This posed a real issue for WebRTC – it has no signaling protocol – this is left for the implementers, but how do you develop such a solution that works across vendors without a suitable signaling protocol?

The answer for that was WHIP and WHEP –

  • WHIP connects Ingress/Ingestion to Media servers
  • WHEP connects Media servers to Egress/Viewers

These are really simple protocols built around the notion of a single HTTP request – in an attempt to get the streaming industry to use them and not shy away from the complexities hidden in WebRTC.

Strengths

Here’s what’s working well for WHIP and WHEP:

  • Simple to implement
    • These protocols for the main flow requires a single round trip – a request and a response
    • Some of the features of WebRTC were removed or made lenient in order to enable that, which is a good thing in this case
  • Operates similarly to other streaming protocols
    • The purpose of it all is to be used in an industry that already exist with existing solutions and vendors
    • The closer we can get to them the easier for them and the more likely they are to adopt it
  • Adoption
    • The above two got it to be adopted by many players in the industry already
    • This adoption is in the form of demos, POCs and actual products
  • WebRTC
    • It is here for over 10 years now and proven as a technology
    • Connecting it to video conferences to stream them live or to add external streams into them using WHIP is not that hard – there are quite a few playing with these use cases already
Weaknesses

There’s the challenging side of things as well:

  • Too simple
    • Edge cases aren’t clearly managed and handled
    • Things like renegotiation when required, ICE restarts, etc
  • WebRTC
    • While WebRTC is great, it wasn’t originally designed for streaming
    • We’re now using it for live streaming, but live isn’t the only thing streaming needs to solve

This last weakness – WebRTC – leads me to the next issue at hand.

Streaming, latency and WebRTC

Streaming comes in different shapes and sizes.

The scenario might have different broadcasters:viewers count – 1:1, 1:many, few:1, few:many – each has its own requirements and nuances as to what I’d prefer using on the sending side, receiving end and on the media server itself.

What really changes everything here is latency. How much latency are we willing to accept?

The lower the latency we want the more challenging the implementation is. The closer to live/real time we wish to get, the more sacrifices we will need to make in terms of quality. I’ve written about the need to choose either quality or latency.

WebRTC is razor focused on real time and live. So much so that it can’t really handle something that has latency in it. It can – but it will sacrifice too much for it at a high complexity cost – something you don’t really want or need.

What does that mean exactly?

  • WebRTC runs over UDP and falls back to TCP if it must
  • The reason behind it is that having the generic retransmissions built into TCP is mostly counterproductive to WebRTC – if a packet is lost, then resending it is going to be too late in many cases to make use of it live – remember?
  • So WebRTC relies on UDP and uses RTP, enabling it to decide how to handle packet losses, bitrate fluctuations and other network issues affecting real time communications
  • If we have a few seconds of latency, then we can use retransmissions on every packet to deal with packet losses. This is exactly what Netflix and YouTube do for example. With its focus on low latency, WebRTC doesn’t really allow that for us

This is when a few tough questions need to be asked – what exactly does your streaming service need?

  • Sub-second latency because it is real time and interactive?
  • If the viewer receives the media two seconds after it has been broadcasted. Is that a huge problem or is it ok?
  • What about 5 seconds?
  • And 30 seconds?
  • Is the stream even live to begin with or is it pre-recorded?

If you need things to be conducted in sub-second latency only, then WebRTC is probably the way to go. But if you have in your use case other latencies as well, then think twice before choosing WebRTC as your go-to solution.

A hybrid WebRTC approach to “live” streaming

An important aspect that needs to be mentioned here is that in many cases, WebRTC is used in a hybrid model in media streaming.

Oftentimes, we want to ingest media using WebRTC and view the media elsewhere using other protocols – usually because we don’t care as much about latency or because we already have the viewing component solved and deployed – here WebRTC ingest is added to an existing service.

Adding the WHIP protocol here, and ingesting WebRTC media to the streaming service means we can acquire the media from a web browser without installing anything. Real time is nice, but not always needed. Browser ingest though is mostly about reducing friction and enabling web applications.

The 3 horsemen: WebTransport, WebCodecs and WebAssembly

That last suggestion would have looked different just two years ago, when for real time the only game in town for browsers was WebRTC. Today though, it isn’t the case.

In 2020 I pointed to the unbundling of WebRTC. The trend in which WebRTC is being split into its core components so that developers will be able to use each one independently, and in a way, build their own solution that is similar to WebRTC but isn’t WebRTC. These components are:

  1. WebTransport – means to send anything over UDP at low latency between a server and a client – without or without retransmissions
  2. WebCodecs – the codecs used in WebRTC, decoupled from WebRTC, with their own frame by frame encoding and decoding interface
  3. WebAssembly – the glue that can implement things with high performance inside a browser

Theoretically, using these 3 components one can build a real time communication solution, which is exactly what Zoom is trying to do inside web browsers.

In the past several months I’ve seen more and more companies adopting these interfaces. It started with vendors using WebAssembly for background blurring and replacement. Moved on to companies toying around with WebTransport and/or WebCodecs for streaming and recently a lot of vendors are doing noise suppression with WebAssembly.

Here’s what Intel showcased during Kranky Geek 2021:

This trend is only going to grow.

How does this relate to streaming?

Good that you asked!

These 3 enables us to implement our own live streaming solution, not based on WebRTC that can achieve sub second latency in web browsers. It is also flexible enough for us to be able to add mechanisms and tools into it that can handle higher latencies as needed, where in higher latencies we improve upon the quality of the media.

Strengths

Here’s what I like about this approach:

  • I haven’t read about it or seen it anywhere, so I like to think of it as one I came up with on my own but seriously…
  • It is able with a single set of protocols and technologies to support any latency requirement we have in our service
  • Support for web browsers (not all yet, but we will be getting there)
  • No need for TURN or STUN servers – less server footprint and headaches and better firewall penetration (that’s assuming WebTransport becomes as common as WebSocket and gets automatically whitelisted by firewalls)
Weaknesses

It isn’t all shiny though:

  • Still new and nascent. We don’t know what doesn’t work and what the limitations are
  • Not all modern browsers support it properly yet
  • We’re back to square one – there’s no streaming protocol to support it that way, which means we don’t support the media streaming ecosystem as a whole
  • Connecting it to WebRTC when needed might not be straightforward
  • You need to build-your-own spec at the moment, which means more work to you
Is WebRTC the future of live streaming?

I don’t know.

WHIP and WHEP are here. They are gaining traction and have vendors behind them pushing them.

On the other hand, they don’t solve the whole problem – only the live aspect of streaming.

The reason WebRTC is used at the moment is because it was the only game in town. Soon that will change with the adoption of solutions based on WebTransport+WebCodecs+WebAssembly where an alternative to WebRTC for live streaming in browsers will introduce itself.

Can this replace WebRTC? For media streaming – yes.

Is this the way the industry will go? This is yet to be seen, but definitely something to track.

The post WHIP & WHEP: Is WebRTC the future of live streaming? appeared first on BlogGeek.me.

Web 上的视频帧处理 – WebAssembly、WebGPU、WebGL、WebCodecs、WebNN 和 WebTransport

webrtchacks - Tue, 03/28/2023 - 14:59

Note: Chinese translation thanks to Xueyuan Jia and Xiaoqian Wu of the W3C. See the English version here.  W3C Web 技术标准专家 François Daoust 和 Dominique Hazaël-Massieux(Dom)先前与我们探讨了如何使用 WebCodecs 和 Streams 进行实时视频处理。那篇文章重点介绍了如何设置流水线以应付来自摄像头、WebRTC 流或其他来源的视频帧低延迟处理。演示了一些处理示例 — 改变颜色、覆盖图像,甚至是改变视频编解码。引用的其他用例还包括机器学习处理,例如添加虚拟背景。 今天,他们将重点讨论可用于进行实际视频处理的诸多技术选项。有很多技术用来读取和更改视频帧内的像素。他们全面回顾了当前基于 Web 的所有技术选项 — JavaScript、WebAssembly (wasm)、WebGPU、WebGL、WebCodecs、Web 神经网络(WebNN)和 WebTransport。其中一些技术已经存在一段时间,许多则是新出现的。 这是一篇关于与视频分析与操作的文章。感谢 François 和 Dominique 与我们分享他们的研究,测试 Web 上可用的进行视频处理的完整技术目录。 正文内容 视频帧处理选项 使用 JavaScript 像素格式 性能 其他考虑 使用 WebAssembly 演示代码 […]

The post Web 上的视频帧处理 – WebAssembly、WebGPU、WebGL、WebCodecs、WebNN 和 WebTransport appeared first on webrtcHacks.

Video Frame Processing on the Web – WebAssembly, WebGPU, WebGL, WebCodecs, WebNN, and WebTransport

webrtchacks - Tue, 03/28/2023 - 14:53

There are a lot of options for reading and changing the pixels inside a video frame. In this post, W3C specialists François Daoust and Dominique Hazaël-Massieux (Dom) review every web-based option for processing video frames on the web available today - JavaScript, WebAssembly (wasm), WebGPU, WebGL, WebCodecs, Web Neural Networks (WebNN), and WebTransport.

The post Video Frame Processing on the Web – WebAssembly, WebGPU, WebGL, WebCodecs, WebNN, and WebTransport appeared first on webrtcHacks.

With WebRTC, don’t expect Google to be your personal outsourcing vendor

bloggeek - Mon, 03/27/2023 - 13:00

Understanding how WebRTC is governed in reality will enable you to make better decisions in your development strategy.

If you are correct or not is something we can argue about. What we can’t argue is that the expectation that a company who is maintaining an open source library doesn’t owe you anything.

Free is worth exactly what you pay for it. 0⃣

And there lies the whole issue – if you aren’t paying for WebRTC, then what gives you the right to complain? (btw – this is different from the other side of it – could Google do a better job of maintaining WebRTC for everyone at the same or lower effort, while increasing external contributions to it).

Table of contents Why this article?

To. Many, Times. People. Complain. About. Google.

I do that as well

If you are complaining, at least know that you’re complaining about something that is reasonable…

One of the more recent cases comes from Twilio (or more accurately a customer of theirs):

There was a minor change in Google’s implementation of WebRTC. For some reason, they decided to be less lenient with how they parse iceServers in peer connections to be more “spec compliant”.

Yes. It is nitpicking.

Yes. It is a useless change.

Yes. They could have decided not to do it.

But they did. And in a weird way, it makes sense to do so.

And there’s a process in place already for dealing with that – Canary and Beta versions of Chrome that vendors (like Twilio) can use to catch and handle these things beforehand. Or they can… well… register to the WebRTC Insights

Twilio had to fix their code (and they did by the way), and yet there are those who blame Google here for making changes in Chrome. Changes that one can say are needed.

I’d add a few more thoughts here before I continue to dive in to this topic properly:

  • When you make an omelet you break a few eggs. Every change done in Chrome is going to break someone’s code
  • Chrome is used by billions of users, on countless different devices, using implementations of an endless stream of companies and developers. If YOU think that you can create code that is flawless that won’t break for someone in the next upgrade, then do let me know – I am not hiring, but for you, I’ll definitely make an exception
Who “owns” WebRTC?

WebRTC is an open standard governed by the W3C and an open source library which confusingly is also named “webrtc”. I prefer to call it libwebrtc.

The WebRTC open source standard is somewhat split in “ownership” between the W3C and the IETF. W3C is in charge of the API surface we use in the browser for WebRTC and the IETF on the network protocol itself – what gets sent over the network.

WebRTC as an open source library is… well… it depends. Google develops and maintains libwebrtc – that’s the source code that goes into Chrome. And Edge. And Firefox. And Safari. Yes – all of them. And then there are other alternative libraries you can use.

The thing is this – you can’t really use a different WebRTC implementation in the browser, because browsers come with libwebrtc “built-in”. And in many cases, if you don’t need a browser, you may still want to use libwebrtc just to be as close as possible to the browser implementation.

Does that mean that Google owns the WebRTC implementation? To some degree it does – while there are alternatives, none of them are truly usable for many of the use cases.

That said, anyone can fork the Google WebRTC implementation and create his own project – open source or otherwise – and continue from there. Apple could do it. So could Microsoft and Mozilla. And yet they all decided to stick with libwebrtc as is.

Why is that?

I can think of two main reasons:

  1. Why “waste” resources (engineers, time, money, etc) when you can get it for free and have Google develop it for you?
  2. If you need to end up interoperating or having your application run on Chrome (=the Internet for non-iPhone users), then your best bet is to stick as close as possible to the source – which is libwebrtc

So in a way, Google owns WebRTC without really owning it. At least as long as Chrome is the undisputed and dominant form in which we consume the internet (are you reading this on a Chrome browser?)

I usually place a global market share graph at this stage. This time, I’ll share this website’s visitors distribution:

A few words about libwebrtc

libwebrtc is maintained by Google for Google. It is open sourced and you can use it. You can even contribute back, which isn’t a simple process.

By Google for Google means that prioritization of features, testing and bug fixes is done based on Google’s needs. These needs include Google Meet, a few other Google services and the need to support and maintain the larger ecosystem.

Who sets the tone here? What decides if your bug is more important to deal with than Google Meet or another vendor’s problems?

Put yourself in the shoes of the Google product manager for WebRTC and you’ll know the answer – it would be Google Meet first. The others later.

This also sets the tone as to the build system and code structure of libwebrtc. It is highly geared towards its use inside Chrome. Less elsewhere. And this in turn means that adopting it as a library inside your own application means dealing with code that isn’t meant to be a classic generic purpose SDK – you’ll need to figure your way through it (and with a bit less documentation than you’d like).

Vendors in the WebRTC ecosystem

There are now hundreds if not thousands of vendors using WebRTC in the ecosystem. They do it directly or indirectly via CPaaS vendors and other tooling and solutions. You can find many of them in my WebRTC Developer Tools Lanscape. Most of them view WebRTC as free. Not only that, it seems like many treat WebRTC as a human right – it needs to be there for them, it must be perfect, and if there’s something ”wrong” with it, then humanity has the obligation to fix it for them.

So… WebRTC is free. But what does that mean exactly? What is the SLA associated with it? What can you expect of it and come back to complain if it isn’t met?

Here are a few additional interesting questions, If WebRTC is cardinal and strategic to your application:

  • Have you invested anything in returning back to the community around WebRTC? Should you?
  • Do you have someone working part time or full time on the libwebrtc codebase itself? Is that work done in the public library or in your proprietary in-house fork?
  • If you run into an issue, can you ask Google to help you out and will they spend the time and resources to do so?
  • Can you pay Google (or anyone else) a support fee for solving your specific issues? (no)
  • Are you here only to take or also to give?

To be clear – there are no right or wrong answers here – just make sure you position your expectations based on your answers as well

Putting your money where your mouth is

Philipp Hancke has been doing WebRTC for a long time and is renowned for his bug reports. He even got Google to fix quite a few of them. Some bugs stayed open for years however, like this bug about TURN relay servers being used sometimes in cases where using STUN will be just fine. A bug here has an impact on the percentage of calls that get relayed via TURN servers which has a negative impact on call quality (at times) but also increases the cost to run those.

This bug has been open for since 2016. Quite a few Googlers took a look but without finding anything that stood out. The crucial hint of what goes wrong came in 2021 in another bug report. In the end, Philipp had to acquire the skills necessary to fix the bug (which will hopefully happen before the end of 2023).

This takes time and time is not cheap – especially that of engineers. Microsoft as his employer apparently decided it was important enough for him to spend time on fixing this and other issues.

Please Google add a feature for me!

HEVC encoding and decoding in WebRTC seems to be a topic some folks get excited about. It would be great to know why.. 

There is a bug report about it in the WebRTC issue tracker which gets fairly frequent updates. And yet… Google does nothing! How can that be?

One would say that’s because it is out of the requirements of what Google needs for Google. There are other contributing factors as well here:

  1. It is also not simple to implement and maintain
  2. Testing this is a headache, especially considering all potential edge cases, hardware, devices, …
  3. Patents. HEVC is a legal minefield. Chrome supports HEVC only when the underlying hardware does. Why would Google go further into that minefield for you?
  4. This isn’t a feature in WebRTC. Not a mandatory one. Even as an optional one you can argue that it is somewhat controversial
How to think about support in WebRTC?

There’s this modern concept of zero trust in cloud computing these days.

Here’s my suggestion to you wrt WebRTC and your stance:

Zero expectations.

Don’t expect – and you won’t be disappointed.

But more importantly – understand how this game is played:

  • Use WebRTC. Take what you are given and make the most of it
  • If you need to modify the source code:
    • Be sure to invest time and thought into how to do it in a way that will let you upgrade to later releases of libwebrtc
    • Upgrade frequently. 4 times a year is great. Less is going to be an issue
    • Follow up on security issues to patch in-between releases if needed. We keep track of these in WebRTC Insights
  • Test frequently
    • Test against the beta and canary releases
    • If things break – report back. Make sure to add as much useful information as possible (follow these suggestions for submitting a WebRTC bug in Chrome)
    • If things break – don’t wait for Google to fix it. See if there’s something on your end you can do to fix things and work around the issue
  • Have means to update your application
    • If you end up with an incompatibility with Chrome, you need a way to upgrade your application. Which will take time. You are in a race against the Chrome release train here
    • A way to release a hotfix to whatever it is your customers are using. Something that can be deployed within hours or days
  • Browsers have a release cadence of a version per month. Think about that. And then plan accordingly
  • Assume things will break. It is not a matter of if – just of when and how
  • Things can be handled and managed better by the Google team for WebRTC. But it isn’t. Nothing you can really do about it

And yes – we’re here to help – you can use WebRTC Insights to get ahead of these issues in many ways.

The post With WebRTC, don’t expect Google to be your personal outsourcing vendor appeared first on BlogGeek.me.

Real-Time Video Processing with WebCodecs and Streams: Processing Pipelines (Part 1)

webrtchacks - Tue, 03/14/2023 - 13:45

WebRTC used to be about capturing some media and sending it from Point A to Point B. Machine Learning has changed this. Now it is common to use ML to analyze and manipulate media in real time for things like virtual backgrounds, augmented reality, noise suppression, intelligent cropping, and much more. To better accommodate this […]

The post Real-Time Video Processing with WebCodecs and Streams: Processing Pipelines (Part 1) appeared first on webrtcHacks.

Different WebRTC server allocation schemes for scaling group calling

bloggeek - Mon, 03/13/2023 - 13:00

In group calls there are different ways to decide on WebRTC server allocation. Here are some of them, along with recommendations of when to use what.

In WebRTC group calling, media server scaling is one of the biggest challenges. There are multiple scaling architectures that are used, and most likely, you will be aiming at a routing alternative, where media servers are used to route media streams around between the various participants of a session.

As your service grows, you will need to deal with scale:

  • Due to an increase in the number of users in a single session
  • Because there’s a need to cater for a lot more sessions concurrently
  • Simply due to the need to support users in different geographical locations

In all these instances, you will have to deal with the following challenge: How do you decide on which server to allocate a new user? There are various allocation schemes to choose from for WebRTC group calling. Each with its own advantages and challenges. Below, I’ll highlight a few such schemes to help you with implementing the WebRTC allocation scheme that is most suitable for your application.

Table of contents Single data center allocation techniques

First things first. Media servers in WebRTC don’t scale well. For most use cases, a single server will be able to support 200-500 users. When more than these numbers are supported, it will usually be due to the fact that it sends lower bitrates by design, supports only voice or built to handle only one way live streaming scenarios.

This can be viewed as a bad thing, but in some ways, it isn’t all bad – with cloud architectures, it is preferable to keep the blast radius of failures smaller, so that an erroneous machine ends up affecting less users and sessions. WebRTC media servers force developers to handle scaling earlier in their development.

Our first order of the day is usually going to be deciding how to deal with more than a single media server in the same data center location. We are likely to load-balance these media servers through our signaling server policy, effectively associating a media server to a user or a media stream when the user joins a session. Here are a few alternatives to making this decision.

Server packing

This one is rather straightforward. We fill out a media server to capacity before moving on to fill out the next one.

Advantages:

  • Easy to implement
  • Simple to maintain

Challenges:

  • Increase blast radius by design
  • Makes little use of other server resources that are idle
Least used

In this technique, we look for the media server that has the most free capacity on it and place the new user or session on it.

Advantages:

  • Automatically balances resources across servers

Challenges:

  • Requires the allocation policy to know all sever’s capacities at all times
Round robin

Our “don’t think too much” approach. Allocate the next user or session to a server and move on to the next one in the list of servers for the next allocation.

Advantages:

  • Easy to implement

Challenges:

  • Feels arbitrary
Random

Then there’s the approach of picking up a server by random. It sounds reckless, but in many cases, it can be just as useful as least used or round robin.

Advantages:

  • Easy to implement

Challenges:

  • Feels really arbitrary
Region selection techniques

The second part is determining which region to send a session or a user in a session to.

If you plan on designing your service around a single media server handling the whole session, then the challenge is going to be where to open a brand new session (adding more users takes place on that same server anyway). Today, many services are moving away from the single server approach to a more distributed architecture.

Lets see what our options are here in general.

First in room

The first user in a session decides in which region and data center it gets created. If there are more than a single media server in that data center, then we go with our single data center allocation techniques to determine which one to use.

This is the most straightforward and naive approach, making it almost the default solution many start with.

Advantages:

  • East to implement

Challenges:

  • Group sizes are limited by a single machine size and scale
  • If the first user to join is located from all the rest of the users, then the media quality will be degraded for all the rest of the participants
  • It makes deciding capacities and availability of resources on servers more challenging due to the need to reserve capacity for potential additional users

Note that everything has a solution. The solutions though makes this harder to implement and may degrade the user experience in the edge cases it deals with.

Application specific

You can pick the first that joins the room to make the decision of geolocation or you can use other means to do that. Here, the intent is to use something you know in your application in advance to make the decision.

For example, if this is a course lesson with the teacher joining from India and all the students are joining from the UK, it might be beneficial to connect everyone to a media server in the UK or vice versa – depending on where you want to put the focus.

A similar approach is to have the session determine the location by the host (similar to first in room) or be the configuration of the host – at account creation or at session creation.

Advantages:

  • Usually easy to implement

Challenges:

  • Group sizes are limited by a single machine size and scale
  • It makes deciding capacities and availability of resources on servers more challenging due to the need to reserve capacity for potential additional users
  • Not exactly a challenge, but mostly an observation – to some applications, the user base is such that creating such optimizations makes little sense. An example can be a country-specific service
Cascading

Cascading is also viewed as distributed/mesh media servers architecture – pick the name you want for it.

With cascading, we let media servers communicate with each other to cater for a single session together. This approach is how modern services scale or increase media quality – in many ways, many of the other schemes here are “baked” into this one. Here are a few techniques that are applicable here:

  • Always connect a new user to the closest media server available. If this media server isn’t already part of the session, it will be added to the session by meshing it with the other media servers that cater for this session
  • When capacity in a media server is depleted, add a new user to a session by scaling it horizontally in the same data center with one of the techniques described in single data server allocation at the beginning of this article
  • In truly large scale sessions (think 10,000 users or more), you may want to entertain the option of creating a hierarchy of media servers where some don’t even interact with end users but rather serve as relay of media between media servers

Advantages:

  • Can achieve the highest media quality per individual user

Challenges:

  • Hard to implement
  • Usually requires more server resources
Sender decides

This one surprised me the first time I saw it. In this approach, we “disconnect” all incoming traffic from outgoing and treat each of them separately as if it were an independent live stream.

What does that mean? When a user joins, he will always connect to the media server closest to them in order to send their media. For the incoming media from other users, he will subscribe to their streams directly on the media servers of those users.

Advantages:

  • Rather simple to implement

Challenges:

  • Doesn’t use good inter-data center links between the servers
  • Doesn’t “feel” right. Something about the fact that not a single media server knows the state of the user’s device bothers me in how you’d optimize things like bandwidth estimation in this architecture
A word about allocation metrics

One thing I ignored in all this is how do you know when a server is “full”. This decision can be done in multiple ways, and I’ve seen different vendors take different approaches here. There are two competing aspects here to deal with:

  1. Utilization – we want our servers to be utilized to their fullest. Resources we pay for and not use are wasted resources
  2. Fragmentation – if we cram more users on servers, we may have a problem when a new user joins a session but has no room on the media server hosting that session. So at times, we’d like to keep some slack for such users. The only question is how much slack

Here are a few examples, so you can make an informed decision on your end:

  • Number of sessions. Limit the number of sessions on a server, no matter the number of users each session has. Good for services with rather small and predictable session sizes. Makes it easier to handle resource allocations in cases of server fragmentation
  • Number of users. Limit the number of users a single server can handle
  • CPU. Put a CPU threshold. Once that threshold is breached, mark the media server as full. You can use two thresholds here – one for not allowing new sessions on the server and one for not allowing any more users on the server
  • Network. Put a network threshold, in a similar way to what we did above for CPU

Sometimes, we will use multiple metrics to make our allocation decision.

Final words

Scaling group calls isn’t simple once you dive into the details. There are quite a few WebRTC allocation schemes that you can use to decide where to place new users joining group sessions. There are various techniques to implement allocation of users in group calling, each with its own advantages and challenges.

Pick your poison

One last word – this article was written based on a new lesson that was just added to the Advanced WebRTC Architecture course. If you are looking for the best WebRTC training, then check out my WebRTC Courses.

The post Different WebRTC server allocation schemes for scaling group calling appeared first on BlogGeek.me.

Can I trust WebRTC getStats accuracy?

bloggeek - Mon, 02/27/2023 - 12:30

Yes and no. WebRTC getStats is what we have to work with, so we have to make do with it. That said, your real problems may lie elsewhere altogether.

Philipp Hancke assisted in writing this article and Midjourney helped with most of the visuals

This is the question I was posed in a meeting last week:

Can I trust WebRTC getStats?

As the Jewish person that I am, I immediately answered with a question of my own:

Assume the answer is “No”. What are you going to do now?

I thought the conversation merits a bit more discussion and some public sharing, which led to this article being written.

Table of contents TL;DR

Yes. You can and should trust the accuracy of WebRTC getStats, but like with everything else, you should also keep a dose of happy suspicion around you.

Like any piece of software, libwebrtc and its getStats implementation by extension, has bugs. These bugs get fixed over time. The priority given to fixing them relates mostly to how much Google’s own services suffer from and a seemingly arbitrary prioritization for the rest of the issues.

See below to learn more on why we have a problem and what you can do about it.

A short history of WebRTC getStats Midjourney, envisioning the history of WebRTC getStats

WebRTC was announced somewhere in 2011 and the initial public code in Chrome was released in 2012. The protocol itself was stabilized and officially published by the W3C in January 2021. Just… 10 years later.

In between these 10 years a lot of discussions took place and the actual API surface of the WebRTC standard specification was modified to fit the feedback provided and to encompass additional use cases and requirements.

We’ve had these discussions taking place in parallel to WebRTC being implemented in web browsers and shipped out so developers can make use of them. Years before WebRTC was officially “standardized” we had hundreds if not thousands of applications in production using WebRTC, oftentimes with paying customers.

At some point, the getStats implementation in the standard specification diverged from that implemented by Google in Chrome, ending with two main alternatives:

  1. Spec-compliant getStats – the new API that adheres to the standard specification. Given that this specification is authored by Googlers it is not surprising that it ended up being a description of what Chrome implemented, whether it made much sense or not. This was added in Chrome 58 back in January 2017
  2. Legacy getStats – the original implementation in Chrome

. This made switching from one to the other a challenge:

  • Google could just implement the new stats, but that would break applications that used legacy getStats implementation
  • Developers wanted to use the spec compliant stats, but needed a browser that supports them

The decision was made that the distinction between the two would be how getStats() is called. Callback-based invocation returned the legacy stats while using a promise returned spec-compliant getStats. The logic behind this was that promises was a new construct introduced to Javascript at the time, so developers who used the legacy getStats didn’t use promises (yet).

This approach worked rather well for the last 6 years, with many (most?) applications adopting the use of the spec-compliant getStats:

We observed a step drop in usage when Google Meet stopped using the legacy API (that’s the blue line going down). That said, a few outliers still remain who use the old getStats. They will not be able to do so in 2024.

Google WebRTC housecleaning project

Fast forward to today (or last year).

WebRTC is a solid standard and implementation used by many. It got us through the pandemic in many ways and aspects.

All the bigger requirements from WebRTC are behind us. There aren’t that many innovations or new features that get introduced to it.

Which is leading Google in recent months to house cleaning tasks:

  • Figuring out where they can squeeze the lemon a bit more for performance reasons
  • Where they can get rid of deadweight by deprecating and killing unnecessary code
  • Following the WebRTC specification even more closely
  • Beefing up best practices in security even more

This house cleaning work has reached getStats, and with it, 4 main areas:

  1. Deprecating and later killing legacy getStats (after waiting for Google Meet to stop using it and migrate to the spec compliant variation)
  2. Trimming down the results object for performance reasons
  3. “Randomizing” the object identifiers in the returned getStats structure for both performance and best practices reasons. This is still planned so it is best to prepare for it and not to interpret the “id” attribute in any way
  4. Making sure all stats in the specifications are reflected in the getStats implementation itself

Such changes are great when viewed in the long term. But in the short term they are a huge headache.

Firefox & Safari

Since Safari uses libwebrtc, it will get most statistics out of the box. However, the binding at the WebKit layer needs some code to be written which creates some difference with libWebRTC changes that Safari does not notice. We observed this with the “trackIdentifier” property recently but there may be others. Apple seems rather reactive here.

Firefox used to spearhead the “spec” getStats implementation but has fallen behind and lacks several stats types (such as candidate-pair stats). This means workarounds like shown by this WebRTC sample are still required for very basic functionality. Statistics related to media quality are lacking even more.

Keeping up the pace with WebRTC getStats changes

At testRTC, we’re offering tools for the full lifecycle of WebRTC applications. These include testing and monitoring services. As such, we rely heavily on getStats.

Years ago, we had to implement the migration from legacy stats to spec complaint stats.

Then came 2022 and with it the housekeeping changes by Google to the statistics found in getStats. It started with Chrome 107 and continues even today. With each such release, we need to get an experienced WebRTC developer to check, test and fix our code to make sure our services collect the statistics properly. All that is on top of the need to support more metrics that Google adds to Chrome in WebRTC getStats from time to time.

Our job is harder than most in this simply because we need to collect and support all the stats – the customer base we have is varied and we never really know which metrics they’d be interested in.

This task of keeping up with getStats has been a bit of a challenge in the last few months. That’s because in each release something else changes. Each step is reasonable. Needed. Minor. But it brings with it changes we need to do in our own planning and roadmap.

To others, such changes have brought with them breakages as well. At times the need to update and upgrade open source components or to fix their own code.

This is a good thing

It is important to state – the changes and work conducted here by Google is for the better.

Going for a spec compliant WebRTC getStats implementation means we have actual documentation that we expect to work. It also means interoperability with other browsers and components (assuming they strive to spec compliance as well).

Improvements in performance and polishing out best practices means better performance and code for WebRTC applications in general.

Removing deadweight and deprecated/unused statistics and similar components means smaller codebase with less edge cases and “things” to test.

This is what we want our WebRTC implementation to be and look like.

The fact that we need to undergo this ordeal is the price we need to pay for it. It would have been a wee bit nicer if Google would lay their plans of such changes well in advance (not through sporadic PSAs but rather as a kind of a public roadmap). This will enable better planning for those running such applications. But it is what it is. And frankly – we get what we pay for (=free).

Chrome’s WebRTC getStats implementation might not be the reason for bad metric values

Then there are bugs. Metrics you obtain for getStats that don’t seem to reflect reality.

There are usually 3 reasons for that to happen:

  1. Chrome. There’s a Chrome bug that leads to bad metrics results via getStats. As I stated earlier, these get fixed based on the priority and backlog of Google when it comes to their libwebrtc library
  2. You. The value is correct. You just don’t understand what it means or how it gets calculated. Since there’s little in the way of documenting each and every metric in getStats, this is quite common
  3. The other side. When your browser interacts with a non-browser device, a native mobile application or a media server, it gets a lot of the data used to report specific metrics via WebRTC getStats from RTCP reports that are calculated, generated and sent by the other device. That side may also have bugs in it (highly likely and even more)

A few things to remember here:

WebRTC is used by MANY inside browsers. Think billion(s) of people

It is adopted by thousands of applications developing directly and indirectly on top of it

Using statistics is standard practice to optimizing for media quality and most of the large WebRTC applications rely on it heavily already

Why should your application and use case be any different in trusting WebRTC getStats?

What can you do about WebRTC getStats changes?

Nothing.

That said, I do have a few suggestions for you:

  1. Understand and assume that things will change, bugs will be found (and fixed), and that for the most part, getStats is a really powerful and useful tool
  2. Test your application (and its stats) against the latest browser builds. This should include the upcoming beta and even the nightly builds if you’re up for it
  3. Make sure your media servers and other components are up to date. Especially in the RTCP reports they spew. When in doubt, question their behavior before libwebrtc (remember that they also need to run after Google’s implementation of WebRTC in Chrome)
  4. Subscribe and follow the WebRTC Insights. That’s where we flag such upcoming issues, among other things we cater for

The post Can I trust WebRTC getStats accuracy? appeared first on BlogGeek.me.

Can a native media engine beat WebRTC’s performance?

bloggeek - Mon, 02/06/2023 - 13:00

WebRTC is the best media engine out there. And it has nothing to do with its performance…

I’ve been part of the video conferencing industry throughout the first decade of the 21st century and a bit of the 2nd decade as well. The driving force at the time was resolution and frame rate. There was an arms race among vendors as to who provides higher resolutions and frame rates in their room system. A lot of the ethos at the time was the implementation of proprietary media engines that were built for the task at hand. Optimizing and fine tuning them for media quality was considered a core competency.

Fast forward to 2023, what should be the mindset and ethos today?

This is a kind of a continuation to my article on the WebRTC predictions for 2023

Table of contents What is a media engine?

In the context of VoIP and WebRTC, a media engine is a component that takes care of media processing. Simplifying it, a media engine implementation does something like this:

  • Capturing the raw data from the input devices (camera and microphone, but also the display)
  • Encoding that media and then sending it over the network (with WebRTC, that’s using SRTP)
  • Receiving the media from the network and then decoding it
  • Playing it back to the speakers and the display

The media engine also deals with improving voice and video – things such as echo cancellation, noise suppression, packet loss concealment, background blurring, etc.

WebRTC (and libWebRTC) as a media engine

One of the descriptions of WebRTC that I love is that WebRTC is a media engine with a JavaScript API on top.

Google’s implementation of WebRTC is libWebRTC. Originally, it came from its acquisition of GIPS (Global IP Solutions) – a company that licensed their proprietary media engine to VoIP developers. Google took that library, sprinkled the WebRTC API definition on top of it and integrated it with their Chrome browser.

10 years ago, there were other media engines as well. Most large vendors built and maintained their own media engine – especially if their market was video conferencing.

WebRTC, being a standard on both network and interface later, with libWebRTC being an open source implementation of it (that is maintained by Google AND integrated inside the most popular web browser) – became the best media engine out there practically overnight (or at least within 10 years and through a pandemic).

Joining a video call in your browser? Great! If you aren’t using Zoom, then 99.99% chance that what you are using is WebRTC, with the libWebRTC implementation.

Can a media engine other than WebRTC perform better? Made with Midjourney

Yes.

But what does that even mean?

What does performing better than WebRTC mean exactly?

  • If it supports HEVC. Is it better?
  • Let’s say it uses 10% less CPU. Is it better? How about 30% less memory consumption. That’s definitely much better
  • The video encoder compresses the same video input at 5% less bits with similar video quality. Is it better now?
  • It has more resilience to packet losses. It must be better!
  • Offering more voice codecs makes it better. Obviously…

libWebRTC isn’t the best media engine out there. At least not in that one (or more) parameters you’ve decided to compare it with your own proprietary alternative. But does it even matter?

Advantages of native (and proprietary) media engines

Building and maintaining your own native and proprietary media engine? Good for you! Lets’ see what advantages you gain by doing that:

  • You own and control your destiny
    • The code is yours
    • Along with it, the ability to modify it at will
  • Your application, your behavior
    • libWebRTC is optimized for… well… nothing. Almost – it is optimized for Google’s own needs
    • Your implementation of a media engine can be optimized to the exact needs, architecture, hardware and software that you use
  • Easy to differentiate
    • You own the code. You modify it to your heart’s content
    • This means that media specific capabilities can be unique and differentiated
Challenges of native (and proprietary) media engines

Now that we’re happy with building our own native and proprietary media engines, lets see what are our challenges:

  • Resources
    • Developing and maintaining media engines is ridiculously expensive and time consuming
    • There aren’t a lot of experienced media engine engineers out there waiting in line to be hired
  • Availability
    • Where exactly is your media engine running? Windows?
      • Now we need it for Mac
      • Next week on iOS and Android
      • And on a gazillion of devices and chipsets
    • Every new device permutation you need to support is a new headache to deal with and optimize for
    • Did I mention it takes time and money to do that?
  • Browsers
    • You’ve got your super perfect solution, but what happens the moment your customers want to be able to use it in a browser?
    • That’s when you need WebRTC…
    • And for that, you need to gateway and interoperate between your own media engine and the WebRTC implementation found in browsers
    • In most cases, doing that will degrade the media experience AND remove most of your proprietary differentiated features
WebTransport, WebCodecs, WebAssembly

We’re in the 3rd year of the WebRTC unbundling trend. This is still early days.

WebAssembly is here. It is powerful. And it is used more and more, with ever increasing usefulness.

WebTransport and WebCodecs are still great experiments – usable mostly for proof of concepts or early implementations. Using these to power a full fledged media engine that doesn’t make use of WebRTC is still a challenge.

Not all browsers support these interfaces, and those that do still have instabilities and a lot of optimization work to pore into them.

Using these is a long term investment that won’t offer a usable solution for 2023.

Why would I choose WebRTC as my media engine every day of the week?

Going to use your own native and proprietary media engine implementation? Good for you!

But do you need browser support in your application? Are these 5% of the user base or interactions or is it more like 50% or more?

Are you looking to make use of open source media servers and components? If so, then are these available for your proprietary implementation or will it be easier to just use ones that support… WebRTC!

Assuming you need browser support for your application and that said browser support isn’t there just as another unused feature to win a customer deal (and then lay forgotten somewhere), then you should just use WebRTC.

Why?

Because at the end of the day, that’s what browsers have available for you.

The post Can a native media engine beat WebRTC’s performance? appeared first on BlogGeek.me.

WebRTC predictions for 2023

bloggeek - Mon, 01/23/2023 - 13:00

Here are the WebRTC predictions and trends you should expect in 2023. It is more of the same, but with nuanced differences.

As we’re starting 2023, it is time to look back and then into the future, to understand where we are and where we are headed with WebRTC. This year, things are getting somewhat trickier here:

  • WebRTC is a done deal. It is here to stay and there are no questions about the need to use it
  • We’re in a global recession (or about to be in one)
  • The pandemic is over, but rearing its ugly head in China, just when the Chinese government decided to open up everything
  • A new toy just came out (generative AI) with a technology paradigm shift that will affect everyone and everything

Oh, and did I mention that I changed a lot in my own work-life? I am now Chief Product Officer at Spearline, dealing with the larger picture of testing and monitoring communication networks. Life is full of surprises

There’s lots to cover, so let’s start.

Table of contents Our WebRTC map

Before I dive into the predictions, it is important to know where we stand. We’ll do this by looking at 3 different layers:

  1. WebRTC the technology
  2. Open source in WebRTC
  3. CPaaS and WebRTC

Let’s start with the technology itself

The era of differentiation

We are well into the era of differentiation:

This started with Google unbundling WebRTC in the browser, starting to offer pieces of it as separate future W3C standards as well as opening up more access to lower levels of the stack. In the past year we’ve seen growing use of these capabilities outside of Google and experimentation and in production.

2021 brought with it background blurring and replacement in the browser to the masses.

In 2022 we’ve seen proprietary codecs and noise suppression finding a solid home in WebRTC applications and technologies using these capabilities. Representative commercial examples of this are Dolby Voice proprietary codec and Twilio’s Krisp partnership on noise cancellation.

If this is hinting on anything, it is that we’re going to see more of these moving forward, as vendors try to differentiate further. The only thing slowing this trend down is the current market recession.

Peak WebRTC

The pandemic that has raised all boats is all but over.

China is opening up, with or without another COVID wave. Many have shifted to hybrid work. Others are now communicating via video sessions a lot more than they used to.

Zoom is seen as the poster child of the pandemic. If you overlay its stock price with WebRTC usage in Chrome, you get this interesting chart:

WebRTC is still 3-4 times bigger in use than it used to be prior to the pandemic. That said, throughout 2022 we’ve seen consistent decrease in use of WebRTC. This is likely to continue into 2023.

My guess/prediction is that we will stay at around 3 times the use we had at the beginning of 2020.

libWebRTC dominance

libWebRTC is still king of the hill when it comes to WebRTC client-side implementations.

Nothing comes close to it.

libWebRTC is Google’s implementation of WebRTC, and the one used across all browsers today. A monoculture.

For most projects, using libWebRTC as a starting point for a non-browser implementation is the way to go. In some niche use cases, other solutions can and should be considered. The main alternative in such cases is probably Pion today.

2022 has been mostly a year of optimizations and polishing for the libWebRTC implementation, continuing on Google’s focus in 2021. 2023 will look no different.

WebRTC Insights clients received an analysis of the contributors to the libWebRTC project throughout history as part of a recent issue tracker sent to them.

Lets try a quick Q&A here on libWebRTC:

Is there a competitive alternative to libWebRTC in WebRTC?

The most popular WebRTC implementation out there is libWebRTC.

It is also the most dominant since it got embedded in all modern browsers.

libWebRTC is well maintained and is undergoing consistent improvements and optimizations. No other WebRTC stack is getting the same level of investment.

This is not expected to change in the foreseeable future.

Why is Google investing in libWebRTC?

This isn’t about Google Meet. Google is monetizing the web via ads delivered on search conducted in browsers and smartphones. By placing more of our activities in browsers and on the web, Google can monetize more interactions – indirectly.

Then there’s Google Meet/Workspace, competing with Microsoft Office on enterprise productivity.

Commoditizing communications is Google’s way of managing complementary technologies. Ben Thomspon in his latest analysis of AI and the Big Five refers to Joel Spolsky’s Strategy Letter V which offers a great explanation for both Google’s approach and is a good segway to our next section on open source:

Open source is not exempt from the laws of gravity or economics. […] something is still going on which very few people in the open source world really understand: a lot of very large public companies, with responsibilities to maximize shareholder value, are investing a lot of money in supporting open source software, usually by paying large teams of programmers to work on it. And that’s what the principle of complements explains.

Once again: demand for a product increases when the price of its complements decreases. In general, a company’s strategic interest is going to be to get the price of their complements as low as possible. The lowest theoretically sustainable price would be the “commodity price” — the price that arises when you have a bunch of competitors offering indistinguishable goods. So:

Smart companies try to commoditize their products’ complements.

The state of WebRTC open source

Not much has changed since my analysis a year ago on WebRTC trends in 2022, where I looked at WebRTC open source projects.

  • Kurento is still dead
  • Janus is great, in the same way it were a year ago
  • Jitsi is still pushing on group meeting features
  • mediasoup is a solid alternative. Its founders and lead developers who worked at Around now work at Miro, who acquired Around
  • Pion is still growing in adoption and use

Unsurprisingly, Janus, Jitsi, mediasoup and Pion still reserve most of their founders and key figures. These are teams/individuals who are personally and emotionally invested in these projects, which is a good thing.

The challenge is that besides Janus, none of them offer any official support and custom development. For the rest, companies need to rely on in-house development or external outsourcing vendors and freelancers.

As this state hasn’t changed for a good few years, not much is expected to change in 2023.

The main difference or question mark can be put on the projects that are now indirectly owned by a business whose focus might be elsewhere:

  • Jitsi – Jitsi was acquired by Atlassian and then 8×8. 8×8 has its focus in UCaaS, CCaaS and CPaaS. Jitsi as a Service has been released and is promoted by 8×8. But what about its open source project? How much would 8×8 be willing to invest in the open source project in 2023?
  • mediasoup – the mediasoup founders are used to having a “day job”. Yesterday it was Around. Today it is Miro. Tomorrow – who knows? Is that going to affect the mediasoup project in 2023? Probably not, but the recession might have different plans for this project
  • Pion – Pion was created by Sean DuBois, who has an infectious enthusiasm towards it and towards easy accessibility of the WebRTC technology. This will probably continue moving forward
  • Janus – Janus is maintained by Meetecho, a company embedded in open source and providing services around them. The current state of the market is unlikely to change their focus and trajectory
CPaaS and WebRTC

The CPaaS landscape is changing and shifting where it comes to WebRTC.

We started seeing these shifts a couple of years ago, but it seems that change is accelerating in this space – something that is different from what is happening with WebRTC open source.

The perceived leaders in WebRTC CPaaS are still Twilio, Vonage and Agora. I have a feeling that by the end of 2023 this will change.

Let’s review the who’s who of WebRTC in CPaaS.

Twilio

No CPaaS list is complete without Twilio. I’ll obviously start with them.

Twilio is continuing their trend from last year of going after the Customer Experience Platform market.

There was one big change that took place in 2022, where Twilio announced focusing on 4 pillars, instead of spreading all over. This was conveyed in Jeff Lawson’s open letter laying off 11% of their workforce. These focus areas are:

  1. “Investing in our platform reliability and trust” scale, security, optimization, …
  2. Increasing the profitability of messaging SMS and social messaging
  3. Accelerating Segment adoption CDP (Customer Data Platform)
  4. Scaling the Flex customer base CCaaS (contact centers)

No word about WebRTC. Definitely no video in here.

The opposite has happened – Twilio Live, announced in 2021, is being shut down:

Interestingly, its migration guide is recommending Mux, a vendor that just launched a WebRTC video offering as well. Should Twilio customers using Programmable Video also migrate that part to Mux? One wonders

Vonage

Vonage has its hands full with Ericsson who acquired them.

Not much has changed on their platform besides the introduction of background blurring and replacement.

As the honeymoon between Vonage and Ericsson will dissipate, along with the realization of a recession, it will be interesting to see what will happen to the Vonage Video APIs – will the level of investment there remain high or will it shrink?

Agora

Agora’s stock tanked since its peak:

Our information there is more limited than that of Zoom simply because the Agora IPO took place only in 2020.

It got into a recent mud fight with Zoom over the quality of experience that their respective platforms offer.

Zoom

Zoom opted to go with the unbundled approach, using WebRTC only sparsely. For video, they are especially focused on building their own media stack replacing most of what WebRTC does. In the short term, such an approach isn’t too productive. Longer run, who knows?

Zoom and APIs and CPaaS is a long affair by now. One which hasn’t worked out well enough for Zoom. Their browser story wasn’t tight enough until recently. This got them to go head to head with competition and commission a performance report pitting their Zoom Video SDK versus Vonage Video API, Agora, Twilio Programmable Video and Amazon Chime SDK.

This specific post is telling:

  • Zoom is looking to publicize its existence as a video CPaaS vendor. Their market penetration here is smaller than the bigger video CPaaS vendors at the moment. This performance report is their assurance to potential customers that they are competitive in this market
  • Amazon is gaining ground. Zoom decided to add them in because they are now competitive and relevant in this market. The Amazon Chime SDK has penetrated the mindshare of developers and competitors (like Zoom) are noticing
Microsoft

IaaS gone video CPaaS. That was in 2020. Both Microsoft Azure and Amazon AWS introduced their own video APIs.

Microsoft had the better story: Azure Communication Services. Uses the same infrastructure as Microsoft Teams. Being able (in the longer run) to connect directly to Microsoft Teams calls.

The network effect and infrastructure were always in their favor. That said, it doesn’t appear enough in discussions I have with developers building WebRTC applications.

There’s a lot of untapped potential here.

Amazon

I am starting to see the Amazon Chime SDK in more places. It seems that like Amazon Connect, after 3 years of being out there, it is getting the critical mass it needs to become “a thing” in the industry.

This is one to watch closely, especially if you are a video API vendor yourself…

Cloudflare (new entrants)

There’s another IaaS vendor who is joining the party of Video APIs – Cloudflare.

Cloudflare started in 2021 with a managed TURN service. One that is still in private beta.

But they announced and launched on September 2023 two additional services:

  1. Cloudflare Stream – WebRTC-based live streaming 
  2. Cloudflare Calls – WebRTC video group calls

Both API offerings that are well-defined these days in the Video API or WebRTC CPaaS space.

Hopefully, they’ll move faster with these two than they had with their managed TURN service.

Mux (new entrants)

Mux, a vendor who focused on video delivery via APIs has joined the WebRTC market as well, offering their own Video APIs – Mux Real-Time Video. This is an interesting take, especially since their target audience is slightly different than that of developers who end up with CPaaS. It brings a fresh look and interpretation of the problem – just like the IaaS vendors and Zoom are.

The interesting part is that Twilio decided to refer their Twilio Live customers to Mux. If I were Mux, I’d mark every customer coming in from Twilio Live, making sure they get the best experience and support so that 6 months from now I can start talking to them about migrating away from Twilio Programmable Video.

SaaS as CPaaS, Embeddable & Prebuilt An embeddable video call, courtesy of DALL-E

Then there’s the lowcode/nocode trend and how it manifests itself in CPaaS. I’ve written an ebook about it – Lowcode & Nocode in Communication APIs (sponsored by Daily, a known CPaaS vendor). In the past two years we’ve seen more and more CPaaS vendors offering lowcode and nocode solutions on top of their video APIs.

To that specific market/solution, we are seeing SaaS vendors heading as well – for some reason, everyone thinks that CPaaS is a great business.

The notable examples here are Whereby, a meetings platform that started offering Whereby Embedded, and Digital Samba, who started from a webinars platform and is now offering Digital Samba Embedded.

This part of the market will continue to evolve, with CPaaS vendors and others offering ever higher layers of abstraction.

How did I do with my 2022 WebRTC predictions?

We’re done with the market overview. Time to move on to predictions.

I’ll start by looking at how I fared with my 2022 predictions of the upcoming trends

This was a hit and miss thing (obviously).

Hitting the nail

There were three trends that I was spot-on.

#1 – Scale & performance

My bet at the time was that we will continue to see a continuation in improving scale and performance of WebRTC. This was definitely the case for 2022.

At the Kranky Geek event in November 2022, Google in their WebRTC annual update spent the time on quite a few items, but the first one of them was performance optimizations:

We will review this slide a few more times later on.

#2 – #newtech

This is the new technology trend, which was split a bit internally:

  • WebAssembly – WebAssembly is now part and parcel of most dominant WebRTC applications out there. This is achieved today by background blurring/replacement and noise suppression.
  • WebTransport, WebCodecs – we’ve seen more of this, but mostly in the experimentation phase. Not much going in actual production (besides maybe Zoom)
  • AV1 – still an ongoing effort. We’re not there yet, but getting closer

#4 – Live streaming

Live streaming continued to evolve in 2022:

  • Cloudflare joining the fray of vendors offering solutions to it
  • Daily scaled up their live streaming to support 15,000 viewers
  • WHIP and WHEP standardization for… live streaming with WebRTC. A thing with a growing ecosystem. More on that in this Kranky Geek session on WHIP & WHEP
Missing miserably

This is where I got it wrong.

#3 – WebRTC infrastructure, hyperscaling and SD-WAN

Here, I thought we’ll still ponder if Anycast and SD-WAN are important to WebRTC.

And then Subspace got shut down, and with it, a lot of the effort to push this story forward. It is sad, because I do think that striving to lower latencies and clearer networks is the way to go. This setback will delay such attempts by a few years.

#5 – 2D to Metaverse

Extremes and experiments to counter Zoom fatigue. I don’t think that that many new alternatives and suggestions were made in 2022 that we haven’t seen before.

Cloud media processing

This is something I haven’t seen coming. It can’t be considered a trend yet, but it is something to keep a close eye on.

The whole point of using SFUs in WebRTC is in order to reduce infrastructure costs in compute.

BUT…

Google started with doing noise suppression in the cloud for Google Meet a few years back. This means decoding and encoding audio in the cloud in an SFU architecture.

And now Google is doing the same for background replacement on low-end devices

Is that a one-time transitional thing, or will others follow suit?

WebRTC predictions for 2023

Time to look at my predictions for 2023. This is where I think we will see the most focus in WebRTC this year, and how it will shape up.

#1 – libWebRTC (and the future of WebRTC)

In libWebRTC we will see more of the same, with a few nuances.

Google’s WebRTC library is mature. It has all the bells and whistles expected of it. Here’s where we will see Google taking libWebRTC:

  1. House cleaning. Cleaning up unused code (we’ve seen this with the recent and ongoing changes to the stats objects). Getting it ever closer to be spec-compliant. These are all things you do when you have time and no large fires to quell
  2. Squeezing the optimization lemon. Doing more with less. Improving performance in CPU and memory use. Improving the algorithms used for bandwidth estimation, echo cancellation, etc.
  3. Polishing collaboration. We’ve seen this take place in 2022. It will continue into 2023. Google will look for opportunities to introduce additional APIs and configurations to make collaboration easier and polished in WebRTC. Check out how you can share a Google Doc in a Google Meet or a Google Meet in a Google Doc for examples of where and why this is taking place

libWebRTC will maintain its leading and dominant position as the WebRTC stack of choice for client-side development. And Google will take it wherever THEY need it.

#2 – Machine learning and media processing

WebAssembly will continue to be a driving force in 2023 when it comes to WebRTC.

It will be used for media processing and in relatively the same places we see it used and experimented today – background replacement, noise suppression and proprietary codecs implementations.

We will also see it enabling more vendors to leave the peer connection implementations in WebRTC and play around with media engines developed using WebAssembly and running on top of WebRTC data channels or WebTransport.

#3 – Voice before video (Lyra first, AV1 later)

This one is a bit of an overreach, but one I am willing to make.

Lyra, Google’s ML-based voice codec, will find its way into WebRTC before AV1 will. This isn’t in terms of availability, but in terms of adoption and popularity of use.

AV1 takes up too much CPU power and memory. This makes it usable only in high-end devices or devices with newer hardware (which is almost non-existent still). We have ways to go until AV1 can become a reality. Probably one or two more years.

Lyra is here. And it is improving in performance and quality. Microsoft’s Satin is breathing down Google’s neck. Something will have to happen here. And my bet is that this will happen in 2023.

The technology is most probably ready. The market is ready.

You can learn more about it from Phillip Hancke’s session about voice codecs in WebRTC at the recent Kranky Geek event.

#4 – Observability

You can say I am biased. So be it.

Observability was always a real challenge with WebRTC applications. Its nature, due to many reasons (one of them being encryption), makes it hard to monitor using legacy tools and methodologies.

What we will see in 2023 is more interest in observability. We have more products in the market that use WebRTC. Contact centers are moving to the cloud. Many of the bigger vendors are in the process of shifting focus from SIP to WebRTC in their current deployments, and not just as a feature in their checklist.

This will bring with it the need for better tools to understand and figure out how WebRTC sessions behave – both in pre-production and in production.

And now it is time for some shameless self-promotion here –

Watch my session from Kranky Geek, where I discuss on where observability of WebRTC statistics fall short (hint: troubleshooting)

Don’t forget to check out the WebRTC products we have at Spearline

#5 – M&As and shutdowns

This is an easy one to make in 2023.

We’re in recession. It will get better by December. It will get worse and stay with us. Whoever is correct in his estimate at what will happen a year from now, one thing is quite apparent:

Companies are closing their pockets, downsizing and keeping to their core focus.

WebRTC is part of it, and as a relatively new technology, it might be hurt more than others. I don’t think this will be the case, simply because we’re also in transition towards hybrid work due to the pandemic we faced. These two will negate each other a bit.

The end though will be house cleaning of the industry itself:

  • Some vendors will not weather well and will shut down this year. Their technology might even be solid, but not reaching product-market-fit or just missing to execute on a solid business plan will get them there faster
  • Others will find their solution by being acquired. We’ve seen quite a few acquisitions in 2022. We will see more in 2023

This in itself puts a strain on developers who need to choose which CPaaS vendor to use – picking the wrong one may lead them stranded with the need to switch (think Twilio Live). They will go to the bigger, more known vendors. Which will lead to a vicious cycle since the smaller vendors may not have the time to grow quickly enough – potential customers will be less willing to risk using them.

Preparing for a rocky year Rendered using Midjourney

Interesting times ahead.

2023 will shape up to be challenging.

On one hand, we have more of the same in a lot of areas. On the other hand, the current market state is causing a lot of instabilities that will cause some shifts in the market.

And that, without saying a word about generative AI and what that might mean to the market of WebRTC and communications moving forward.

The post WebRTC predictions for 2023 appeared first on BlogGeek.me.

coturn: No Time to Die – Q&A with new project leads

webrtchacks - Tue, 01/17/2023 - 13:45

New coturn project leads Gustavo Garcia and Pavel Punsky give an update on the popular TURN server project, what's new in STUN and TURN standards, and the roadmap for the project

The post coturn: No Time to Die – Q&A with new project leads appeared first on webrtcHacks.

Pages

Subscribe to OpenTelecom.IT aggregator

Using the greatness of Parallax

Phosfluorescently utilize future-proof scenarios whereas timely leadership skills. Seamlessly administrate maintainable quality vectors whereas proactive mindshare.

Dramatically plagiarize visionary internal or "organic" sources via process-centric. Compellingly exploit worldwide communities for high standards in growth strategies.

Get free trial

Wow, this most certainly is a great a theme.

John Smith
Company name

Yet more available pages

Responsive grid

Donec sed odio dui. Nulla vitae elit libero, a pharetra augue. Nullam id dolor id nibh ultricies vehicula ut id elit. Integer posuere erat a ante venenatis dapibus posuere velit aliquet.

More »

Typography

Donec sed odio dui. Nulla vitae elit libero, a pharetra augue. Nullam id dolor id nibh ultricies vehicula ut id elit. Integer posuere erat a ante venenatis dapibus posuere velit aliquet.

More »

Startup Growth Lite is a free theme, contributed to the Drupal Community by More than Themes.