What powers Google Meet and Microsoft Teams? WebRTC Demystified | Step By Step Tutorial
If your heart is beating and you need food to survive, it’s extremely likely that this crazy pandemic forced you to participate in at least one video call. As companies are adjusting their infrastructures in order to accommodate more and more remote work and products such as Zoom have seen a 2900% growth in their daily active users in just 4 months, it’s clear that video calls will become an increasingly important part of virtually everyone’s life.
If you’d love to try and build your own video call app for work, for fun or simply because you have nothing else to do, then this article is for you!
WebRTC In Pills
Usually before trying out any new technology, I try to get a basic understanding on how it actually works. If you don’t feel this urge and just want to see stuff working, just head to the next article — no one will judge you.
Simply put, WebRTC allows browser to send data (like audio and video) real time. It can acquire audio/video streams from webcams and microphones, and transmit such streams peer-to-peer, without having to download external plugins or software. Let’s break down how the magic happens.
WebRTC — Main Components
WebRTC comprises the following 3 main components:
- MediaStream: this API allows to gain access to webcams, mics and screen. It controls where the stream is consumed and enables to control the device producing the audio/video stream.
- RTCPeerConnection: this component is the core of WebRTC and allows participants (peers) to connect (more or less) directly, without intermediaries. Each peer transmits his stream (acquired through the MediaStream API) creating audio/video feeds that other peers can subscribe to. This API handles audio/video codecs, NAT traversal, packet loss management, bandwidth management, data transfer and much more.
- RTCDataChannel: this API was designed to achieve bi-directional data transfer. It’s inspired by WebSocket but uses UDP instead of TCP in order to reduce congestions and overhead typical of TCP connections.
WebRTC — Data Exchange Flow
Sending and receiving audio/video streams with web servers such as YouTube is easy. Bob’s computer asks a DNS server the address of YouTube.com and bam, he has it. Bob’s browser makes a request to the address he’s received and when he does so, YouTube knows how to send him that baby shark video he asked for.
But what it if Bob, after a nasty fight with Alice, wanted to establish a peer-to-peer connection with his true friend Gareth to tell him how Alice and Eve where talking behind his back all the time and now he doesn’t want to have anything to do with both of them ever again? He could try and ask the same DNS server how to reach Gareth, but if he did that the DNS server would reply saying it’s never even heard of the guy. So how is a WebRTC call is established? Let’s find out 🚀
Part 1 : SDP
The first thing our friend Bob would need to do in order to get in touch with Gareth would be generating a SDP (Session Description Protocol) offer. This offer contains information about the session Bob wants to start: what codecs he is able to understand, what kind of media he wants to transmit (audio/video/generic data) and more. More importantly the SDP offer should also contain a list of IP addresses and ports Bob is prepared to receive incoming media streams, which Gareth will use to communicate with him. How does Bob generate this list? Please welcome ICE!
Part 2: ICE (ICE Baby)
In order to instruct Gareth on how to connect with him, Bob goes through a process known as ICE candidates gathering.
ICE (Internet Connectivity Establishment) is a standard method of NAT traversal which deals with the process of connecting media through NAT’s by conducting connectivity checks. ICE collects candidates using either a STUN or a TURN server. Let’s take a quick look at what these fancy acronyms stand for:
- STUN (Session Traversal Utilities for NAT) server: allows Bob to find out his public IP address, what kind of NAT he’s behind and which internet side port is associated by the NAT device with a particular local port on Bob’s machine.
- TURN (Traversal Using Relay around NAT) server: if for some reason the STUN server cannot establish a connection with the peer, a request is made to the TURN server which will act as a media relay. The TURN server will provide it’s public IP address and port that will forward packets received to and from both peers. This relay address is added to the ICE candidate list. Of course, in this case the call won’t be peer-to-peer as all data exchanged between Bob and Gareth will flow through the TURN server
Part 3: Signaling
Now that each client has gathered some media to send and created an offer, how does it get to the other peer? This is where signaling takes place: the discovery and negotiation process to establish the network session connection with Gareth, or more generally speaking, the other peer. WebRTC doesn’t enforce a particular signaling protocol, so you can use pretty much anything you like: SIP, Web Sockets, XMPP, carrier pigeons or even smoke signals if that’s what you’re into (and latency is not issue). No matter what you end up picking, the main idea is that each peer contacts a signaling server which acts as intermediary to exchange necessary information:
- Network data: where the peers are located on the internet so that they can find each other
- Session control information data: when and how to open, close, modify the session
- Media data: which audio/video codecs do both peers understand?
Now that we’ve taken an overall look at what happens behind the curtains, let’s see what steps and which parts of the actual API are needed to get a bare minimum call between Bob and Gareth going.
- Bob’s browser asks access to local webcam/mic using
navigator.mediaDevices.getUserMedia(). Constraints such as e.g. video resolution, echo cancellation can be passed to the
- Bob creates a new instance of
RTCPeerConnectionand adds the previously granted media tracks to the set of tracks which will be transmitted to the other peers. This is done by calling the
RTCPeerConnectionconstructor accepts a
RTCConfigurationparameter where you can specify various settings such as STUN server URLs, ICE servers, certificates and more.
- Bob creates the SDP offer by calling the
- Bob sets the newly created offer as the local description (
RTCPeerConnection.setLocalDescription()) and sends it to the receiver through the signaling channel.
- Gareth waits for an incoming offer. As soon as he receives it he creates his instance of
RTCPeerConnectionand immediately after sets the received offer as the remote description through the
- Gareth repeats steps 1 and 2 to capture local media and attach it to the peer connection.
- Gareth creates the SDP answer by calling the
- Gareth repeats step 4 and sets the answer as local description and sends it to Bob through the signaling channel.
- Finally Bob receives Gareth’s answer and sets it as his remote description
- Hurrah! Bob and Gareth can now exchange media 🎉🥳
This article has provided a high level overview of WebRTC is and the macro components that make the magic happen. Want to see it in action?
Head over to the next article of this series where we’ll be building a simple video call app using Angular and PeerJS (a neat wrapper library built on top of WebRTC)
If you liked this article please feel free to talk to your grandma about it, send me a couple of bitcoin, do a backflip, or leave a comment in the section below :)