Metaverse Communication with spatial audio in Virtual Meetings

Content

Online meetings and conferencing platforms have radically improved since the beginning of the Covid-19 pandemic. MS Teams meetings, Zoom calls, and similar video calls have become routine. And there is a new hype called metaverse that is hardly discussed from an audio perspective.

But despite growth in virtual productivity, real-time collaboration is still inconvenient and restrictive. Spatial audio virtual meeting platforms are emerging as one of the most promising alternatives to traditional conferencing methods. Which enables – you guessed right – communication with 3d sound just like in real-life.

The benefits of spatial audio in communication platforms have already been looked at in this article. This article looks into the key ingredients meeting platforms that put audio first!

Thanks to Aïli Niimura, Daniela Rieger, Immersitech, and Atmoky for contributing to writing this article.

Update 23

Dolby.io virtual worlds solutions added

3D sound-based virtual meeting platforms

3D sound-based virtual meeting platforms can be categorized only liminal. Often, you enter a website where you interact with other people in real-time through sound. You can see yourself move through a virtual space through a 2D or 3D visual representation. As an avatar, as you wish.

The experience feels akin to a Zoom call or Microsoft Teams meeting, and yet they are nothing of the sort. Spatial audio communication sets these tools apart from standard call platforms.

While 3D sound-based virtual meeting platforms can differ individually, they all have a few things in common:

Spatial audio communication plays a major role
Visuals take a back seat.
Social communication, interaction, intelligibility, and live performance are emphasized.

What is the difference between 3D virtual meetings and platforms like Zoom?

How are they different from social VR platforms like AltspaceVR and Spatial? Or virtual meeting places like Virbela, Mozilla Hubs, and audio-only platforms like Swell and Clubhouse?

Spatial audio communication solutions can complement fully-fledged immersive worlds and are also valuable entities in and of themselves. Meaning they can exist as standalone communication tools without much visual representation.

While most 3D social platforms prioritize graphics and avatars, these sound-based environments are less visually taxing. Instead of working with interactions that seem more like a video game, we get visuals that serve to support the sonic interaction.

These platforms emphasize spatial audio as the primary means of sensing your surroundings.

One of the key differences between the two is that in 3D sound-based environments, visuals do not have to be 3D; audio must be 3D.

2D Platforms

It seems that these platforms require a much less cognitive leap for the average person than jumping into Facebook Horizon, or AltspaceVR for the first time.

While many social VR worlds do have a 2D version for PC, they tend to be clunkier, without many of the bells and whistles that are necessary for the experience. In these 2D top-down environments, only audio is 3D while 2D visuals represent 3D navigation on a horizontal (2D) plane.

The 3D audio software company atmoky shows a possible realisation of this approach from a combination of 2D video and high-resolution 3D audio in their Web SDK Demo demo video. On their homepage you can also test the tool yourself: atmoky - Spatial Audio Web SDK Demo.

Audio-based social apps such as Swell or Clubhouse offer advanced networks of audio communication but are in dire need of spatial audio.

But if they did have spatial audio, would they be classified as one of these 3D audio virtual conferencing platforms? On one hand, we shouldn’t be too picky and just say yes, they would be. On the other hand, regardless of spatial audio integration, the ability to navigate around the space would be missing.

Even with 2D graphics, it is possible to convey movement throughout a 2D space. Taking things a step further and adding head-tracking may be one way to improve the platform, though that goes without saying for all spatial audio experiences. Here is a snippet of Immersitech’s approach that we will highlight later.

3D Platforms

AltspaceVR, Facebook Horizon, and Mozilla Hubs are all social VR platforms that enable spatial audio communication. These kinds of experiences are closely related to what we’re talking about but differ in some pretty major ways. Firstly, they focus primarily on the visual environment and a first-person perspective user interface.

While they might be considered more fleshed out than a primarily audio-based solution, better graphics and environment-heavy solutions are not always the most important features for easy and profound social communication. They certainly take a lot longer to jump in and jump out of.

The same also applies to other web-based 3D meeting places such as Virbela. A 3D meeting platform that requires 2D interaction (you are still using a PC) shouldn’t need a complex 3D environment, where so much gets lost in just trying to operate the system.

What do most virtual conferencing platforms use?

Virtual communication platforms usually need to compress audio in order to transmit it with low delay rates. Mono is often used to cut down on computational cost as well. With great technological improvements, platforms like Zoom and Microsoft Teams are going from mono to stereo – it’s a start, but far from where we can and must go.

That said, you don’t need much to spatialize audio. You can spatialize any number of channels. Virtual conferencing platforms will likely encode positional data in their audio codecs in the future, creating three-dimensional audio environments for conference participants. The advantages are obvious: a more natural acoustic environment, better intelligibility of the people speaking even when speaking simultaneously, reduced fatigue as well as increased productivity.

Since Corona, there has been a huge rise in such virtual events. So I shared more insights on how you can make them make them a success.

Why is spatial audio so important?

Spatial audio is fundamental for feeling natural and comfortable in a virtual space. Beyond ergonomics and increased productivity/intelligibility, spatial audio communication has many other straightforward benefits. Such as improved virtual navigation, and emotional resonance.

Navigation in a virtual space

In real life, most of us use more than just our eyes to navigate. We constantly use several senses to exist to guide ourselves through this strange and beautiful world we live in. Navigating a 3D virtual world should be the same – not in a 1-to-1 sense, but in that we need more than one sensory mode in order to feel natural, comfortable, and to get around.

Spatial audio affords virtual spaces to do more easily

It pairs down the level of interactions actually needed, resulting in less manual work and more natural communication.

Spatial audio makes virtual spaces have to work less hard to foster the types of interactions they were built to host. We can

think faster by using our instincts rather than cognitive reasoning.
waste less time doing manual work and write out our thoughts
engage in natural communication for longer periods of time.

Many of us have different styles of productivity and communication, but hearing a voice is more natural and intimate, yet so much less invasive. Its intuitive nature simply exists in and of itself. Perhaps that’s where the rise in 3D audio social conferencing platforms comes from – there is an obvious need for more natural communication.

The realism of natural communication is more easily achieved when it’s sonic rather than visual, due to the computational cost of 3D graphics and individualization of avatars.

Can spatial audio deepen the quality of our communication?

Making use of this level of instinct does more than just increase attention and productivity – it also adds emotion to our interactions. With naturalistic spatial audio, we can make our virtual speech communications convey more thought, depth, and emotional connection.

Every system has an adaptation period. When you move to a new country, you need to adapt to the culture, and when you use a different laptop, the keys just feel different. The same applies to virtual environments!

The less you have to do and adapt to, the better. Removing the clutter within 3D virtual environments, and replacing it with 3D audio is one solution while elevating a 2D virtual environment with spatial audio is another.

Why use these 3D audio worlds instead of Zoom?

They are better for creating groups naturally
It’s possible to separate individuals out into space while still feeling like you are together
You can see and hear everyone
You can have multiple sound events going on concurrently – talks, concerts, etc.

Spatial audio communication is the driving enabler of all the above. What a powerful tool!

What can 3D audio worlds be used for? Well, online conferencing is a straightforward use case. But gaming, for example, would also be greatly enhanced if you could hear your friends’ voices coming from their avatars. The use cases are truly quite broad, since our voices tell us so much about ourselves and our environment.

We also use our voices to communicate most of the time. In virtual settings, spatial audio can be a much more direct way of engaging in the experience than via visuals.

Not all spatial sound is the same. In order to achieve the full effect of 3D audio in virtual environments, it is essential not only to reproduce the directions of the sound sources correctly. It’s also crucial to ensure a high degree of externalisation when reproducing through headphones. Because only when it comes to the so-called out-of-the-head perception can we speak of a real 3D audio experience.

This is the goal of the 3D audio software company atmoky with their new product atmoky Ears. According to their own information, it is a tool for binaural rendering of 3D audio content for headphones. It includes a 3D audio externalizer, HRTF personalization for different age groups, and a performance mode.

Have you heard of High Fidelity?

High Fidelity is a real-time spatial audio API for group chats. From the people that brought you Second Life, remember?

High Fidelity has been out there for a long time. The company went through an extensive journey, from being a robust social VR platform, top-down 2D social environment with 3D audio. Their first demo showed how you could listen to multiple DJ sets without one DJ’s sound conflicting with another.

As you move or rotate, the sound around you transforms accordingly, just like it might in real life, or perhaps even better.

High Fidelity took their strengths of low-latency networked audio, mature audio spatialization and de-noising, and created a tool for adding these capabilities to any platform.

That means that they’ve created an API (application programming interface) for developers to integrate into their own existing or budding tools. It’s easy to implement and works well. What’s great about this initiative is that it will (hopefully) provide an incentive for more online social platforms to move into spatial audio communication without having to spend further resources on creating their own spatial audio engine.

Are there other spatial audio integration tools?

Companies like Dirac, Immersitech, and Atmoky have also developed spatial audio integrations for voice calls.

These integrations allow you to hear your manager’s voice as if it’s coming from their video display, which adds a layer of comfort and realism. A speaker who can be seen on the top right of the screen is therefore also perceived acoustically from this position, which adds an extra level of comfort and realism.

These tools are particularly interesting as competition for spatial audio interactions in our embedded devices arise. It’s a new (ish) pocket of spatial audio that doesn’t get enough attention. It’s really useful right now, but we’ll see in the long run how necessary third-party integration tools become.

The rush of competitors in this space is somewhat overwhelming, but we will surely see how they evolve as our mobile electronics evolve to contain these performances as built-in capabilities (as announced by Apple in iOS 15). Here is the example of Atmoky.

Gathertown & Mozilla Hubs

Gathertown is a 2D video-calling space that lets multiple people hold separate conversations in parallel, walking in and out of those conversations just as easily as they would in real-life.

It is slightly reminiscent of Mozilla Hubs, the 3D virtual room that you can share with friends. Within these environments, you can “watch videos, play with 3D objects, or just hang out.”. You can also create your own environment. I know of a startup that recreated their office space here due to Covid. It was a fun way of coming together in virtual spaces during the home office period.

Although Gathertown is a 2D visual environment and Mozilla Hubs is fully 3D and cross-platform, they share similar qualities such as simplicity and customizability.

Gathertown can be described as cute – sometimes very smart and at other times impractical. While they do embrace the 8-bit pixelated art style, it somehow lands a little bit literally flat and doesn’t look very personal. The tutorial didn’t feel very efficient and I wasn’t convinced that I would virtually leave a note at someone’s desk.

In Gathertown, I would place a sound object in space but the audio was not spatialised. I would therefore lean towards using Mozilla Hubs as a long-term solution due to their spatialization system and easy interface. However, I could see Gathertown as a fun playground for a team-building activity – it might be more familiar to folks due to the old-school video game style.

Degy World

Degy World is a PC VR conferencing platform

Many PC social VR worlds feature spatial audio prominently. That’s because the game engines that these worlds are built with make it easier to make it sound like a voice is coming from the avatar. Of course, quality 3D audio design extends beyond the built-in qualities of a game engine.

Case in point: all the worlds that focused on visuals and virtual conferencing fell a little flat to me, even if they also featured spatial audio.

Degy world was one example of a conferencing platform wherein my user experience was improved by spatial audio. Despite the experience being a little more than janky, I ended up making quality connections. Keeping this connection was made easier through direct messages. Sending written direct messages is always a plus, even when spatial audio communication is involved. We want hybrid worlds, not single-modality worlds.

Immersitech – Bringing Clarity to Communications

Immersitech delivers advanced noise cancellation, voice clarity and spatial audio. For unified communications, virtual events, and social entertainment applications like gaming. Immersitech’s SDKs provide dramatic audio improvements with a simple premise, remove unwanted noise and enhance the audio users want to hear. Like voices, and then place the user into an immersive, spatial environment. The result is a better-than-reality experience.

Immersitech’s SDKs are designed to provide an easy-to-integrate package for service providers offering cloud, mobile edge, or premise-based communications. It allows providers to establish unique user configurations that can be positioned to optimize the proximity to the main speakers and/or sports action. Including being stacked so everyone has a front seat. Among the many exciting features are enhanced environment options that allow users to actively select others to talk privately with (i.e. Whisper). While still hearing the main session or in the case of Sidebar, create a sub-conference for private discussions

With immersive spatial audio, the benefits of attending a virtual meeting, event, or other virtual social interactions, include the users having a more natural audio experience with audio seemingly coming from all directions. Spatial audio not only increases the enjoyment of the audio experience but also increases user engagement and reduces fatigue.

Dolby.io Virtual Worlds

Dolby.io Virtual Worlds wants revolutionizing the way people experience virtual worlds. By leveraging the Dolby Atmos technology, individuals are able to be heard more clearly and their voices can be spatially positioned. This is creating a more immersive experience and facilitating communication between users.

Through Unreal Engine and Unity plugins, developers are provided with an easy-to-implement solution to adding spatial audio to their projects. Additionally, features such as spatial mixing, placement, AI noise reduction, echo cancellation, dynamic audio leveling and real-time streaming of live audio/video into virtual world offer maximum flexibility for creators in their applications of Dolby.io Spatial Audio.

Conclusion

There is a rise in shared virtual worlds that rely on spatial audio communication to revolutionize and improve communication platforms through the efficient transmission of sound and a non-fatiguing sense of presence. These platforms do not require heavy 3D graphics and often have a simple 2D user interface.

These platforms promise a future wherein virtual connections refrain from existing in the extreme. Spatial audio conferencing platforms will ideally just become a natural part of existing conferencing tools, while supporting emerging technology.

Hopefully, we can stop retroactively outfitting social tools with quality audio and spatial sound, and start building better environments from the get-go. Spatial sound for virtual communication can go a long way, and a host of new and existing tools are determined to prove it.

The tools offered by some of these solutions invite you to question how spatial audio can be integrated into your existing experiences, platforms and tools. It is critical in the coming months and years to be open-minded towards this audio framework, and to create the head-tracking abilities in our devices necessary for advancing this revolution even further.

A lot to digest? No worries, you now know who to talk to. Ask me anything!

Get in contact

Video Call from the Future with Spatial Audio for Zoom, MS Teams, telephony

How Virtual Events are fueled by Sound (part 1/2)

Virtual Conference VR: Advantages and Ideas for Audio Events (Part 2/2)

How Do We Localize Binaural 3D Sound?