Home
Blog
A new open-source developer app for AI translation, dubbing and lip synching to try

A new open-source developer app for AI translation, dubbing and lip synching to try

A new open-source developer app for AI translation, dubbing and lip synching to try
Published on
Mar 2024

Text-to-speech, voice cloning, and visual dubbing are some of the hottest trends in AI at the moment. Used in tandem with AI transcription and translation, they make it possible to generate hyper-realistic voiceovers, indistinguishable from the sound of the speaker’s natural voice and speech patterns — including in entirely new languages.

Our partners at Sync Labs have just published an open-source repo for building an app that translates any video to any language with perfectly matched lip movements. Its backbone leverages Gladia API for speech-to-text and translation, ElevenLabs for text-to-speech and voice cloning, and Sync Labs for visual dubbing.

Following a quick intro to all of the tech elements of this fantastic project, we’ll explain how you can test them first-hand using this app, which will be available for public access in a week.

Speech-to-text and translation

Speech-to-text or automatic speech recognition (ASR) converts spoken words into text. The process involves preprocessing audio data to enhance quality, employing advanced speech recognition algorithms to correctly identify words, and integrating language modeling to predict word sequences. Post-processing may be applied to refine the transcribed text, resulting in an accurate written representation of the spoken content. For a more detailed breakdown of how it works, feel free to visit our introduction to speech-to-text.

AI translation, also known as machine translation, employs AL/ML to automatically translate text or speech from one language to another. The process includes tokenization of input, utilizing natural language processing for context and grammar understanding, and employing machine learning models—often neural networks like the multilingual Whisper ASR—to predict accurate translations.

At Gladia, we rely on a hybrid ASR architecture, powered by optimized Whisper and other state-of-the-art models, supporting 99 languages for transcription and translation. Integrated into Sync’s app, our API allows us to transcribe what’s being said and translate it in near real-time, with the resulting transcript fed into the rest of the structure.

Text-to-speech and voice cloning

Text-to-speech technology does the opposite of speech-to-text by converting written text into spoken language. The system analyzes input text using natural language processing, understanding its structure and semantics. Prosody modeling is then applied to incorporate elements like intonation and rhythm, contributing to a natural and expressive synthesized speech. The synthesis engine generates speech based on the analyzed text and prosody modeling, resulting in a final output of synthesized voice that faithfully represents the spoken version of the input text.  

Voice cloning comes into play to make the output as close to the human voice as possible. To yield realistic results, the process starts with collecting a substantial dataset of the target voice. Extracting relevant features like pitch and tone, machine learning models, often utilizing deep neural networks, are trained to mimic the unique characteristics of the voice across a wide emotional spectrum, i.e. confident speech, happy exclamations, angry rants, and so on.

ElevenLabs is among the top software out there for text-to-speech and voice cloning. The company leverages proprietary deep-learning tech to choose from a library of high-fidelity male and female voices (or produce them from scratch!), enabling seamless creation of custom videos, ebooks, and more in 29 languages.

Visual dubbing

Visual dubbing, or lip reanimation, is an AI technology that synchronizes translated or transcribed audio with realistic lip movements in video content.

By analyzing and replicating the original speaker's lip gestures, the system generates animated lip movements that align with the new audio. While the technology is raising obvious concerns about the use of deep fakes, it’s also a highly powerful tool to break down language barriers in video content, providing a high-fidelity alternative to traditional dubbing.

On a mission to break the language barriers in video content and reinvent dubbing, Sync Labs enables developers to seamlessly lip-sync a video to audio in near real-time using a single API.

How the translation app works

We invited you to dive into the x thread below for a detailed video tutorial and instructions. Theres's also this Medium tutorial available, and of course the link to the original repo to clone and launch the app yourself.

Conclusion

Thanks to this amazing open-source project, we can see just how powerful speech-to-text, text-to-speech, voice cloning, and lip-synching technologies can be when used together. We hope you enjoy this incredible free tool by Sync Labs, powered by Gladia. If you’re building a voice app using our API and would like us to spread the word, do not hesitate to reach out here.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Speech-To-Text

Keeping LLMs accurate: Your guide to reducing hallucinations

Over the last few years, Large Language Models (LLMs) have become accessible and transformative tools, powering everything from customer support and content generation to complex, industry-specific applications in healthcare, education, and finance.

Case Studies

Transforming note-taking for students with AI transcription

In recent years, fuelled by advancements in LLMs, the numbers of AI note-takers has skyrocketed. These apps are increasingly tailored to meet the unique needs of specific user groups, such as doctors, sales teams and project managers.

Speech-To-Text

RAG for voice platforms: combining the power of LLMs with real-time knowledge

It happens all the time. A user submits a query to a large language model (LLM) and swiftly gets a response that is clear, comprehensive, and obviously incorrect.