Last week, OpenAI released Whisper, an open-source deep learning model for speech recognition. OpenAI’s tests on Whisper show promising results in transcribing audio not only in English but also in several other languages.
Developers and researchers who have experimented with Whisper are also impressed with what the model can do. What’s perhaps just as important, however, is what Whisper’s release tells us about the changing culture in artificial intelligence (AI) research and the kinds of applications we can expect in the future.
A return to openness?
OpenAI has been heavily criticized for not open-sourcing its models. GPT-3 and DALL-E, two of OpenAI’s most impressive deep learning models, are only available behind paid API services, and there is no way to download and test them.In contrast, Whisper was released as a pre-trained, open-source model that anyone could download and run on their computing platform of choice. This latest development comes as the past few months have seen a trend toward greater openness among commercial AI research labs.
In May, Meta open-sourced OPT-175B, a large language model (LLM) similar in size to GPT-3. In July, Hugging Face released BLOOM, another open source LLM of GPT-3 scale. And in August Stability.ai released Stable Diffusion, an open-source image generation model that competes with OpenAI’s DALL-E.
Open source models can open new windows to help research and build specialized applications of deep learning models.
OpenAI’s Whisper embraces diversity of data.
A key feature of Whisper is the diversity of data used to train it. Whisper was trained on 680,000 hours of multilingual and multitask supervised data collected from the web. One-third of the training data consists of non-English audio examples.
“Whisper can robustly simulate English speech and perform at a state-of-the-art level with nearly 10 languages — as well as translation from those languages to English,” an OpenAI spokesperson told VentureBeat in written comments.
Although the lab’s analysis of languages other than English is not comprehensive, users who have tested it report solid results.
Again, data diversity has become a popular trend in the AI research community. Bloom, released that year, was the first language model to support 59 languages. And Meta is working on a model that supports translation in 200 languages.
Advances toward greater data and language diversity will ensure that more people can access and benefit from advances in deep learning.
Run your model.
Because Whisper is open source, developers and users can choose to run it on the computation platform of their choice, whether it’s their laptop, desktop workstation, mobile device or cloud server. OpenAI released five different Whisper sizes, each trading off precision for speed, with the smallest model being about 60 times faster than the largest.
“Since transcription using the largest Whisper model runs faster than real-time on the [Nvidia] A100 [GPU], I expect practical use cases to run the smaller models on mobile or desktop systems. will be, once the models are properly ported to the relevant environment,” said an OpenAI spokesperson. “It allows users to upload their voice data to the cloud without privacy concerns for automatic speech recognition ( (ASR) to run, while potentially draining more battery and increasing latency compared to alternative ASR solutions.”
Developers who have tried Whisper are satisfied with the opportunities it can provide. And this can create challenges for cloud-based ASR services that have been the dominant option until now.
“At first glance, Whisper appears to be much better than other SaaS [software-as-a-service] products in accuracy,” MLops expert Noah Gift told VentureBeat. “Because it’s free and programmable, it probably means a very significant challenge to services that only offer replication.”
Gift ran the model on his computer to transcribe hundreds of MP4 files ranging from 10 minutes to hours. For machines with Nvidia GPUs, running the model locally and syncing the results to the cloud can be much more cost-effective, says Gift.
“Many content creators with some programming experience who weren’t initially using transcription services due to cost will quickly adopt Whisper into their workflow,” Gift said.
Gift is now using Whisper to automate transcription in its workflow. And with automatic transcription, it has the potential to use other open-source language models, such as Text Summarizer.
“Content creators from indies to major movie studios can use this technology and it has the potential to be an important tool for incorporating AI into our everyday workflows,” Gift said. “By making transcription a commodity, the real AI revolution can begin now for those in the content space — from YouTubers to news to feature film (all industries I’ve worked in professionally). What’s the job?”
Create your own applications
A number of steps have already been taken to make Whisper easier to use for people who don’t have the technical expertise to set up and run machine learning models. An example of this is a joint project by journalist Peter Stern and GitHub engineer Christina Warren to create a “free, secure and easy-to-use transcription app for journalists” based on Whisper.
Meanwhile, open-source models like Whisper open up new possibilities in the cloud. Developers are using platforms like Hugging Face to host Whisper and make it available through API calls.
“It takes 10 minutes for a company to build their own transcription service powered by Whisper, and start transcribing calls or audio content at a high level,” Jeff Boudier, growth and product manager at Hugging Face, told VentureBeat. ”
Hugging Face already has several Whisper-based services, including a YouTube transcription app.
Or adapt existing applications for your purposes.
And another benefit of open-source models like Whisper is fine-tuning — the process of taking a pre-trained model and optimizing it for a new application. For example, Whisper can be fine-tuned to improve ASR performance in a language that is not well supported in the current model. Or it can be fine-tuned to better identify medical or technical terms. Another interesting direction could be to fine-tune the model for tasks other than ASR, such as speaker verification, acoustic event detection and keyword identification.
“It might be interesting to see where it goes,” Gift said. “For very technical verticals, a fine-tuned version could be a game changer in how they can deliver technical information. For example, could it be the start of a revolution in medicine because primary care Can clinicians record their conversations and then eventually be automated into AI systems that diagnose patients?
“We’ve already received feedback that you can use Whisper as a plug-and-play service to get even better results,” Hugging Face technical lead Philipp Schmid told VentureBeat. “Coupling the model with fine-tuning will further improve performance. Fine-tuning, especially for languages that were not well represented in the prior training dataset, can significantly improve performance. Is.