When I first had the idea for VOMO, it was inspired by the release of OpenAI's Whisper model, which showed a significant improvement in the accuracy of speech-to-text technology. At the time, I envisioned several key features: precise speech-to-text conversion, real-time transcription, the ability to refine transcribed text using GPT, and the integration of vectorized notes with a question-answering function.
As I began researching various products on the market, including OpenAI's Whisper, Assembly, Google and Microsoft's speech-to-text services, and Deepgram, I discovered that each had its own strengths and weaknesses. Whisper was the most powerful, but it lacked two essential features I needed: real-time speech-to-text and support for audio files larger than 25MB without manual segmentation.
Google and Microsoft's real-time speech-to-text offerings were not accurate enough for our needs. If the transcriptions were not precise, users might not continue using our service.
Initially, I found Assembly's pricing to be too high.
Then I discovered Deepgram, which met many of my requirements. They offered a cloud-hosted Whisper model that could support transcription of extended recordings with the same level of accuracy, and their real-time speech-to-text pricing was acceptable (although I later removed this feature). Additionally, for recording meetings, Deepgram could support automatic speaker identification and formatting. These were all features we needed.
Later, I added a bulk speech-to-text feature, allowing users to select dozens of audio files from Apple's Voice Memos and import them into VOMO for batch transcription.
However, I discovered that using Deepgram's Whisper model had concurrency limitations, so we switched to the Nova-2 model. In my opinion, its transcription accuracy is comparable to Whisper, but with faster processing speeds.
As a result, we continue to use Deepgram's Nova-2 model.
In summary, third-party services like Deepgram can significantly reduce the workload for products like VOMO. Most of the speech-related features we wanted to implement were already available through Deepgram.