Lmst

Voxtral Transcribe 2 from Mistral AI brings open, production ready speech AI to everyone: fast, accurate transcription, solid diarization and support for long, multilingual audio. It is a strong option if you want powerful speech understanding without locking into closed APIs.

#Voxtral #Transcribe2 #MistralAI #SpeechAI #AITranscription #OpenSourceAI #FLOSS

ICONIQ (@ICONIQCapital)

음성 합성 플랫폼 ElevenLabs가 Series D를 발표한 것을 축하하는 트윗으로, 플랫폼이 여러 언어의 음성을 전 세계적으로 접근 가능하게 만든다는 점을 강조합니다. 공동 창업자 @matiii와 @dabkowski_piotr가 의도한 방식대로 음성으로 순간을 기념했다는 내용입니다.

https://x.com/ICONIQCapital/status/2019068892946342163

#elevenlabs #voice #funding #speechai

🚀 Demo mới: hệ thống lồng tiếng video‑2‑video chất lượng cao, hiện hỗ trợ dubbing tiếng Anh → Pháp. Pipeline: TIGER (tách âm), WhisperX (diarization & STT), Mistral_Tower (dịch), CosyVoice3 (TTS). Tiếp tục cải thiện giữ nguyên tone & prosody. Mọi ý kiến, đề xuất đều chào đón!

#AI #SpeechAI #VoiceCloning #Dubbing #CôngNghệ #TríTuệNhânTạo #TinCôngNghệ #AIVietnam #VoiceAI #NhậnDạngGiọng #DịchTựĐộng

https://www.reddit.com/r/LocalLLaMA/comments/1qinq1x/we_have_come_a_long_way_in_voice_prosody_cl

Demo hệ thống lồng tiếng video AI chất lượng cao, hiện hỗ trợ dịch tiếng Anh → Pháp. Pipeline: TIGER (tách âm thanh), WhisperX (diarization + STT), Mistral_Tower (dịch), CosyVoice3 (TTS). Tiếng nói chưa giữ được ngữ điệu sau dịch, sẽ cải thiện. Mong nhận ý kiến! #AI #VoiceCloning #SpeechAI #Dubbing #CôngNghệ #AIÂmThanh

https://www.reddit.com/r/LocalLLaMA/comments/1qinq1x/we_have_come_a_long_way_in_voice_prosody_cloning/

Die Stimme kann biometrisch sein:

Aus dem Sprachsignal lässt sich mehr ableiten als Worte – bis hin zu Gesundheit, Bildung und politischen Präferenzen. Und betroffen sind auch Unbeteiligte, wenn ihre Stimme als Hintergrund in Aufnahmen landet.

Konsequenz: Kommunikation konsequent Ende-zu-Ende verschlüsseln – und keine proprietären Sprachassistenten oder Cloud-Transkription nutzen.

https://www.telepolis.de/article/Privatsphaere-endet-wo-das-Sprechen-beginnt-11145260.html

#Datenschutz #Privatsphäre #SpeechAI #Spracherkennung #Überwachung #KI #Biometrie #Datenminimierung #OnDevice #EUAIAct

ModelScope (@ModelScope2022)

StepFun의 음성 모델 'Step-Audio-R1.1'이 Artificial Analysis Speech Reasoning 리더보드에서 SOTA를 달성했습니다(정확도 96.4%). Grok, Gemini, GPT-Realtime 등을 능가했으며 네이티브 오디오 추론(End-to-End), 오디오-네이티브 CoT, 실시간 처리를 특징으로 합니다.

https://x.com/ModelScope2022/status/2011687986338136089

#speechai #audiomodel #sota #stepaudior1.1

FOSS Advent Calendar - Door 14: Bring Text to Life with Coqui TTS

Meet Coqui TTS, a powerful, open-source deep learning toolkit for cutting-edge Text-to-Speech. It turns written words into natural, expressive audio using state-of-the-art neural models, all while running completely offline on your own machine.

Coqui TTS supports a wide range of languages and voices, and its real strength lies in flexibility: you can use pre-trained models for instant results or train custom voices with your own datasets. Everything happens locally, your data stays private, no APIs or subscriptions required. Whether for accessibility tools, narration, creative projects, or research, Coqui gives you full control over synthetic speech, from tone and pace to emotional delivery.

Pro tip: Experiment with voice cloning or fine-tune a model for a unique vocal character. With Coqui, you’re not just generating speech you’re crafting it.

Link: https://github.com/coqui-ai/TTS

What would you create with open-source, local TTS-audiobooks, game dialogue, or your own custom assistant voice?

#AdventCalendar #AI #OpenSource #TTS #Python #MachineLearning #CoquiTTS #AIVoices #VoiceSynthesis #LocalAI #FOSS #Privacy #Accessibility #TextToSpeech #CreativeTech #VoiceTech #DeepLearning #ArtificialIntelligence #TechNerds #Innovation #FOSSAdvent #ContentCreation #EthicalAI #VoiceCloning #DevTools #FutureTech #AITools #SpeechAI #linux #ki #adventskalender

Mô hình giọng nói AI của Sesame gây ấn tượng với khả năng biểu cảm, đối thoại tự nhiên và thông minh vượt trội so với Moshi, dù cả hai dùng công nghệ nền tảng tương tự (Mimi, Llama). Cộng đồng đang tìm hiểu điều gì đã tạo nên bước nhảy vọt này: dữ liệu huấn luyện, hàm mất mát, kiến trúc, tích hợp LLM hay quy trình tổng thể?

#AI #SpeechAI #TextToSpeech #SesameAI #MoshiAI #LLM #Technology #TríTuệNhânTạo #GiọngNóiAI #CôngNghệ #MôHìnhNgônNgữ

https://www.reddit.com/r/LocalLLaMA/comments/1paj990/why

Mô hình giọng nói của Sesame được đánh giá là cảm xúc, tự nhiên và thông minh vượt trội so với Moshi, dù cả hai đều dựa trên công nghệ tương tự (Mimi, Llama). Cộng đồng đang tìm hiểu lý do cho sự khác biệt lớn này: liệu có phải do dữ liệu huấn luyện, hàm mục tiêu, kiến trúc, tích hợp LLM hay kỹ thuật hệ thống?

#AI #SpeechAI #Sesame #Moshi #LLM #MachineLearning
#TríTuệNhânTạo #MôHìnhGiọngNói #HọcMáy #XửLýNgônNgữTựNhiên

https://www.reddit.com/r/LocalLLaMA/comments/1paj990/why_does_sesames_speech

Meta’s new Omnilingual ASR model drops character error rates below 10 % for 78 % of the 1,600 languages it was tested on – a huge leap for low‑resource, under‑represented tongues. The system leverages in‑context learning and is released under Creative Commons, inviting the community to build on it. Read the full benchmark details! #OmnilingualASR #SpeechAI #LowResource #UnderrepresentedLangs

🔗 https://aidailypost.com/news/metas-omnilingual-asr-hits-sub10-error-78-1600-languages

As of today, my computer can __nicely__ read aloud for me !

I'm lazy, i read slowly, so i don't like reading, i skip a lot of articles

I have been looking for a solution for several months

#Accessibility #A11y #Orca #WebBrowser #ZenBrowser #Firefox #Piper #Pied #SpeechAI #AI #Nix #NixOS

Yesterday, I ordered food online. However it went a little off. And I contacted Support. They called me and for one moment, I thought it's a bot or recorded voice or something. And I hated it. Then I realized it's a human on the line.

I was planning to do an LLM+TTS+Speech Recognition and deploy it on A311D. To see if I can practice british accent with it. Now I'm rethinking about what I want to do. This way we are going, it doesn't lead to a good destination. I would hate it if I would have to talk to a voice enabled chatbot as support agent rather than a human.

And don't get me wrong. Voice enabled chatbots can have tons of good uses. But replacing humans with LLMs, not a good one. I don't think so.

#LLM #AI #TTS #ASR #speechrecognition #speechai #ML #MachineLearning #chatbot #chatbots #artificialintelligence

SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems. Multi-modal LLM system simulates human communication using speech and generates human-like dialogues with consistent content, rhythm, & emotion.

Funnily, they also elaborate on a "think before you speak" design aspect. This might also be applicable to our everyday lives.

doi: 10.48550/arXiv.2401.03945
#LLM #multimodal #speechAI #multiagent #conversationalai

Amazon’s New Nova Sonic Voice Model Targets Voice AI Rivals With Real-Time Expressive Output

#AI #VoiceAI #NovaSonic #AmazonAI #AlexaPlus #AIModel #SpeechAI #RealTimeVoice #BedrockAI #AIassistant

https://winbuzzer.com/2025/04/08/amazons-new-nova-sonic-voice-model-targets-voice-ai-rivals-with-real-time-expressive-output-xcxwbn/

ChatGPT’s Advanced Voice Mode Expands to Web and Improves Conversational Flow

#AI #ChatGPT #VoiceAI #OpenAI #AIAssistants #Chatbots #SpeechAI #GenAI

https://winbuzzer.com/2025/03/25/chatgpts-advanced-voice-mode-expands-to-web-with-real-time-conversations-xcxwbn/

VoizHub AI Review - Clone Any Celebrity Voice 🎤, Dub Any Video or Voice In Any Language 🌍, Record In Real-Time 🎙️, AI Transcript Anything 📝, Turn Any Text Into Speech 🗣️, Turn Any URL Into Speech 🌐, Generate Podcasts 🎧, Narrate Audiobooks 📚, Narrate Short Videos 🎥, And More ✨!

Get Instant Access: https://www.digiproductz.com/get/voizhub-ai
Read Full Review: https://www.digiproductz.com/voizhub-ai-review/

#VoizHubAI #VoizHubAIBonuses #VoizHubAIOTOs #AIVoiceCloning #VoiceTechnology #SpeechAI #VideoDubbing #VoiceoverAI

For the past couple of years, as each new @mozilla #CommonVoice dataset of #voice #data is released, I've been using @observablehq to visualise the #metadata coverage across the 100+ languages in the dataset.

Version 17 was released yesterday (big ups to the team - EM Lewis-Jong, @jessie, Gina Moape, Dmitrij Feller) and there's some super interesting insights from the visualisation:

➡ Catalan (ca) now has more data in Common Voice than English (en) (!)

➡ The language with the highest average audio utterance duration at nearly 7 seconds is Icelandic (is). Perhaps Icelandic words are longer? I suspect so!

➡ Spanish (es), Bangla (Bengali) (bn), Mandarin Chinese (zh-CN) and Japanese (ja) all have a lot of recorded utterances that have not yet been validated. Albanian (sq) has the highest percentage of validated utterances, followed closely by Erzya / Arisa (myv).

➡ Votic (vot) has the highest percentage of invalidated utterances, but with 76% of utterances invalidated, I wonder if this language has been the target of deliberate invalidation activity (invalidating valid sentences, or recording sentences to be deliberately invalid) given the geopolitical instability in Russia currently.

See the visualisation here and let me know your thoughts below!

➡ https://observablehq.com/@kathyreid/mozilla-common-voice-v17-dataset-metadata-coverage

#linguistics #languages #data #VoiceAI #VoiceData #SpeechAI #SpeechData #DataViz

Last week, as part of my #PhD program at the #ANU School of #cybernetics, I gave my final presentation, which is a summary of my methods and #research findings. I covered my interview work, the #dataset documentation analysis work I've been doing and my analysis work around #accents in @mozilla's #CommonVoice platform.

There were some insightful and thought-provoking questions from my panel and audience members, and of course - so many ideas for future research inquiry!

A huge thanks to my panel, chaired so well by Professor Alexandra Zafiroglu, to Dr Elizabeth Williams, my meticulous, methodical and always-encouraging Primary Supervisor, and to my co-supervisors Dr Jofish Kaye and Dr Paul Wong 黃仲熙 for their deep expertise in #HCI and #data respectively.

Similarly, a huge thank you to my #PhD cohort - Charlotte Bradley, Tom Chan, Danny Bettay and Sam Backwell - as well as the other cohorts in the School - for your encouragement and intellectual journeying.

#PhD #PhDlife #cybernetics #milestone #ANU #voiceAI #speechAI #ASR #SpeechRecognition

Kathy Reid presenting her #PhD final presentation.

Results from Kathy Reid's survey of #ML practitioners

Kathy Reid's work in assessing the Whisper #ASR engine

#Quantiphi is working with #NeMo to build a modular generative AI solution to improve worker productivity. Nvidia also announced four inference GPUs, optimized for a diverse range of emerging LLM and generative AI applications. Each GPU is designed to be optimized for specific #AIInference workloads while also featuring specialized software.

#SpeechAI, #supercomputing in the #cloud, and #GPUs for #LLMs and #GenerativeAI among #Nvidia’s next big moves | #AI
https://venturebeat.com/ai/speech-ai-supercomputing-cloud-gpus-llms-generative-ai-nvidia-next-big-moves/

I *really* would love Audible to have speech to text recognition, for those hard to understand moments.

Also, give me a dwelling that can be generated alongside academic reference.

#Audible #speechtotext #speechAI #speechTechnology

#SpeechAI

Client Info