/robowaifu/ - Speech Synthesis/Recognition general

Name
Subject
E-mail
Message	Max message length: 6144
Files	Drag files to upload or click here to select them Maximum 5 files / Maximum size: 20.00 MB

Spoiler images
Password	(used to delete files and postings)
Use bypass

Robowaifu Technician 06/28/2023 (Wed) 13:28:24 No.23558

>>23538 https://github.com/ggerganov/whisper.cpp

NoidoDev ##eCt7e4 06/28/2023 (Wed) 14:55:14 No.23561

>>23558 >Whisper C++ >Beta: v1.4.2 / Stable: v1.2.1 / Roadmap | F.A.Q. >High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model: >Plain C/C++ implementation without dependencies >Apple silicon first-class citizen - optimized via ARM >NEON, Accelerate framework and Core ML >AVX intrinsics support for x86 architectures >VSX intrinsics support for POWER architectures >Mixed F16 / F32 precision >4-bit and 5-bit integer quantization support >Low memory usage (Flash Attention) >Zero memory allocations at runtime >Runs on the CPU >Partial GPU support for NVIDIA via cuBLAS >Partial OpenCL GPU support via CLBlast >BLAS CPU support via OpenBLAS >C-style API Thanks, that might come in handy. There seems to be enough GPU support, despite running on a CPU. I'm still thinking of building a dedicated server in some time, using the Arc380 (70W). >large 2.9 GB ~3.3 GB The original one needs 10GB or more for the large one. Which would rather indicate to get a 3060 (170W). Many thing will work fine with smaller models anyways.

Chobitsu 06/28/2023 (Wed) 23:49:24 No.23579

>>23558 Thanks for the reminder Anon. That anon's work is really quite excellent tbh.

NoidoDev ##eCt7e4 06/29/2023 (Thu) 02:51:03 No.23590

>>23558 >>23561 This (bit hard to understand) guy here https://www.youtube.com/watch?v=75H12lYz0Lo tests it on a Raspberry Pi and it works actually surprisingly fast! He tries to get smaller and smaller with his optimizations. I'll keep an eye on that.

Robowaifu Technician 06/29/2023 (Thu) 02:56:58 No.23591

>>23579 aws transcribe cost 3 cents per minute and you want to rent a server to run that thing which probably requires multiple gpus. Doesn't make any sense.

NoidoDev ##eCt7e4 06/29/2023 (Thu) 16:15:34 No.23601

>>23591 >Whisper vs AWS transcribe This is about running it at home. The tiny model works on a Raspberry Pi and the large one maybe on a 4GB GPU, certainly on a 6GB GPU (like the Arc380 which uses 70W). Do as you wish, but the general notion here is that we want our waifus be independent from the internet. Some might even say, not connected to it. Using online services for something so fundamental as speech recognition (transcription), especially beyond development, is a special case and will not be recommended.

peteblank 06/30/2023 (Fri) 06:27:24 No.23634

>>23535 That took quiet a while and was more productive than whatever the heck kiwi is doing. I'm going to start using a name tag so I can get some proper recognition for what I've done so far. Which is trying to make a hasel actuator, this, buying supplies, reading up on electronics and testing the arduino and soon making a 3d anime girl doll from scratch. I'm really about to leave this place cause this is bullshit.

Robowaifu Technician 06/30/2023 (Fri) 07:30:21 No.23635

>>23634 peteblank is an anagram for "pleb taken"

Grommet 06/30/2023 (Fri) 10:51:29 No.23639

>>23590 Wow. That's most excellent.

NoidoDev ##eCt7e4 06/30/2023 (Fri) 11:50:11 No.23640

>>23634 It's good that you did something, during the last few month, but don't exaggerate. You had some advice from other anons here when trying to make the hasel actuator. You also bring this kind of vitriol with you, bashing someone or this board in way too many comments. >3d anime girl doll from scratch I'm looking forward to see that. >I'm really about to leave this place You don't need to hang out here every day. Work on your project and report back later.

Robowaifu Technician 06/30/2023 (Fri) 12:36:55 No.23643

>>23640 I am right to be upset at kiwi since he's attacking my character for no reason. I told him I was planning to do this for profit if possible, i emailed the guy who made the 3d model asking for permission and then he turns around and claims i want to steal other people's stuff.

Chobitsu 07/01/2023 (Sat) 05:49:29 No.23674

>>23634 >I'm going to start using a name tag so I can get some proper recognition for what I've done so far. Good thinking Anon. That's not really why we use names here. Watch the movie 50 first dates to understand the actual reason.

NoidoDev ##eCt7e4 07/01/2023 (Sat) 17:15:10 No.23685

>>23643 I deleted my original post here, but forgot to copy it. Just wanted to post the new link to the related post. Well... Related: >>23682 This thread is about speech synthesis and maybe recognition, even not about 3D models. You can crosslink posts like above.

NoidoDev ##eCt7e4 07/02/2023 (Sun) 00:17:29 No.23736

>our research team kept seeing new voice conversion methods getting more complex and becoming harder to reproduce. So, we tried to see if we could make a top-tier voice conversion model that was extremely simple. So, we made kNN-VC, where our entire conversion model is just k-nearest neighbors regression on WavLM features. And, it turns out, this does as well if not better than very complex any-to-any voice conversion methods. What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. https://bshall.github.io/knn-vc https://arxiv.org/abs/2305.18975

Chobitsu 07/02/2023 (Sun) 04:35:24 No.23757

>>23736 >What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. Lol. That seems a little bizarre to think through. Thanks Anon. >ps. I edited the subject ITT, thanks for pointing that out NoidoDev.

NoidoDev ##eCt7e4 08/27/2023 (Sun) 19:24:14 No.24951

We should think about optimizations of speech recognition (synthesis needs it's own approach): - there are FPGA SBCs which you can train to react to certain words, then put out a text or trigger something - instead of recording a 30s sentence, record much shorter but go on directly after the first one, check the parts, but also glue them together and send the whole sentence to the speech recognition model - maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something - finding ways to detect made up words - construct words out of syllables instead of just jumping to what could have been meant, using that for parts of a sentence where the speech recognition model is uncertain - using the certainty values of speech recognition to look for errors (misunderstandings), maybe using the syllable construction, wordlists and list of names for that

Chobitsu 08/30/2023 (Wed) 00:30:01 No.25064

>>24951 >- maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something I would anticipate this should at the least provide greater odds of a coherent parse (particularly in a noisy environment) than just STT alone. Good thinking Anon.

NoidoDev ##eCt7e4 08/30/2023 (Wed) 19:39:15 No.25075

Related: >>25073 >VALL-E X is an amazing multilingual text-to-speech (TTS) model proposed by Microsoft. While Microsoft initially publish in their research paper, they did not release any code or pretrained models. Recognizing the potential and value of this technology, our team took on the challenge to reproduce the results and train our own model. We are glad to share our trained VALL-E X model with the community, allowing everyone to experience the power next-generation TTS https://github.com/Plachtaa/VALL-E-X https://huggingface.co/spaces/Plachta/VALL-E-X

01 09/02/2023 (Sat) 01:36:54 No.25096

>>25075 also worth noting that : its broken if you launch it thru "python -X utf8 launch-ui.py" command and let install "vallex-checkpoint.pt" and whisper "medium.pt" models on its own, very weird as its already solved here : https://github.com/Plachtaa/VALL-E-X#install-with-pip-recommended-with-python-310-cuda-117--120-pytorch-20 download them manually, thats it.

NoidoDev ##pTGTWW 09/02/2023 (Sat) 08:39:14 No.25100

>>25075 >>25096 Thanks. This will be very useful.

NoidoDev ##pTGTWW 10/06/2023 (Fri) 01:45:39 No.25805

There's some excitement around a Discord server being removed, which was working on AI voice models. We might even not have known about it (I didn't), but here's the website: https://voice-models.com https://docs.google.com/spreadsheets/d/1tAUaQrEHYgRsm1Lvrnj14HFHDwJWl0Bd9x0QePewNco/edit#gid=1227575351 and weights.gg (not voice models) >AI Hub discord just got removed from my server list But it seems to be only a fraction of the models. Some mention a IIRC backup: https://www.reddit.com/r/generativeAI/comments/16zzuh4/ai_hub_discord_just_got_removed_from_my_server/

Chobitsu 10/06/2023 (Fri) 05:23:49 No.25809

>>25805 >I WARNED YOU ABOUT THE DOXXCORD STAIRS BRO Save.everything. Doxxcord is even more deeply-controlled than G*ogle is. DMCAs don't result in a forum getting disappear'd.

NoidoDev ##pTGTWW 10/10/2023 (Tue) 00:53:57 No.25876

>Otamatone https://youtu.be/Y_ILdh1K0Fk Found here, related: >>25273

Chobitsu 10/10/2023 (Tue) 19:44:00 No.25893

>>25876 Had no idea that was a real thing NoidoDev, thanks! Any chance it's opensauce?

NoidoDev ##pTGTWW 10/11/2023 (Wed) 00:59:42 No.25909

>>25893 The original belongs to a corporation, but if you look for "Otamatone DIY" you can find some variants.

Chobitsu 10/11/2023 (Wed) 18:04:22 No.25924

>>25909 Cool. Thank you NoidoDev! :^)

NoidoDev ##pTGTWW 10/11/2023 (Wed) 19:32:01 No.25931

>>17474 Can we get this with time stamps? So we can use it for voice training (text to speech).

NoidoDev ##pTGTWW 11/20/2023 (Mon) 06:33:31 No.26511

>ⓍTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. There is no need for an excessive amount of training data that spans countless hours. https://huggingface.co/coqui/XTTS-v2 (only non-commercial licence) Testing Space: https://huggingface.co/spaces/coqui/voice-chat-with-mistral Via https://www.reddit.com/r/LocalLLaMA/comments/17yzr6l/coquiai_ttsv2_is_so_cool/ (seems to be much closer to the ElevenLabs quality)

01 11/20/2023 (Mon) 07:52:43 No.26512

>>26511 also this one https://github.com/yl4579/StyleTTS2 some people claim its 100x faster than coqui's xtts. still no webui tho :(

NoidoDev ##pTGTWW 11/21/2023 (Tue) 10:33:41 No.26535

>>26512 Thank, I saw this mentioned but forgot to look it up.

01 11/24/2023 (Fri) 17:01:15 No.26566

>>26512 tested it locally, rtx 3070. works fast as fuck. https://files.catbox.moe/ow0ryz.mp4

Chobitsu 11/25/2023 (Sat) 10:11:46 No.26572

>>26535 >>26566 Thanks Anons. :^)

OpenVoice Robowaifu Technician 01/04/2024 (Thu) 08:08:39 No.27995

New zero-shot voice cloning model just dropped. Examples: https://research.myshell.ai/open-voice Github: https://github.com/myshell-ai/OpenVoice Notebook: https://github.com/camenduru/OpenVoice-colab#-colab Paper: https://arxiv.org/abs/2312.01479

Chobitsu 01/05/2024 (Fri) 09:37:20 No.28020

>>27995 REALLY impressive Anon, thanks!

MetaVoice 1B NoidoDev ##pTGTWW 02/09/2024 (Fri) 12:25:29 No.29257

>MetaVoice 1B - The new TTS and Voice cloning open source model Colab: https://drp.li/7RUPU MetaVoice Online Demo - https://ttsdemo.themetavoice.xyz/ https://huggingface.co/metavoiceio https://youtu.be/Y_k3bHPcPTo Not as good as proprietary models.

Chobitsu 02/12/2024 (Mon) 06:13:10 No.29369

>>29257 >Not as good as proprietary models. Ehh, they'll get better with time, no doubt. Thanks Anon! Cheers. :^)

Bringing Whisper and LLaMA to the masses NoidoDev ##pTGTWW 02/12/2024 (Mon) 18:30:36 No.29415

>This week we’re talking with Georgi Gerganov about his work on Whisper.cpp and llama.cpp. Georgi first crossed our radar with whisper.cpp, his port of OpenAI’s Whisper model in C and C++. Whisper is a speech recognition model enabling audio transcription and translation. Something we’re paying close attention to here at Changelog, for obvious reasons. Between the invite and the show’s recording, he had a new hit project on his hands: llama.cpp. This is a port of Facebook’s LLaMA model in C and C++. Whisper.cpp made a splash, but llama.cpp is growing in GitHub stars faster than Stable Diffusion did, which was a rocket ship itself. https://changelog.com/podcast/532 Some takeaways: Whiper didn't do speaker identification (Diarization) when they published this in March 22, 2023, and it seems to be hard to find something doing that. But they said people set up their own pipelines for doing this and Whisper might get there as well. I found this on the topic by briefly searching, it still doesn't seem to be covered in some easy way: >How to use OpenAIs Whisper to transcribe and diarize audio files https://github.com/lablab-ai/Whisper-transcription_and_diarization-speaker-identification- Discussion on this: https://huggingface.co/spaces/openai/whisper/discussions/4 Azure AI services seem to be able to do it, but this doesn't help us much. Well, I mean for using it as a tool to extract voice files for training it's one thing, but we also need it as a skill for our waifus: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speaker-recognition?tabs=script&pivots=programming-language-cpp

Chobitsu 02/13/2024 (Tue) 19:08:31 No.29440

>>29415 Very nice. Thanks NoidoDev! I have a lot of respect for Gerganov. He very-clearly understands the issues of latency in a systems development context. Exactly the kinds of expertise vital for success to /robowaifu/ and our affiliated cadres in the end. Cheers. :^)

State of AI for Speech and Audio NoidoDev 02/26/2024 (Mon) 12:54:39 No.29956

>Data Exchange Podcast 198 - Sep 21, 2023 Overview over everything related to speech. https://www.youtu.be/w4DULuvgO1Y Yishay Carmiel is the CEO of Meaning, a startup at the forefront of building real-time speech applications for enterprises. Episode Notes: https://thedataexchange.media/state-of-ai-for-speech-and-audio >Sections Generative AI for Audio (text-to-speech; text-to-music; speech synthesis) - 00:00:44 Speech Translation - 00:09:44 Automatic Speech Recognition and other models that use audio inputs - 00:13:16 Speech Emotion Recognition - 00:19:55 Restoration - 00:21:55 Similarities in recent trends in NLP and Speech - 00:24:23 Diarization (speaker identification), and implementation challenges - 00:29:47 Voice cloning and risk mitigation - 00:35:36

Robowaifu Technician 03/18/2024 (Mon) 16:52:22 No.30390

There is some Japanese open source programs for speech synthesis such as VOICEVOX though I should mention if you use these voices they will have funny accents if you make them speak English which can be kinda cute sometimes. https://voicevox.hiroshiba.jp And TALQu but it is only for Windows. https://booth.pm/ja/items/2755336 NNSVS is for singing also open source. https://nnsvs.github.io SociallyIneptWeeb used VOICEVOX for an AI waifu before and detailed what he did https://www.youtube.com/watch?v=bN5UaEkIPGM&t=674s

NoidoDev ##pTGTWW 03/18/2024 (Mon) 22:07:21 No.30398

>>30390 Oh wow, this is really good. Thanks. >https://nnsvs.github.io >NNSVS >Neural network based singing voice synthesis library > GitHub: https://github.com/nnsvs/nnsvs > Paper: https://arxiv.org/abs/2210.15987 > Demo: https://r9y9.github.io/projects/nnsvs/ >Features > Open-source: NNSVS is fully open-source. You can create your own voicebanks with your dataset. > Multiple languages: NNSVS has been used for creating singing voice synthesis (SVS) systems for multiple languages by VocalSynth comminities (8+ as far as I know). > Research friendly: NNSVS comes with reproducible Kaldi/ESPnet-style recipes. You can use NNSVS to create baseline systems for your research.

Robowaifu Technician 03/18/2024 (Mon) 23:02:26 No.30403

>>30398 Here is a site I found that writes some about it and has links to written tutorials. https://nnsvs.carrd.co/

NoidoDev ##pTGTWW 03/29/2024 (Fri) 22:15:22 No.30625

>VoiceCraft >>30614 Thanks, but it's about voice cloning again. I think what I really want are artificial voices which don't belong to anyone. Cloning has it's use cases as well, but I don't need or want it for a robot wife. Also I don't need to be to close to a human. To me the quality problem is a solved problem at this point, at least for robowaifus. I was very impressed certainly by the singing capabilities I saw and heard recently, see above >>30390

Robowaifu Technician 03/31/2024 (Sun) 02:54:56 No.30657

>>30625 If you aren't worried about human closeness there is a pretty simple TTS that sounds like old retro synthesized voices. Unfortunately I cant find a video that has the female voice. https://github.com/adafruit/Talkie

NoidoDev ##pTGTWW 03/31/2024 (Sun) 05:06:57 No.30664

>>30657 Thanks, but I didn't mean to go so extreme into the other direction. I just meant for our use case here, and in my opinion, the current state of the technology should be sufficient in terms of quality or it's at least close to it. Making it faster and run better on smaller devices would be good, though. For content creation it's another story, if we don't want to only have stories about robots.

Robowaifu Technician 04/26/2024 (Fri) 19:51:52 No.31027

I figure this might be the best place for this. I found a paper on lip syncing synthesized voices.

Robowaifu Technician 04/26/2024 (Fri) 19:59:13 No.31028

>>31027 For some reason the file didnt attach https://doi.org/10.1109/ROMAN.2010.5598656

Mechnomancer 04/27/2024 (Sat) 13:43:54 No.31038

>>31027 I've been thinking about designing something similar, now I'm totally gonna s̶t̶e̶a̶l̶ be inspired by this.

Grommet 04/28/2024 (Sun) 12:16:35 No.31049

>>31027 That is great. I mentioned doing something, sorta, the same with facial expressions. I believe this is the same sort of "framework" or idea. Here's the paper, Real-time lip synchronization between text-to-speech (TTS) system and robot mouth Well I can't upload it. I get an error saying,"Tor users can not upload files". What??????? Here's an address for the paper. https://sci-hub.ru/10.1109/roman.2010.5598656

Chobitsu 04/29/2024 (Mon) 03:03:04 No.31055

>>31049 >Well I can't upload it. I get an error saying,"Tor users can not upload files". What??????? Lol, welcome to my world! :D TBH I think Robbit must've disabled file posting by Torfags. I hope he changes that soon.