We should think about optimizations of speech recognition (synthesis needs it's own approach):
- there are FPGA SBCs which you can train to react to certain words, then put out a text or trigger something
- instead of recording a 30s sentence, record much shorter but go on directly after the first one, check the parts, but also glue them together and send the whole sentence to the speech recognition model
- maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something
- finding ways to detect made up words
- construct words out of syllables instead of just jumping to what could have been meant, using that for parts of a sentence where the speech recognition model is uncertain
- using the certainty values of speech recognition to look for errors (misunderstandings), maybe using the syllable construction, wordlists and list of names for that