In my programming project I build a system around an Asterisk VoIP server. My purpose is to enable streaming speech recognition once inbound call occurs, i.e. I want to run automatic voice recognition since starting of conversation two people are involved into. The ASR (automatic speech recognition) engine I have chosen to implement that is Kaldi powered by Vosk server ([login to view URL]). As it needs some integration into Asterisk software, I use Asterisk-specific module ([login to view URL]) to carry out ASR operations without compatibility issues. So far if anybody speaks anything while calling, it gives very clear text output. The problem I'm struggling is how to enable streaming ASR immediately during the conversation, i.e. since Dial() application of Asterisk dialplan gets executed.
That's a subject of this job - create script (most likely, with some Asterisk REST Interface components) which works as follows:
1) since Dial() application starts running, real-time audio stream gets processed via ASR engine that is waiting for inputs inside of docker container (because I deploy Kaldi as a software built in Vosk server which is compatible with Asterisk, here is the out-of-box program implementation released on Github: [login to view URL])
2) once conversation begins and voice streaming is detected, audial data flow heads the ASR powered by Vosk server (within the docker container);
3) while data flow continues because of ongoing conversation between persons, ASR generates transcribed outputs (files) which should be advanced to an HTTP server to evaluate content of them (don't concern about this part, it's beyond this particular job, surely);
4) since conversation gets wrapped up, last phrases get processed via ASR to pass final outputs to the HTTP server mentioned above;
5) whenever inbound call occurs, same steps to be carried out: audial data capture - speech recognition within the docker container - text file through to the HTTP server.
That all to be compliant with real time requirements, so data flow needs fast and seamless throughput before and after ASR processing, as a matter of course.
While searching for any helpful content on the Internet, I encountered this Stack Overflow question [login to view URL]
It makes clear the same purpose, just in other words than in my description. However, I demand implementation of the system design with Kaldi/Vosk rather then Google Speech. As for language to be used for development, I would leave some options. So, Python/Java/JS are acceptable to do that.
The job will be considered as complete and worth full payment only if there is a provable functionality of the program which enables all listed steps without implementation errors. Certainly, it must be compatible with all aforementioned software products too.