Author: LivelyCoffee (with help from Keeeweeee and HumanThakrar)

Skip to content:

Feel free to check out other functions:

Introduction

The EVA AI Listener is stored in a file called listener.py containing the Listener class and mic_exec() functions responsible for capturing external audio and transcribing it by working together.

The listener module (core.functions.listener) of EVA contains essential classes and functions that enable EVA to listen and transcribe user input (in the form, here, of audio data directly from the input device). It enables powerful access and control over the audio input stream which means we can modify the input system according to our needs.

Working of LTMv1.1

In short, the listener module has a Listener class that gets initialised by the main function. It then opens the audio input stream to capture audio data from the input device. This class is what initialises and starts InputStream() which then utilises the callback function to help the listen() function store proper audio data into the numpy array according to the properly applied conditions.

Next, we have the listen() function that records audio, stores it into a numpy array, and returns the post-processed audio array as NDArray[np.float32].

Finally, we have the mic_exec() function that uses a Listener class object to record and take in audio data, and it then sends it over to a Faster-Whisper model (where, we are using “tiny” model_size here, but using “small” is recommended if possible). Faster-Whisper then transcribes the audio data and converts it into text, which is then processed and returned as the “query”.

import sounddevice as sd
import numpy as np
from numpy.typing import NDArray
from faster_whisper import WhisperModel
import time
import math

MODEL_SIZE = "tiny" # or "small"
MODEL = WhisperModel(MODEL_SIZE, device="cpu", compute_type="int8") # or CUDA with float32

#MAX_TIME = 10 # seconds !--> Not required anymore
PAUSE_THRESHOLD = 1.2 # seconds (1.2 to 1.7 is good)
SPEECH_THRESHOLD = 3.6 # old - 3.5
AUDIO_THRESHOLD = 0.1 # old - 0.1

SAMPLE_RATE = 16000

class Listener:
    def __init__(self):
        self.started = False
        self.sound_data = []
        self.prev_time = time.time()

        self.stream = sd.InputStream(
            samplerate=SAMPLE_RATE, 
            channels=1, 
            callback=self.callback
        )
        self.stream.start()

    def callback(self, indata, frames, time_info, status):
        volume = math.sqrt(float((indata * indata).sum())) * 10

        if volume > SPEECH_THRESHOLD: # USER SPEECH DETECTION
            self.started = True 
            self.prev_time = time.time() 

        if volume > AUDIO_THRESHOLD and self.started: # AUDIO DETECTION
            self.sound_data.append(indata.copy())

    def listen(self) -> NDArray[np.float32]:
        self.started = False
        self.sound_data = []
        self.prev_time = time.time()

        print("[SR]: Listening...")
        while True:
            time.sleep(0.05)
            if ((time.time() - self.prev_time) >= PAUSE_THRESHOLD) and self.started:
                break
        
        sound_data = self.sound_data
        self.sound_data = []

        if not sound_data:
            return np.array([], dtype=np.float32)

        audio = np.concatenate(sound_data, axis=0)
        audio = audio.astype(np.float32)
        return audio.flatten()
    
    def shutdown(self):
        self.stream.stop()
        self.stream.close()

def mic_exec(listener: Listener) -> str:
    '''
    Main MIC Executor for EVA. Listens and Recognises User Query and Outputs a Sanitised Query
    '''
    while True:
        query = ""
        audio = listener.listen()
        if len(audio) != 0:
            print("[SR]: Recognising...")
            try:
                segments, info = MODEL.transcribe(audio, language = "en", task="translate", condition_on_previous_text=False)
                query = "".join([segment.text.strip() for segment in segments])
            except:
                print()
                return ""
        query = str(query).lower()
        print(f"\\nUSER: {query}")
        return query

#*---------- END OF CODE ----------*

Let us analyse this file (`listener.py`) segment-by-segment.

Importing Libraries

First and foremost, we have the imports, and all they do is import all necessary functions, modules and libraries into the file.

import sounddevice as sd
import numpy as np
from numpy.typing import NDArray
from faster_whisper import WhisperModel
import time
import math

We will be using the following libraries:

`sounddevice`

`numpy`

`faster-whisper`

time - time library in Python (https://docs.python.org/3/library/time.html)
math - math library in Python (https://docs.python.org/3/library/math.html)

All the above libraries work together and allow us to create the right tools and functions that enable this module to work. Proper credit is due to the creators and maintainers of these libraries for providing proper support for this to work out.

Initialising Variables

Initially, a few variables and constants have to be defined. This way, we can modify the right values without having to dive into the code too deeply. It allows for easier user access for customisation and fine-tuning.

MODEL_SIZE = "tiny" # or "small"
MODEL = WhisperModel(MODEL_SIZE, device="cpu", compute_type="int8") # or CUDA with float32

#MAX_TIME = 10 # seconds !--> Not required anymore
PAUSE_THRESHOLD = 1.2 # seconds (1.2 to 1.7 is good)
SPEECH_THRESHOLD = 3.6 # old - 3.5
AUDIO_THRESHOLD = 0.1 # old - 0.1

SAMPLE_RATE = 16000

MODEL_SIZE: This is the faster-whisper model size that we will be using to transcribe audio to text. Faster-whisper provides various models (tiny, small, medium, large and largev3).
MODEL: This initialises the WhisperModel that we will be using, with the given params. Notice how we have set the device to be “cpu”? This is because we need to keep the GPU available for a better and faster working LLM and other OS-specific tasks that may require it.
- model_size=MODEL_SIZE- Self Explanatory (set the model size to use).
- device - To specify whether to use the GPU (if so then which) or the CPU.
- compute_type - It basically handles how numbers are stored and processed inside the model. Most Neural Networks (NNs) use float32, or 32-bit numbers, which provides better accuracy but slower speeds. It lets you define “accuracy v/s speed” tradeoffs.
For GPU: Recommended value is float16 (more accurate, fast enough on GPUs) For CPU: Recommended value is int8 (faster)
PAUSE_THRESHOLD (float, seconds): This is the number of seconds of silence you want to assume as the end of a query. Basically, “how long does the user have to stay silent for the model to start transcribing”. 1.2-1.7 works best, but it is really up to personal preference.

NOTE: This does NOT affect the quality of output given by the faster-whisper model.

SPEECH_THRESHOLD(float): This is the volume-floor for speech. Basically, above what level of volume do you want to recognise as the user is speaking. This has to usually be fine-tuned by the user, but going too low will break functionality. Recommended value is between 3-4.

NOTE: This does NOT affect the quality of output given by the faster-whisper model.

AUDIO_THRESHOLD (float): This is the volume-floor for audio, any sound below this level will be IGNORED completely. Can be fine-tuned to block background noise, but it DOES affect the quality of output given by the faster-whisper model.
- SAMPLE_RATE (int): It is recommended you do not change this value from the default of 16000. If required, this may be changed based on the input device utilised by the user. It affects a lot of parameters, and also may affect the quality of output given by the faster-whisper model.

Listener Class

This is the main class that handles the audio input stream, audio pre-processing, storage and output of the proper audio type for our model to transcribe.

class Listener:
    def __init__(self):
        self.started = False
        self.sound_data = []
        self.prev_time = time.time()

        self.stream = sd.InputStream(
            samplerate=SAMPLE_RATE, 
            channels=1, 
            callback=self.callback
        )
        self.stream.start()

    def callback(self, indata, frames, time_info, status):
        volume = math.sqrt(float((indata * indata).sum())) * 10

        if volume > SPEECH_THRESHOLD: # USER SPEECH DETECTION
            self.started = True 
            self.prev_time = time.time() 

        if volume > AUDIO_THRESHOLD and self.started: # AUDIO DETECTION
            self.sound_data.append(indata.copy())

    def listen(self) -> NDArray[np.float32]:
        self.started = False
        self.sound_data = []
        self.prev_time = time.time()

        print("[SR]: Listening...")
        while True:
            time.sleep(0.05)
            if ((time.time() - self.prev_time) >= PAUSE_THRESHOLD) and self.started:
                break
        
        sound_data = self.sound_data
        self.sound_data = []

        if not sound_data:
            return np.array([], dtype=np.float32)

        audio = np.concatenate(sound_data, axis=0)
        audio = audio.astype(np.float32)
        return audio.flatten()
    
    def shutdown(self):
        self.stream.stop()
        self.stream.close()

It has 4 functions - __init__ , callback, listen and shutdown, all of which are used to aid in EVA’s listening capability.

Let us dive right into it, exploring each function, its parameters, the code logic, execution and returning values. We will also explore how it integrates with other elements of the code, and where it is used in EVA.

`init()` Function

It is the initialising function that initialises the Listener class. It is a direct requirement (and recommendation) of python that each class should have one. We won’t go deep into the working of this, since this is a python basics topic which can be explored by the user themselves.

def __init__(self):
        self.started = False
        self.sound_data = []
        self.prev_time = time.time()

        self.stream = sd.InputStream(
            samplerate=SAMPLE_RATE, 
            channels=1, 
            callback=self.callback
        )
        self.stream.start()

The __init__() function initialises key variables, and is also the point where the InputStream is created and started (basically, the input device is opened to allow audio data to stream in). Let us look at the first few lines of code:

			self.started = False
			self.sound_data = []
			self.prev_time = time.time()

The self.started variable (bool) is owned by the class, and is a status check variable that updates according to defined rules. It basically is the status for if the user has started speaking. Let us explore WHY this even exists.

Basically, we want to be able to capture what the user is saying only if they do indeed start to speak. Otherwise, the listening function does NOT store any audio data. This saves on computation and transcribing power, and also prevents unnecessary code execution. We only start to actually “record” audio WHEN the user starts speaking. Initially, we say this condition is “False”.

<aside> 💡

In short, self.started is the variable that tells us if the user has started to speak or not. (i.e., if the volume of input audio indata has passed the SPEECH_THRESHOLD.)

</aside>

Next, we have self.sound_data variable (list) which will actually be storing the audio data itself. To be more specific, it will store the whole and entirety of the numpy array containing the raw audio data. This is initialised to be an empty list in the beginning.

Finally, we have the self.prev_time variable (float) is a very important variable that handles the logic for if we want to stop listening. It is initialised by time.time() which returns a float value for time (in seconds) since the Unix epoch. We will need to utilise differences between time to get time between events in seconds.

Let us now look at the next part of the function, which is what initialises and creates the InputStream object from sounddevice :

			self.stream = sd.InputStream(
            samplerate=SAMPLE_RATE, 
            channels=1, 
            callback=self.callback
        )

We create self.stream as an object of the sounddevice.InputStream() class. The InputStream is what actually opens the input device to allow external audio data to enter the device through the provided input device. We must initialise this object with samplerate that we will be using with the input device, and the channels we will take audio in from - where 1 is mono and 2 would be stereo audio. For models like faster-whisper, it is recommended to take in mono audio, which also ultimately makes data post processing and pre processing much faster.

We then also need to provide the self.stream object with a “callback” function. This callback function is the function that is to be executed when new data is detected to enter in into the stream. We will look into this in more detail later when we do further into the code. For now, remember that these are the parameters that the InputStream class requires. For more information, you may read the sounddevice documentation.

Let us move forward into the last step of initialisation, that is, when we open the stream:

self.stream.start()

This opens the stream and starts to listen into external raw audio data coming in through the input device. It is what activates the device “microphone”.

`callback()` Function

It is the function that is called by the stream (or here, the InputStream object) whenever new audio data flows into the stream from the input device and is detected.

   def callback(self, indata, frames, time_info, status):
        volume = math.sqrt(float((indata * indata).sum())) * 10

        if volume > SPEECH_THRESHOLD: # USER SPEECH DETECTION
            self.started = True 
            self.prev_time = time.time() 

        if volume > AUDIO_THRESHOLD and self.started: # AUDIO DETECTION
            self.sound_data.append(indata.copy())

We specify what we wish to do to this audio data, how we can manipulate it, and it also gives us other valuable information regarding how the audio data was captured, the time at which it was, the status (which helps us access valuable information for debugging and errors) as well as the frames of audio.

Basics of Callback Functions

The callback function is usually required to have this minimal basic structure, about which you can read more in the sounddevice documentation:

	def callback(indata, outdata, frames, time, status):
				outdata[:] = indata
		
# Incase of InputStream(), there is no outdata, hence:

	def callback(indata, frames, time, status):
        pass # putting indata somewhere

indata - Chunk of raw audio data that was received in the InputStream.
outdata - Chunk of audio data to send to the OutputStream to be played buy the output device.
frames - The number of samples received in this particular raw audio chunk. To be remembered that indata.shape() = (frames, channels).
time - It is a dictionary that contains the timing information about the audio stream. (Named time_info so that it does not collide with the “time” library).
status - Indicates problems and warnings in the audio stream. (A more common one is InputOverflow which may indicate your callback is too slow.)

It is also required that the callback function should be fast, should never be able to stall/wait, and should NOT be blocking. Hence, it is recommended we do not populate the callback function with too much code. Currently, a better method would be, for instance, to make an audio queue and have indata be copied into this queue to be processed elsewhere. But I could not get this working, hence we will not be talking about that mechanism here.

Our Callback Function for InputStream()

Let us take a look inside the callback function:

				volume = math.sqrt(float((indata * indata).sum())) * 10

        if volume > SPEECH_THRESHOLD: # USER SPEECH DETECTION
            self.started = True 
            self.prev_time = time.time() 

        if volume > AUDIO_THRESHOLD and self.started: # AUDIO DETECTION
            self.sound_data.append(indata.copy())

The first line here is used to calculate the signal strength (volume) of this particular chunk of data we have received (indata). While named so, it is not “stable volume” or even “volume” in reality at all. The current calculation is really terrible and exaggerates the difference between chunk spikes - which here, ironically, actually helps us in better controlling the flow. That is why we will not be using the fine-tuned RMS value method which is normalised, frame-independent, and gives us a better representation to “volume”:

# using numpy - slower and usually compute heavy
volume = np.linalg.norm(indata)*10

# using basic math functions --> CURRENT, approximately same as numpy method
volume = math.sqrt(float((indata * indata).sum())) * 10

# using the more stable RMS method
volume = float(np.sqrt(np.mean(indata**2)))

In simple terms, the callback function does three things:

Compute audio chunk signal strength (volume = math.sqrt(float((indata * indata).sum())) * 10)
Compare signal strength (volume) with SPEECH_THRESHOLD → If true, then set status of self.started to true and update the user-last-spoken time.
ONLY take in data IF user has started to speak AND if the data is above the noise threshold (AUDIO_THRESHOLD).

			if volume > SPEECH_THRESHOLD: # USER SPEECH DETECTION
            self.started = True # User has started to speak 👍
            self.prev_time = time.time() # User last spoke at this time

			# ONLY record if the chunk has no noise + user has started speaking
      if volume > AUDIO_THRESHOLD and self.started: # AUDIO DETECTION
            self.sound_data.append(indata.copy())

Note that we do not discard all chunks that user DID NOT SPEAK (that is, we separate audio and speech thresholds) since otherwise faster-whisper receives continuous unnatural speech, which will cause incorrect and really weird transcriptions.

Recording Conditions and Logic Flow

Next, we are checking first to see if the user is speaking or not (or in better terms, if this audio chunk is a part of user speech, or is just noise/silence). This happens by comparing the signal strength (volume) with the earlier defined constant SPEECH_THRESHOLD. This helps us set the self.started variable to True to denote that the user is indeed speaking or at-least has started speaking. We also update the self.prev_time to reset the timing sequence for comparison later. We will explore this as we reach the other functions in the class. In short, the self.prev_time variable tells us when last the user spoke, or basically helps us answer “when did the user last speak?”.

The second, and final code execution we do in the callback, is to ONLY append a copy of indata into the self.sound_data storage IF it is valid sound (this removes unnecessary noise and silence, which might make processing by the faster-whisper model harder). Less noise and silence means faster-whisper is more accurate in telling us what the user has said.

Queuing of Audio Data: Callback Function Optimisation

We limit the callback to only this much code execution, which is already a fir bit borderline in terms of how heavy the callback function can be made. Again, using queue and putting indata out from the callback for processing is usually a better way to go. If you would like to explore this, you may use the queue library in python, and create a queue:

import queue
audio_queue = queue.Queue() # made inside Listener class instead

# to put items for processing in a queue:
def callback(indata, frames, time, status):
			audio_queue.put(indata.copy())
			
# to get item from queue to be processed -> NOT in callback
while not audio_queue.empty():
			chunk = []
			chunk = audio_queue.get(item)
			...

We may now finally move on to the next function.

`listen()` Function

The listen function is the main listening function for the Listener class that actually returns the post-processed and usable/valid audio data for us to use in the form of a numpy float32 array.

    def listen(self) -> NDArray[np.float32]:
        self.started = False
        self.sound_data = []
        self.prev_time = time.time()

        print("[SR]: Listening...")
        while True:
            time.sleep(0.05)
            if ((time.time() - self.prev_time) >= PAUSE_THRESHOLD) and self.started:
                break
        
        sound_data = self.sound_data
        self.sound_data = []

        if not sound_data:
            return np.array([], dtype=np.float32)

        audio = np.concatenate(sound_data, axis=0)
        audio = audio.astype(np.float32)
        return audio.flatten()

The listen function is a blocking function, which means the execution of code (and in-fact of main.py) is stalled until the listen function has exited / returned the array. If your code does ever hang, your only culprits can originate from the listen and callback functions unless it is a external library or operating system issue.

Capturing (Recording) of Audio Data

We start off by setting the initial values and resetting sound_data (so that we have no carry-over of previous audio data). We also make sure to reset the other variables to ensure there are no logic flow or timing issues.

We then print a debug statement ([SR]: Listening… where SR stands for Speech Recognition) and create a infinite while True loop. This loop is what makes the listen function to be blocking, since it pauses the current execution and allows for accumulation of sound data into the self.sound_data audio storage variable according to the logic we have implemented.

				print("[SR]: Listening...")
        while True:
            time.sleep(0.05)
            if ((time.time() - self.prev_time) >= PAUSE_THRESHOLD) and self.started:
                break

Here, we use time.sleep(0.05) instead of sd.sleep(50) because sd hands over the loop control to the PortAudio thread created by soundaudio to handle the stream, which may cause stalls, hangs or even program crashes. This is prevented by using a python bytecode controlled block using time.sleep() with a sleep-time of 0.05 seconds (or 50 ms - for sd.sleep()).

Stop Condition for Audio Capture

Then, we have the actual logic code that is performed to check whether the user has completed speech and finalise (or freeze the array) and continue with post-processing. To do this, we use a if statement to check if the current time minus the previous time passes the threshold - basically, if the time since speech (or time after speech) has passed PAUSE_THRESHOLD amount of seconds. If so, we break the loop and freeze the output array.

				sound_data = self.sound_data
        self.sound_data = []

This means that, the function starts to take in audio data, and it keeps checking if the user has stopped talking. If so, we un-block the code and let the current audio chunk freeze. We do this “freezing” by setting a local variable (sound_data)as the sound data, and then we clear the self.sound_data variable as a redundant measure to prevent old audio from creeping into the next audio transcription.

Audio Data Post-Processing in Detail

After this step, we have to begin the processing of the received audio data, which is now stored as a ND numpy array inside of the local variable sound_data.

				if not sound_data:
            return np.array([], dtype=np.float32)

        audio = np.concatenate(sound_data, axis=0)
        audio = audio.astype(np.float32)
        return audio.flatten()

We also check if the audio was not captured at all, this would mean sound_data should be empty or (hopefully not) None. If it is so, we then directly return an empty array of proper datatype and valid formatting that is acceptable as an audio-input into the faster-whisper model. (Even though we do check before processing an empty input, this helps us stay redundant and prevent any errors and bugs).

Audio Post-Processing Continued + Function Return

Currently, the sound_data variable is a list of audio chunks that we captured through the callback function. It is basically a collection of audio-snippets that we recorded from the InputStream. So, we have to first convert these bits/parts of audio data into one long audio.

To do this, we use numpy to concatenate (or basically, “stitch”) all the different audio segments or chunks end-to-end into one long audio “file” so that you now have one long continuous audio stream to work with.