Audio Classifier

This tutorial will show you how to train an audio classifier using Teachable Machine, then write Python code to detect the trained audio.

You can use it for various purposes, such as...

Send voice commands to a robot
Detect audio alarms
Recognize animals from their sound

Note that when using Teachable Machines, audio samples are limited to 1 second each, so you can't have long commands. If you need longer commands (eg. "turn left"), you can break them into multiple classes (ie. one for "turn", another for "left") and detect if the two classes appear consecutively.

Training

Open the Teachable Machine website.
Click "Get Started"
Select "Audio Project"
Record your background noise then click "Extract Sample".
Record your audio sample. Note that this will record for two seconds by default, if recording a word, you should say it twice. Click "Extract Sample" after each recording.
If you have more than one sound to recognize, add a new class and repeat the sample recording.
When you have sufficient samples, click "Train Model" and wait for it to complete.
Test your model on the Teachable Machine page to make sure it works correctly.
Click "Export Model", select "TensorFlow Lite", then download the model.

You should now have a zip file named "converted_tflite.zip". Open it up and extract the file named "soundclassifier_with_metadata.tflite"; this file contains the trained weights.

Installation

To run an audio classifier on Python, you'll need to install a few things.

LiteRT

This is the software library for machine learning. LiteRT (...formerly known as TensorFlow Lite) provides a lighter (smaller install size, lower memory use) alternative to TensorFlow. If you already have TensorFlow installed, you can use that instead.

To install LiteRT on Linux systems (...including Raspberry Pi), first create and activate a new virtual environment...

python -m venv litert
source litert/bin/activate

On Mac and Windows, you can skip the above step. Next, you'll need to install LiteRT...

pip install ai-edge-litert

MediaPipe

MediaPipe provides machine learning models for various tasks. We'll be using the Audio Classifier model provided by MediaPipe.

pip install mediapipe

Code

import time

from mediapipe.tasks import python
from mediapipe.tasks.python.audio.core import audio_record
from mediapipe.tasks.python.components import containers
from mediapipe.tasks.python import audio

import numpy as np

# Various options. You can play around with these.
model = 'soundclassifier_with_metadata.tflite'
max_results = 5
score_threshold = 0.8
overlapping_factor = 0.5


# Callback function. The audio classifier will run this after classification.
def print_result(result, timestamp_ms):
    if len(result.classifications[0].categories) < 1:
        return

    first_result = result.classifications[0].categories[0]

    # Only print if the detected audio is not the background noise.
    if first_result.category_name[0] != '0':
        print(first_result)

# Initialize the audio classification model.
base_options = python.BaseOptions(model_asset_path=model)
options = audio.AudioClassifierOptions(
    base_options=base_options, running_mode=audio.RunningMode.AUDIO_STREAM,
    max_results=max_results, score_threshold=score_threshold,
    result_callback=print_result)
classifier = audio.AudioClassifier.create_from_options(options)

# Initialize the audio recorder
buffer_size, sample_rate, num_channels = 44100, 44100, 1
record = audio_record.AudioRecord(num_channels, sample_rate, buffer_size)

# Initialize a tensor to store the audio data
audio_format = containers.AudioDataFormat(num_channels, sample_rate)
audio_data = containers.AudioData(buffer_size, audio_format)

# We'll try to run inference every interval_between_inference seconds.
# This is usually half of the model's input length to create an overlapping
# between incoming audio segments to improve classification accuracy.
input_length_in_second = buffer_size / sample_rate
interval_between_inference = input_length_in_second * (1 - overlapping_factor)
last_inference_time = time.time()

# Start audio recording in the background.
record.start_recording()

# Loop forever
while True:
  # Wait until at least interval_between_inference seconds has passed since
  # the last inference.
  now = time.time()
  diff = now - last_inference_time
  if diff < interval_between_inference:
    time.sleep(0.01)
    continue
  last_inference_time = now

  # Load the input audio from the AudioRecord instance and run classify.
  data = record.read(buffer_size)
  audio_data.load_from_array(data.astype(np.float32))
  classifier.classify_async(audio_data, time.time_ns() // 1_000_000)

Save the above code into a .py file (eg. audio.py), and run it with...

python audio.py

Say the word that you have used for training; you should see the result printed on screen. You may also see some warning messages (...mostly related to timestamp), and can safely ignore them.