[Cristal] Setting up a voice recognition/Google assistant server
Introduction
This article is part of all the steps necessary to create a personalized voice assistant, the explanations of which you can find here: click. Of course if you just want to see how to create an Ubuntu server allowing voice recognition via TCP/IP as well as the integration of the Google Assistant SDK then you are in the right place!
Transmission wav audio file between esp32 and linux server for recognition
As I said in the introductory article, you can use any machine here as long as it runs Linux. The first step is to integrate a Python script to perform speech recognition, so you will need to have Python installed on your Linux machine. In addition, it is necessary to install the “speech_recognition” module, for this you need to create a Python virtual environment at the root of your project :
pip install virtualenv
python3 -m venv <virtual-environment-name>
It is necessary to activate the environment each time you want to use it, whether to run a script using it or to install a new module.
source <virtual-environment-name>/bin/activate
If you see the name of your environment in parentheses at the start of your line in the terminal then it’s good, it’s activated! Now install the module to perform voice recognition.
pip install SpeechRecognition
You can now add this python file to your server, it performs French voice recognition, but you can simply modify the values for the language you want! Name it “recognize-fr.py”.
import speech_recognition as sr
import os
current_dir = os.getcwd()
filename = current_dir + "/enregistrement.wav"
# Check if the file is empty
if os.path.getsize(filename) == 0:
print("Le fichier est vide, aucun traitement effectué.")
with open("rapport.txt", "w") as file:
file.write("Erreur de reconnaissance\n")
else:
r = sr.Recognizer()
with sr.AudioFile(filename) as source:
audio = r.record(source)
try:
datafr = r.recognize_google(audio, language="fr-FR")
print("Reconnaissance réussie : ", datafr)
except sr.UnknownValueError:
print("Ressayez s'il vous plaît...")
datafr = "Erreur de reconnaissance"
with open("rapport.txt", "w") as file:
file.write(datafr)
file.write("\n")
# Vérifiez le contenu du fichier après l'écriture
with open("rapport.txt", "r") as file:
content = file.read()
print("Contenu du fichier :", repr(content))
However this script is obviously not enough, you have to add the reception and sending tasks, I coded this in C++, it is therefore important to have a gcc type compiler on your machine.
#include <iostream>
#include <fstream>
#include <cstring>
#include <cstdlib>
#include <sys/socket.h>
#include <netinet/in.h>
#include <unistd.h>
#include <sstream>
#define PORT 8080
#define MAX_CONNECTIONS 5
bool is_file_empty(const std::string& filename){
struct stat file_stat;
if (stat(filename.c_str(), &file_stat) != 0){
return true;
}
return file_stat.st_size == 0;
}
int main() {
int serverSocket, recSocket, sendSocket;
struct sockaddr_in serverAddr, clientAddr;
socklen_t addrSize = sizeof(clientAddr);
char buffer[1024] = {0};
// Création du socket serveur
if ((serverSocket = socket(AF_INET, SOCK_STREAM, 0)) == 0) {
std::cerr << "Erreur de création de socket" << std::endl;
return -1;
}
serverAddr.sin_family = AF_INET;
serverAddr.sin_addr.s_addr = INADDR_ANY;
serverAddr.sin_port = htons(PORT);
// Lier le socket au port
if (bind(serverSocket, (struct sockaddr *)&serverAddr, sizeof(serverAddr)) < 0) {
std::cerr << "Échec de la liaison" << std::endl;
close(serverSocket);
return -1;
}
// Écouter les connexions entrantes
if (listen(serverSocket, MAX_CONNECTIONS) < 0) {
std::cerr << "Échec de l'écoute" << std::endl;
close(serverSocket);
return -1;
}
std::cout << "Serveur en écoute sur le port " << PORT << std::endl;
while (true) {
std::cout << "En attente de connexion..." << std::endl;
recSocket = accept(serverSocket, (struct sockaddr *)&clientAddr, &addrSize);
if (recSocket < 0) {
std::cerr << "La connexion utile à la réception a échouée" << std::endl;
continue;
}
std::cout << "Connexion acceptée" << std::endl;
// Recevoir le fichier
ssize_t bytesRead;
std::ofstream outfile("enregistrement.wav", std::ios::binary);
if (!outfile.is_open()) {
std::cerr << "Erreur d'ouverture du fichier enregistrement.wav" << std::endl;
close(recSocket);
continue;
}
bool receivedData = false;
while ((bytesRead = recv(recSocket, buffer, sizeof(buffer), 0)) > 0) {
outfile.write(buffer, bytesRead);
receivedData = true;
}
outfile.close();
// Fermer le socket après la réception du fichier audio
close(recSocket);
if (!receivedData || is_file_empty("enregistrement.wav")) {
std::cerr << "Fichier reçu est vide, aucun traitement effectué" << std::endl;
continue; // Passer à la prochaine connexion
}
std::cout << "Fichier reçu avec succès" << std::endl;
// Exécuter le script Python
std::cout << "Exécution du script Python..." << std::endl;
int result = system("python3 recognize-fr.py");
if (result != 0) {
std::cerr << "Échec de l'exécution du script Python" << std::endl;
} else {
std::cout << "Script Python exécuté avec succès" << std::endl;
}
sendSocket = accept(serverSocket, (struct sockaddr *)&clientAddr, &addrSize);
if (sendSocket < 0) {
std::cerr << "La connexion utile à l'envoi a échouée" << std::endl;
continue; // Passer à la prochaine connexion
}
// Lire le contenu de rapport.txt
std::ifstream reportFile("rapport.txt");
if (!reportFile.is_open()) {
std::cerr << "Échec de l'ouverture de rapport.txt" << std::endl;
close(sendSocket); // Fermer le socket en cas d'échec
continue; // Passer à la prochaine connexion
}
std::stringstream reportBuffer;
reportBuffer << reportFile.rdbuf();
std::string reportContent = reportBuffer.str();
reportFile.close();
// Envoyer le contenu de rapport.txt au client
ssize_t sentBytes = send(sendSocket, reportContent.c_str(), reportContent.size(), 0);
if (sentBytes < 0) {
std::cerr << "Échec de l'envoi du rapport" << std::endl;
} else {
std::cout << "Rapport envoyé avec succès (" << sentBytes << " bytes)" << std::endl;
}
close(sendSocket);
}
// Fermer le socket serveur (en théorie, cette ligne ne sera jamais exécutée)
close(serverSocket);
return 0;
}
Compile with this command:
g++ -std=c++11 -o server-side main.cpp
Before launching the script, make sure that your Python virtual environment is indeed activated, it will not work otherwise!
./server-side
If all goes well you should see this:
./server-side
Serveur en écoute sur le port 8080
En attente de connexion...
But if you see this, don’t worry:
./server-side
Échec de la liaison
This means that another process is using port 8080 while our script wants to use it to communicate with esp32, you can check who is using this port:
sudo lsof -i:8080
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
server-si 34293 root 3u IPv4 278644 0t0 TCP *:http-alt (LISTEN)
To kill it subsequently by identifying it by its PID:
sudo kill -9 34293
And it should now work! However, one last configuration is necessary, in fact we are going to run two scripts in parallel on our Linux machine, it is therefore preferable to limit these two processes to two separate sessions (which is also necessary if you use SSH like me).
To do this, install screen which is a terminal multiplexer:
sudo apt install screen
Now every time you run the server-side script, do this in order:
- Create a new session :
screen -S <name-of-your-session>
- Or connect to a pre-existing session that you created :
screen -r <name-of-your-session>
- Activate your virtual environment
- Run
./server-side
- Hit
Ctrl+A
thenD
Manage sending/reception from esp32 for voice recognition
The following C++ file send audio file and receive the recognized character string:
wavserv.h
:
#ifndef WAV_H_
#define WAV_H_
#include <WiFi.h>
#include <FS.h>
#include <SD.h>
#include <SPI.h>
String recowav();
#endif
wavserv.cpp
:
#include <WiFi.h>
#include <FS.h>
#include <SD.h>
#include <SPI.h>
#include <wavserv.h>
// Configuration du serveur
const char* serverIP = "<your-server-ip>";
const int serverPort = 8080;
// Pins SPI
#define SCK 18
#define MISO 19
#define MOSI 23
#define CS 5
#define BUFFER_SIZE 4096
String recowav(){
// Initialisation du SPI et de la carte SD
SPI.begin(SCK, MISO, MOSI, CS);
if (!SD.begin(CS)) {
Serial.println("Card Mount Failed");
return "error";
}
uint8_t cardType = SD.cardType();
if (cardType == CARD_NONE) {
Serial.println("No SD card attached");
return "error";
}
// Création du socket
WiFiClient client;
if (!client.connect(serverIP, serverPort)) {
Serial.println("Connection to server failed");
return "error";
}
// Ouverture du fichier .wav à envoyer
File file = SD.open("/audio.wav");
if (!file) {
Serial.println("Failed to open file for reading");
return "error";
}
// Lecture et envoi du fichier .wav
char buffer[BUFFER_SIZE];
while (file.available()) {
int bytesRead = file.read((uint8_t*)buffer, BUFFER_SIZE);
if (client.write((const uint8_t*)buffer, bytesRead) != bytesRead) {
Serial.println("Error sending file");
break;
}
}
file.close();
// Fermer l'écriture du socket pour indiquer la fin de l'envoi du fichier
client.flush();
client.stop();
// Reconnecter pour recevoir le rapport
if (!client.connect(serverIP, serverPort)) {
Serial.println("Connection to server failed");
return "error";
}
// Réception du contenu de rapport.txt
Serial.println("En attente de la réponse du serveur...");
while (client.connected() || client.available()) {
if (client.available()) {
String report = client.readStringUntil('\n'); // Read until new line character
client.stop();
return report;
}
}
// Fermeture du socket
client.stop();
}
Enable the Google Assistant SDK
We will now see together how it is possible to integrate your homemade voice assistant with your Google home application to be able to control all of your connected devices exactly as if you were talking to an official Google voice assistant! I made this choice so as not to get lost in the use of lots of different APIs but you can do it if you wish, here Google does not process audio directly, it is simply a matter of sending a text command to your Google Assistant using your account.
Setting up the Google Assistant SDK is quite complex, follow this official guide.
If you have problems with google-oauthlib-tool, particularly the --headless
parameter, do this:
-
execute
screen -S auth
-
execute
source env/bin/activate
-
execute
google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype --save --client-secrets </path/to/client_secret_client-id.json>
(modify the command to match your secret file) -
Complete authentication on any device using chrome
-
At this state, you should see a failed to load website page
-
Open chrome dev tools(F12)
-
Go to network
-
Reload the webpage
-
On the entry that popped up, click copy as cURL
-
On your Linux machine, press
Ctrl+a
and afterwardD
to close the screen -
Paste in terminal
Connect Google Assistant with ESP32
We are going to create an intermediate server which accepts HTTP requests and which executes bash commands, it will take as arguments the secret identifiers specific to our device for security reasons!
- Initialize a new Node.js project
Use npm to initialize a new project. This will create a package.json file where information about your project and its dependencies will be stored.
npm init -y
- Install Express
Express is a minimalist framework for Node.js that makes it easy to create web servers. Install Express as a dependency in your project.
npm install express
- Create the Server
Create a server.js
file in your project directory. This file will contain the server code.
touch server.js
Open server.js
in a text editor and add the following code:
const express = require('express');
const { exec } = require('child_process');
const app = express();
// Clé API pour sécuriser les requêtes
const API_KEY = 'VOTRE_CLE_API_GENERATED'; // Remplacez par la clé API générée
app.use(express.json());
// Middleware pour vérifier la clé API
app.use((req, res, next) => {
const apiKey = req.header('x-api-key');
if (apiKey !== API_KEY) {
return res.status(403).send('Accès refusé');
}
next();
});
app.post('/execute', (req, res) => {
const deviceId = req.body.deviceId;
const deviceModelId = req.body.deviceModelId;
const phrase = req.body.phrase; // Nouvelle phrase à envoyer
// Vérification des paramètres
if (!deviceId || !deviceModelId || !phrase) {
return res.status(400).send('Paramètres manquants : deviceId, deviceModelId ou phrase');
}
// Construire la commande avec les paramètres
const command = `./run_assistant.sh ${deviceId} ${deviceModelId} "${phrase}"`;
exec(command, { shell: '/bin/bash' }, (error, stdout, stderr) => {
if (error) {
return res.status(500).send(`Erreur d'exécution : ${error.message}`);
}
if (stderr) {
return res.status(500).send(`Erreur de commande : ${stderr}`);
}
res.send(stdout);
});
});
const PORT = 3000; // Choisissez le port que vous souhaitez utiliser
app.listen(PORT, () => {
console.log(`Serveur en écoute sur le port ${PORT}`);
});
- Create the
run_assistant.sh
script that executes the necessary Bash commands. Place this file in the same directory as server.js.
#!/bin/bash
# Récupérer les paramètres
DEVICE_ID=$1
DEVICE_MODEL_ID=$2
PHRASE=$3
# Activer l'environnement virtuel
source ~/prog/cristal-env/bin/activate
echo "Environnement activé."
# Obtenir la date et l'heure actuelle
CURRENT_DATETIME=$(date '+%d-%m-%Y %H:%M:%S')
# Construire la commande complète
COMMAND="python -m googlesamples.assistant.grpc.textinput --device-id $DEVICE_ID --device-model-id $DEVICE_MODEL_ID"
# Écrire la commande et la phrase dans le fichier de log avec l'horodatage
echo "[$CURRENT_DATETIME] Command: $COMMAND, Phrase: \"$PHRASE\"" >> command_log.txt
# Exécuter le script expect
expect ./send_command.exp "$DEVICE_ID" "$DEVICE_MODEL_ID" "$PHRASE"
The line echo [$CURRENT_DATETIME] Command: $COMMAND, Phrase: \"$PHRASE\"" >> command_log.txt
writes the full command and phrase to the command_log.txt
file, appending the timestamp at the beginning.
- Make sure the script is executable:
chmod +x run_assistant.sh
- Install expect (if necessary):
On Ubuntu, you can install expect with the following command:
sudo apt-get install expect
- Create an Expect Script:
We will create an expect script that sends the phrase after detecting the prompt :
#!/usr/bin/expect
# Récupérer les arguments
set device_id [lindex $argv 0]
set device_model_id [lindex $argv 1]
set phrase [lindex $argv 2]
# Lancer la commande Python
spawn python -m googlesamples.assistant.grpc.textinput --device-id $device_id --device-model-id $device_model_id
# Attendre l'invite
expect ": "
# Envoyer la phrase et appuyer sur Entrée
send "$phrase\r"
# Attendre que le processus se termine
expect eof
As you have probably noticed, it is necessary to define an API key on your server and which you will reuse on your ESP32. Define const API_KEY with a secure and long character string, you can for example generate this secret key with:
node -e "console.log(require('crypto').randomBytes(50).toString('hex'))"
All you have to do is let the intermediate server listen to port 3000 using the same steps as for the voice recognition script, i.e. create a second session with screen (with a different name of course), and detach from it after running the server with the command node server.js
.
Manage sending/reception from esp32 for google assistant
The following C++ file send command for google assistant as well as the secret identifiers of the device.
#ifndef GASDK_H_
#define GASDK_H_
#include <WiFi.h>
#include <HTTPClient.h>
void exec_com_assistant(String apiKey, String deviceId, String modelId, String phrase);
#endif
#include <WiFi.h>
#include <HTTPClient.h>
const char* serverUrl = "http://<your-server-ip>:3000/execute";
const char* apiKey;
void exec_com_assistant(String apiKey, String deviceId, String modelId, String phrase) {
HTTPClient http;
// Commence la requête POST
http.begin(serverUrl);
http.addHeader("Content-Type", "application/json");
http.addHeader("x-api-key", apiKey.c_str());
// Corps de la requête JSON avec paramètres
String jsonPayload = "{\"deviceId\": \"" + deviceId + "\", \"deviceModelId\": \"" + modelId + "\", \"phrase\": \"" + phrase + "\"}";
// Envoie la requête
int httpResponseCode = http.POST(jsonPayload);
if (httpResponseCode > 0) {
String response = http.getString();
Serial.println("Réponse du serveur : " + response);
}
http.end();
}
Conclusion
After having correctly configured the server and in particular the two virtual terminals via screen, you will have to obtain the following configuration permanently:
- session cristalWAV -> listen
port 8080
- session cristal GOOGLE -> listen
port 3000
Auteur : Romain MELLAZA
Date de publication : 11 Juillet 2024