Project Key Notes
Project: Trixy
Programming language: Python, pytorch
Project Type: Server / Client / Standalone voice assistent
Project Terms: Client = Satellite; Satellite registrated != Satellite connected
Network: Command Socket (Port 2101), Raw Audio Input Live Stream (Port 2102, 16KHz, Mono), Raw Audio Output Live Stream (Port 2103, 16KHz, Mono), Raw Music Output Live Stream (Port 2104, 48KH, Stereo), Multi-Client support
Plugin: Full Plugin System, dynamicly loaded from ./plugins/*/ dir. Each Plugin go an own folder with at least main.py and config.json. Can be enabled/Disable (variable in the config.json)
Wakeword Detection: Custom made with a pytorch model running only on the client
Voice Recognition: Custom made with a pytorch model running only on the client
Event Handler: Main part of the project. Everything is called and uses the event hander
Container and Factory: the main parts are stored in a container on the serv such as event hander, plugin system, satellite manager, and are stored in the application system. Plugins gain access to the application system and from there, they can trigger events, gain access to the satellites to transfer audio, and so on.
User Interface: textual TU and tmux. Starts with the system. maintenance using ssh connection and opening tmux. No "print()" functions in prod system. in debug/dev Mode no textual, but print functions
schedule: The voice assistent has a schedule manager, that triggers event
Satelite Status: registrated, not connected, connected (a satellite must be registrated before it's allowed to connect to the server. A registrated satellite gets a registration file (json format) on the server, based on the mac adress, containing the room, alias name, and an id)
Event Handler
The Event Handler is the main component and triggers all the functions.
Almost evertything triggers an event and the system and Plugins can react to the raised events.
there is an annotation "@TrixyEvent(["event_name", ...])" you can put above a function. This function will be called, when the defined events where called:
Example:
.....
event_data = new WakewordReceived_EventData()
event_handler.trigger("wakeword_received", event_data)
.....
class MyPlugin(TrixyPlugin):
@TrixyEvent(["wakeword received"])
def on_wakeword_received(self, event_name, event_data):
pass
@TrixyEvent(["satellite connected", "satellite disconnected"])
def on_satellite_connection_changed(self, event_name, satellite_data):
pass
Command Socket
Basic information
- Uses the trixy Protocoll, a custom Protocoll
- it's not HTTP nor an API.
- It's more a data stream, like in ICQ Messanger, MSN Messanger, or in online game like Counter Strike.
- The protocoll sends a serialized python class
- The first 4 bytes are a magic number "TRXI"
- followed by the version 4 bytes for major:int (4 byte binary), 4 bytes for minor:int (4 byte binary), 4 bytes for revision:int (4 byte binary)
- followed by datetime in binary
- followed by options (32 bit binary, flags)
1 bit: if set, the serialized string is gz compressed
2 bit: if set, the serialized string is encrypted
3 bit: if set, the serialized string is in json format, otherwise it's binary
4 bit: if set, the receiver must send a "recived" response
5 bit: if set, the serialized class is base64 encodet
6 bit: if set, the message is a multi-part
7 bit: if set, it's a serialized dict, not a class, and the class-name may be empty and the class name length may be 0
8 bit: if set, silent - no debug log for sending/receiving this message
... other state of the art Flags for option
- followed by MD5 Check Sum of the serialized class instance
- followed by str.length:int (4 byte binary) for class name
- follwoed by the class name
- followed by the length:int (4 byte binary) of the serialized class instance
- followed by the serialized class instance.
- When the message was received, it will be deserialized with the specific class (class name) or when option flag 7 was set, it's a dict.
- All Classes that will be de-/serialized for the command socket are stored in ./trixy_core/network/cmd/*.py
- There may some hard codet command messages not following this protocoll to save network trafic. So those commands are only a few bytes and they all start with "TRXI". They won't be de-/serialized. They will be directly handled and send from the Network Component (hard codet)
- TRXINOOP (noop command to keep alive / Heartbeat)
- TRXIPING (ping command)
- TRXIPONG (pong command)
- TRXIPRNT (simply prints a string, so it's easy to test, if the connection was established)
- TRXYHELO (hello command for debugging and simple prints "Hello" when received)
- ... some other state of the art commands
Basic Wakeword infos
- custom PyTorch Model, containing 2 wakewords ("custom" and "system command".... only with the system command wake word you are allowed to make administration commands/intents like set time, reboot system, ....)
- The pytorch file (.pth is a .zip file, with different file extention) is password protected
- The pytorch file (.pth is a .zip file, with different file extention) must contain a meta file (.json) with informations about the model.
- model finetuning can be done at any time
Basic Voice Recognition infos
- custom PyTorch Mode
- The pytorch file (.pth is a .zip file, with different file extention) is password protected
- The pytorch file (.pth is a .zip file, with different file extention) must contain a meta file (.json) with informations about the mod and all used speaker. "speaker": [...], where the speaker id is the array index.
- model finetuning can be done at any time and new speakers may be added
Workflows
Satellite Registration
- The server must be in the "Registration Mode".
- When entering the registration Mode, the server stays for 60 seconds in the registration mode, then it auto switches back to normal mode.
- When in registration mode, it allowes unknown satellites to connect.
- When in registration mode, the first unknown satellite that tries to connect, will be registrated and gets an registration file on the server.
- During registration, multiple commands between satellite and server may be exchanged. The server will ask for the Room, Alias Name and mac adress.
Satellite Connection
- On system start, the satellite will automaticly connect to the server using the adress and port from the satellite_config.json
- When the connection was lost or failed, the satellite will retry to connect to the server after 5 seconds.
- When a satellite was connected, it sends basic informations (room id, alias, mac adress, trixy client version, ports for the audio sockets) to the server.
- The server compars the information. The mac adress must be valid, registrated, and must not be on the black list.
- The server sends an accepted, denied, command to the satellite and triggers an event "satellite connected", and when accepted, it also sends the ports for the audio stream socket connections.
- The server creates an satellite instance and add add it to the satellite manager.
- The satellite connects the audio streaming sockets and look for the connections
- When a satellite connection was lost/closed, the server triggers a "satellite disconnected" event.
Wakeword Detection
- Starts with the satellite
- When a Wakeword was detected:
- Pause Wakeword Detection
- Start recording the microfon in buffer
- Sends command to the server (including wakeword ID, speaker id, speaker name)
- After 10 seconnds without receving a selected command from the server, the audio buffer will be deleted
- When the server received a "wakeword detected" command from a satellite, it will wait 1 second to check if other satellites also received a wakeword
- The server will raise an event on the event handler.
- The server will selected the satellite with the highest volume (when you have a satellite in the kitchen and in the livingroom, and you say the wakeword in the kitchen, the satellite in the living room will also notice it.. butthe volume in the kitchen was higher)
- The server will start a conversation session
- the server sends a command to the selected satellite (with conversation session id).
- The selected satellite will send the buffer throw the raw audio input stream, and starts streaming the audio.
- After 3 seconds of silence (the client checks every chunk before sending if silent, and if silent, how many chunks (how many seconds) it was silent... and after 3 seconds it stops transfering
- 60 Seconds is the max recording time.
- When the recording was stopped eigher by 60 seconds timeout or 3 seconds of silence or an "abort" command from the server, the client sends a command that the recoding was done.
- Info: Silence detection only starts, when a non silence chunk was already detected... so before silence detection can start it must be loude first.
- When recorded done command was received, the server triggers an event "raw audio input received" with conversation id, audio data, speaker id, speaker name, and some other state of the art infos
- Plugins may react on the event, like a "Mozilla DeepSpeech STT" Plugin, that converts the raw audio data to a text.... and triggers a "Text received" event.
- Plugins may react to the "Text received" event like a NLP Plugin that converts the text to an intent and triggers an "intent recevied" event.
- Plugins may react to the "intent received" event like a "HomeMatic Plugin" that switches on the light for example... and a "TTS" Plugin that creates a wave file, or creates raw audio data with the spoken text... and triggers "tts received" event...
- The "tts received" is a native server event and the server will send the tts audio data threw the audio output (16 KHz, Mono) stream.
- Plugins may ask questions and wait for raw audio input.
Btw, the satellite automaticly buffers and plays all audio data that where send threw the audio output and music output socket stream.
Here is a basic conversation that should be possible, with the wakeword "Trixy":
User: Trixy, Please order a pizza
Trixy: Ok, what kind of pizza do you want?
User: A spicey one
Trixy: Ok, i will send an order to your local pizza service
As you can see, the wakeword was only said once, and the plugin "PizzaOrder" asked a question in the conversation and waited for an answer... and after the user gave the answer and silence was detected again, the plugin received the final conversation. The Event handler was triggerd multiple times... and the conversation-id didn't change... so the system knew, that the final text was ment to send to the PizzaOrder Plugin, after creating the Text.
I think the best way to solver this is to create STT, then send the text to the NLP that creates the intent, and then the plugin was triggerd by the intent received event... and the plugin noticed that the type of pizza was missing and asked for the type... then the user said something... this again triggerd the audio received event... the STT was triggerd again, that triggerd the intent handler, .. the intent handler noticed that there was already a conversation (because the conversation id was given all the way), and it retriggerd the same intent again, but this time with the type of pizza.
Server / Client / Standalone
Only the server and standalone have the event handler, plugins, .... The client only has wakeword detection, voice recognition and the socket streams with the network comonent.
The Standalone Version has wakeword detection, voice recognition, as well as event handler, plugins and so on... The connection to the server is not required.
The Standalone version may connect to the server for updateing the ML Models, exchange informations such as "calendar entries", "music", "assets", and some things Plugins can handle... For example there may be an "adress book" Plugin or a "Notes" Plugin... and then the notes and adresses may be syncronized when the standalone connects to the server.
The Standalone Version will try to connect to the server every 30 seconds (defined in the ./config/standalone_config.json)
Debuging
- with --debug startup argument you can enter dev mode.
- In dev mode, there is no TUI. But there a print outputs
- A pprint(:str) function is used for all debug outputs. When in dev mode, this function uses print(:str), and when in prod mode, it loggs the output and shows it in a Log-Widget
Schedule Manager
- There may be multiple schedule entrie
- Each schedule has a unique name.
- They may can have multiple triggers such as date, time, event (from event handler), weekday, and other stade of the art triggers
- They may can trigger an event, start a specific ML Training, start an internal function. Multiple actions are allowed.
- Containes other state of the art features
Basic Programming strategies
Satellites and Satellite Manager
There should be a satellite manager that stores every registrated sattelite. To make programing plugins easy, we need simeple access to the satellites. The access points for satellites is the satellite manager class. The manager contains an array with instances from the satellite class. Each registrated satellite will get an instance, no matter if connected or not.
In this instance all informations are stored such as properties with getter and setter like (last known) room id, mac adress, alias name, IP, command socket, raw audio input socket, raw audio output socket, raw music output socket, reference to last conversation, .... as well functions like "say", "disconnect"....
The Manager also has functions like "disconnect()", "disconnectAll()", "reconnect()", "reconnectAll()", ....
It also has functions to find a satellite by room id, mac adress, connection state, .... or to find multiple satellites.
It would be good, if you could also select the satellite by index directly from satellite like: "satellite_manager[0]" gives the satellite from the array with index 0..... or having Selectores (case insensetive) when it's a string like satellite_manager["status = connected, room = kitchen"] or something like that... and when multiple satellites hit the selection, then the funciton will be applies to all satellites matching this selector.... don't know if this is possible in python, but it would be cool.
Plugins
Plugins have there won directory in the plugins folder. Each plugin has a main.py and a config.json. The config.json is automaticly loaded and sotred in the config variable from the plugin.
There is a plugin manager that manges all the plugins.
You can have
in the main.py must be a class. extending the TrixyPlugin Class. The TrixyPlugin class is the main PluginClass every plugin must extend from. It already has the property "application" (linkes to the application main class), the config property (the plugins config.json is loaded in) a property with getter and setter for "enabled" as well as a function "is_enable()", a "reload_config()" and a "save_config()" function to save the config.
File System
I want following structure:
- ./ (the trixy apllication is here)
- ./plugins (this is a directory where the plugins are stored
- ./plugins/myplugin (this is an exmaple plugin directory for "myplugin". Every plugin got it's own directory, and within this directory a "main.py" and a "config.json" is in
- ./models (here are the PyTorch/ML models stored)
- ./models/wakeword (this is the directory for the wakeword models)
- ./models/wakeword/mymodel (this is an example directory for the mymodel Wakeword Model... in here the pth file is in)
- ./models/voice_recognition (this is the directory for the voice_recognition models. Same as in wakeword, there is another subdirectory, and in this, the .pth file is in)
- ./config (in this directory is the configuration files for the server/client are in)
- ./trixy_core (in here, all core python scripts are in, well structrueed by component)
- ./trixy_core/arbitration (in this directory, the python scripts for the arbitration component are in)
- ./trixy_core/config (in this directory, the python scripts for the config component are in)
- ./trixy_core/conversation (in this directory, the python scripts for the conversation component are in)
- ./trixy_core/events (in this directory, the python scripts for the events component are in)
- ./trixy_core/scheduler (in this directory, the python scripts for the scheduler component are in)
- ./trixy_core/network (in this directory, the python scripts for the network component are in)
- ./trixy_core/assets (in this directory, the python scripts for the assets component are in)
- ./trixy_core/...... (there are many other components)
- ./assets (in this directory the profiles are in.)
- ./assets/<"default"/profile_id>
- ./trainer (here are the trainers in)
- ./trainer/data (here are the data for the trainers in)
- ./trainer/data/wakeword (training data for the wakeword trainer)
- ./trainer/data/voice_recognition (training data for the voice_recognition trainer)
- ./trainer/wakeword (the wakeword trainer)
- ./trainer/voice_recognition (the voice recognition trainer)
The apllication can be started with:
"> python3 main.py server"
"> python3 main.py client"
"> python3 main.py standalone"
There are also some startup arguments, like --config, where you can specific the config file that should be used... but as default the client should use the "client_config.json", the server the "server_config.json" and the standalone should use the "standalone_config.json"
Assets
in the config file there is a profile given. This is used and stored by the asset manager. When asking the asset manager to get the path for the asset "audio/success.wav", the asset manager will look in the asset directory for the file: "./assets/MyProfile/audio/success.wav"... if it doesn't exist, he will use the default profile as fallback "./assets/default/audio/success.wav" If this also doesn't exist, the asset manager will return false.
User Interface (TUI)
- Pultiple views... you can switch views with F1, F2, F3, ....
- Uses Textual CSS
- uses widgets
- The UI has a Title "Trixy Server" / "Trixy Client" / "Trixy Standalone"
- may have multiple sub-main views for satellites, Plugins, schedule and ML Trainer
- A Sub-Main View overwrites the F-Keys and main menu for it's own use. You can exist a Sub-Main View by pressing "Esc" to go back to the main view
- A Sub-Main View extends the Title with the sub view name... Eample: "Trixy Server - Plugins"
- The content area is scrollable.
Main View (Server)
F1 (General)
- Shows general startus
- how many satellites are connected
- Hostname
-Host address
- Command Socket Port
- How Pany Plugins are available
- How many Containers are started
- Server Up-Time
- Version
- OS / OS Version
F2 (Config)
Shows the server configuration and you can chance the configuiration here.
F3 (Satellites)
- Shows a List of all registrated satellites and if they are connected or not.
- When selecting a satellite, it opens a Sub-Main View for the selected satellite
- There are also buttons to switch to registration mode, blacklist a satellite, delete a satellite registration, ...
- Auto Updates the status for the satellite (connected/disconnected) every 10 seconds
F4 (Plugins)
- Shows a list of all available plugins and there status (enabled / disabled)
- When selecting a plugin, it opens a new Sub-Main View for this plugin.
- There is a Button to Toggle enable status for the selected Plugin
F5 (Schedule)
- Shows a list of all Schedules and when they where triggert the last time (or "disabled" or "never")
- When selecting a schedule entry, it opens in a Sub-Main View
- you can add new schedule or remove the selected schedule entry
F6 (ML Trainer)
- Shows a list of all ML Trainers and when they where running the last time or the progress current epoch of max epochs of the current run, when they are currently trainin
- ML Trainer are hard coded
- When selecting a ML Trainer, it opens in a new Sub-Main View.
F9 (Logs)
- Hier werden die einzelnen Logeinträge in einem Scrollbarem
Sub-Main View (Satellites)
F1 (Info)
- Up Time des satellites
- Host Name / IP
- Ping (wird alle 10 sec aktualisiert)
- Mac Address
- Version
- OS / OS Version
F2 (Connection)
- Information about the Client/Server Connection (Command Socket Stream)
- IP / Port
- Connection Up-Time
- Status
- Bytes transfered total / Today
- Information about the Client/Server Connection (Raw Audio Input Socket Stream)
- IP / Port
- Connection Up-Time
- Status
- Bytes transfered Today
- Information about the Client/Server Connection (Raw Audio Output Socket Stream)
- IP / Port
- Connection Up-Time
- Status
- Bytes transfered Today
- Information about the Client/Server Connection (Raw Music Output Socket Stream)
- IP / Port
- Connection Up-Time
- Status
- Bytes transfered Today
F3 Conversation
- Informations about Wakeword detection. When was the model trained, what model is used, when was the last detection
- Informations about the Voice Recognition. When was the model trained, what model is used, how was detected the last time
- When was the last conversation and how long was the conversation (live updated when changed)
- Is a conversation active right at the moment (this label gets live updated as soon as a wakeword, or silence or timeout was detected)
F4 Configuration
- Configuration for the client
F5 Updates
- Buttons for Update the Wakeword Model, Voice Recognition Model
- Button for Update the script (the server will send a "scripts.zip" file (or .tar.gz) to the client containing all the python scripts and the client will extract and override all the data).
- Reboot System (from the satellite)
- Buttons to update Assets (default and the profile, the satellite is using)
- Button for Update Plugins
- Button for Enter "Wakeword Detected" Mode... so this will fake a "Wakeword detected"
- Button to playback a "audio/test.wav" asset file (path depends on the asset path... using the asset manager)
- Button to playback / stop a music file, using the music output socket live stream, and the server sends the audio data to the client
Sub-Main View (Plugins)
F1 (Info)
- General Plugin infos... name, description, version, last modify time, is active, ....
- Enable / Disable Buttons
F2 (Config)
- Here are some configs for the Plugin. The Plugin has may has a "config_view.py" script to handle the Plugin configurations. If not, then the view will be auto generated based on the config.json structure. (All values will be edited in text boxes and when saving, the type will be automaticly casted, ... when the original value in the .json file was an integer, it will be saved as an integer... it it was a boolean, it will be saved as boolean... same to floats and strings.
Sub-Main View (Schedule)
F1 (Info)
- Basic Informations
- Name
- Description
- Last time triggerd
- Duration, how long the action took when it was triggerd the last time
F2 (Trigger)
- Configuration about the triggers
F3 (Action)
- Configuration about the actions
Sub-Main View (ML Trainer)
F1 (Info)
- Basic informations
- ML Trainer Name
F2 (Model Info)
Some basic informations about the current used model... some of them from the meta .json file that is included in the .pth file...
F3 (Training)
If there is a training currently running, it shows the current status (Epoch, max epoch, Accuracy, Loss, .....
And a progressbar... then how long an epoch takes to train, and how long the training properbly will take (updated every 5 seconds)
If there is no training currently running, it shows the stats of the last training. And there are input fields for starting a training with specific params like Batch Size, Min Epoch, Max Epoch, ..... continue an old checkpoint, ....
There are also Buttons to Start/Stop/Pause the training.
Main View (Client)
F1 (General)
- Shows general startus
- Hostname
- Host address
- Command Socket Port
- Server Host Name / Port
- Client Up-Time
- Version
- OS / OS Version
F2 (Config)
Shows the client configuration
F3 (Wakeword)
- Informations about the used Wakeword Model
- Informations about the used Voice Recognition Model
- Informations about the last time a wakeword was said (and which one, custom or system command)
- Information who was the last one who used the wakeword (speaker id and speaker name)
- Button to manuelly trigger the wakeword said, so the audio stream starts for 60 seconds or 3 sec silence)
ML Trainer
Informations
Technology: PyTorch
Metafile: metafile.json (.pth files are .zip files but other file extention. so we can use zip to include the metafile in the .pth file)
All trainers can be included from the server script files, or they can be executed from terminal with custom arguments.
The Configuration files are locatated in the ./config directory or as start arguments when executing from console. Arguments, when set, have a higher priority then config variables.
Raw Wave files must be never deleted, moved or edited. They stay untouched!
Use state of the art mechanics and advanced features to make profesionell ML trainers.
Also include other optional technologies, you can turn on/off in the configuration file... technologies like porcubine uses for example or other state of the art products uses... So we can train different kind of models... The technology that is used is stored in the models meta data.
Meta File
.pth Files are in real .zip FIles but with a different file extention. But it's possible to add custom files to those archives. I make use of it and want to add a metafile.json to the archive. Here are basic informations stored such as Training Date, Author, Description, Training Results, computer name (the model was trained on), training informations like duration, num epochs, length of a chunk that was used for the training....
When using .pt Files, "torch.package" should be used to store all those meta informations.... and when using onnx the informations are stored with onnx module "metadata_props".
I want to support all 3 file formats.
Wakeword Info
There are 2 Wakewords. 1) Custom; 2) "System Command" for admin commands.
Wakeword Data Source
The training data are stored in ./trainer/data/wakeword/raw//*.wav
in "custom" and "system command" the wave files only containes the wakeword. Some of them are 2 seconds long, some are 8 seconds with silence at the beginning and/or silence at the end. So they must be trimmed to a fitting and constant length.... sometimes with a bit silence at the beginning and a bit silence at the end, but the full wakeword must stay in the wave file. So the first thing you need to detect is the max length of the spoken wakeword without silence. then you can trim all wave files to the same size with a random duration of silence at the beginning/end and save those as wave files in ".../chunked//*.wav
Next to system command and custom, there is also a "negative" and a "background" directory. Here are waves inside, that do not include the wakeword. You must split those waves into many single wave files. Some of the raw/negative are 30 seconds long... For example, the max wakeword length is 1.5 seconds... so you can split a single wave into 20 wave files. None of those files have the wakeword.
After that, sort them into training directories.... use a good algorythmus to split the files into "train" / "val" / "test".
Sampling Rate: 16 kHz
Bit-Depth / Format: 16-bit PCM mono
Feature Extraction: Log-Mel-Spectrogram (20–40 Filterbänke, 10ms Shift)
Data argumentation: background noices, silence, volume, pitch, speed
Modellarchitektur: RepCNN
Voice Recognition
The Voice Recognitionshould is a ML that recognize a speaker and returns his name. The number of speakers are dynamic.... there can be 4 speakers or there can be 90 speakers.... The number of speakers and there names are stored in the model meta file (.pth files in the metafile.json, .pt files use torch.package, and onnx use metadata_props ... and it's always "num_speaker" for the amount of speaker and "speaker001" = "Johannes", "speaker002" = "Peter", .....)
Based on the "num_speaker" the model architecture is formed. This alloweds a dynamic number of speakers.
With fune tuning, new speakers can be added.
Voice Recognition Data Source
The training data are stored in ./trainer/data/voice_recognition/raw//*.wav
They files should be trimmed from silence. Then they can be split into slices with the same length the wakeword has got. So when the preparing is called, we need the length as argument. it should be the same as wakeword used.... if not given, the wakeword model should be readed, to get the meta information.
then, same as for the wakeword training for the negative/background, the files should be splitted into multiple peaces. some of the raw files are 30 seconds long and can be splitted into 20 files...
The directory name, the wave files are in, has the same name as the speaker... this is the "speaker name". The speaker name must be added to the speaker metadata array... the index of the entry is the "speaker id".
After that, sort them into training directories.... use a good algorythmus to split the files into "train" / "val" / "test".
Sampling Rate: 16 kHz
Bit-Tiefe / Format: 16-bit PCM mono
Feature Extraction: Log-Mel-Spectrogram or MFCC (40 Filterbänke, 25ms Fenster, 10ms Shift)
VAD (Voice Activity Detection): WebRTC VAD or Energy-based
Data argumentation: Noise, Reverberation, Loudness, Mic-Simulation
Modellarchitekture: ECAPA-TDNN, TitaNet-S, SpeakerNet-M
Target output: Speaker Embedding
Loss-Functions: Additive Margin Softmax (e.g. ArcFace), GE2E, Triplet Loss