AI Model Training Pipeline

Employee Transcribes Waveforms

Trained employee listens to audio recordings and manually transcribes them into Cree syllabics. These labelled pairs form the core training dataset.

Fine-tune Pre-trained Model (Whisper)

Waveform–syllabics pairs are used to fine-tune Whisper, adapting it to recognise spoken Cree. The result is a dedicated Cree speech-to-text model.

Output: Waveform → Syllabic tokens

🔓

Employee transcription no longer needed. The Cree speech-to-text model now handles audio transcription automatically.

Employee Translates to English

Transcribed Cree syllabics are translated into English by a fluent speaker, producing Cree–English sentence pairs under strict privacy protocols.

Train Cree–English Translation Model

The Cree–English sentence pairs are used to train a dedicated translation model from scratch, learning the mapping between both languages.

Output: Cree–English translation model

🔓

Employee translation no longer needed. The Cree–English model now handles translation of transcribed audio automatically.

Train Cree Syllabics Tokenizer

A tokenizer is trained on the full transcribed dataset, learning how to split Cree syllabic text into meaningful sub-word units. This tokenizer will be used by the LLM.

Output: Syllabics tokenizer

Monolingual Cree Text

Raw transcriptions for native Cree language fluency

Syllabics Tokenizer

Defines how Cree text is split into tokens for the LLM

Speech–Text Pairs

Waveform + syllabics for multimodal training

Cree–English Pairs

Sentence pairs for bilingual language understanding

Train Large Language Model from Scratch

Using monolingual Cree text, the Cree syllabics tokenizer, speech–text pairs, and Cree–English pairs together, a multimodal LLM is trained from scratch. The monolingual data builds native Cree fluency; the pairs align it bilingually and enable audio understanding.

✦ Multimodal · Trained from scratch