FLiP β FactLoLM (English, SONAR)
Factorized Log-linear Models trained to recover lexical content (keywords) from SONAR sentence embeddings, as described in:
Santosh Kesiraju, Bolaji Yusuf, Simon Sedlacek, Oldrich Plchot, Petr Schwarz. FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings. Speech@FIT, Brno University of Technology. arXiv: 2604.18109
Code: BUTSpeechFIT/FLiP Data: BUT-FIT/FLiP-data
Available variants
Trained on Mozilla Common Voice v15 (English), vocabulary of 100 000 unigrams.
| Path | Rank | Params (M) | Model size |
|---|---|---|---|
mcv15/rank-512/ |
512 | ~153 | 207 MB |
mcv15/rank-1024/ |
1024 | ~305 | 414 MB |
Each variant directory contains:
model.safetensorsβ E1(vocab x rank), E2(rank x 1024), b(vocab x 1)vocab.jsonβ{"word": index, ...}(100 000 entries)config.jsonβ hyperparameters
Model architecture
FactLoLM scores every vocabulary word for an input sentence embedding :
where and form a rank- factorization of the full projection matrix. Keywords are the top- words by (logit) score.
Usage
import json
import torch
from safetensors.torch import load_file
# Load
tensors = load_file("mcv15/rank-512/model.safetensors") # after hf download
E1, E2, b = tensors["E1"], tensors["E2"], tensors["b"] # CPU tensors
with open("mcv15/rank-512/vocab.json") as f:
vocab = json.load(f) # {word: index}
int2word = {v: k for k, v in vocab.items()}
def extract_keywords(embedding: torch.Tensor, topn: int = 10):
"""embedding: (1024,) SONAR sentence embedding (speech or text)."""
scores = (E1 @ E2 @ embedding.unsqueeze(-1)).squeeze(-1) + b.squeeze(-1)
top_ids = scores.topk(topn).indices.tolist()
return [int2word[i] for i in top_ids]
Or use the full FLiP pipeline (recommended):
python scripts/evaluate.py \
--sdict mcv15/rank-512/model.safetensors \
--vocab mcv15/rank-512/vocab.json \
--data_yaml configs/datasets.yaml \
--dataset mcv_15_en_test \
--topn 10 --metrics all
Results
Keyword extraction on MCV15 EN (SONAR embeddings):
| Model | Text acc. | Speech acc. |
|---|---|---|
| LiP (full-rank baseline) | 59.45 | 57.27 |
| FLiP rank-512 | 76.77 | 73.62 |
| FLiP rank-1024 | 77.29 | 74.09 |
Span-aware accuracy vs SpLiCE (10 k concept vocabulary):
| Method | Text | Speech |
|---|---|---|
| SpLiCE | 29.58 | 28.21 |
| FLiP | 61.45 | 58.83 |
Citation
@misc{kesiraju2026flip,
title = {{FLiP}: Towards understanding and interpreting multimodal multilingual sentence embeddings},
author = {Kesiraju, Santosh and Yusuf, Bolaji and Sedl{\'{a}}{\v{c}}ek, {\v{S}}imon and Plchot, Old{\v{r}}ich and Schwarz, Petr},
year = {2026},
eprint = {2604.18109},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2604.18109},
}
License
MIT β see LICENSE.