FLiP — FactLoLM (English, SONAR)

Factorized Log-linear Models trained to recover lexical content (keywords) from SONAR sentence embeddings, as described in:

Santosh Kesiraju, Bolaji Yusuf, Simon Sedlacek, Oldrich Plchot, Petr Schwarz. FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings. Speech@FIT, Brno University of Technology. arXiv: 2604.18109

Code: BUTSpeechFIT/FLiP Data: BUT-FIT/FLiP-data

Available variants

Trained on Mozilla Common Voice v15 (English), vocabulary of 100 000 unigrams.

Path	Rank	Params (M)	Model size
`mcv15/rank-512/`	512	~153	207 MB
`mcv15/rank-1024/`	1024	~305	414 MB

Each variant directory contains:

model.safetensors — E1 (vocab x rank), E2 (rank x 1024), b (vocab x 1)
vocab.json — {"word": index, ...} (100 000 entries)
config.json — hyperparameters

Model architecture

FactLoLM scores every vocabulary word $w$ for an input sentence embedding $\mathbf{s} \in \mathbb{R}^{1024}$ :

$\eta_w = (\mathbf{E}_1 \mathbf{E}_2\, \mathbf{s})_w + b_w$

where $\mathbf{E}_1 \in \mathbb{R}^{V \times r}$ and $\mathbf{E}_2 \in \mathbb{R}^{r \times 1024}$ form a rank- $r$ factorization of the full projection matrix. Keywords are the top- $n$ words by (logit) score.

Usage

import json
import torch
from safetensors.torch import load_file

# Load
tensors = load_file("mcv15/rank-512/model.safetensors")  # after hf download
E1, E2, b = tensors["E1"], tensors["E2"], tensors["b"]   # CPU tensors

with open("mcv15/rank-512/vocab.json") as f:
    vocab = json.load(f)                       # {word: index}
int2word = {v: k for k, v in vocab.items()}

def extract_keywords(embedding: torch.Tensor, topn: int = 10):
    """embedding: (1024,) SONAR sentence embedding (speech or text)."""
    scores = (E1 @ E2 @ embedding.unsqueeze(-1)).squeeze(-1) + b.squeeze(-1)
    top_ids = scores.topk(topn).indices.tolist()
    return [int2word[i] for i in top_ids]

Or use the full FLiP pipeline (recommended):

python scripts/evaluate.py \
    --sdict mcv15/rank-512/model.safetensors \
    --vocab  mcv15/rank-512/vocab.json \
    --data_yaml configs/datasets.yaml \
    --dataset mcv_15_en_test \
    --topn 10 --metrics all

Results

Keyword extraction on MCV15 EN (SONAR embeddings):

Model	Text acc.	Speech acc.
LiP (full-rank baseline)	59.45	57.27
FLiP rank-512	76.77	73.62
FLiP rank-1024	77.29	74.09

Span-aware accuracy vs SpLiCE (10 k concept vocabulary):

Method	Text	Speech
SpLiCE	29.58	28.21
FLiP	61.45	58.83

Citation

@misc{kesiraju2026flip,
  title         = {{FLiP}: Towards understanding and interpreting multimodal multilingual sentence embeddings},
  author        = {Kesiraju, Santosh and Yusuf, Bolaji and Sedl{\'{a}}{\v{c}}ek, {\v{S}}imon and Plchot, Old{\v{r}}ich and Schwarz, Petr},
  year          = {2026},
  eprint        = {2604.18109},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2604.18109},
}

License

MIT — see LICENSE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for BUT-FIT/FLiP-en-sonar

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

Paper • 2604.18109 • Published 8 days ago