Kernels

Activation

Activation is a python package that contains custom CUDA-based activation kernels, primarily targeting AMD GPUs.

  • Currently implemented
    • PolyNorm

    • RMSNorm

    • FusedAddRMSNorm

      A fused operator that combines residual addition (x + residual) with RMSNorm in a single kernel.

      • Instead of:

        y = x + residual
        hidden_state = rms_norm(y, weight, eps)
        out = y + some_op(hidden_state)
        
      • Fused as:

        hidden_state, y = fused_add_rms_norm(x, residual, weight, eps)
        out = y + some_op(hidden_state)
        
    • FusedMulPolyNorm

      A fused operator that combines PolyNorm with an element-wise multiplication by a Tensor.

      • Instead of:

        y = poly_norm(x, weight, bias, eps)
        out = y * a
        
      • Fused as:

        out = fused_mul_poly_norm(x, a, weight, bias, eps)
        
    • FusedMulGroupedPolyNorm (CUDA)

      A CUDA-accelerated grouped variant of FusedMulPolyNorm for MoE (Mixture of Experts) models. Fuses the entire PolyNorm computation into CUDA kernels (fwd + bwd), with per-expert weights/bias, in-kernel binary search for expert mapping, optional routing scores multiplication, and hidden_clamp fusion.

      • Instead of:

        for i, expert in enumerate(experts):
            out[start:end] = fused_mul_poly_norm(x[start:end], mul[start:end], weight[i], bias[i], eps)
        
      • Fused as:

        out = fused_mul_grouped_poly_norm(x, mul, weight, bias, offsets, eps,
                                           scores=scores, hidden_clamp=10.0)
        

Installation

# Local CUDA build (development)
pip install --no-build-isolation -e .

Usage

import torch
import activation

torch.set_default_device("cuda")
poly_norm = activation.layers.PolyNorm(eps=1e-6)
x = torch.randn(10, 10)

print(poly_norm(x))

Performance

  • Test cases are from the Motif LLM
  • The results can be reproduced using the provided benchmarking tools.
  • For details on how to use the benchmarking tools, please refer to the benchmarks README.
  • The benchmark results may show fluctuations, especially in the backward pass and when the dimension size is small.

RMSNorm

H100 Results

Forward Performance

RMSNorm Forward Performance

Backward Performance

RMSNorm Backward Performance

MI250 Results

Forward Performance

RMSNorm Forward Performance

Backward Performance

RMSNorm Backward Performance


FusedAddRMSNorm

For fusion case performance, the non-fused baseline was implemented with our custom kernels.

H100 Results

Forward Performance

FusedAddRMSNorm Forward Performance

Backward Performance

FusedAddRMSNorm Backward Performance

MI250 Results

Forward Performance

FusedAddRMSNorm Forward Performance

Backward Performance

FusedAddRMSNorm Backward Performance


PolyNorm

H100 Results

Forward Performance

PolyNorm Forward Performance

Backward Performance

PolyNorm Backward Performance

MI250 Results

Forward Performance

PolyNorm Forward Performance

Backward Performance

PolyNorm Backward Performance


FusedMulPolyNorm

For fusion case performance, the non-fused baseline was implemented with our custom kernels.

H100 Results

Forward Performance

FusedMulPolyNorm Forward Performance

Backward Performance

FusedMulPolyNorm Backward Performance

MI250 Results

Forward Performance

FusedMulPolyNorm Forward Performance

Backward Performance

FusedMulPolyNorm Backward Performance


FusedMulGroupedPolyNorm (CUDA)

This kernel is implemented in CUDA C++ (compiled via setup.py). Benchmarks compare three variants: Naive (raw PyTorch reference), Compiled (torch.compile'd reference), and CUDA (fused CUDA kernel). Benchmark dimension: 1280, 384 experts.

Training profile (B200, motif3_seq, lbs=8, seqlen=4K):

CUDA kernel torch.compile Speedup
Forward 0.7 ms 2.1 ms 3.0x
Backward 1.4 ms 3.7 ms 2.6x

Pre-commit Hooks

This project uses pre-commit to automatically check and format code before commits.

Setup

  1. Install pre-commit:

    pip install pre-commit
    
  2. Install the git hooks:

   pre-commit install

Once installed, the configured hooks will run automatically on each commit.

Included Hooks

The following tools are run via pre-commit:

  • yapf โ€“ Python code formatter
  • typos โ€“ Spell checker for common typos
  • isort โ€“ Organizes and sorts Python imports
  • clang-format โ€“ Formats C++/CUDA code (--style=file)
  • pymarkdown โ€“ Lints and auto-fixes Markdown files
  • actionlint โ€“ Validates GitHub Actions workflows

Usage

  • Run all checks on the entire codebase:

    pre-commit run --all-files
    
  • Run a specific hook (example: isort):

  pre-commit run isort --all-files
Downloads last month
110
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support