Stock Prediction AI β€” Regime-Aware LightGBM

I reverse-engineered a hedge fund's trading strategy and open-sourced the model.

A LightGBM regressor that predicts next-day log returns for 150 US stocks, paired with a one-line regime rule that decides whether to act on the prediction. Trained on a MacBook M1. No cloud GPU, no paid data beyond a single FMP API subscription.

Author: @jc_builds


equity curves vs buy-and-hold vs SPY across 4 market regimes

TL;DR

Regime Strategy vs B&H Strategy vs SPY
2010 bull βˆ’0.1% +10.9% βœ“
2018 bear +11.9% βœ“ +18.1% βœ“
2020 COVID +22.0% βœ“ +38.8% βœ“
2022 bear βˆ’0.0% βˆ’3.8%
Live 2025–26 βˆ’3.2% βˆ’6.2%
5-split avg +6.1% +11.6%

The model beats buy-and-hold in crashes (2018, 2020) and is roughly flat in bulls and grinding bears (2010, 2022, live). It is a crash shield, not a stock picker.


How it works

pipeline

The whole system is five boring parts:

  1. Universe β€” 150 large-cap US equities, 20 years of daily bars from FMP.
  2. Features β€” 47 engineered features per stock: multi-horizon returns, rolling vol, drawdowns, moving averages, SPY regime context, cross-sectional ranks. All strictly known at day t to predict day t+1.
  3. Model β€” a single LightGBM regressor trained to predict next-day log return. MSE loss. Early stopping on a 10% chronological validation tail.
  4. Decision β€” a per-stock long-or-cash gate, threshold tuned on validation to maximize excess-return-vs-buy-and-hold.
  5. Backtest β€” walk-forward, out-of-sample, 5 bps per position change.

No shorts. No leverage. No options. No intraday.


The one rule that moved the needle

regime rule diagram

The raw model was already decent, but the thing that actually moved metrics was a one-line regime gate:

# Pseudocode β€” see src/strategy.py for the real version
bull = spy_price > spy_200_day_moving_average
threshold = bull_threshold if bull else bear_threshold  # bull: -0.003, bear: tuned
position = 1 if pred > threshold else 0
  • In bull markets (SPY above 200-day MA), demand a strongly bearish signal before going to cash. Otherwise stay long and earn the bull.
  • In bear markets, use a stricter bar: go long only if the model genuinely expects up.

That one if statement lifted the 2018 split from flat to +11.85% vs buy-and-hold.


Results across 5 walk-forward regimes

excess return vs SPY by regime

The training is a classic walk-forward: for each test year Y, train on all data from 2005 through Y-1, never touching Y. For the live split, training ran through 2025-04-22 and testing ran forward from there.

Where it won (crashes):

  • 2018 bear β€” volatile Q4 crash. Strategy +18.05% vs SPY, beat B&H on 128 of 130 stocks.
  • 2020 COVID β€” V-shaped crash and rally. Strategy +38.83% vs SPY, beat B&H on 113 of 138 stocks.

Where it barely won or lost (bulls and grinding bears):

  • 2010 bull β€” +10.90% vs SPY (the 150-stock basket happened to beat SPY significantly in 2010, and the strategy just matched the basket minus costs).
  • 2022 bear β€” βˆ’3.78% vs SPY. LightGBM's early-stopping triggered at iteration 1; the 2005–2021 training distribution simply didn't predict 2022's rate-hike bear well. Effectively always-long, matched the basket minus costs.
  • Live 2025–26 β€” βˆ’6.16% vs SPY. A strong bull year. Same ceiling problem as 2010 β€” a long-or-cash strategy's upper bound in a straight-up bull is basically B&H minus costs.

Average across 675 stock-years tested: +11.57% excess vs SPY, +6.11% excess vs per-stock buy-and-hold.


What worked vs what failed

worked vs failed

Eight model versions were tried. Seven regressed. One shipped. The champion isn't clever β€” it's the simplest thing that works plus one regime rule.

Version Idea Outcome
v1 Plain LightGBM, MSE, val-tuned threshold baseline that works
v2 Asymmetric y-weighted loss regressed β€” early-stopped at iter 1
v3 Magnitude-weighted training regressed β€” averaged away crash calls
v4 v1 + regime-aware threshold champion β€” shipped
v4b v4 with fixed threshold slight regression
v5 5-seed ensemble regressed β€” averaged away high-conviction crash picks
v6 P(big-down) classifier regressed β€” target too rare
v7 MLX MLP on Apple GPU regressed β€” calibration collapsed
v8 v4 trained on 2015+ only regressed β€” best_iter 120 β†’ 5, lost regularization
v9 Multi-modal top-10 rotation ships separately, feast-or-famine profile

What I actually learned

learnings

  1. The model only helps in crashes. In a bull market, just buying and holding wins. Any long-or-cash strategy is capped at B&H minus transaction costs.
  2. A one-line rule beat every clever loss function. I spent weeks on asymmetric losses, magnitude weighting, and custom objectives. A single "is SPY above its 200-day average?" gate outperformed all of them.
  3. More training data usually helps β€” but not always. Twenty years of training usually beat ten. One version (v8) flipped it. You have to test instead of guess.
  4. Leave the losing year in the README. The 2025 live run lost 6% vs SPY. Hiding it would make every other number untrustworthy.

Artifacts

This repository contains:

stockprediction-ai/
β”œβ”€β”€ README.md                  this file
β”œβ”€β”€ config.json                hyperparameters, feature columns, regime thresholds, splits
β”œβ”€β”€ results.json               per-split metrics (strat final, B&H final, excess)
β”œβ”€β”€ boosters/
β”‚   β”œβ”€β”€ 2010.txt               LightGBM text-format booster, trained on 2005–2009
β”‚   β”œβ”€β”€ 2018.txt               trained on 2005–2017
β”‚   β”œβ”€β”€ 2020.txt               trained on 2005–2019
β”‚   β”œβ”€β”€ 2022.txt               trained on 2005–2021
β”‚   └── live.txt               trained on 2005-01-01 through 2025-04-22
└── images/                    charts used in this card

Boosters are plain LightGBM text files β€” load them with any LightGBM runtime, no Python required.


Using the live booster

import json
import lightgbm as lgb
import numpy as np
import pandas as pd
from huggingface_hub import hf_hub_download

REPO = "jc-builds/stockprediction-ai"

# 1. Grab the live booster and config
booster_path = hf_hub_download(REPO, "boosters/live.txt")
config_path  = hf_hub_download(REPO, "config.json")

booster = lgb.Booster(model_file=booster_path)
config  = json.loads(open(config_path).read())
feature_cols = config["feature_columns"]

# 2. Build a feature row for one stock on one trading day
#    (see github.com/jc-builds/stockprediction/blob/main/src/features.py
#     for the full 47-feature builder β€” the code is open-source).
features: pd.DataFrame = build_features_for_today(symbol="AAPL")
x = features[feature_cols].astype(float).values

# 3. Predict next-day log return
pred = booster.predict(x)[0]

# 4. Apply the regime rule
bull        = spy_close_today > spy_200_day_ma_today
thr_bull    = config["regime_rule"]["bull_threshold"]   # -0.003
thr_bear    = config["splits"]["live"]["bear_threshold"]
threshold   = thr_bull if bull else thr_bear
position    = 1 if pred > threshold else 0              # 1=long, 0=cash

Limitations β€” read before using

  • Not investment advice. I lost 6% live in a year SPY made 32%. Treat this as an educational artifact.
  • Survivorship bias in the 150-stock universe. The chosen tickers are all still listed in 2025-26; anything that went bankrupt in 2008-10 isn't in the dataset.
  • Transaction cost model is linear. 5 bps per flip. Real slippage is bigger on small caps and during panics.
  • No tax modeling. Wash sales, short-term gains β€” ignored.
  • The regime rule is leaky in subtle ways. SPY's 200-day MA uses the same history the booster is trained on. In practice this isn't lookahead (the rule only reads today's price vs its trailing mean), but any regime rule is a second optimization over the test set in disguise.
  • The 2022 case is a honest miss. The 2005–2021 training distribution did not generalize. best_iter=1 on v4 means the model "gave up" β€” it returned a near-constant prediction. Any real deployment would need to detect this (e.g., refuse to trade when val loss is flat) rather than fall through to always-long.

Reproduce

git clone https://github.com/jc-builds/stockprediction
cd stockprediction
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Set FMP_API_KEY in .env, then:
PYTHONPATH=src python src/run.py --model v4

Citation

If this was useful in your own work or teaching, please cite:

@misc{jcbuilds_stockprediction_2026,
  author       = {Jared Cassoutt (@jc_builds)},
  title        = {Stock Prediction AI: Regime-Aware LightGBM},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/jc-builds/stockprediction-ai}},
}

License

MIT. Do whatever you want, but don't sue me when you lose money.

Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support