CoFEE: Reasoning Control for LLM-Based Feature Discovery in Quant VC
The headline result
Compared to vanilla prompting, CoFEE produces 15.2% higher success, 29% fewer features, and a 53.3% reduction in cost. The features get better and there are fewer of them, at half the spend. The paper's claim is that this is not an artifact of clever prompt tuning but the result of treating feature discovery as a reasoning problem and enforcing the reasoning explicitly.
The trick is simple to state and load-bearing in practice: stop asking the LLM for features, ask it to think. Reason backward from the outcome. Decompose the goal into subgoals. Check for leakage at every step. Throw out weak ideas before they become features.
Feature discovery is a reasoning problem
The dominant pattern for using an LLM to engineer features is a single-shot prompt: “here is the dataset, here is the target, list useful features.” The model produces a long list. Some features are predictive, some are leakage in disguise, some are proxies for the target, some are post-outcome variables that would not be available at prediction time. The downstream pipeline cannot tell which is which without additional engineering, and the cost of generating that long list grows linearly with the number of attempts.
The paper's reframing is that feature discovery is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals. The leakage check is not an afterthought; it is part of the reasoning. CoFEE is built around making that reasoning visible, structured, and controllable.
This is also where CoFEE differs from earlier automated feature engineering tools. Featuretools, autofeat, and similar libraries are excellent at the structured-data, deterministic-transformation case: window functions, aggregations, and cross-tabulations on relational data. They were not designed for the unstructured-text case where the right features have to be reasoned about, not transformed. CoFEE is the LLM-era counterpart for unstructured inputs: the same ambition (automate the feature engineering step), a different mechanism (reasoning, not transformation), and a different applicable domain (free-form text and qualitative attributes instead of relational tables).
The four cognitive behaviors CoFEE enforces
CoFEE introduces four cognitive behaviors into the LLM's feature discovery loop. Each is a reasoning move with a clear analog in how a careful human analyst constructs features.
1. Backward chaining from outcomes. Start with the target and work backward toward observable variables that could plausibly cause it. The model is forced to articulate the causal pathway from feature to outcome before the feature is allowed into the candidate set. Backward chaining is what separates “variables correlated with the target” from “variables predictive of the target.”
2. Subgoal decomposition. Break the prediction goal into subgoals. For a venture-capital outlier prediction problem, the subgoals might be founder quality, market timing, team composition, and execution velocity. Each subgoal becomes a sub-problem with its own feature set. Decomposition forces the model to cover the space of useful features rather than fixating on one cluster.
3. Verification against observability and leakage criteria. Every candidate feature is checked against two questions: (a) would this feature actually be observable at the time of prediction, and (b) is it a function of the outcome itself? If either answer disqualifies the feature, the feature is rejected. This is the cognitive behavior that does the most work on the cost side: unverifiable and leaking features get filtered before they are scored downstream.
4. Explicit backtracking of rejected reasoning paths. When a candidate feature is rejected, the path that produced it is recorded and the model backtracks rather than rolling forward. Traditional prompts let the LLM keep proposing similar features even after one was rejected; CoFEE's backtracking uses each rejection as a signal to redirect the search.
The four behaviors are not independent. Backward chaining generates the candidate space; subgoal decomposition organizes it; verification prunes it; backtracking redirects it. The empirical result is that the candidate set converges faster on features that survive downstream evaluation.
How CoFEE cuts LLM feature engineering cost by 53%
The headline economic claim of the paper is that structured reasoning recovers more than half the LLM API spend that vanilla prompting wastes on unstructured exploration. The paper benchmarks CoFEE against a vanilla-prompt baseline on an LLM feature discovery task in venture capital, with three numbers anchoring the comparison:
- -53.3% cost. Total spend on the LLM API for the feature discovery phase falls by more than half. Backtracking and verification eliminate the long tail of candidate features that vanilla prompting generates and then has to filter downstream. For teams running LLM feature engineering at scale, this is the difference between feature discovery being a major line item and a rounding error.
- -29% feature count. The candidate set CoFEE proposes is roughly 30% smaller than the vanilla baseline. Smaller candidate sets are cheaper to score downstream, less prone to multiple-comparison false positives, and easier to audit.
- +15.2% success. Features produced by CoFEE are more often retained by the downstream classifier and more often predictive of the held-out outcome. The reasoning scaffolding produces better features, not just fewer of them.
The cost reduction matters in two ways. Practically, it brings large-scale feature discovery into reach for teams that could not previously afford it. Methodologically, it shows that the economics of LLM-based feature engineering are dominated by unstructured exploration; structuring the reasoning explicitly recovers more than half the spend without sacrificing predictive quality. For practitioners optimizing LLM API spend, the implication is that the cheapest improvement to a feature engineering pipeline is usually not a smaller model or a per-token discount; it is structuring the reasoning so the model stops generating features that will be rejected anyway.
Provider-agnostic by design. The benchmark uses a recent frontier model as the underlying base model, but the cognitive scaffolding is provider-agnostic. The same four behaviors transfer to GPT-4 and successors, Anthropic Claude (Sonnet, Opus, Haiku), Google Gemini, and any modern instruction-tuned LLM. Teams switching providers for cost, latency, or governance reasons keep the same feature discovery contract; only the underlying token pricing changes.
How CoFEE detects feature leakage automatically
Target leakage is the most expensive failure mode in machine learning: a model that performs well in training because a feature is implicitly a function of the label, then collapses when deployed because the feature is unavailable or post-outcome at prediction time. Detecting leakage by hand is slow, requires domain knowledge of when each variable is observed, and scales badly when the LLM is generating hundreds of candidate features. CoFEE turns this manual review step into an automated component of the reasoning loop.
The verification behavior runs every candidate feature through two structured checks before the feature is allowed into the downstream pipeline:
- Observability check. Would this feature actually be available at the time of prediction? Variables measured after the prediction date, or variables that depend on actions taken in response to the prediction, fail this check. The model is required to articulate the time-of-decision information set explicitly, which catches a class of leakage that surface-level prompts miss.
- Function-of-outcome check. Is this feature a transformation, proxy, or strict function of the target itself? Features that look predictive but are mechanically derived from the outcome (post-funding revenue, post-acquisition valuation, label-conditional aggregations) fail this check.
Combined, the two checks operationalize what an experienced data scientist does intuitively when reviewing a feature list: ask, for each candidate, “could I actually compute this at decision time, and is it really independent of the answer?” CoFEE makes that review the LLM's job rather than the human's, and runs it on every candidate rather than spot-checking. The downstream effect is fewer leaking features in production, fewer model-collapse incidents, and less retraining to undo a leakage-driven training accident.
For teams whose ML pipelines have been bitten by silent leakage at deployment, this is the part of CoFEE that pays for itself first. The cost reduction and the success-rate gain are measurable; the leakage prevention is the failure mode that would have shown up months later in production, and now does not.
Where CoFEE sits in quant VC: feature discovery for Think-Reason-Learn
Most LLM-native machine learning pipelines, including the rest of the Think-Reason-Learn family, depend on a high-quality set of input features. GPTree walks decision trees whose branches are natural-language questions over those features. Random Rule Forest builds ensembles of YES/NO rules whose predicates are those features. Reasoned Rule Mining calibrates Bayesian evidence over those features. Policy Induction embeds natural-language policies that reference those features inside the prompt.
In all four cases, the quality of the feature set caps the quality of the downstream pipeline. CoFEE is the upstream machinery that improves that cap. Better features feeding a GPTree produce better trees. Better features feeding RRF produce sharper rules. Better features feeding Policy Induction produce policies that generalize more cleanly. The paper's contribution is to make the upstream feature discovery step a structured reasoning process rather than a single-shot generation.
CoFEE also closes a loop with the LLM-Augmented ML thread, where features feed classical learners. In LLM-AR the LLM-generated features become predicates inside a probabilistic logic program; GPT-HTree uses LLM-derived personas as features in a hierarchical clustering pipeline; rare-event prediction uses LLM features inside ensemble learners. Wherever an LLM-generated feature is consumed by a downstream model, CoFEE is the discipline that produces the feature.
Why this matters for quant venture capital
Venture capital is a domain where the predictive variables are mostly buried in unstructured text: founder bios, GitHub histories, publication records, deal memos, market commentary. Classical machine learning has limited reach into that data because the feature engineering required is prohibitive at scale. LLMs change the economics of that step, but only if the feature engineering itself is reliable.
CoFEE's contribution to quant VC is to make LLM feature engineering reliable enough to deploy in production: features that survive leakage checks, that are observable at the time of prediction, and that are reasoned forward from causal pathways rather than scraped from surface-level correlations. Combined with the downstream Think-Reason-Learn pipelines, this turns the unstructured founder-and-market data into structured features that classical and LLM-native learners can both consume.
Limitations
The paper is explicit about several limitations. The cognitive behaviors are operationalized as prompt-level scaffolding, which makes them sensitive to the underlying base model and to the calibration of the verification prompts. The 53.3% cost reduction is measured on a specific feature discovery task in venture capital and is not guaranteed to generalize to every domain, though the underlying mechanism (backtracking plus verification) should transfer. The leakage and observability checks rely on the model's ability to articulate the time-of-prediction information set; in domains where that information set is itself ambiguous, the checks may fail silently. Finally, the paper does not directly compare against learned reasoning-control approaches (e.g., RL fine-tuning of feature discovery agents); a comparison there is a natural next step.
Read the paper
CoFEE: Reasoning Control for LLM-Based Feature Discovery.
Maximilian Westermann, Ben Griffin, Aaron Ontoyin Yin, Zakari Salifu, Yagiz Ihlamur, Kelvin Amoaba, Joseph Ternasky, Fuat Alican, Yigit Ihlamur.
arXiv preprint arXiv:2604.21584, April 2026.
Read on arXiv · Download PDF.
For the broader Think-Reason-Learn family that consumes CoFEE's features, see Think-Reason-Learn, GPTree, Random Rule Forest, Reasoned Rule Mining, and Policy Induction. For the LLM-Augmented ML thread that also consumes LLM-generated features, see LLM-AR, GPT-HTree, Rare-event prediction, Verifiable Reasoning, and Learning What to Ask. For empirical benchmarking, see VCBench.
Authored by members of the Vela team and our collaborators at the University of Oxford. See the full roster of contributors.
For research collaboration on LLM-based feature discovery, reasoning control, cognitive scaffolding for LLM agents, or quant VC pipelines that consume LLM-generated features, email engage@vela.partners.