Question 1

What is CoFEE?

Accepted Answer

CoFEE (Cognitive Feature Engineering Engine) is a framework from Vela Research that enforces cognitive behaviors in LLM reasoning during feature discovery. Instead of asking the LLM to list features in a single prompt, CoFEE decomposes the task into backward chaining from outcomes, subgoal decomposition, leakage and observability verification, and explicit backtracking of rejected reasoning paths. The result is 15.2% higher success, 29% fewer features, and 53.3% lower cost compared to vanilla prompting. arXiv:2604.21584, April 2026.

Question 2

How does CoFEE reduce LLM feature engineering cost by 53%?

Accepted Answer

Most of the cost in LLM-based feature engineering comes from unstructured exploration: the model proposes long lists of candidate features, many of which are leakage in disguise, proxies for the target, or post-outcome variables that would not be available at prediction time. CoFEE filters those out at the reasoning stage rather than the scoring stage. Verification against observability and leakage criteria rejects bad candidates early; explicit backtracking redirects the search instead of letting the model keep generating similar bad candidates. The combined effect is a 53.3% reduction in API cost on the feature discovery phase.

Question 3

What are the four cognitive behaviors in CoFEE?

Accepted Answer

(1) Backward chaining from outcomes: start with the target and reason backward to observable variables that could plausibly cause it. (2) Subgoal decomposition: break the prediction goal into subgoals so feature search covers the space rather than fixating on one cluster. (3) Verification against observability and leakage criteria: every candidate feature is checked for whether it would be observable at prediction time and whether it is a function of the outcome itself. (4) Explicit backtracking of rejected reasoning paths: when a candidate is rejected, the path is recorded and the search redirects rather than rolling forward.

Question 4

How does CoFEE compare to vanilla LLM prompting for feature discovery?

Accepted Answer

Three benchmarked numbers. CoFEE produces 15.2% higher success on the downstream prediction task than vanilla prompting. CoFEE produces 29% fewer features in the candidate set, which makes scoring cheaper and reduces multiple-comparison false positives. CoFEE reduces total LLM API cost on the feature discovery phase by 53.3%. The features get better and there are fewer of them, at half the spend.

Question 5

Why is feature discovery a reasoning problem?

Accepted Answer

Useful features are not just variables correlated with the target; they are variables predictive of the target, which means they need to be observable at the time of prediction and not a function of the outcome itself. Distinguishing predictive features from leakage, proxies, and post-outcome signals requires reasoning about causal pathways and information availability. CoFEE makes that reasoning explicit and structured rather than leaving it implicit in a single-shot prompt.

Question 6

Where does CoFEE fit in Vela's quant VC research program?

Accepted Answer

CoFEE sits in the Think-Reason-Learn family alongside GPTree, Random Rule Forest, Reasoned Rule Mining, and Policy Induction. Those four downstream pipelines all consume features as input. CoFEE is the upstream machinery that improves the quality of those features. It also supplies features to the LLM-Augmented ML thread (LLM-AR, GPT-HTree, rare-event prediction, verifiable reasoning), where LLM-generated features feed classical learners.

Question 7

Can CoFEE be used outside venture capital?

Accepted Answer

Yes. The cognitive behaviors CoFEE enforces (backward chaining, subgoal decomposition, leakage verification, backtracking) are domain-agnostic. The 53.3% cost reduction is benchmarked on a venture capital feature discovery task, but the underlying mechanism transfers to any domain where LLMs are used to engineer features from unstructured data: medical diagnosis, hiring, legal document analysis, scientific literature mining, fraud detection, credit risk, and similar high-stakes prediction problems with rich text inputs.

Question 8

How do you prevent target leakage with LLM-generated features?

Accepted Answer

Run every candidate feature through two structured checks before it enters the downstream pipeline. First, an observability check: would this feature actually be available at the time of prediction? Variables measured after the prediction date or that depend on actions taken in response to the prediction fail this check. Second, a function-of-outcome check: is this feature a transformation, proxy, or strict function of the target itself? Features mechanically derived from the outcome fail this check. CoFEE makes the LLM run these checks on every candidate it proposes, which catches the silent target-leakage failures that would otherwise show up months later in production.

Question 9

How does CoFEE compare to Featuretools and autofeat?

Accepted Answer

Featuretools and autofeat are excellent for the structured-data, deterministic-transformation case: window functions, aggregations, and cross-tabulations on relational data. They were not designed for the unstructured-text case where the right features have to be reasoned about, not transformed. CoFEE is the LLM-era counterpart for unstructured inputs. Same ambition (automate the feature engineering step), different mechanism (reasoning, not transformation), different applicable domain (free-form text and qualitative attributes instead of relational tables). Teams already using Featuretools on their structured tables can layer CoFEE on top of their unstructured columns.

Question 10

What are best practices for LLM feature engineering?

Accepted Answer

Five practical rules. (1) Stop asking the LLM for features in a single shot; ask it to reason backward from the outcome, then propose features. (2) Decompose the prediction goal into subgoals so the model covers the space rather than fixating on one cluster. (3) Run an observability check on every candidate feature before scoring it; reject anything not available at prediction time. (4) Run a function-of-outcome check on every candidate; reject anything that is a proxy or transformation of the label. (5) When a candidate is rejected, record the path and backtrack rather than letting the model keep proposing similar candidates. CoFEE is the framework that operationalizes these five rules end-to-end.

Question 11

Can LLMs do feature engineering for fraud detection?

Accepted Answer

Yes, and fraud detection is one of the highest-leverage transfer domains. Fraud features are mostly buried in unstructured signals (transaction memos, account narratives, support tickets, KYC documents, behavioral logs) where classical feature engineering struggles. The CoFEE machinery applies directly: backward chaining from the fraud label to plausible upstream behaviors, subgoal decomposition (transactional, behavioral, network, KYC), explicit leakage checks (chargeback codes are post-outcome, late-stage manual review notes are post-outcome), and backtracking when a candidate fails verification. The leakage check is especially important in fraud, where post-detection variables routinely contaminate naive feature sets.

Question 12

Can LLMs do feature engineering for credit risk?

Accepted Answer

Yes. Credit risk modeling has the structural feature CoFEE is designed for: a high-stakes outcome with strong regulatory and business pressure to avoid leakage, and predictive signal scattered across unstructured documentation (loan memos, collateral descriptions, borrower narratives, news mentions). CoFEE's observability check is critical here, since many naive features (post-default collection notes, restructuring documentation) are leakage that would invalidate the model under scrutiny. The reasoning-control approach also produces auditable feature provenance, which matters for credit risk model documentation under SR 11-7 and similar regimes.

Question 13

Can LLMs do feature engineering for medical diagnosis?

Accepted Answer

Yes, with care. Medical features often live in clinical notes, radiology reports, and discharge summaries, all of which are unstructured text where CoFEE's reasoning-control approach applies cleanly. The observability check is especially important in medical settings: features derived from notes written after a diagnostic test was ordered, or from discharge summaries that already encode the diagnosis, are leakage relative to the time-of-decision a clinical model is supposed to support. The function-of-outcome check catches features that encode the label by reference (ICD codes, treatment selection). Clinical deployment requires additional regulatory review beyond what CoFEE provides, but the feature discovery layer is a meaningful upgrade over single-shot prompting in this domain.

Question 14

What are CoFEE's limitations?

Accepted Answer

The cognitive behaviors are operationalized as prompt-level scaffolding, which makes them sensitive to the underlying base model and to the calibration of the verification prompts. The 53.3% cost reduction is measured on a specific feature discovery task and may not generalize to every domain, though the underlying mechanism (backtracking plus verification) should transfer. The leakage and observability checks rely on the model's ability to articulate the time-of-prediction information set; in domains where that information set is itself ambiguous, the checks may fail silently. The paper does not directly compare against learned reasoning-control approaches like RL fine-tuning of feature discovery agents.

CoFEE: Reasoning Control for LLM-Based Feature Discovery in Quant VC

The headline result

Feature discovery is a reasoning problem

The four cognitive behaviors CoFEE enforces

How CoFEE cuts LLM feature engineering cost by 53%

How CoFEE detects feature leakage automatically

Where CoFEE sits in quant VC: feature discovery for Think-Reason-Learn

Why this matters for quant venture capital

Limitations

Read the paper