StatsPAI

What is StatsPAI about?

StatsPAI delivers a comprehensive workbench for applied causal inference and econometrics in Python. It mirrors familiar Stata/R commands (e.g., regress, ivregress, reghdfe, csdid, rdrobust, synth) while exposing Pythonic result objects that can be inspected, plotted, and exported directly to agents or notebooks.

How to use StatsPAI?

Install the package via pip:
```
pip install statspai
```

Import the library and load a teaching dataset:

import statspai as sp
card = sp.datasets.card_1995()

Call a Stata‑style routine:

result = sp.regress("lwage ~ educ + exper", data=card, robust="hc1")
print(result.summary())

Access tidy data frames, plots, or export tables using methods such as .tidy(), .plot(), sp.outreg2(), and sp.modelsummary().

Key Features of StatsPAI

Unified API covering OLS, IV/2SLS, high‑dimensional FE, staggered DiD, regression discontinuity, synthetic control, matching, and more.
Result objects with .summary(), .tidy(), .plot(), .to_latex(), .to_docx(), .to_agent_summary().
Validation tier exposing validation_status, parity matrices, and cross‑language certification against Stata/R references.
Teaching datasets (Card, Callaway‑Sant'Anna, Lee, LaLonde, California Prop 99, etc.) for quick experimentation.
Interactive plot editor (sp.interactive) that generates reproducible Matplotlib code.
Companion skill repositories for agents to automate empirical research pipelines.
MCP server support enabling model‑context‑protocol interactions.

Use Cases of StatsPAI

Academic researchers replacing Stata/R scripts with a single Python codebase.
Economists performing policy evaluation (minimum‑wage effects, tax reforms, public health interventions) with reproducible notebooks.
LLM or autonomous agents that need structured, machine‑readable output for downstream decision‑making.
Teaching causal inference courses using ready‑made datasets and familiar command syntax.
Teams exporting regression tables directly to Word/Excel for reports.

FAQ

Q: Do I need a Stata or R license to use StatsPAI? A: No. All algorithms are implemented in Python; the library only provides parity validation against Stata/R outputs.

Q: How can I check if a function has been certified? A: Use sp.describe_function("function_name")["validation_status"] or sp.list_functions(validation_status="certified").

Q: Can I customize the output format of tables? A: Yes. Functions like sp.outreg2, sp.modelsummary, and .to_latex() accept arguments to control file type, styling, and column selection.

Q: Is Bayesian analysis supported? A: StatsPAI focuses on frequentist econometrics. For Bayesian causal inference, consider complementary packages such as CausalPy.

Q: How do I integrate StatsPAI with an MCP server? A: After installing the library, the MCP server can be launched according to the package’s documentation; the library exposes metadata (sp.function_schema()) that the server consumes.

English | 中文

StatsPAI: a Python-native Stata/R replacement for applied causal inference

StatsPAI is for empirical researchers who would normally jump between Stata, R, and Python. Its goal is to make common Stata/R econometrics and causal-inference workflows feel native in Python: load a dataset, estimate a model, inspect diagnostics, export tables, and hand the result to an agent or notebook without leaving one API.

It is meant to be a practical replacement path for new Python-first work:

Stata-style routines: regress, ivregress, reghdfe, csdid, rdrobust, synth, psmatch2, outreg2.
R-style routines: lm, fixest, did, rdrobust, Synth, DoubleML, MatchIt, modelsummary, broom.
Python-native outputs: .summary(), .tidy(), .plot(), .to_latex(), .to_docx(), .to_agent_summary() where supported by the result object.
Companion Stata tooling: our own stata-code can work with StatsPAI so agents can understand existing Stata workflows, translate them into Python, and cross-check results more smoothly.
Companion skill repos: Auto-Empirical-Research-Skills, AER-Skills, Awesome-Journal-Skills, and Paper-WorkFlow can work alongside StatsPAI and an agent as the methods, journal, manuscript, and reproducibility skill layer.

StatsPAI is not a promise that every Stata/R command is bit-for-bit identical. When exact external parity matters, use the validation_status metadata, the reference-parity tests, and sp.cross_validate to see what has been certified for that estimator.

Install

pip install statspai

Then:

import statspai as sp

print(sp.datasets.list_datasets()[["name", "design", "n_obs"]].head())

StatsPAI ships teaching datasets such as Card (1995), Callaway-Sant'Anna mpdta, Lee (2008) RD, LaLonde/NSW, and California Proposition 99. The examples below run offline after installation.

At a glance: 1,139 registered functions across 87 submodules; 339k LOC (core) + 182k LOC (tests). Run python scripts/registry_stats.py to reproduce these numbers.

If You Come From Stata Or R

What you used before	Stata / R examples	StatsPAI entry point
OLS / robust SE	`reg y x, vce(robust)` / `lm()` + `sandwich`	`sp.regress(..., robust="hc1")`
IV / 2SLS	`ivregress 2sls` / `AER::ivreg()`	`sp.ivreg("y ~ (d ~ z) + x", data=df)`
High-dimensional FE	`reghdfe` / `fixest::feols()`	`sp.feols("y ~ x
Staggered DiD	`csdid` / `did::att_gt()`	`sp.callaway_santanna()` + `sp.aggte()`
Regression discontinuity	`rdrobust` / `rdrobust::rdrobust()`	`sp.rdrobust()`
Synthetic control	`synth` / `Synth::synth()`	`sp.synth()`
Matching / PSM	`psmatch2` / `MatchIt`	`sp.psmatch2()` and matching helpers
Publication tables	`outreg2`, `esttab` / `modelsummary`	`sp.outreg2()`, `sp.modelsummary()`

Compared With Other Python Causal Packages

StatsPAI is meant to be the broad Stata/R-style workbench for applied empirical research, not only a single modeling family.

Package	Best fit	Where StatsPAI is different
`causallib`	Observational causal inference with a scikit-learn-style workflow: IPW, matching, standardization, doubly robust estimation, and evaluation.	StatsPAI is broader for Stata/R migration: OLS, IV, high-dimensional FE, DiD, RD, synthetic control, matching, diagnostics, validation metadata, and publication-table export in one API.
`CausalPy`	Bayesian causal analysis for quasi-experimental settings, built around PyMC models, uncertainty, and visual diagnostics.	StatsPAI prioritizes familiar Stata/R econometrics commands, frequentist workflows, cross-language parity evidence, bundled teaching datasets, and agent-ready result summaries.

Use causallib when you mainly want sklearn-style treatment-effect pipelines. Use CausalPy when you want Bayesian causal modeling in PyMC. Use StatsPAI when you want one Python package to replace the everyday Stata/R empirical workflow.

Beginner Examples With Results

The outputs below are rounded from the bundled examples in this repository using StatsPAI 1.20.0.

1. OLS: the first `regress` / `lm` replacement

Question: how much higher is log wage for one more year of schooling in the Card (1995) teaching dataset?

import statspai as sp

card = sp.datasets.card_1995()
ols = sp.regress("lwage ~ educ + exper", data=card, robust="hc1")
print(ols.summary())

Result:

Model: OLS
Dependent Variable: lwage

           Coefficient  Std. Error  t-statistic  P>|t|
Intercept       4.9060      0.0599      81.8392 0.0000
educ            0.1088      0.0042      25.8730 0.0000
exper           0.0164      0.0014      11.3496 0.0000

R-squared: 0.2102

Read it like a Stata/R regression table: in this replica, one additional year of schooling is associated with about 0.109 higher log wage, before dealing with endogeneity.

2. IV / 2SLS: replace `ivregress 2sls` or `AER::ivreg`

Question: instrument education with proximity to a four-year college (nearc4).

import statspai as sp

card = sp.datasets.card_1995()
iv = sp.ivreg(
    "lwage ~ (educ ~ nearc4) + exper + expersq + black + south + smsa",
    data=card,
)
print(iv.summary())

Result:

Model: IV-2SLS
Dependent Variable: lwage

           Coefficient  Std. Error  t-statistic  P>|t|
educ            0.1418      0.0188       7.5606 0.0000

Model Diagnostics:
First-stage F (educ): 159.8305
Partial R2 (educ)   : 0.0505
Hausman p-value     : 0.0322

StatsPAI prints the coefficient and the diagnostics you would usually collect with separate post-estimation calls.

3. Staggered DiD: replace `csdid` or R `did`

Question: what is the average minimum-wage effect on teen employment in the Callaway-Sant'Anna mpdta example?

import statspai as sp

mp = sp.datasets.mpdta()
gt = sp.callaway_santanna(
    data=mp,
    y="lemp",
    t="year",
    i="countyreal",
    g="first_treat",
)
overall = sp.aggte(gt, type="simple", bstrap=False)
print(overall.summary())

Result:

Callaway and Sant'Anna (2021) - aggte[simple]

ATT:        -0.032977
Std. Error:  0.005493
95% CI:     [-0.043742, -0.022211]
P-value:     0.0000
Observations: 2,500

The headline estimate is negative and statistically precise in this bundled replica.

4. Regression discontinuity: replace `rdrobust`

Question: is there an incumbent advantage at the zero-margin cutoff in the Lee (2008) Senate election design?

import statspai as sp

lee = sp.datasets.lee_2008_senate()
rd = sp.rdrobust(data=lee, y="voteshare_next", x="margin", c=0)
print(rd.summary())

Result:

Sharp RD Estimation

RD Effect:   0.061599
Std. Error:  0.022662
95% CI:     [0.017183, 0.106015]
P-value:     0.0066

Bandwidth H: 0.042287
N Effective Left: 440
N Effective Right: 443

The robust bias-corrected RD estimate is about 0.062 vote-share points.

5. Synthetic control: replace Stata/R `synth`

Question: how did California's Proposition 99 affect cigarette sales?

import statspai as sp

prop99 = sp.datasets.california_prop99()
sc = sp.synth(
    data=prop99,
    outcome="cigsale",
    unit="state",
    time="year",
    treated_unit="California",
    treatment_time=1989,
)
print(sc.summary())

Result:

Synthetic Control Method

ATT:        -13.085166
Std. Error:  4.164718
95% CI:     [-21.247862, -4.922469]
P-value:     0.0789

Active donor weights:
Montana  0.8420
Nevada   0.1580

The estimate says California consumed about 13 fewer packs per capita after the intervention in this replica.

Interactive Plot Editing

If you miss Stata's Graph Editor, use sp.interactive(fig) on any matplotlib figure returned by StatsPAI. It opens a Jupyter editing panel with a live preview, so beginners can adjust a figure without learning every matplotlib option first.

What it is for:

change titles, labels, fonts, colors, markers, line widths, grids, legends, axis limits, figure size, and export DPI;
switch among publication-oriented themes, including academic, ggplot-like, FiveThirtyEight-style, and dark presentation styles;
keep the data layer protected while editing cosmetic elements;
export reproducible Python code for the edits, so the final figure can be regenerated from a script instead of being only a manual screenshot.

import statspai as sp

mp = sp.datasets.mpdta()
gt = sp.callaway_santanna(data=mp, y="lemp", t="year",
                          i="countyreal", g="first_treat")
agg = sp.aggte(gt, type="dynamic", bstrap=False)
fig, ax = sp.ggdid(agg)

editor = sp.interactive(fig)   # edit the plot in Jupyter
print(editor.generate_code())  # copy reproducible matplotlib edits

The screenshot above shows the intended workflow: preview on one side, editing controls on the other, and code export for reproducibility.

Everyday Workflow

import statspai as sp

card = sp.datasets.card_1995()
r1 = sp.regress("lwage ~ educ + exper", data=card, robust="hc1")
r2 = sp.ivreg("lwage ~ (educ ~ nearc4) + exper", data=card)

print(r1.summary())                         # human-readable table
print(r1.tidy().head())                      # broom-style dataframe
sp.modelsummary(r1, r2, output="table.docx") # Word table
sp.outreg2(r1, r2, filename="results.xlsx")  # Stata-style export

Useful docs:

Validation And Agent Use

StatsPAI has a large API surface, so validation status matters.

import statspai as sp

print(sp.describe_function("ivreg")["validation_status"])
print(sp.list_functions(validation_status="certified")[:5])

Use the validation metadata to distinguish:

certified functions with external numerical evidence;
validated functions with internal or published-reference checks;
API-stable functions whose interface is stable but whose exact Stata/R parity may be design-dependent;
experimental functions for frontier workflows.

Agent-facing metadata is available through sp.list_functions(), sp.describe_function(), and sp.function_schema().

Cross-language parity, made queryable

The validation tier above has a richer, auditable backing: a parity index where every verified function records what it was aligned against, to what tolerance, on which test, and how closely it matched. Each row traces to a committed test artifact (the pinned StatsPAI ↔ R ↔ Stata harness, version-locked via renv.lock + per-run provenance) — nothing is asserted from memory.

import statspai as sp

sp.parity_status("feols")
# {'status': 'bit-exact', 'reference': 'fixest::feols',
#  'reference_versions': {'R': '...4.5.2...', 'fixest': '0.14.0'},
#  'tolerance': 'rel_est<=1e-06, rel_se<=1e-06', 'headline': {...}, 'test': [...]}

sp.parity_summary()              # honest coverage counts (verified vs unverified)
sp.parity_matrix(status="bit-exact")

Grades: bit-exact (machine tolerance vs a named R/Stata reference), aligned (documented looser tolerance), analytical-only (recovers a known DGP truth), external-replication (published-paper numbers), and unverified (registered but no parity evidence attached yet — the honest gap). The full, auto-generated matrix is published at docs/parity.md.

Changelog

Release notes live outside the README:

CHANGELOG.md for the full version history.
Docs changelog page for the rendered documentation site.

The README is intentionally focused on first-time users.

Reviewers

StatsPAI is under JOSS review. Reviewers can start with:

Citation

If you use StatsPAI in research, cite the package and the underlying method papers for each estimator. sp.citation() returns the package citation, and many result objects expose estimator-level citation helpers.

@software{wang2026statspai,
  author  = {Wang, Biaoyue and Rozelle, Scott},
  title   = {StatsPAI: Validation-Tiered Causal Inference and
             Econometrics Workflows for Python},
  year    = {2026},
  version = {1.20.0},
  url     = {https://github.com/brycewang-stanford/StatsPAI}
}

License

MIT. See LICENSE.

StatsPAI Overview

What is StatsPAI about?

How to use StatsPAI?

Key Features of StatsPAI

Use Cases of StatsPAI

FAQ

StatsPAI's README

StatsPAI: a Python-native Stata/R replacement for applied causal inference

Install

If You Come From Stata Or R

Compared With Other Python Causal Packages

Beginner Examples With Results

1. OLS: the first `regress` / `lm` replacement

2. IV / 2SLS: replace `ivregress 2sls` or `AER::ivreg`

3. Staggered DiD: replace `csdid` or R `did`

4. Regression discontinuity: replace `rdrobust`

5. Synthetic control: replace Stata/R `synth`

Interactive Plot Editing

Everyday Workflow

Validation And Agent Use

Cross-language parity, made queryable

Changelog

Reviewers

Citation

License

StatsPAI Reviews

Login Required

Similar MCP Servers like StatsPAI

Mcp Server Deep Research

Arxiv Latex MCP

OpenRouter Agents MCP Server

Arxiv Mcp Server

OpenAlex Author Disambiguation

NotebookLM MCP Server

Deep Research MCP

Arxiv Mcp Server

Semanticscholar Universal MCP Server

Actions

StatsPAI's Information

StatsPAI

StatsPAI Overview

What is StatsPAI about?

How to use StatsPAI?

Key Features of StatsPAI

Use Cases of StatsPAI

FAQ

StatsPAI's README

StatsPAI: a Python-native Stata/R replacement for applied causal inference

Install

If You Come From Stata Or R

Compared With Other Python Causal Packages

Beginner Examples With Results

1. OLS: the first regress / lm replacement

2. IV / 2SLS: replace ivregress 2sls or AER::ivreg

3. Staggered DiD: replace csdid or R did

4. Regression discontinuity: replace rdrobust

5. Synthetic control: replace Stata/R synth

Interactive Plot Editing

Everyday Workflow

Validation And Agent Use

Cross-language parity, made queryable

Changelog

Reviewers

Citation

License

StatsPAI Reviews

Login Required

Similar MCP Servers like StatsPAI

Mcp Server Deep Research

Arxiv Latex MCP

OpenRouter Agents MCP Server

Arxiv Mcp Server

OpenAlex Author Disambiguation

NotebookLM MCP Server

Deep Research MCP

Arxiv Mcp Server

Semanticscholar Universal MCP Server

Actions

StatsPAI's Information

1. OLS: the first `regress` / `lm` replacement

2. IV / 2SLS: replace `ivregress 2sls` or `AER::ivreg`

3. Staggered DiD: replace `csdid` or R `did`

4. Regression discontinuity: replace `rdrobust`

5. Synthetic control: replace Stata/R `synth`