Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elex 4549 add model flexibility #109

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

lennybronner
Copy link
Collaborator

@lennybronner lennybronner commented Sep 25, 2024

Description

Instead of being stuck with OLS as our base model for the bootstrap model, this PR gives us flexibility to use different models. We were wed to OLS because we had a quick way of computing the leave-one-out-residual, which we needed for the bootstrap, since the training residual would be biased towards zero. We now use k-fold cross validation to get an estimate for the leave-one-out residual (k-fold residual will be greater than or equal to loo residual, so this is if anything more conservative). We also add the ability to play with OLS vs. QR models. In the future, we may expand the models that are being used here.

This needs this branch of elex-solver, which allows us to calculate the k-fold residual. Tests will fail until this branch is merged/released.

Jira Ticket

https://arcpublishing.atlassian.net/browse/ELEX-4549

Test Steps

elexmodel 2017-11-07_VA_G --estimands=margin --office_id=G --geographic_unit_type=county --pi_method bootstrap  --features baseline_normalized_margin --percent_reporting 30 --aggregates postal_code --aggregates unit --model_parameters '{"model_type": "QR"}'

vs

elexmodel 2017-11-07_VA_G --estimands=margin --office_id=G --geographic_unit_type=county --pi_method bootstrap  --features baseline_normalized_margin --percent_reporting 30 --aggregates postal_code --aggregates unit --model_parameters '{"model_type": "OLS"}'

@lennybronner lennybronner requested a review from a team as a code owner September 25, 2024 02:19
@dmnapolitano dmnapolitano self-assigned this Sep 30, 2024
@@ -5,7 +5,7 @@

INSTALL_REQUIRES = (
"click>=8.1",
"elex-solver>=2.1.1",
"elex-solver @ git+https://github.com/washingtonpost/elex-solver.git@updating-residuals",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this temporarily so we can test / run the tests here in github 🎉

Copy link
Contributor

@dmnapolitano dmnapolitano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good! 🎉 🎉 I have some questions here and on the elex-solver PR, and I was also wondering if you could add documentation about the new solver options to the README? 🤔 🎉

model.fit(x, y, weights=weights, taus=0.5, **model_kwargs)
else:
raise BootstrapElectionModelException(
f"Please use defined model type 'QR' or 'OLS', input is: {self.model_type}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a polite and helpful error message! 😄 🎉

model_kwargs = {"fit_intercept": True, "regularize_intercept": False, "n_feat_ignore_reg": 1}
if self.model_type == "OLS":
# if we run OLS we use cross validation to find the optimal lambda
# TODO: why do we not do this for other models also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol wait, why do we not do this for other models also? 🤔

Is this a "how can I write code to make this happen" question or a theoretical question (or both)? 🤔

y_train,
weights_train,
aggregate_indicator_train,
K=5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, how did you pick the default of 5 here (and elsewhere)? I mean I've done (and seen) 3-fold, 5-fold, 10-fold CV but usually my choice there depends on the number of examples in the data set and how much computing power I have 🤔


# we cannot just apply ols_y_B/old_z_B to the test units because that would be missing
# our contest level random effect
# TODO: DO WE NEVER DO ANYTHING WITH THE OUTPUT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants