Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to summarize replicate z-scores from controls as medians #38

Open
gwaybio opened this issue Feb 11, 2021 · 8 comments · Fixed by #39
Open

Add option to summarize replicate z-scores from controls as medians #38

gwaybio opened this issue Feb 11, 2021 · 8 comments · Fixed by #39
Labels
enhancement New feature or request

Comments

@gwaybio
Copy link
Member

gwaybio commented Feb 11, 2021

The only option is to define based on mean. I need to add an option to define based on median.

@gwaybio
Copy link
Member Author

gwaybio commented Feb 12, 2021

@shntnu noted in https://github.com/broadinstitute/neuronal-cell-painting/issues/6#issuecomment-767719260 that we should give more thought to using negative control correlation z-scores to transform replicate correlations because of skewed distributions.

It is also possible that Amoolya brought this up yesterday, and suggested using median instead of mean.

I thought that she was referring to step 4 in https://raw.githubusercontent.com/broadinstitute/grit-benchmark/main/media/grit_calculation.png. This is what I added as an option in #39.

@shntnu - is your interpretation different?

To me, z-scoring is suitable for grit - even in the presence of skewed pairwise correlations. If we think about z-scoring as a way to normalize the correlations and interpret how many perturbations will exist above/below the distribution then we are burned. However, if we think about using the z-score to find where the mean/median replicate correlation is in respect to the controls (which we do) then we are ok. Biologically, negative controls will have variance, and, in a CRISPR experiment, potentially different off-target effects. By comparing replicate correlations to this potentially skewed distribution will help us know how different, on average, are the replicates from all controls. This underscores the importance of quality controls, which is true for all experiments.

@gwaybio
Copy link
Member Author

gwaybio commented Feb 12, 2021

a nice succinct summary on uses and misuses of z-scores https://influentialpoints.com/Training/z_scores_use_and_misuse.htm

@gwaybio gwaybio reopened this Feb 12, 2021
@AnneCarpenter
Copy link
Member

I will need to rely on other experts here - if you don't think you have sufficient ones chiming in, please ask for help in finding them! Sounds like Amoolya may be all you need, if you can have her read drafts sooner rather than later it will help I'm sure.

Remember how you say that Grit ranges between -1 and + 1? If that is a mathematical relationship and not just by chance, then I suspect that range might only hold true for experiments where the distribution is relatively normal and not skewed too much. Just speculation though.

@gwaybio
Copy link
Member Author

gwaybio commented Feb 12, 2021

I will need to rely on other experts here

This doesn't give me confidence! (Just FYI) Do you mean you'd like a consensus opinion in order to move forward? If so, what does a consensus opinion look like?

Remember how you say that Grit ranges between -1 and + 1? If that is a mathematical relationship and not just by chance, then I suspect that range might only hold true for experiments where the distribution is relatively normal and not skewed too much. Just speculation though.

We can assume that random data follows a normal distribution. We cannot assume that every profile without signal will appear random. That is the cost of doing business :)

@AnneCarpenter
Copy link
Member

I just meant please do not count on ME to judge whether decisions on this are sound. If you, or other experts you trust, feel confident about it then carry on :)

@gwaybio
Copy link
Member Author

gwaybio commented Feb 12, 2021

Gotcha! Thanks for clarifying. I choose @shntnu to trust :)

@gwaybio
Copy link
Member Author

gwaybio commented Feb 12, 2021

some more info: in broadinstitute/grit-benchmark#22 I calculated grit using mean and median summary, and plotted the results (pasted below). The y axis is grit calculated with median, and the x axis is grit calculated with mean. The difference is very minor. We observe slightly elevated grit scores using median, potentially because poorly targeting guides reduce the mean score.

cell_health_grit_metric_summary_comparison

Calculated mean vs. median has a range of Spearman rank correlation between 0.9798 - 0.9835 (see this notebook).

@shntnu
Copy link
Member

shntnu commented Mar 22, 2021

However, if we think about using the z-score to find where the mean/median replicate correlation is in respect to the controls (which we do) then we are ok.

I agree with this and with the explanation in #38 (comment).

I'd note that there are other, completely different ways of reporting the comparison of the two distributions (1. correlation to replicates and 2. corrections to negative controls) e.g. Average Precision (replicates = class 1, negative controls = class 2), but @gwaygenomics's choice of average z-score of class 1, using class 2 as a reference, is a defensible choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants