Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define percent strong #21

Open
shntnu opened this issue Nov 9, 2020 · 9 comments
Open

Define percent strong #21

shntnu opened this issue Nov 9, 2020 · 9 comments
Labels
documentation Improvements or additions to documentation

Comments

@shntnu
Copy link
Member

shntnu commented Nov 9, 2020

(Stubs for now, so we can add this documentation to code later)

Percent strong is reported in two ways. We should distinguish between these ways of reporting (they are similar but not the same)

  1. The fraction of replicate pairs that are more similar to each other than 95% of non-replicate pairs.
  2. The fraction of replicate sets that are more coherent than 95% of non-replicate sets of the same size. The coherence of a set is defined as the median correlation among elements of the set.

The second version can be a bit confusing so here is an example:

  • Consider an experiment that has 100 perturbations, performed in 5 replicates
  • For each perturbation, we compute the pairwise correlation among its 5 replicates (the replicate set), and report the median value. This median replicate correlation value is the coherence of the perturbation's replicates.
  • We now compute the coherence of non-replicate sets of the same size. So we sample 5 random wells in the experiment such that no two are replicates of the same perturbation. We compute the pairwise correlation among the 5 wells (the non-replicate set), and report the median value. We do this several times to build a null distribution
  • We now report the fraction of perturbations that have a coherence greater than 95th percentile of the null. This is the percent strong.
@shntnu shntnu added the documentation Improvements or additions to documentation label Nov 9, 2020
@gwaybio
Copy link
Member

gwaybio commented Nov 19, 2020

@niranjchandrasekaran - at checkin today I think I may have answered your question about "percent strong" incorrectly. We are not calculating medians in percent_strong.py - all we do currently is determine the percentage of "group_replicates" that are higher than a quantile (95% default) of "not group_replicates".

@niranjchandrasekaran
Copy link
Member

@gwaygenomics, does that mean the current implementation of percent_strong computes the first type of percent_strong in #21 (comment) and not the second type?

@shntnu
Copy link
Member Author

shntnu commented Nov 19, 2020

@gwaygenomics, does that mean the current implementation of percent_strong computes the first type of percent_strong in #21 (comment) and not the second type?

I think that is correct.

The good news is that it should require relatively little extra code to implement the second type, given that this matrix is being computed

similarity_melted_df = assign_replicates(
similarity_melted_df=similarity_melted_df, replicate_groups=replicate_groups
)

@gwaybio
Copy link
Member

gwaybio commented Jan 14, 2021

note: we've also renamed percent_strong to percent_matching

@shntnu shntnu closed this as completed Mar 22, 2021
@gwaybio gwaybio reopened this May 13, 2021
@gwaybio
Copy link
Member

gwaybio commented May 13, 2021

@shntnu
Copy link
Member Author

shntnu commented Aug 20, 2021

I'll copy here text by @gwaygenomics from the paper https://github.com/broadinstitute/lincs-profiling-complementarity because it's the clearest description of the method I've come across!


Constructing an appropriate null distribution to calculate reproducibility metrics
In order to calculate percent replicating and percent matching metrics, we constructed matched null distributions. We designed the null distributions to control for different replicate counts between different compounds and MOAs, taking into account different replicate counts per assay. We also constructed different null distributions within each treatment dose independently to account to control for dose differences.

Specifically, for percent replicating, for a given perturbation x with n replicates of dose p, we randomly sampled n non-replicate profiles from all 1,327 common perturbations treated with dose p. We performed this sampling procedure 1,000 times per replicate cardinalityclass (e.g. compounds with 3 replicates, 4 replicates, 5 replicates, etc.) with two additional restrictions: (1) the random sample did not include replicates for perturbation x, and (2) no two compounds of the same non-x perturbation were included in the same null group. For example, in cases where a compound treatment at a specific dose had five replicates, we sampled 1,000 groups of five randomly sampled non-replicate profiles of the same dose. For percent replicating, we used level 4 profiles considering compound and dose information as replicates. We considered a replicating profile one in which the ground truth median pairwise replicate correlation was higher than 95% of the null distribution. We therefore calculate the percent replicating metric as the total number of replicating profiles over all common compounds.

For percent matching, we performed a similar procedure. The only differences were that we (1) used level 5 consensus signatures and (2) considered MOA and dose information as replicates. We subsequently constructed dose and MOA replicate count-specific null distributions to compare against. We considered a matched MOA one in which the ground truth MOA median pairwise correlation was higher than 95% of the null distribution. We therefore calculate the percent matching metric as the total number of matched MOAs over all common MOAs.

We used these null distributions to calculate a non-parametric p value. First, for each compound, we calculated its median pairwise replicate correlation. We next calculated the median pairwise correlations of each randomly sampled group matched to the specific dose and replicate count. Lastly, we calculated a compound specific p value by dividing how many times the real median pairwise correlation of replicates was higher than all 1,000 randomly sampled groups of median pairwise correlations.

@shntnu
Copy link
Member Author

shntnu commented Oct 22, 2021

Quick note because this came up when reviewing @jccaicedo 's paper:

As of Oct 2021, the definitions in #21 (comment) might be inconsistent with the terminology used in the package.

@gwaybio
Copy link
Member

gwaybio commented Oct 22, 2021

I am fairly sure they will be inconsistent - although I do think the differences will be very minor. We did not use this package in that paper, and I wrote the package implementation a couple months before

@shntnu
Copy link
Member Author

shntnu commented Oct 22, 2021

Makes sense 👍
(I added this note because we were citing this issue in the LUAD gdoc (in comments, not actually citing it) and I didn't want people to get confused)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants