updated branch from previous merges to deal with conflicts

cytomining · Sep 30, 2024 · a62a6fc · a62a6fc
2 parents 76b69aa + af07d49
commit a62a6fc
Show file tree

Hide file tree

Showing 28 changed files with 335 additions and 75 deletions.
diff --git a/.github/workflows/integration-test.yml b/.github/workflows/integration-test.yml
@@ -56,6 +56,28 @@ jobs:
         uses: pre-commit/[email protected]
         with:
           extra_args: --all-files
+  python-type-checks:
+    # This job is used to check Python types
+    name: Python type checks
+    # Avoid fail-fast to retain output
+    strategy:
+      fail-fast: false
+    runs-on: ubuntu-22.04
+    if: github.event_name != 'schedule'
+    steps:
+      - name: Checkout repo
+        uses: actions/checkout@v4
+      - name: Setup python, and check pre-commit cache
+        uses: ./.github/actions/setup-env
+        with:
+          python-version: ${{ env.TARGET_PYTHON_VERSION }}
+          cache-pre-commit: false
+          cache-venv: true
+          setup-poetry: true
+          install-deps: true
+      - name: Run mypy
+        run: |
+          poetry run mypy .
   integration-test:
     name: Pytest (Python ${{ matrix.python-version }} on ${{ matrix.os }})
     # Runs pytest on all tested versions of python and OSes

diff --git a/README.md b/README.md
@@ -11,16 +11,20 @@
 Pycytominer is a suite of common functions used to process high dimensional readouts from high-throughput cell experiments.
 The tool is most often used for processing data through the following pipeline:
 
-<img height="325" alt="Description of the pycytominer pipeline. Images flow from feature extraction and are processed with a series of steps" src="https://github.com/cytomining/pycytominer/blob/main/media/pipeline.png?raw=true">
+<img height="700" align="center" alt="Description of the pycytominer pipeline. Images flow from feature extraction and are processed with a series of steps" src="https://github.com/cytomining/pycytominer/blob/main/media/pipeline.png?raw=true">
+
+> Figure 1. The standard image-based profiling experiment and the role of Pycytominer. (A) In the experimental phase, a scientist plates cells, often perturbing them with chemical or genetic agents and performs microscopy imaging. In image analysis, using CellProfiler for example, a scientist applies several data processing steps to generate image-based profiles. In addition, scientists can apply a more flexible approach by using deep learning models, such as DeepProfiler, to generate image-based profiles. (B) Pycytominer performs image-based profiling to process morphology features and make them ready for downstream analyses. (C) Pycytominer performs five fundamental functions, each implemented with a simple and intuitive API. Each function enables a user to implement various methods for executing operations.
 
 [Click here for high resolution pipeline image](https://github.com/cytomining/pycytominer/blob/main/media/pipeline.png)
 
-Image data flow from a microscope to cell segmentation and feature extraction tools (e.g. CellProfiler or DeepProfiler).
+Image data flow from a microscope to cell segmentation and feature extraction tools (e.g. [CellProfiler](https://cellprofiler.org/) or [DeepProfiler](https://cytomining.github.io/DeepProfiler-handbook/docs/00-welcome.html)) (**Figure 1A**).
 From here, additional single cell processing tools curate the single cell readouts into a form manageable for pycytominer input.
-For CellProfiler, we use [cytominer-database](https://github.com/cytomining/cytominer-database) or [CytoTable](https://github.com/cytomining/CytoTable).
-For DeepProfiler, we include single cell processing tools in [pycytominer.cyto_utils](pycytominer/cyto_utils/).
+For [CellProfiler](https://cellprofiler.org/), we use [cytominer-database](https://github.com/cytomining/cytominer-database) or [CytoTable](https://github.com/cytomining/CytoTable).
+For [DeepProfiler](https://cytomining.github.io/DeepProfiler-handbook/docs/00-welcome.html), we include single cell processing tools in [pycytominer.cyto_utils](pycytominer/cyto_utils/).
 
-From the single cell output, pycytominer performs five steps using a simple API (described below), before passing along data to [cytominer-eval](https://github.com/cytomining/cytominer-eval) for quality and perturbation strength evaluation.
+Next, Pycytominer performs reproducible image-based profiling (**Figure 1B**).
+The Pycytominer API consists of five key steps (**Figure 1C**).
+The outputs generated by Pycytominer are utilized for downstream analysis, which includes machine learning models and statistical testing to derive biological insights.
 
 The best way to communicate with us is through [GitHub Issues](https://github.com/cytomining/pycytominer/issues), where we are able to discuss and troubleshoot topics related to pycytominer.
 Please see our [`CONTRIBUTING.md`](https://github.com/cytomining/pycytominer/blob/main/CONTRIBUTING.md) for details about communicating possible bugs, new features, or other information.
@@ -66,6 +70,30 @@ Pycytominer is primarily built on top of [pandas](https://pandas.pydata.org/docs
 
 Pycytominer currently supports [parquet](https://parquet.apache.org/) and compressed text file (e.g. `.csv.gz`) i/o.
 
+### CellProfiler support
+
+Currently, Pycytominer fully supports data generated by [CellProfiler](https://cellprofiler.org/), adhering defaults to its specific data structure and naming conventions.
+
+CellProfiler-generated image-based profiles typically consist of two main components:
+
+- **Metadata features:** This section contains information about the experiment, such as plate ID, well position, incubation time, perturbation type, and other relevant experimental details. These feature names are prefixed with `Metadata_`, indicating that the data in these columns contain metadata information.
+- **Morphology features:** These are the quantified morphological features prefixed with the default compartments (`Cells_`, `Cytoplasm_`, and `Nuclei_`). Pycytominer also supports non-default compartment names (e.g., `Mito_`).
+
+Note, [`pycytominer.cyto_utils.cells.SingleCells()`](pycytominer/cyto_utils/cells.py) contains code designed to interact with single-cell SQLite files exported from CellProfiler.
+Processing capabilities for SQLite files depends on SQLite file size and your available computational resources (for ex. memory and CPU).
+
+### Handling inputs from other image analysis tools (other than CellProfiler)
+
+Pycytominer also supports processing of raw morphological features from image analysis tools beyond [CellProfiler](https://cellprofiler.org/).
+These tools include [In Carta](https://www.moleculardevices.com/products/cellular-imaging-systems/high-content-analysis/in-carta-image-analysis-software), [Harmony](https://www.revvity.com/product/harmony-5-2-office-revvity-hh17000019#product-overview), and others.
+Using Pycytominer with these tools requires minor modifications to function arguments, and we encourage these users to pay particularly close attention to individual function documentation.
+
+For example, to resolve potential feature issues in the `normalize()` function, you must manually specify the morphological features using the `features` [parameter](https://pycytominer.readthedocs.io/en/latest/pycytominer.html#pycytominer.normalize.normalize).
+The `features` parameter is also available in other key steps, such as [`aggregate`](https://pycytominer.readthedocs.io/en/latest/pycytominer.html#pycytominer.aggregate.aggregate) and [`feature_select`](https://pycytominer.readthedocs.io/en/latest/pycytominer.html#pycytominer.feature_select.feature_select).
+
+If you are using Pycytominer with these other tools, please file [an issue](https://github.com/cytomining/pycytominer/issues) to reach out.
+We'd love to hear from you so that we can learn how to best support broad and multiple use-cases.
+
 ## API
 
 Pycytominer has five major processing functions:
@@ -97,6 +125,8 @@ Each processing function has unique arguments, see our [documentation](https://p
 
 The default way to use pycytominer is within python scripts, and using pycytominer is simple and fun.
 
+The example below demonstrates how to perform normalization with a dataset generated by [CellProfiler](https://cellprofiler.org/).
+
 ```python
 # Real world example
 import pandas as pd
@@ -135,21 +165,6 @@ And, more specifically than that, image-based profiling readouts from [CellProfi
 
 Therefore, we have included some custom tools in `pycytominer/cyto_utils` that provides other functionality:
 
-- [Data processing for image-based profiling](#data-processing-for-image-based-profiling)
-  - [Installation](#installation)
-  - [Frameworks](#frameworks)
-  - [API](#api)
-  - [Usage](#usage)
-    - [Pipeline orchestration](#pipeline-orchestration)
-  - [Other functionality](#other-functionality)
-    - [CellProfiler CSV collation](#cellprofiler-csv-collation)
-    - [Creating a cell locations lookup table](#creating-a-cell-locations-lookup-table)
-    - [Generating a GCT file for morpheus](#generating-a-gct-file-for-morpheus)
-  - [Citing pycytominer](#citing-pycytominer)
-
-Note, [`pycytominer.cyto_utils.cells.SingleCells()`](pycytominer/cyto_utils/cells.py) contains code to interact with single-cell SQLite files, which are output from CellProfiler.
-Processing capabilities for SQLite files depends on SQLite file size and your available computational resources (for ex. memory and cores).
-
 ### CellProfiler CSV collation
 
 If running your images on a cluster, unless you have a MySQL or similar large database set up then you will likely end up with lots of different folders from the different cluster runs (often one per well or one per site), each one containing an `Image.csv`, `Nuclei.csv`, etc.
@@ -228,7 +243,7 @@ pycytominer.cyto_utils.write_gct(
 )
 ```
 
-## Citing pycytominer
+## Citing Pycytominer
 
 If you have used `pycytominer` in your project, please use the citation below.
 You can also find the citation in the 'cite this repository' link at the top right under `about` section.

diff --git a/media/legacy_pipeline.png b/media/legacy_pipeline.png
diff --git a/media/pipeline.png b/media/pipeline.png
diff --git a/poetry.lock b/poetry.lock
diff --git a/pycytominer/aggregate.py b/pycytominer/aggregate.py
@@ -83,7 +83,7 @@ def aggregate(
     # Only extract single object column in preparation for count
     if compute_object_count:
         count_object_df = (
-            population_df.loc[:, np.union1d(strata, [object_feature])]
+            population_df.loc[:, list(np.union1d(strata, [object_feature]))]
             .groupby(strata)[object_feature]
             .count()
             .reset_index()
@@ -92,7 +92,9 @@ def aggregate(
 
     if features == "infer":
         features = infer_cp_features(population_df)
-    population_df = population_df[features]
+
+    # recast as dataframe to protect against scenarios where a series may be returned
+    population_df = pd.DataFrame(population_df[features])
 
     # Fix dtype of input features (they should all be floats!)
     population_df = population_df.astype(float)
@@ -101,7 +103,9 @@ def aggregate(
     population_df = pd.concat([strata_df, population_df], axis="columns")
 
     # Perform aggregating function
-    population_df = population_df.groupby(strata, dropna=False)
+    # Note: type ignore added below to address the change in variable types for
+    # label `population_df`.
+    population_df = population_df.groupby(strata, dropna=False)  # type: ignore[assignment]
 
     if operation == "median":
         population_df = population_df.median().reset_index()
@@ -118,10 +122,10 @@ def aggregate(
         for column in population_df.columns
         if column in ["ImageNumber", "ObjectNumber"]
     ]:
-        population_df = population_df.drop([columns_to_drop], axis="columns")
+        population_df = population_df.drop(columns=columns_to_drop, axis="columns")
 
     if output_file is not None:
-        output(
+        return output(
             df=population_df,
             output_filename=output_file,
             output_type=output_type,