Merge pull request #228 from BCG-Gamma/dev/2.0.0

BUILD: release sklearndf 2.0.0
BCG-X-Official · Aug 26, 2022 · e769c73 · e769c73
2 parents 7cb9780 + a5dcea4
commit e769c73
Show file tree

Hide file tree

Showing 82 changed files with 4,417 additions and 3,465 deletions.
diff --git a/.gitignore b/.gitignore
@@ -413,3 +413,6 @@ TSWLatexianTemp*
 
 # exclude notebooks directory: this is generated during build
 /notebooks/
+
+# OmniGraffle previews
+**/*.graffle/preview.jpeg
diff --git a/.idea/sklearndf.iml b/.idea/sklearndf.iml
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,37 +1,41 @@
 repos:
   - repo: https://github.com/PyCQA/isort
-    rev: 5.5.4
+    rev: 5.10.1
     hooks:
       - id: isort
 
   - repo: https://github.com/psf/black
-    rev: 22.1.0
+    rev: 22.6.0
     hooks:
       - id: black
         language_version: python3
 
   - repo: https://gitlab.com/pycqa/flake8
-    rev: 3.9.0
+    rev: 4.0.1
     hooks:
       - id: flake8
         name: flake8
         entry: flake8 --config tox.ini
         language: python_venv
-        additional_dependencies: [ flake8-comprehensions, flake8-import-order ]
+        additional_dependencies:
+          - flake8-comprehensions ~= 3.10
         types: [ python ]
 
   - repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v3.2.0
+    rev: v4.3.0
     hooks:
       - id: check-added-large-files
       - id: check-json
+      - id: check-xml
       - id: check-yaml
+        exclude: condabuild/meta.yaml
 
   - repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v0.931
+    rev: v0.971
     hooks:
       - id: mypy
-        files: src/
+        files: src|sphinx|test
+        language_version: python38
         additional_dependencies:
-          - numpy>=1.22
-          - gamma-pytools>=2.0.dev5,<3a
+          - numpy~=1.22
+          - gamma-pytools~=2.0,!=2.0.0
diff --git a/README.rst b/README.rst
@@ -1,6 +1,13 @@
-.. image:: sphinx/source/_static/sklearndf_logo.png
+.. image:: sphinx/source/_images/sklearndf_logo.png
 
-|
+----
+
+.. Begin-Badges
+
+|pypi| |conda| |azure_build| |azure_code_cov|
+|python_versions| |code_style| |made_with_sphinx_doc| |License_badge|
+
+.. End-Badges
 
 *sklearndf* is an open source library designed to address a common need with
 `scikit-learn <https://github.com/scikit-learn/scikit-learn>`__: the outputs of
@@ -11,55 +18,67 @@ feature names.
 To this end, *sklearndf* enhances scikit-learn's estimators as follows:
 
 - **Preserve data frame structure**:
-    Return data frames as results of transformations, preserving feature names as the column index.
+  Return data frames as results of transformations, preserving feature names as the
+  column index.
 - **Feature name tracing**:
-    Add additional estimator properties to enable tracing a feature name back to its original input feature; this is especially useful for transformers that create new features (e.g., one-hot encode), and for pipelines that include such transformers.
+  Add additional estimator properties to enable tracing a feature name back to its
+  original input feature; this is especially useful for transformers that create new
+  features (e.g., one-hot encode), and for pipelines that include such transformers.
 - **Easy use**:
-    Simply append DF at the end of your usual scikit-learn class names to get enhanced data frame support!
+  Simply append DF at the end of your usual scikit-learn class names to get enhanced
+  data frame support!
 
-.. Begin-Badges
-
-|pypi| |conda| |azure_build| |azure_code_cov|
-|python_versions| |code_style| |made_with_sphinx_doc| |License_badge|
-
-.. End-Badges
+The following quickstart guide provides a minimal example workflow to get up and running
+with *sklearndf*.
+For additional tutorials and the API reference,
+see the `sklearndf documentation <https://bcg-gamma.github.io/sklearndf/>`__.
+Changes and additions to new versions are summarized in the
+`release notes <https://bcg-gamma.github.io/sklearndf/release_notes.html>`__.
 
 
 Installation
 ------------
 
-*sklearndf* supports both PyPI and Anaconda
+*sklearndf* supports both PyPI and Anaconda.
+We recommend to install *sklearndf* into a dedicated environment.
 
 
 Anaconda
 ~~~~~~~~
 
-.. code-block:: RST
+.. code-block:: sh
 
-    conda install sklearndf -c bcg_gamma -c conda-forge
+    conda create -n sklearndf
+    conda activate sklearndf
+    conda install -c bcg_gamma -c conda-forge sklearndf
 
 
 Pip
 ~~~
 
-.. code-block:: RST
+macOS and Linux:
+^^^^^^^^^^^^^^^^
+
+.. code-block:: sh
 
+    python -m venv sklearndf
+    source sklearndf/bin/activate
     pip install sklearndf
 
+Windows:
+^^^^^^^^
 
-Quickstart
-----------
+.. code-block::
 
-The following quickstart guide provides a minimal example workflow to get up and running
-with *sklearndf*.
-For additional tutorials and the API reference,
-see the *sklearndf* `documentation <https://bcg-gamma.github.io/sklearndf/>`__.
+    python -m venv sklearndf
+    sklearndf\Scripts\activate.bat
+    pip install sklearndf
 
-Changes and additions to new versions are summarized in the
-`release notes <https://bcg-gamma.github.io/sklearndf/release_notes.html>`__.
 
+Quickstart
+----------
 
-Creating a DataFrame friendly scikit-learn preprocessing pipeline
+Creating a DataFrame-friendly scikit-learn preprocessing pipeline
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The titanic data set includes categorical features such as class and sex, and also has
@@ -74,9 +93,9 @@ We will build a preprocessing pipeline which:
 - for categorical variables fills missing values with the string 'Unknown' and then one-hot encodes
 - for numerical values fills missing values using median values
 
-The strength of *sklearndf* is to maintain the scikit-learn conventions and expressivity,
-while also preserving data frames, and hence feature names. We can see this after using
-fit_transform on our preprocessing pipeline.
+The strength of *sklearndf* is to maintain the scikit-learn conventions and
+expressiveness, while also preserving data frames, and hence feature names. We can see
+this after using ``fit_transform`` on our preprocessing pipeline.
 
 .. code-block:: Python
 
@@ -92,12 +111,14 @@ fit_transform on our preprocessing pipeline.
     )
     from sklearndf.pipeline import (
         PipelineDF,
-        ClassifierPipelineDF
+        ClassifierPipelineDF,
     )
     from sklearndf.classification import RandomForestClassifierDF
 
     # load titanic data
-    titanic_X, titanic_y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
+    titanic_X, titanic_y = fetch_openml(
+        "titanic", version=1, as_frame=True, return_X_y=True
+    )
 
     # select features
     numerical_features = ['age', 'fare']
@@ -109,7 +130,7 @@ fit_transform on our preprocessing pipeline.
     preprocessing_categorical_df = PipelineDF(
         steps=[
             ('imputer', SimpleImputerDF(strategy='constant', fill_value='Unknown')),
-            ('one-hot', OneHotEncoderDF(sparse=False, handle_unknown="ignore"))
+            ('one-hot', OneHotEncoderDF(sparse=False, handle_unknown="ignore")),
         ]
     )
 
@@ -125,26 +146,27 @@ fit_transform on our preprocessing pipeline.
     transformed_df.head()
 
 
-+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+
-| feature_out | embarked_C | embarked_Q | embarked_S | embarked_Unknown | sex_female | sex_male | pclass_1.0 | pclass_2.0 | pclass_3.0 | age    | fare     |
-+=============+============+============+============+==================+============+==========+============+============+============+========+==========+
-|0            |0           |0           |1           |0                 |1           |0         |1           |0           |0           |29      |211.3375  |
-+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+
-|1            |0           |0           |1           |0                 |0           |1         |1           |0           |0           |0.9167  |151.55    |
-+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+
-|2            |0           |0           |1           |0                 |1           |0         |1           |0           |0           |2       |151.55    |
-+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+
-|3            |0           |0           |1           |0                 |0           |1         |1           |0           |0           |30      |151.55    |
-+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+
-|4            |0           |0           |1           |0                 |1           |0         |1           |0           |0           |25      |151.55    |
-+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+
++-------------+------------+------------+---+------------+--------+--------+
+| feature_out | embarked_C | embarked_Q | … | pclass_3.0 | age    | fare   |
++=============+============+============+===+============+========+========+
+| **0**       | 0          | 0          | … | 0          | 29     | 211.34 |
++-------------+------------+------------+---+------------+--------+--------+
+| **1**       | 0          | 0          | … | 0          | 0.9167 | 151.55 |
++-------------+------------+------------+---+------------+--------+--------+
+| **2**       | 0          | 0          | … | 0          | 2      | 151.55 |
++-------------+------------+------------+---+------------+--------+--------+
+| **3**       | 0          | 0          | … | 0          | 30     | 151.55 |
++-------------+------------+------------+---+------------+--------+--------+
+| **4**       | 0          | 0          | … | 0          | 25     | 151.55 |
++-------------+------------+------------+---+------------+--------+--------+
 
 
 Tracing features from post-transform to original 
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The *sklearndf* pipeline has a `feature_names_original_` attribute which returns a series
-mapping the output columns (the series' index) to the input columns (the series' values).
+The *sklearndf* pipeline has a ``feature_names_original_`` attribute
+which returns a *pandas* ``Series``, mapping the output column names (the series' index)
+to the input column names (the series' values).
 We can therefore easily select all output features generated from a given input feature,
 such as in this case for embarked.
 
@@ -157,34 +179,34 @@ such as in this case for embarked.
 +-------------+------------+------------+------------+------------------+
 | feature_out | embarked_C | embarked_Q | embarked_S | embarked_Unknown |
 +=============+============+============+============+==================+
-|0            |0.0         |0.0         |1.0         |0.0               |
+| **0**       | 0.0        | 0.0        | 1.0        | 0.0              |
 +-------------+------------+------------+------------+------------------+
-|1            |0.0         |0.0         |1.0         |0.0               |
+| **1**       | 0.0        | 0.0        | 1.0        | 0.0              |
 +-------------+------------+------------+------------+------------------+
-|2            |0.0         |0.0         |1.0         |0.0               |
+| **2**       | 0.0        | 0.0        | 1.0        | 0.0              |
 +-------------+------------+------------+------------+------------------+
-|3            |0.0         |0.0         |1.0         |0.0               |
+| **3**       | 0.0        | 0.0        | 1.0        | 0.0              |
 +-------------+------------+------------+------------+------------------+
-|4            |0.0         |0.0         |1.0         |0.0               |
+| **4**       | 0.0        | 0.0        | 1.0        | 0.0              |
 +-------------+------------+------------+------------+------------------+
 
 
 Completing the pipeline with a classifier
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Scikit-learn regressors and classifiers have a *sklearndf* sibling obtained by appending
-DF to the class name; the API remains the same.
-The result of any predict and decision function will be returned as a pandas series
-(single output) or data frame (class probabilities or multi-output).
+``DF`` to the class name; the API of the native estimators is preserved.
+The result of any predict and decision function will be returned as a *pandas*
+``Series`` (single output) or ``DataFrame`` (class probabilities or multi-output).
 
 We can combine the preprocessing pipeline above with a classifier to create a full
 predictive pipeline. *sklearndf* provides two useful, specialised pipeline objects for
-this, RegressorPipelineDF and ClassifierPipelineDF. Both implement a special two-step
-pipeline with one preprocessing step and one prediction step, while staying compatible
-with the general sklearn pipeline idiom.
+this, ``RegressorPipelineDF`` and ``ClassifierPipelineDF``.
+Both implement a special two-step pipeline with one preprocessing step and one
+prediction step, while staying compatible with the general sklearn pipeline idiom.
 
-Using ClassifierPipelineDF we can combine the preprocessing pipeline with
-RandomForestClassifierDF() to fit a model to a selected training set and then score
+Using ``ClassifierPipelineDF`` we can combine the preprocessing pipeline with
+``RandomForestClassifierDF`` to fit a model to a selected training set and then score
 on a test set.
 
 .. code-block:: Python
@@ -197,17 +219,21 @@ on a test set.
             max_features=2/3,
             max_depth=7,
             random_state=42,
-            n_jobs=-3
+            n_jobs=-3,
         )
     )
 
     # split data and then fit and score random forest classifier
-    df_train, df_test, y_train, y_test = train_test_split(titanic_X, titanic_y, random_state=42)
+    df_train, df_test, y_train, y_test = train_test_split(
+        titanic_X, titanic_y, random_state=42
+    )
     pipeline_df.fit(df_train, y_train)
     print(f"model score: {pipeline_df.score(df_test, y_test).round(2)}")
 
 
-model score: 0.79
+|
+
+    model score: 0.79
 
 
 Contributing
@@ -220,8 +246,7 @@ For any bug reports or feature requests/enhancements please use the appropriate
 `GitHub form <https://github.com/BCG-Gamma/sklearndf/issues>`_, and if you wish to do
 so, please open a PR addressing the issue.
 
-We do ask that for any major changes please discuss these with us first via an issue or
-at our team email: [email protected].
+We do ask that for any major changes please discuss these with us first via an issue.
 
 For further information on contributing please see our
 `contribution guide <https://bcg-gamma.github.io/sklearndf/contribution_guide.html>`__.
@@ -254,10 +279,10 @@ or have a look at
 .. Begin-Badges
 
 .. |conda| image:: https://anaconda.org/bcg_gamma/sklearndf/badges/version.svg
-    :target: https://anaconda.org/BCG_Gamma/sklearndf
+   :target: https://anaconda.org/BCG_Gamma/sklearndf
 
 .. |pypi| image:: https://badge.fury.io/py/sklearndf.svg
-    :target: https://pypi.org/project/sklearndf/
+   :target: https://pypi.org/project/sklearndf/
 
 .. |azure_build| image:: https://dev.azure.com/gamma-facet/facet/_apis/build/status/BCG-Gamma.sklearndf?repoName=BCG-Gamma%2Fsklearndf&branchName=develop
    :target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=8&_a=summary
@@ -266,15 +291,15 @@ or have a look at
    :target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=8&_a=summary
 
 .. |python_versions| image:: https://img.shields.io/badge/python-3.7|3.8|3.9-blue.svg
-    :target: https://www.python.org/downloads/release/python-380/
+   :target: https://www.python.org/downloads/release/python-380/
 
 .. |code_style| image:: https://img.shields.io/badge/code%20style-black-000000.svg
-    :target: https://github.com/psf/black
+   :target: https://github.com/psf/black
 
 .. |made_with_sphinx_doc| image:: https://img.shields.io/badge/Made%20with-Sphinx-1f425f.svg
-    :target: https://bcg-gamma.github.io/sklearndf/index.html
+   :target: https://bcg-gamma.github.io/sklearndf/index.html
 
 .. |license_badge| image:: https://img.shields.io/badge/License-Apache%202.0-olivegreen.svg
-    :target: https://opensource.org/licenses/Apache-2.0
+   :target: https://opensource.org/licenses/Apache-2.0
 
 .. End-Badges