Skip to content

Commit

Permalink
Merge pull request #228 from BCG-Gamma/dev/2.0.0
Browse files Browse the repository at this point in the history
BUILD: release sklearndf 2.0.0
  • Loading branch information
j-ittner authored Aug 26, 2022
2 parents 7cb9780 + a5dcea4 commit e769c73
Show file tree
Hide file tree
Showing 82 changed files with 4,417 additions and 3,465 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -413,3 +413,6 @@ TSWLatexianTemp*

# exclude notebooks directory: this is generated during build
/notebooks/

# OmniGraffle previews
**/*.graffle/preview.jpeg
7 changes: 5 additions & 2 deletions .idea/sklearndf.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 13 additions & 9 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,37 +1,41 @@
repos:
- repo: https://github.com/PyCQA/isort
rev: 5.5.4
rev: 5.10.1
hooks:
- id: isort

- repo: https://github.com/psf/black
rev: 22.1.0
rev: 22.6.0
hooks:
- id: black
language_version: python3

- repo: https://gitlab.com/pycqa/flake8
rev: 3.9.0
rev: 4.0.1
hooks:
- id: flake8
name: flake8
entry: flake8 --config tox.ini
language: python_venv
additional_dependencies: [ flake8-comprehensions, flake8-import-order ]
additional_dependencies:
- flake8-comprehensions ~= 3.10
types: [ python ]

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.2.0
rev: v4.3.0
hooks:
- id: check-added-large-files
- id: check-json
- id: check-xml
- id: check-yaml
exclude: condabuild/meta.yaml

- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.931
rev: v0.971
hooks:
- id: mypy
files: src/
files: src|sphinx|test
language_version: python38
additional_dependencies:
- numpy>=1.22
- gamma-pytools>=2.0.dev5,<3a
- numpy~=1.22
- gamma-pytools~=2.0,!=2.0.0
165 changes: 95 additions & 70 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
.. image:: sphinx/source/_static/sklearndf_logo.png
.. image:: sphinx/source/_images/sklearndf_logo.png

|
----

.. Begin-Badges
|pypi| |conda| |azure_build| |azure_code_cov|
|python_versions| |code_style| |made_with_sphinx_doc| |License_badge|

.. End-Badges
*sklearndf* is an open source library designed to address a common need with
`scikit-learn <https://github.com/scikit-learn/scikit-learn>`__: the outputs of
Expand All @@ -11,55 +18,67 @@ feature names.
To this end, *sklearndf* enhances scikit-learn's estimators as follows:

- **Preserve data frame structure**:
Return data frames as results of transformations, preserving feature names as the column index.
Return data frames as results of transformations, preserving feature names as the
column index.
- **Feature name tracing**:
Add additional estimator properties to enable tracing a feature name back to its original input feature; this is especially useful for transformers that create new features (e.g., one-hot encode), and for pipelines that include such transformers.
Add additional estimator properties to enable tracing a feature name back to its
original input feature; this is especially useful for transformers that create new
features (e.g., one-hot encode), and for pipelines that include such transformers.
- **Easy use**:
Simply append DF at the end of your usual scikit-learn class names to get enhanced data frame support!
Simply append DF at the end of your usual scikit-learn class names to get enhanced
data frame support!

.. Begin-Badges
|pypi| |conda| |azure_build| |azure_code_cov|
|python_versions| |code_style| |made_with_sphinx_doc| |License_badge|

.. End-Badges
The following quickstart guide provides a minimal example workflow to get up and running
with *sklearndf*.
For additional tutorials and the API reference,
see the `sklearndf documentation <https://bcg-gamma.github.io/sklearndf/>`__.
Changes and additions to new versions are summarized in the
`release notes <https://bcg-gamma.github.io/sklearndf/release_notes.html>`__.


Installation
------------

*sklearndf* supports both PyPI and Anaconda
*sklearndf* supports both PyPI and Anaconda.
We recommend to install *sklearndf* into a dedicated environment.


Anaconda
~~~~~~~~

.. code-block:: RST
.. code-block:: sh
conda install sklearndf -c bcg_gamma -c conda-forge
conda create -n sklearndf
conda activate sklearndf
conda install -c bcg_gamma -c conda-forge sklearndf
Pip
~~~

.. code-block:: RST
macOS and Linux:
^^^^^^^^^^^^^^^^

.. code-block:: sh
python -m venv sklearndf
source sklearndf/bin/activate
pip install sklearndf
Windows:
^^^^^^^^

Quickstart
----------
.. code-block::
The following quickstart guide provides a minimal example workflow to get up and running
with *sklearndf*.
For additional tutorials and the API reference,
see the *sklearndf* `documentation <https://bcg-gamma.github.io/sklearndf/>`__.
python -m venv sklearndf
sklearndf\Scripts\activate.bat
pip install sklearndf
Changes and additions to new versions are summarized in the
`release notes <https://bcg-gamma.github.io/sklearndf/release_notes.html>`__.
Quickstart
----------

Creating a DataFrame friendly scikit-learn preprocessing pipeline
Creating a DataFrame-friendly scikit-learn preprocessing pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The titanic data set includes categorical features such as class and sex, and also has
Expand All @@ -74,9 +93,9 @@ We will build a preprocessing pipeline which:
- for categorical variables fills missing values with the string 'Unknown' and then one-hot encodes
- for numerical values fills missing values using median values

The strength of *sklearndf* is to maintain the scikit-learn conventions and expressivity,
while also preserving data frames, and hence feature names. We can see this after using
fit_transform on our preprocessing pipeline.
The strength of *sklearndf* is to maintain the scikit-learn conventions and
expressiveness, while also preserving data frames, and hence feature names. We can see
this after using ``fit_transform`` on our preprocessing pipeline.

.. code-block:: Python
Expand All @@ -92,12 +111,14 @@ fit_transform on our preprocessing pipeline.
)
from sklearndf.pipeline import (
PipelineDF,
ClassifierPipelineDF
ClassifierPipelineDF,
)
from sklearndf.classification import RandomForestClassifierDF
# load titanic data
titanic_X, titanic_y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
titanic_X, titanic_y = fetch_openml(
"titanic", version=1, as_frame=True, return_X_y=True
)
# select features
numerical_features = ['age', 'fare']
Expand All @@ -109,7 +130,7 @@ fit_transform on our preprocessing pipeline.
preprocessing_categorical_df = PipelineDF(
steps=[
('imputer', SimpleImputerDF(strategy='constant', fill_value='Unknown')),
('one-hot', OneHotEncoderDF(sparse=False, handle_unknown="ignore"))
('one-hot', OneHotEncoderDF(sparse=False, handle_unknown="ignore")),
]
)
Expand All @@ -125,26 +146,27 @@ fit_transform on our preprocessing pipeline.
transformed_df.head()
+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+
| feature_out | embarked_C | embarked_Q | embarked_S | embarked_Unknown | sex_female | sex_male | pclass_1.0 | pclass_2.0 | pclass_3.0 | age | fare |
+=============+============+============+============+==================+============+==========+============+============+============+========+==========+
|0 |0 |0 |1 |0 |1 |0 |1 |0 |0 |29 |211.3375 |
+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+
|1 |0 |0 |1 |0 |0 |1 |1 |0 |0 |0.9167 |151.55 |
+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+
|2 |0 |0 |1 |0 |1 |0 |1 |0 |0 |2 |151.55 |
+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+
|3 |0 |0 |1 |0 |0 |1 |1 |0 |0 |30 |151.55 |
+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+
|4 |0 |0 |1 |0 |1 |0 |1 |0 |0 |25 |151.55 |
+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+
+-------------+------------+------------+---+------------+--------+--------+
| feature_out | embarked_C | embarked_Q | | pclass_3.0 | age | fare |
+=============+============+============+===+============+========+========+
| **0** | 0 | 0 || 0 | 29 | 211.34 |
+-------------+------------+------------+---+------------+--------+--------+
| **1** | 0 | 0 || 0 | 0.9167 | 151.55 |
+-------------+------------+------------+---+------------+--------+--------+
| **2** | 0 | 0 || 0 | 2 | 151.55 |
+-------------+------------+------------+---+------------+--------+--------+
| **3** | 0 | 0 || 0 | 30 | 151.55 |
+-------------+------------+------------+---+------------+--------+--------+
| **4** | 0 | 0 || 0 | 25 | 151.55 |
+-------------+------------+------------+---+------------+--------+--------+


Tracing features from post-transform to original
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The *sklearndf* pipeline has a `feature_names_original_` attribute which returns a series
mapping the output columns (the series' index) to the input columns (the series' values).
The *sklearndf* pipeline has a ``feature_names_original_`` attribute
which returns a *pandas* ``Series``, mapping the output column names (the series' index)
to the input column names (the series' values).
We can therefore easily select all output features generated from a given input feature,
such as in this case for embarked.

Expand All @@ -157,34 +179,34 @@ such as in this case for embarked.
+-------------+------------+------------+------------+------------------+
| feature_out | embarked_C | embarked_Q | embarked_S | embarked_Unknown |
+=============+============+============+============+==================+
|0 |0.0 |0.0 |1.0 |0.0 |
| **0** | 0.0 | 0.0 | 1.0 | 0.0 |
+-------------+------------+------------+------------+------------------+
|1 |0.0 |0.0 |1.0 |0.0 |
| **1** | 0.0 | 0.0 | 1.0 | 0.0 |
+-------------+------------+------------+------------+------------------+
|2 |0.0 |0.0 |1.0 |0.0 |
| **2** | 0.0 | 0.0 | 1.0 | 0.0 |
+-------------+------------+------------+------------+------------------+
|3 |0.0 |0.0 |1.0 |0.0 |
| **3** | 0.0 | 0.0 | 1.0 | 0.0 |
+-------------+------------+------------+------------+------------------+
|4 |0.0 |0.0 |1.0 |0.0 |
| **4** | 0.0 | 0.0 | 1.0 | 0.0 |
+-------------+------------+------------+------------+------------------+


Completing the pipeline with a classifier
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Scikit-learn regressors and classifiers have a *sklearndf* sibling obtained by appending
DF to the class name; the API remains the same.
The result of any predict and decision function will be returned as a pandas series
(single output) or data frame (class probabilities or multi-output).
``DF`` to the class name; the API of the native estimators is preserved.
The result of any predict and decision function will be returned as a *pandas*
``Series`` (single output) or ``DataFrame`` (class probabilities or multi-output).

We can combine the preprocessing pipeline above with a classifier to create a full
predictive pipeline. *sklearndf* provides two useful, specialised pipeline objects for
this, RegressorPipelineDF and ClassifierPipelineDF. Both implement a special two-step
pipeline with one preprocessing step and one prediction step, while staying compatible
with the general sklearn pipeline idiom.
this, ``RegressorPipelineDF`` and ``ClassifierPipelineDF``.
Both implement a special two-step pipeline with one preprocessing step and one
prediction step, while staying compatible with the general sklearn pipeline idiom.

Using ClassifierPipelineDF we can combine the preprocessing pipeline with
RandomForestClassifierDF() to fit a model to a selected training set and then score
Using ``ClassifierPipelineDF`` we can combine the preprocessing pipeline with
``RandomForestClassifierDF`` to fit a model to a selected training set and then score
on a test set.

.. code-block:: Python
Expand All @@ -197,17 +219,21 @@ on a test set.
max_features=2/3,
max_depth=7,
random_state=42,
n_jobs=-3
n_jobs=-3,
)
)
# split data and then fit and score random forest classifier
df_train, df_test, y_train, y_test = train_test_split(titanic_X, titanic_y, random_state=42)
df_train, df_test, y_train, y_test = train_test_split(
titanic_X, titanic_y, random_state=42
)
pipeline_df.fit(df_train, y_train)
print(f"model score: {pipeline_df.score(df_test, y_test).round(2)}")
model score: 0.79
|
model score: 0.79


Contributing
Expand All @@ -220,8 +246,7 @@ For any bug reports or feature requests/enhancements please use the appropriate
`GitHub form <https://github.com/BCG-Gamma/sklearndf/issues>`_, and if you wish to do
so, please open a PR addressing the issue.

We do ask that for any major changes please discuss these with us first via an issue or
at our team email: [email protected].
We do ask that for any major changes please discuss these with us first via an issue.

For further information on contributing please see our
`contribution guide <https://bcg-gamma.github.io/sklearndf/contribution_guide.html>`__.
Expand Down Expand Up @@ -254,10 +279,10 @@ or have a look at
.. Begin-Badges
.. |conda| image:: https://anaconda.org/bcg_gamma/sklearndf/badges/version.svg
:target: https://anaconda.org/BCG_Gamma/sklearndf
:target: https://anaconda.org/BCG_Gamma/sklearndf

.. |pypi| image:: https://badge.fury.io/py/sklearndf.svg
:target: https://pypi.org/project/sklearndf/
:target: https://pypi.org/project/sklearndf/

.. |azure_build| image:: https://dev.azure.com/gamma-facet/facet/_apis/build/status/BCG-Gamma.sklearndf?repoName=BCG-Gamma%2Fsklearndf&branchName=develop
:target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=8&_a=summary
Expand All @@ -266,15 +291,15 @@ or have a look at
:target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=8&_a=summary

.. |python_versions| image:: https://img.shields.io/badge/python-3.7|3.8|3.9-blue.svg
:target: https://www.python.org/downloads/release/python-380/
:target: https://www.python.org/downloads/release/python-380/

.. |code_style| image:: https://img.shields.io/badge/code%20style-black-000000.svg
:target: https://github.com/psf/black
:target: https://github.com/psf/black

.. |made_with_sphinx_doc| image:: https://img.shields.io/badge/Made%20with-Sphinx-1f425f.svg
:target: https://bcg-gamma.github.io/sklearndf/index.html
:target: https://bcg-gamma.github.io/sklearndf/index.html

.. |license_badge| image:: https://img.shields.io/badge/License-Apache%202.0-olivegreen.svg
:target: https://opensource.org/licenses/Apache-2.0
:target: https://opensource.org/licenses/Apache-2.0

.. End-Badges
Loading

0 comments on commit e769c73

Please sign in to comment.