-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #228 from BCG-Gamma/dev/2.0.0
BUILD: release sklearndf 2.0.0
- Loading branch information
Showing
82 changed files
with
4,417 additions
and
3,465 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,37 +1,41 @@ | ||
repos: | ||
- repo: https://github.com/PyCQA/isort | ||
rev: 5.5.4 | ||
rev: 5.10.1 | ||
hooks: | ||
- id: isort | ||
|
||
- repo: https://github.com/psf/black | ||
rev: 22.1.0 | ||
rev: 22.6.0 | ||
hooks: | ||
- id: black | ||
language_version: python3 | ||
|
||
- repo: https://gitlab.com/pycqa/flake8 | ||
rev: 3.9.0 | ||
rev: 4.0.1 | ||
hooks: | ||
- id: flake8 | ||
name: flake8 | ||
entry: flake8 --config tox.ini | ||
language: python_venv | ||
additional_dependencies: [ flake8-comprehensions, flake8-import-order ] | ||
additional_dependencies: | ||
- flake8-comprehensions ~= 3.10 | ||
types: [ python ] | ||
|
||
- repo: https://github.com/pre-commit/pre-commit-hooks | ||
rev: v3.2.0 | ||
rev: v4.3.0 | ||
hooks: | ||
- id: check-added-large-files | ||
- id: check-json | ||
- id: check-xml | ||
- id: check-yaml | ||
exclude: condabuild/meta.yaml | ||
|
||
- repo: https://github.com/pre-commit/mirrors-mypy | ||
rev: v0.931 | ||
rev: v0.971 | ||
hooks: | ||
- id: mypy | ||
files: src/ | ||
files: src|sphinx|test | ||
language_version: python38 | ||
additional_dependencies: | ||
- numpy>=1.22 | ||
- gamma-pytools>=2.0.dev5,<3a | ||
- numpy~=1.22 | ||
- gamma-pytools~=2.0,!=2.0.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,13 @@ | ||
.. image:: sphinx/source/_static/sklearndf_logo.png | ||
.. image:: sphinx/source/_images/sklearndf_logo.png | ||
|
||
| | ||
---- | ||
|
||
.. Begin-Badges | ||
|pypi| |conda| |azure_build| |azure_code_cov| | ||
|python_versions| |code_style| |made_with_sphinx_doc| |License_badge| | ||
|
||
.. End-Badges | ||
*sklearndf* is an open source library designed to address a common need with | ||
`scikit-learn <https://github.com/scikit-learn/scikit-learn>`__: the outputs of | ||
|
@@ -11,55 +18,67 @@ feature names. | |
To this end, *sklearndf* enhances scikit-learn's estimators as follows: | ||
|
||
- **Preserve data frame structure**: | ||
Return data frames as results of transformations, preserving feature names as the column index. | ||
Return data frames as results of transformations, preserving feature names as the | ||
column index. | ||
- **Feature name tracing**: | ||
Add additional estimator properties to enable tracing a feature name back to its original input feature; this is especially useful for transformers that create new features (e.g., one-hot encode), and for pipelines that include such transformers. | ||
Add additional estimator properties to enable tracing a feature name back to its | ||
original input feature; this is especially useful for transformers that create new | ||
features (e.g., one-hot encode), and for pipelines that include such transformers. | ||
- **Easy use**: | ||
Simply append DF at the end of your usual scikit-learn class names to get enhanced data frame support! | ||
Simply append DF at the end of your usual scikit-learn class names to get enhanced | ||
data frame support! | ||
|
||
.. Begin-Badges | ||
|pypi| |conda| |azure_build| |azure_code_cov| | ||
|python_versions| |code_style| |made_with_sphinx_doc| |License_badge| | ||
|
||
.. End-Badges | ||
The following quickstart guide provides a minimal example workflow to get up and running | ||
with *sklearndf*. | ||
For additional tutorials and the API reference, | ||
see the `sklearndf documentation <https://bcg-gamma.github.io/sklearndf/>`__. | ||
Changes and additions to new versions are summarized in the | ||
`release notes <https://bcg-gamma.github.io/sklearndf/release_notes.html>`__. | ||
|
||
|
||
Installation | ||
------------ | ||
|
||
*sklearndf* supports both PyPI and Anaconda | ||
*sklearndf* supports both PyPI and Anaconda. | ||
We recommend to install *sklearndf* into a dedicated environment. | ||
|
||
|
||
Anaconda | ||
~~~~~~~~ | ||
|
||
.. code-block:: RST | ||
.. code-block:: sh | ||
conda install sklearndf -c bcg_gamma -c conda-forge | ||
conda create -n sklearndf | ||
conda activate sklearndf | ||
conda install -c bcg_gamma -c conda-forge sklearndf | ||
Pip | ||
~~~ | ||
|
||
.. code-block:: RST | ||
macOS and Linux: | ||
^^^^^^^^^^^^^^^^ | ||
|
||
.. code-block:: sh | ||
python -m venv sklearndf | ||
source sklearndf/bin/activate | ||
pip install sklearndf | ||
Windows: | ||
^^^^^^^^ | ||
|
||
Quickstart | ||
---------- | ||
.. code-block:: | ||
The following quickstart guide provides a minimal example workflow to get up and running | ||
with *sklearndf*. | ||
For additional tutorials and the API reference, | ||
see the *sklearndf* `documentation <https://bcg-gamma.github.io/sklearndf/>`__. | ||
python -m venv sklearndf | ||
sklearndf\Scripts\activate.bat | ||
pip install sklearndf | ||
Changes and additions to new versions are summarized in the | ||
`release notes <https://bcg-gamma.github.io/sklearndf/release_notes.html>`__. | ||
Quickstart | ||
---------- | ||
|
||
Creating a DataFrame friendly scikit-learn preprocessing pipeline | ||
Creating a DataFrame-friendly scikit-learn preprocessing pipeline | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The titanic data set includes categorical features such as class and sex, and also has | ||
|
@@ -74,9 +93,9 @@ We will build a preprocessing pipeline which: | |
- for categorical variables fills missing values with the string 'Unknown' and then one-hot encodes | ||
- for numerical values fills missing values using median values | ||
|
||
The strength of *sklearndf* is to maintain the scikit-learn conventions and expressivity, | ||
while also preserving data frames, and hence feature names. We can see this after using | ||
fit_transform on our preprocessing pipeline. | ||
The strength of *sklearndf* is to maintain the scikit-learn conventions and | ||
expressiveness, while also preserving data frames, and hence feature names. We can see | ||
this after using ``fit_transform`` on our preprocessing pipeline. | ||
|
||
.. code-block:: Python | ||
|
@@ -92,12 +111,14 @@ fit_transform on our preprocessing pipeline. | |
) | ||
from sklearndf.pipeline import ( | ||
PipelineDF, | ||
ClassifierPipelineDF | ||
ClassifierPipelineDF, | ||
) | ||
from sklearndf.classification import RandomForestClassifierDF | ||
# load titanic data | ||
titanic_X, titanic_y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True) | ||
titanic_X, titanic_y = fetch_openml( | ||
"titanic", version=1, as_frame=True, return_X_y=True | ||
) | ||
# select features | ||
numerical_features = ['age', 'fare'] | ||
|
@@ -109,7 +130,7 @@ fit_transform on our preprocessing pipeline. | |
preprocessing_categorical_df = PipelineDF( | ||
steps=[ | ||
('imputer', SimpleImputerDF(strategy='constant', fill_value='Unknown')), | ||
('one-hot', OneHotEncoderDF(sparse=False, handle_unknown="ignore")) | ||
('one-hot', OneHotEncoderDF(sparse=False, handle_unknown="ignore")), | ||
] | ||
) | ||
|
@@ -125,26 +146,27 @@ fit_transform on our preprocessing pipeline. | |
transformed_df.head() | ||
+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+ | ||
| feature_out | embarked_C | embarked_Q | embarked_S | embarked_Unknown | sex_female | sex_male | pclass_1.0 | pclass_2.0 | pclass_3.0 | age | fare | | ||
+=============+============+============+============+==================+============+==========+============+============+============+========+==========+ | ||
|0 |0 |0 |1 |0 |1 |0 |1 |0 |0 |29 |211.3375 | | ||
+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+ | ||
|1 |0 |0 |1 |0 |0 |1 |1 |0 |0 |0.9167 |151.55 | | ||
+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+ | ||
|2 |0 |0 |1 |0 |1 |0 |1 |0 |0 |2 |151.55 | | ||
+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+ | ||
|3 |0 |0 |1 |0 |0 |1 |1 |0 |0 |30 |151.55 | | ||
+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+ | ||
|4 |0 |0 |1 |0 |1 |0 |1 |0 |0 |25 |151.55 | | ||
+-------------+------------+------------+------------+------------------+------------+----------+------------+------------+------------+--------+----------+ | ||
+-------------+------------+------------+---+------------+--------+--------+ | ||
| feature_out | embarked_C | embarked_Q | … | pclass_3.0 | age | fare | | ||
+=============+============+============+===+============+========+========+ | ||
| **0** | 0 | 0 | … | 0 | 29 | 211.34 | | ||
+-------------+------------+------------+---+------------+--------+--------+ | ||
| **1** | 0 | 0 | … | 0 | 0.9167 | 151.55 | | ||
+-------------+------------+------------+---+------------+--------+--------+ | ||
| **2** | 0 | 0 | … | 0 | 2 | 151.55 | | ||
+-------------+------------+------------+---+------------+--------+--------+ | ||
| **3** | 0 | 0 | … | 0 | 30 | 151.55 | | ||
+-------------+------------+------------+---+------------+--------+--------+ | ||
| **4** | 0 | 0 | … | 0 | 25 | 151.55 | | ||
+-------------+------------+------------+---+------------+--------+--------+ | ||
|
||
|
||
Tracing features from post-transform to original | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The *sklearndf* pipeline has a `feature_names_original_` attribute which returns a series | ||
mapping the output columns (the series' index) to the input columns (the series' values). | ||
The *sklearndf* pipeline has a ``feature_names_original_`` attribute | ||
which returns a *pandas* ``Series``, mapping the output column names (the series' index) | ||
to the input column names (the series' values). | ||
We can therefore easily select all output features generated from a given input feature, | ||
such as in this case for embarked. | ||
|
||
|
@@ -157,34 +179,34 @@ such as in this case for embarked. | |
+-------------+------------+------------+------------+------------------+ | ||
| feature_out | embarked_C | embarked_Q | embarked_S | embarked_Unknown | | ||
+=============+============+============+============+==================+ | ||
|0 |0.0 |0.0 |1.0 |0.0 | | ||
| **0** | 0.0 | 0.0 | 1.0 | 0.0 | | ||
+-------------+------------+------------+------------+------------------+ | ||
|1 |0.0 |0.0 |1.0 |0.0 | | ||
| **1** | 0.0 | 0.0 | 1.0 | 0.0 | | ||
+-------------+------------+------------+------------+------------------+ | ||
|2 |0.0 |0.0 |1.0 |0.0 | | ||
| **2** | 0.0 | 0.0 | 1.0 | 0.0 | | ||
+-------------+------------+------------+------------+------------------+ | ||
|3 |0.0 |0.0 |1.0 |0.0 | | ||
| **3** | 0.0 | 0.0 | 1.0 | 0.0 | | ||
+-------------+------------+------------+------------+------------------+ | ||
|4 |0.0 |0.0 |1.0 |0.0 | | ||
| **4** | 0.0 | 0.0 | 1.0 | 0.0 | | ||
+-------------+------------+------------+------------+------------------+ | ||
|
||
|
||
Completing the pipeline with a classifier | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Scikit-learn regressors and classifiers have a *sklearndf* sibling obtained by appending | ||
DF to the class name; the API remains the same. | ||
The result of any predict and decision function will be returned as a pandas series | ||
(single output) or data frame (class probabilities or multi-output). | ||
``DF`` to the class name; the API of the native estimators is preserved. | ||
The result of any predict and decision function will be returned as a *pandas* | ||
``Series`` (single output) or ``DataFrame`` (class probabilities or multi-output). | ||
|
||
We can combine the preprocessing pipeline above with a classifier to create a full | ||
predictive pipeline. *sklearndf* provides two useful, specialised pipeline objects for | ||
this, RegressorPipelineDF and ClassifierPipelineDF. Both implement a special two-step | ||
pipeline with one preprocessing step and one prediction step, while staying compatible | ||
with the general sklearn pipeline idiom. | ||
this, ``RegressorPipelineDF`` and ``ClassifierPipelineDF``. | ||
Both implement a special two-step pipeline with one preprocessing step and one | ||
prediction step, while staying compatible with the general sklearn pipeline idiom. | ||
|
||
Using ClassifierPipelineDF we can combine the preprocessing pipeline with | ||
RandomForestClassifierDF() to fit a model to a selected training set and then score | ||
Using ``ClassifierPipelineDF`` we can combine the preprocessing pipeline with | ||
``RandomForestClassifierDF`` to fit a model to a selected training set and then score | ||
on a test set. | ||
|
||
.. code-block:: Python | ||
|
@@ -197,17 +219,21 @@ on a test set. | |
max_features=2/3, | ||
max_depth=7, | ||
random_state=42, | ||
n_jobs=-3 | ||
n_jobs=-3, | ||
) | ||
) | ||
# split data and then fit and score random forest classifier | ||
df_train, df_test, y_train, y_test = train_test_split(titanic_X, titanic_y, random_state=42) | ||
df_train, df_test, y_train, y_test = train_test_split( | ||
titanic_X, titanic_y, random_state=42 | ||
) | ||
pipeline_df.fit(df_train, y_train) | ||
print(f"model score: {pipeline_df.score(df_test, y_test).round(2)}") | ||
model score: 0.79 | ||
| | ||
model score: 0.79 | ||
|
||
|
||
Contributing | ||
|
@@ -220,8 +246,7 @@ For any bug reports or feature requests/enhancements please use the appropriate | |
`GitHub form <https://github.com/BCG-Gamma/sklearndf/issues>`_, and if you wish to do | ||
so, please open a PR addressing the issue. | ||
|
||
We do ask that for any major changes please discuss these with us first via an issue or | ||
at our team email: [email protected]. | ||
We do ask that for any major changes please discuss these with us first via an issue. | ||
|
||
For further information on contributing please see our | ||
`contribution guide <https://bcg-gamma.github.io/sklearndf/contribution_guide.html>`__. | ||
|
@@ -254,10 +279,10 @@ or have a look at | |
.. Begin-Badges | ||
.. |conda| image:: https://anaconda.org/bcg_gamma/sklearndf/badges/version.svg | ||
:target: https://anaconda.org/BCG_Gamma/sklearndf | ||
:target: https://anaconda.org/BCG_Gamma/sklearndf | ||
|
||
.. |pypi| image:: https://badge.fury.io/py/sklearndf.svg | ||
:target: https://pypi.org/project/sklearndf/ | ||
:target: https://pypi.org/project/sklearndf/ | ||
|
||
.. |azure_build| image:: https://dev.azure.com/gamma-facet/facet/_apis/build/status/BCG-Gamma.sklearndf?repoName=BCG-Gamma%2Fsklearndf&branchName=develop | ||
:target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=8&_a=summary | ||
|
@@ -266,15 +291,15 @@ or have a look at | |
:target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=8&_a=summary | ||
|
||
.. |python_versions| image:: https://img.shields.io/badge/python-3.7|3.8|3.9-blue.svg | ||
:target: https://www.python.org/downloads/release/python-380/ | ||
:target: https://www.python.org/downloads/release/python-380/ | ||
|
||
.. |code_style| image:: https://img.shields.io/badge/code%20style-black-000000.svg | ||
:target: https://github.com/psf/black | ||
:target: https://github.com/psf/black | ||
|
||
.. |made_with_sphinx_doc| image:: https://img.shields.io/badge/Made%20with-Sphinx-1f425f.svg | ||
:target: https://bcg-gamma.github.io/sklearndf/index.html | ||
:target: https://bcg-gamma.github.io/sklearndf/index.html | ||
|
||
.. |license_badge| image:: https://img.shields.io/badge/License-Apache%202.0-olivegreen.svg | ||
:target: https://opensource.org/licenses/Apache-2.0 | ||
:target: https://opensource.org/licenses/Apache-2.0 | ||
|
||
.. End-Badges |
Oops, something went wrong.