-
-
Notifications
You must be signed in to change notification settings - Fork 25.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Common Private Loss Module #19088
Conversation
a8692c3
to
aae84ef
Compare
b3e670a
to
8c3c710
Compare
35e462d
to
b7e34bf
Compare
To keep this PR clean, I rebased the 2 commits with b7e34bf. |
Thanks for doing this work @lorentzenchr ! Should we start using this module at least for GLMs? And remove parts of Would it make sens to separate regression and classification losses with a mixin and possibly put them in separate files? |
I think separating the losses into regression and classification would make this PR easier to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lorentzenchr, I have started to have a look at your pull request to become more familiar with loss nomenclature. Together with two minor comments (more will come... :) ), I have a major one about documentation. I know this is a private module, it is not clear to me how private modules have been managed in scikit-learn. Still, some private functions are documented in scikit-learn and you already wrote the docstrings for each of the losses: but the docstrings are not rendered in the documentation. I believe having the losses described in the doc will be really helpful and will save you some work in finalizing #11430 (which I would really like to see merged one day... :) ). Thanks for your work!
Not sure this is relevant... apparently #14265 and #15116 introduced some guidelines to manage parallelism.... perhaps there is a way to connect the scikit-learn/sklearn/ensemble/_hist_gradient_boosting/splitting.pyx Lines 309 to 312 in c9c89cf
|
DocumentationBased on @cmarmo's suggestion in #19088 (review), I've the same question: To render or not to render docstrings of this private module under API Reference? Number of Threads
I think that it is beneficial to have complete control over the number of threads in this low level module, i.e. leave it as is. Any function/estimator, in particular the user facing ones, can then decide to use the logic with |
I think this PR is already quite large (with a lot of repeating stuff, I miss C++ templates...). I'd like to completely remove
Do you mean
Where should |
I would be happy with C++ templates if it makes the code easier to maintain. How would you want to use templates here?
Hmm breaking this PR up into categorical + regression would not work. What about just adding |
Even with additional dependencies, I don't have a convincing solution for vectorizing all the scalar loss/gradient/hessian functions (aka making them ufuncs) together with openmp support (n_threads), see #15123 (comment) and following comments.
I can certainly do that, if another possible reviewer agrees. In that case, I would open a new PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments about the design/API.
for i in prange( | ||
n_samples, schedule='static', nogil=True, num_threads=n_threads | ||
): | ||
loss[i] = closs_pinball_loss(y_true[i], raw_prediction[i], self.quantile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If y_true
has a dtype of float32
, would this upcast to float64 each time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It is upcast to float64. This is similar to the HGBT losses, where any call to exp, expit, etc. is upcast fo float64, e.g.
p_i = _cexpit(raw_predictions[i]) |
cdef double closs(self, double y_true, double raw_prediction) nogil: | ||
return closs_pinball_loss(y_true, raw_prediction, self.quantile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not look like closs
is being called anywhere. Where do you expect this function to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is expected to be used in SGD, see
cdef double loss(self, double p, double y) nogil: |
@thomasjpfan We could get rid of all the repetitions of the 50 instances of TLDR I've no conclusion which approach is best or generally fastest. I think, I give up benchmarking as it seems to be a magic art of wizzards:blush: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an impressive contribution, @lorentzenchr!
This is my first pass, and I still need to get in the details of tests yet. But here are a first few comments.
I'm not 100% sure but my guess is that the CI failure for doc-min-dependencies is due to Cython=0.28.5. This did not happen before setting gcc -pthread -shared -B /home/circleci/miniconda/envs/testenv/compiler_compat -L/home/circleci/miniconda/envs/testenv/lib -Wl,-rpath=/home/circleci/miniconda/envs/testenv/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.6/sklearn/_isotonic.o -Lbuild/temp.linux-x86_64-3.6 -lm -o sklearn/_isotonic.cpython-36m-x86_64-linux-gnu.so -fopenmp Exited with code exit status 1 CircleCI received exit code 1 |
3d1d995
to
c7f7523
Compare
Hi @lorentzenchr, I would like to do another review of this PR as well as previous experiments you mentioned you made priorly -- I need to get some time. :) I also think that reducing code duplication would greatly help maintaining this module; yet I do not see how easy/possible this is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got some time to get into the losses after reviewing your notebook for experiments: all the implementations in sklearn/_loss/_loss.pyx
look correct to me. 👍
I still need to go through the tests.
I've left some comment regarding design and efficiency.
# Fused types for y_true, y_pred, raw_prediction | ||
ctypedef fused Y_DTYPE_C: | ||
np.npy_float64 | ||
np.npy_float32 | ||
|
||
|
||
# Fused types for gradient and hessian | ||
ctypedef fused G_DTYPE_C: | ||
np.npy_float64 | ||
np.npy_float32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fused types have been defined in the code base for KDTree
, BallTree
and DistMetrics
(here are their definitions).
To ease maintenance on the long run, we might what to define as less new fused type as possible (I would favor using existing ones or cython's built-in fused types).
Are there particular reasons for using numpy's types?
Would using Cython's floating
be possible here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason for numpy types other than copying from the hist_gradient_boosting/common.pxd
.
I do not use cython.floating
as I want to have 2 different fused types:
Y_DTYPE_C
fory_true
,raw_prediction
,sample_weight
G_DTYPE_C
forloss
,gradient
,hessian
Using only one fused type, all of them would always be forced to the same dtype. Use case is again the HGBT, where gradients and hessians are float32, whereas y_true is float64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Maybe it is worth indicating the motives here, explaining that defining two types allows working with more flexibility on inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading a bit in https://numpy.org/devdocs/reference/c-api/dtype.html, the np.npy_float64
is really always a 64 bit floating point number, even on 32-bit systems, in contrast to plain C double
which may be implementation specific. But I wonder if any C compiler that is used for scikit-learn does not comply to IEEE 754-1989.
sklearn/_loss/_loss.pxd
Outdated
cdef class cBinaryCrossEntropy(cLossFunction): | ||
cdef double closs(self, double y_true, double raw_prediction) nogil | ||
cdef double cgradient(self, double y_true, double raw_prediction) nogil | ||
cdef double2 cgrad_hess(self, double y_true, double raw_prediction) nogil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't cCategoricalCrossEntropy
missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! cCategoricalCrossEntropy
does not have an implementation of closs
, cgradient
and cgrad_hess
. Those C-functions working on single data points are intended for SGD in linear models, and there is no multiclass support, so far.
On top, the signature would be different. We have
cdef double closs(self, double y_true, double raw_prediction)
But for cCategoricalCrossEntropy
, it would be
cdef double closs(self, double y_true, double* raw_prediction)
# cython: wraparound=False | ||
# cython: language_level=3 | ||
|
||
# Design: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing design notes 👍
There's a significant amount of code duplication for the _loss
, _gradient
and _gradient_hessian
methods.
Is there a particular reason for this? Is this necessary for the implementation (I am thinking of your comment here)?
If this is for efficiency reasons (like avoiding dispatch when calling template methods), probably it might be possible to factor some code out for those classes' methods without loosing efficiency using an include file and then use include statements.
This is what has been used for KDTree
and BallTree
whose main methods are defined in an include file. See this comment for the initial motives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance critical point is c) Wrap C functions in a loop to get Python functions operating on ndarrays.
Options:
- Implement loop functions once in base class and use virtual inheritance => performance penalty
- Use wrapper function that takes a function pointer and loops over this function => performance penalty
- Use C++ CRTP: Make virtual inheritance of 1. a compile time problem.
- Use C++ template on functors (
struct
withoperator()
): Makes 2. a compile time problem. - Use Tempita or another template based on string replacement (as in
_sag_fast.pyx.tp
) => no syntax checks - Write out loops manually = current approach: assured performance, no C++, but a lot of repeating code.
1+2 have performance penalties. 3 and 4 would require a considerable commitment to using C++ templates. 5 is ugly. That leaves us with 6:zany_face::rofl:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, using include files and include statements is kind of in between 5. and 6.
Is 1. really that costly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a look in my notebook, it seems to be roughly 10-20% compared to 6=current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the need for performance (which I value), but to me it comes at a cost of a hardened code maintenance.
I think the view of a core-dev is necessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of proceeding similarly to how KDTree and BallTree are implemented: the main implementation is defined via template methods in an include file (binary_tree.pxi
):
scikit-learn/sklearn/neighbors/_binary_tree.pxi
Lines 91 to 99 in 54375d2
# Implementation Notes | |
# -------------------- | |
# This implementation uses the common object-oriented approach of having an | |
# abstract base class which is extended by the KDTree and BallTree | |
# specializations. | |
# | |
# The BinaryTree "base class" is defined here and then subclassed in the BallTree | |
# and KDTree pyx files. These files include implementations of the | |
# "abstract" methods. |
The helper functions for both KDTree
and BallTree
:
scikit-learn/sklearn/neighbors/_binary_tree.pxi
Lines 101 to 143 in 54375d2
# Necessary Helper Functions | |
# -------------------------- | |
# These are the names and descriptions of the "abstract" functions which are | |
# defined in kd_tree.pyx and ball_tree.pyx: | |
# cdef int allocate_data(BinaryTree tree, ITYPE_t n_nodes, ITYPE_t n_features): | |
# """Allocate arrays needed for the KD Tree""" | |
# cdef int init_node(BinaryTree tree, ITYPE_t i_node, | |
# ITYPE_t idx_start, ITYPE_t idx_end): | |
# """Initialize the node for the dataset stored in tree.data""" | |
# cdef DTYPE_t min_rdist(BinaryTree tree, ITYPE_t i_node, DTYPE_t* pt): | |
# """Compute the minimum reduced-distance between a point and a node""" | |
# cdef DTYPE_t min_dist(BinaryTree tree, ITYPE_t i_node, DTYPE_t* pt): | |
# """Compute the minimum distance between a point and a node""" | |
# cdef DTYPE_t max_rdist(BinaryTree tree, ITYPE_t i_node, DTYPE_t* pt): | |
# """Compute the maximum reduced-distance between a point and a node""" | |
# cdef DTYPE_t max_dist(BinaryTree tree, ITYPE_t i_node, DTYPE_t* pt): | |
# """Compute the maximum distance between a point and a node""" | |
# cdef inline int min_max_dist(BinaryTree tree, ITYPE_t i_node, DTYPE_t* pt, | |
# DTYPE_t* min_dist, DTYPE_t* max_dist): | |
# """Compute the minimum and maximum distance between a point and a node""" | |
# cdef inline DTYPE_t min_rdist_dual(BinaryTree tree1, ITYPE_t i_node1, | |
# BinaryTree tree2, ITYPE_t i_node2): | |
# """Compute the minimum reduced distance between two nodes""" | |
# cdef inline DTYPE_t min_dist_dual(BinaryTree tree1, ITYPE_t i_node1, | |
# BinaryTree tree2, ITYPE_t i_node2): | |
# """Compute the minimum distance between two nodes""" | |
# cdef inline DTYPE_t max_rdist_dual(BinaryTree tree1, ITYPE_t i_node1, | |
# BinaryTree tree2, ITYPE_t i_node2): | |
# """Compute the maximum reduced distance between two nodes""" | |
# cdef inline DTYPE_t max_dist_dual(BinaryTree tree1, ITYPE_t i_node1, | |
# BinaryTree tree2, ITYPE_t i_node2): | |
# """Compute the maximum distance between two nodes""" |
are defined in their respective .pyx
file, respectively here
scikit-learn/sklearn/neighbors/_kd_tree.pyx
Lines 26 to 34 in 54375d2
#---------------------------------------------------------------------- | |
# The functions below specialized the Binary Tree as a KD Tree | |
# | |
# Note that these functions use the concept of "reduced distance". | |
# The reduced distance, defined for some metrics, is a quantity which | |
# is more efficient to compute than the distance, but preserves the | |
# relative rankings of the true distance. For example, the reduced | |
# distance for the Euclidean metric is the squared-euclidean distance. | |
# For some metrics, the reduced distance is simply the distance. |
and there:
scikit-learn/sklearn/neighbors/_ball_tree.pyx
Lines 33 to 41 in 54375d2
#---------------------------------------------------------------------- | |
# The functions below specialized the Binary Tree as a Ball Tree | |
# | |
# Note that these functions use the concept of "reduced distance". | |
# The reduced distance, defined for some metrics, is a quantity which | |
# is more efficient to compute than the distance, but preserves the | |
# relative rankings of the true distance. For example, the reduced | |
# distance for the Euclidean metric is the squared-euclidean distance. | |
# For some metrics, the reduced distance is simply the distance. |
Here, one would proceed as follows:
- make
_loss.pyx
an include file_loss.pxi
which definescLossFunction
and its template methods (that iscLossFunction.{closs, cgradient, cgrad_hess, _loss, _gradient, _loss_gradient, and _gradient_hessian}
). - make a
.pyx
file per loss, which defines a class extendingcLossFunction
and which implements the helper functions called by the template methods (closs
,cgradient
,closs_grad
,cgrad_hess
).
In this context cLossFunction
, defined in _loss.pxi
, would looks like this:
cdef class cLossFunction:
"""Base class for convex loss functions."""
cdef double closs(self, double y_true, double raw_prediction) nogil:
"""Compute the loss for a single sample.
Parameters
----------
y_true : double
Observed, true target value.
raw_prediction : double
Raw prediction value (in link space).
Returns
-------
double
The loss evaluated at `y_true` and `raw_prediction`.
"""
return closs(y_true, raw_prediction)
# cgradient and cgrad_hess are defined similarly
def _loss(
self,
Y_DTYPE_C[::1] y_true, # IN
Y_DTYPE_C[::1] raw_prediction, # IN
Y_DTYPE_C[::1] sample_weight, # IN
G_DTYPE_C[::1] loss, # OUT
int n_threads=1
):
"""Compute the pointwise loss value for each input.
Parameters
----------
y_true : array of shape (n_samples,)
Observed, true target values.
raw_prediction : array of shape (n_samples,)
Raw prediction values (in link space).
sample_weight : array of shape (n_samples,) or None
Sample weights.
loss : array of shape (n_samples,)
A location into which the result is stored.
n_threads : int
Number of threads used by OpenMP (if any).
Returns
-------
loss : array of shape (n_samples,)
Element-wise loss function.
"""
cdef:
int i
int n_samples = y_true.shape[0]
if sample_weight is None:
for i in prange(
n_samples, schedule='static', nogil=True, num_threads=n_threads
):
loss[i] = closs(y_true[i], raw_prediction[i])
else:
for i in prange(
n_samples, schedule='static', nogil=True, num_threads=n_threads
):
loss[i] = (sample_weight[i]
* closs(y_true[i], raw_prediction[i]))
return np.asarray(loss)
# _gradient, _loss_gradient and _gradient_hessian are defined similarly
and _half_squared_error.pyx
would for instance look like:
# Note: we potentially need some more imports for some losses.
include "_loss.pxi"
# -------------------------------------
# Single point inline C functions
# -------------------------------------
# Half Squared Error
cdef inline double closs(
double y_true,
double raw_prediction
) nogil:
return 0.5 * (raw_prediction - y_true) * (raw_prediction - y_true)
cdef inline double cgradient(
double y_true,
double raw_prediction
) nogil:
return raw_prediction - y_true
cdef inline double_pair cgrad_hess(
double y_true,
double raw_prediction
) nogil:
cdef double_pair gh
gh.val1 = raw_prediction - y_true # gradient
gh.val2 = 1. # hessian
return gh
cdef class cHalfSquaredError(cLossFunction):
"""Half Squared Error with identity link.
Domain:
y_true and y_pred all real numbers
Link:
y_pred = raw_prediction
"""
# cLossFunction entirely define the behavior
# of this loss via template methods
pass
Pros: this would remove code duplication, easing maintenance. Performance and class hierarchy aren't modified (no use of virtual inheritance like previously).
Cons: the compilation will take longer, binary will be larger, and more file will come in. Functions which are used as helper functions in several cLossFunction
(e.g. cgrad_hess_half_gamma
and cgrad_hess_half_poisson
called by cgrad_hess_half_tweedie
) could be defined in the main (or in another) include file, but I think this is manageable.
What do you think of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjerphan Thanks for explaining the include statement in more detail. Now, I get it.
I think it's a viable solution. Mainly, I've 2 concerns: First, the binaries are already largish. Second, which sign (+ or -) has the tradeoff between less overall amount of code and more individual files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjerphan Thanks for explaining the include statement in more detail. Now, I get it.
You're welcome, I should have done it initially.
Mainly, I've 2 concerns: First, the binaries are already largish. Second, which sign (+ or -) has the tradeoff between less overall amount of code and more individual files.
I do agree, it mainly helps removing duplicated logic. Moreover, what I propose here is also a deprecated development pattern in Cython.
I can explore it in a PR on your fork with this PR branch as a target to see if this would even be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although not a beauty, I favor the use of cython.Tempita
over include
. They are (ok, both) already used in the codebase, so no extra commitment is required. However, I'd be reluctant to put work into a PR as we know such solutions work - no need for a proof of concept (at least not for me).
How to proceed?
I guess, further feedback might be the best way forward. I can put it on the agenda of the next dev meeting (the 3rd time?). My hope was that this PR is far from controversial and easy to merge:
- It does not touch anything (yet).
- Benchmarks indicate no regression and some nice improvements.
- Tests are veeeery solid (more than the combination of tests from separate/cluttered loss modules!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although not a beauty, I favor the use of cython.Tempita over include. They are (ok, both) already used in the codebase, so no extra commitment is required. However, I'd be reluctant to put work into a PR as we know such solutions work - no need for a proof of concept (at least not for me).
OK.
How to proceed? I guess, further feedback might be the best way forward.
So do I.
On a side note, probably cython.final
can help with method inlining here?
Edit: I do not have a strong opinion on using cython.Tempita over include
.
Thanks for your effort and for checking. As all tests pass, I'm pretty sure that the loss implementations, gradients and hessians are correct. Regarding design decision and efficiency, see my comment #19088 (comment). |
dea5fd1
to
a5018f4
Compare
Closed in favor of #20567. |
Reference Issues/PRs
Solves #15123.
Alternative PR using tempita to remove code duplication: #20567
What does this implement/fix? Explain your changes.
This PR introduces a new private module
sklearn._loss
. It implements most common loss functions, gradients and hessians, low-level Cython interface and high-level Python interface.Any other comments?
n_threads
for OpenMP parallelism can be set in any calling function.Losses implemented
Follow-up
See PR #19089 for loss replacement of HGBT and linear logistic regression.