Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Common Private Loss Module #19088

Closed
wants to merge 26 commits into from

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Jan 1, 2021

Reference Issues/PRs

Solves #15123.
Alternative PR using tempita to remove code duplication: #20567

What does this implement/fix? Explain your changes.

This PR introduces a new private module sklearn._loss. It implements most common loss functions, gradients and hessians, low-level Cython interface and high-level Python interface.

Any other comments?

  • Solid test suite. (I hope so! Some more tests are possible.)
  • n_threads for OpenMP parallelism can be set in any calling function.
  • Implementation very similar to the losses in HGBT.

Losses implemented

  • squared error
  • absolute error
  • pinball/quantile loss
  • Poisson deviance (log link)
  • Gamma deviance (log link)
  • Tweedie deviance (log link)
  • Log-loss/Binary crossentropy (a numerically stable version)
  • Multinomial-loss/Categorical crossentropy (label encoded target)

Follow-up

See PR #19089 for loss replacement of HGBT and linear logistic regression.

@lorentzenchr lorentzenchr force-pushed the loss_module branch 2 times, most recently from b3e670a to 8c3c710 Compare January 2, 2021 12:28
@lorentzenchr lorentzenchr changed the title [WIP] Common Private Loss Module [MRG] Common Private Loss Module Jan 2, 2021
Base automatically changed from master to main January 22, 2021 10:53
@lorentzenchr
Copy link
Member Author

To keep this PR clean, I rebased the 2 commits with b7e34bf.

@rth
Copy link
Member

rth commented Mar 29, 2021

Thanks for doing this work @lorentzenchr !

Should we start using this module at least for GLMs? And remove parts of glm_distribution.py which are redundant. On one side adding some private code API that's tested and not used as part of the migration is good, but if we can remove some amount of code at the same time it would be better IMO. Unless you think it would make this harder to review?

Would it make sens to separate regression and classification losses with a mixin and possibly put them in separate files?

@thomasjpfan
Copy link
Member

Would it make sense to separate regression and classification losses with a mixin and possibly put them in separate files?

I think separating the losses into regression and classification would make this PR easier to review.

Copy link
Contributor

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lorentzenchr, I have started to have a look at your pull request to become more familiar with loss nomenclature. Together with two minor comments (more will come... :) ), I have a major one about documentation. I know this is a private module, it is not clear to me how private modules have been managed in scikit-learn. Still, some private functions are documented in scikit-learn and you already wrote the docstrings for each of the losses: but the docstrings are not rendered in the documentation. I believe having the losses described in the doc will be really helpful and will save you some work in finalizing #11430 (which I would really like to see merged one day... :) ). Thanks for your work!

@cmarmo
Copy link
Contributor

cmarmo commented Mar 31, 2021

Not sure this is relevant... apparently #14265 and #15116 introduced some guidelines to manage parallelism.... perhaps there is a way to connect the n_threads here to the OMP_NUM_THREADS and SKLEARN_OPENMP_PARALLELISM_ENABLED environment variables? Like here

IF SKLEARN_OPENMP_PARALLELISM_ENABLED:
int n_threads = omp_get_max_threads()
ELSE:
int n_threads = 1

@lorentzenchr
Copy link
Member Author

Documentation

Based on @cmarmo's suggestion in #19088 (review), I've the same question: To render or not to render docstrings of this private module under API Reference?

Number of Threads

Not sure this is relevant... apparently #14265 and #15116 introduced some guidelines to manage parallelism.... perhaps there is a way to connect the n_threads here to the OMP_NUM_THREADS and SKLEARN_OPENMP_PARALLELISM_ENABLED environment variables?

I think that it is beneficial to have complete control over the number of threads in this low level module, i.e. leave it as is. Any function/estimator, in particular the user facing ones, can then decide to use the logic with OMP_NUM_THREADS. I guess there will be instances where we always want to set n_threads=1. This is in line to what is done in #17622.

@lorentzenchr
Copy link
Member Author

Should we start using this module at least for GLMs? And remove parts of glm_distribution.py which are redundant. On one side adding some private code API that's tested and not used as part of the migration is good, but if we can remove some amount of code at the same time it would be better IMO. Unless you think it would make this harder to review?

I think this PR is already quite large (with a lot of repeating stuff, I miss C++ templates...). I'd like to completely remove glm_distribution.py in the end. My idea was to have a LinearLoss, see https://github.com/lorentzenchr/scikit-learn/blob/4e007384e7eb4e86553a1bb44a220d9a7ce48134/sklearn/linear_model/_linear_loss.py#L10 which is from #19089.
I don't know if it is worth to replace some parts of glm_distributions.py just temporarily.

Would it make sense to separate regression and classification losses with a mixin and possibly put them in separate files?

I think separating the losses into regression and classification would make this PR easier to review.

Do you mean *.py (x)or *.pyx files? We could restructure into:

  • _loss_common.pyx with just cLossFunction
  • _loss_regression.pyx
  • _loss_classification.pyx

Where should cBinaryCrossEntropy go to? In my point of view, that's a regression loss for targets in the closed interval [0, 1]. Only cCategoricalCrossEntropy requires discrete targets, i.e. label encoding instead of OHE in order to save memory.

@thomasjpfan
Copy link
Member

I think this PR is already quite large (with a lot of repeating stuff, I miss C++ templates...).

I would be happy with C++ templates if it makes the code easier to maintain. How would you want to use templates here?

Where should cBinaryCrossEntropy go to? In my point of view, that's a regression loss for targets in the closed interval [0, 1].

Hmm breaking this PR up into categorical + regression would not work. What about just adding cHalfSquaredError and cBinaryCrossEntropy? Future PRs would be easier to merge because the tests would have already been merged.

@lorentzenchr
Copy link
Member Author

I would be happy with C++ templates if it makes the code easier to maintain. How would you want to use templates here?

Even with additional dependencies, I don't have a convincing solution for vectorizing all the scalar loss/gradient/hessian functions (aka making them ufuncs) together with openmp support (n_threads), see #15123 (comment) and following comments.

What about just adding cHalfSquaredError and cBinaryCrossEntropy? Future PRs would be easier to merge because the tests would have already been merged.

I can certainly do that, if another possible reviewer agrees. In that case, I would open a new PR.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments about the design/API.

for i in prange(
n_samples, schedule='static', nogil=True, num_threads=n_threads
):
loss[i] = closs_pinball_loss(y_true[i], raw_prediction[i], self.quantile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If y_true has a dtype of float32, would this upcast to float64 each time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is upcast to float64. This is similar to the HGBT losses, where any call to exp, expit, etc. is upcast fo float64, e.g.

p_i = _cexpit(raw_predictions[i])
. See also #18659.

Comment on lines +786 to +846
cdef double closs(self, double y_true, double raw_prediction) nogil:
return closs_pinball_loss(y_true, raw_prediction, self.quantile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not look like closs is being called anywhere. Where do you expect this function to be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expected to be used in SGD, see

cdef double loss(self, double p, double y) nogil:

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Apr 11, 2021

I would be happy with C++ templates if it makes the code easier to maintain. How would you want to use templates here?

@thomasjpfan We could get rid of all the repetitions of the 50 instances of for i in prange() using a templated wrapper function that accepts functors, i.e. C++ structs with a call operator. This way, the functions can be inlined inside the for-loop (as opposed to passing pointers to loss functions in a wrapper_function like generic_loop). I've benchmarked one implementation of this idea. It should also work with openmp, but we have to put the #pragma omp directives ourselves, see https://github.com/lorentzenchr/notebooks/blob/master/bench_logloss_Cython.ipynb, in particular the plot under section 2.2 (same as in details). I don't know why the templated version seems to have an overhead compared to the other implementations for large arrays. I also observed that the first call to a function takes longer than subsequent calls (i.e. timeit gives shorter times compared to timing myself and calling gc.collect() each time).

TLDR I've no conclusion which approach is best or generally fastest. I think, I give up benchmarking as it seems to be a magic art of wizzards:blush:

image

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an impressive contribution, @lorentzenchr!

This is my first pass, and I still need to get in the details of tests yet. But here are a first few comments.

@lorentzenchr
Copy link
Member Author

I'm not 100% sure but my guess is that the CI failure for doc-min-dependencies is due to Cython=0.28.5. This did not happen before setting NPY_NO_DEPRECATED_API in a55d0ff, so I'm reverting that commit.

gcc -pthread -shared -B /home/circleci/miniconda/envs/testenv/compiler_compat -L/home/circleci/miniconda/envs/testenv/lib -Wl,-rpath=/home/circleci/miniconda/envs/testenv/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.6/sklearn/_isotonic.o -Lbuild/temp.linux-x86_64-3.6 -lm -o sklearn/_isotonic.cpython-36m-x86_64-linux-gnu.so -fopenmp
error: Command "gcc -pthread -B /home/circleci/miniconda/envs/testenv/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -I/home/circleci/miniconda/envs/testenv/lib/python3.6/site-packages/numpy/core/include -I/home/circleci/miniconda/envs/testenv/lib/python3.6/site-packages/numpy/core/include -I/home/circleci/miniconda/envs/testenv/include/python3.6m -I/home/circleci/miniconda/envs/testenv/include/python3.6m -c sklearn/_loss/_loss.c -o build/temp.linux-x86_64-3.6/sklearn/_loss/_loss.o -MMD -MF build/temp.linux-x86_64-3.6/sklearn/_loss/_loss.o.d -fopenmp" failed with exit status 1

Exited with code exit status 1

CircleCI received exit code 1

@lorentzenchr lorentzenchr force-pushed the loss_module branch 2 times, most recently from 3d1d995 to c7f7523 Compare May 10, 2021 16:09
@jjerphan
Copy link
Member

Hi @lorentzenchr,

I would like to do another review of this PR as well as previous experiments you mentioned you made priorly -- I need to get some time. :)

I also think that reducing code duplication would greatly help maintaining this module; yet I do not see how easy/possible this is.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got some time to get into the losses after reviewing your notebook for experiments: all the implementations in sklearn/_loss/_loss.pyx look correct to me. 👍

I still need to go through the tests.

I've left some comment regarding design and efficiency.

Comment on lines +9 to +18
# Fused types for y_true, y_pred, raw_prediction
ctypedef fused Y_DTYPE_C:
np.npy_float64
np.npy_float32


# Fused types for gradient and hessian
ctypedef fused G_DTYPE_C:
np.npy_float64
np.npy_float32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fused types have been defined in the code base for KDTree, BallTree and DistMetrics (here are their definitions).

To ease maintenance on the long run, we might what to define as less new fused type as possible (I would favor using existing ones or cython's built-in fused types).

Are there particular reasons for using numpy's types?
Would using Cython's floating be possible here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason for numpy types other than copying from the hist_gradient_boosting/common.pxd.

I do not use cython.floating as I want to have 2 different fused types:

  • Y_DTYPE_C for y_true, raw_prediction, sample_weight
  • G_DTYPE_C for loss, gradient, hessian

Using only one fused type, all of them would always be forced to the same dtype. Use case is again the HGBT, where gradients and hessians are float32, whereas y_true is float64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Maybe it is worth indicating the motives here, explaining that defining two types allows working with more flexibility on inputs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading a bit in https://numpy.org/devdocs/reference/c-api/dtype.html, the np.npy_float64 is really always a 64 bit floating point number, even on 32-bit systems, in contrast to plain C double which may be implementation specific. But I wonder if any C compiler that is used for scikit-learn does not comply to IEEE 754-1989.

cdef class cBinaryCrossEntropy(cLossFunction):
cdef double closs(self, double y_true, double raw_prediction) nogil
cdef double cgradient(self, double y_true, double raw_prediction) nogil
cdef double2 cgrad_hess(self, double y_true, double raw_prediction) nogil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't cCategoricalCrossEntropy missing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! cCategoricalCrossEntropy does not have an implementation of closs, cgradient and cgrad_hess. Those C-functions working on single data points are intended for SGD in linear models, and there is no multiclass support, so far.
On top, the signature would be different. We have

cdef double closs(self, double y_true, double raw_prediction)

But for cCategoricalCrossEntropy, it would be

cdef double closs(self, double y_true, double* raw_prediction)

# cython: wraparound=False
# cython: language_level=3

# Design:
Copy link
Member

@jjerphan jjerphan May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing design notes 👍

There's a significant amount of code duplication for the _loss, _gradient and _gradient_hessian methods.
Is there a particular reason for this? Is this necessary for the implementation (I am thinking of your comment here)?

If this is for efficiency reasons (like avoiding dispatch when calling template methods), probably it might be possible to factor some code out for those classes' methods without loosing efficiency using an include file and then use include statements.

This is what has been used for KDTree and BallTree whose main methods are defined in an include file. See this comment for the initial motives.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance critical point is c) Wrap C functions in a loop to get Python functions operating on ndarrays.
Options:

  1. Implement loop functions once in base class and use virtual inheritance => performance penalty
  2. Use wrapper function that takes a function pointer and loops over this function => performance penalty
  3. Use C++ CRTP: Make virtual inheritance of 1. a compile time problem.
  4. Use C++ template on functors (struct with operator()): Makes 2. a compile time problem.
  5. Use Tempita or another template based on string replacement (as in _sag_fast.pyx.tp) => no syntax checks
  6. Write out loops manually = current approach: assured performance, no C++, but a lot of repeating code.

1+2 have performance penalties. 3 and 4 would require a considerable commitment to using C++ templates. 5 is ugly. That leaves us with 6:zany_face::rofl:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, using include files and include statements is kind of in between 5. and 6.

Is 1. really that costly?

Copy link
Member Author

@lorentzenchr lorentzenchr May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a look in my notebook, it seems to be roughly 10-20% compared to 6=current implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the need for performance (which I value), but to me it comes at a cost of a hardened code maintenance.

I think the view of a core-dev is necessary here.

Copy link
Member

@jjerphan jjerphan May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of proceeding similarly to how KDTree and BallTree are implemented: the main implementation is defined via template methods in an include file (binary_tree.pxi):

# Implementation Notes
# --------------------
# This implementation uses the common object-oriented approach of having an
# abstract base class which is extended by the KDTree and BallTree
# specializations.
#
# The BinaryTree "base class" is defined here and then subclassed in the BallTree
# and KDTree pyx files. These files include implementations of the
# "abstract" methods.

The helper functions for both KDTree and BallTree:

# Necessary Helper Functions
# --------------------------
# These are the names and descriptions of the "abstract" functions which are
# defined in kd_tree.pyx and ball_tree.pyx:
# cdef int allocate_data(BinaryTree tree, ITYPE_t n_nodes, ITYPE_t n_features):
# """Allocate arrays needed for the KD Tree"""
# cdef int init_node(BinaryTree tree, ITYPE_t i_node,
# ITYPE_t idx_start, ITYPE_t idx_end):
# """Initialize the node for the dataset stored in tree.data"""
# cdef DTYPE_t min_rdist(BinaryTree tree, ITYPE_t i_node, DTYPE_t* pt):
# """Compute the minimum reduced-distance between a point and a node"""
# cdef DTYPE_t min_dist(BinaryTree tree, ITYPE_t i_node, DTYPE_t* pt):
# """Compute the minimum distance between a point and a node"""
# cdef DTYPE_t max_rdist(BinaryTree tree, ITYPE_t i_node, DTYPE_t* pt):
# """Compute the maximum reduced-distance between a point and a node"""
# cdef DTYPE_t max_dist(BinaryTree tree, ITYPE_t i_node, DTYPE_t* pt):
# """Compute the maximum distance between a point and a node"""
# cdef inline int min_max_dist(BinaryTree tree, ITYPE_t i_node, DTYPE_t* pt,
# DTYPE_t* min_dist, DTYPE_t* max_dist):
# """Compute the minimum and maximum distance between a point and a node"""
# cdef inline DTYPE_t min_rdist_dual(BinaryTree tree1, ITYPE_t i_node1,
# BinaryTree tree2, ITYPE_t i_node2):
# """Compute the minimum reduced distance between two nodes"""
# cdef inline DTYPE_t min_dist_dual(BinaryTree tree1, ITYPE_t i_node1,
# BinaryTree tree2, ITYPE_t i_node2):
# """Compute the minimum distance between two nodes"""
# cdef inline DTYPE_t max_rdist_dual(BinaryTree tree1, ITYPE_t i_node1,
# BinaryTree tree2, ITYPE_t i_node2):
# """Compute the maximum reduced distance between two nodes"""
# cdef inline DTYPE_t max_dist_dual(BinaryTree tree1, ITYPE_t i_node1,
# BinaryTree tree2, ITYPE_t i_node2):
# """Compute the maximum distance between two nodes"""

are defined in their respective .pyx file, respectively here

#----------------------------------------------------------------------
# The functions below specialized the Binary Tree as a KD Tree
#
# Note that these functions use the concept of "reduced distance".
# The reduced distance, defined for some metrics, is a quantity which
# is more efficient to compute than the distance, but preserves the
# relative rankings of the true distance. For example, the reduced
# distance for the Euclidean metric is the squared-euclidean distance.
# For some metrics, the reduced distance is simply the distance.

and there:

#----------------------------------------------------------------------
# The functions below specialized the Binary Tree as a Ball Tree
#
# Note that these functions use the concept of "reduced distance".
# The reduced distance, defined for some metrics, is a quantity which
# is more efficient to compute than the distance, but preserves the
# relative rankings of the true distance. For example, the reduced
# distance for the Euclidean metric is the squared-euclidean distance.
# For some metrics, the reduced distance is simply the distance.


Here, one would proceed as follows:

  • make _loss.pyx an include file _loss.pxi which defines cLossFunction and its template methods (that is cLossFunction.{closs, cgradient, cgrad_hess, _loss, _gradient, _loss_gradient, and _gradient_hessian}).
  • make a .pyx file per loss, which defines a class extending cLossFunction and which implements the helper functions called by the template methods (closs, cgradient, closs_grad, cgrad_hess).

In this context cLossFunction, defined in _loss.pxi, would looks like this:

cdef class cLossFunction:
    """Base class for convex loss functions."""
    
    cdef double closs(self, double y_true, double raw_prediction) nogil:
        """Compute the loss for a single sample.

        Parameters
        ----------
        y_true : double
            Observed, true target value.
        raw_prediction : double
            Raw prediction value (in link space).

        Returns
        -------
        double
            The loss evaluated at `y_true` and `raw_prediction`.
        """
        return closs(y_true, raw_prediction)
        
    # cgradient and cgrad_hess are defined similarly
    
    def _loss(
        self,
        Y_DTYPE_C[::1] y_true,          # IN
        Y_DTYPE_C[::1] raw_prediction,  # IN
        Y_DTYPE_C[::1] sample_weight,   # IN
        G_DTYPE_C[::1] loss,            # OUT
        int n_threads=1
    ):
        """Compute the pointwise loss value for each input.

        Parameters
        ----------
        y_true : array of shape (n_samples,)
            Observed, true target values.
        raw_prediction : array of shape (n_samples,)
            Raw prediction values (in link space).
        sample_weight : array of shape (n_samples,) or None
            Sample weights.
        loss : array of shape (n_samples,)
            A location into which the result is stored.
        n_threads : int
            Number of threads used by OpenMP (if any).

        Returns
        -------
        loss : array of shape (n_samples,)
            Element-wise loss function.
        """
        cdef:
            int i
            int n_samples = y_true.shape[0]

        if sample_weight is None:
            for i in prange(
                n_samples, schedule='static', nogil=True, num_threads=n_threads
            ):
                loss[i] = closs(y_true[i], raw_prediction[i])
        else:
            for i in prange(
                n_samples, schedule='static', nogil=True, num_threads=n_threads
            ):
                loss[i] = (sample_weight[i]
                    * closs(y_true[i], raw_prediction[i]))

        return np.asarray(loss)

    # _gradient, _loss_gradient and _gradient_hessian are defined similarly

and _half_squared_error.pyx would for instance look like:

# Note: we potentially need some more imports for some losses.
include "_loss.pxi"

# -------------------------------------
# Single point inline C functions
# -------------------------------------
# Half Squared Error
cdef inline double closs(
    double y_true,
    double raw_prediction
) nogil:
    return 0.5 * (raw_prediction - y_true) * (raw_prediction - y_true)


cdef inline double cgradient(
    double y_true,
    double raw_prediction
) nogil:
    return raw_prediction - y_true


cdef inline double_pair cgrad_hess(
    double y_true,
    double raw_prediction
) nogil:
    cdef double_pair gh
    gh.val1 = raw_prediction - y_true  # gradient
    gh.val2 = 1.                       # hessian
    return gh


cdef class cHalfSquaredError(cLossFunction):
    """Half Squared Error with identity link.

    Domain:
    y_true and y_pred all real numbers

    Link:
    y_pred = raw_prediction
    """
   # cLossFunction entirely define the behavior
   # of this loss via template methods
    pass

Pros: this would remove code duplication, easing maintenance. Performance and class hierarchy aren't modified (no use of virtual inheritance like previously).

Cons: the compilation will take longer, binary will be larger, and more file will come in. Functions which are used as helper functions in several cLossFunction (e.g. cgrad_hess_half_gamma and cgrad_hess_half_poisson called by cgrad_hess_half_tweedie) could be defined in the main (or in another) include file, but I think this is manageable.

What do you think of this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjerphan Thanks for explaining the include statement in more detail. Now, I get it.
I think it's a viable solution. Mainly, I've 2 concerns: First, the binaries are already largish. Second, which sign (+ or -) has the tradeoff between less overall amount of code and more individual files.

Copy link
Member

@jjerphan jjerphan May 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjerphan Thanks for explaining the include statement in more detail. Now, I get it.

You're welcome, I should have done it initially.

Mainly, I've 2 concerns: First, the binaries are already largish. Second, which sign (+ or -) has the tradeoff between less overall amount of code and more individual files.

I do agree, it mainly helps removing duplicated logic. Moreover, what I propose here is also a deprecated development pattern in Cython.

I can explore it in a PR on your fork with this PR branch as a target to see if this would even be possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although not a beauty, I favor the use of cython.Tempita over include. They are (ok, both) already used in the codebase, so no extra commitment is required. However, I'd be reluctant to put work into a PR as we know such solutions work - no need for a proof of concept (at least not for me).

How to proceed?
I guess, further feedback might be the best way forward. I can put it on the agenda of the next dev meeting (the 3rd time?). My hope was that this PR is far from controversial and easy to merge:

  • It does not touch anything (yet).
  • Benchmarks indicate no regression and some nice improvements.
  • Tests are veeeery solid (more than the combination of tests from separate/cluttered loss modules!)

Copy link
Member

@jjerphan jjerphan Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although not a beauty, I favor the use of cython.Tempita over include. They are (ok, both) already used in the codebase, so no extra commitment is required. However, I'd be reluctant to put work into a PR as we know such solutions work - no need for a proof of concept (at least not for me).

OK.

How to proceed? I guess, further feedback might be the best way forward.

So do I.

On a side note, probably cython.final can help with method inlining here?

Edit: I do not have a strong opinion on using cython.Tempita over include.

@lorentzenchr
Copy link
Member Author

I got some time to get into the losses after reviewing your notebook for experiments: all the implementations in sklearn/_loss/_loss.pyx look correct to me. 👍

Thanks for your effort and for checking. As all tests pass, I'm pretty sure that the loss implementations, gradients and hessians are correct.

Regarding design decision and efficiency, see my comment #19088 (comment).

@lorentzenchr
Copy link
Member Author

Closed in favor of #20567.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A common private module for differentiable loss functions used as objective functions in estimators
6 participants