Add pairwise #54

nalimilan · 2020-11-12T21:27:14Z

It works well overall but there are a few issues:

When vectors have different types, promote_typejoin is used (via broadcast) to choose element type. promote_type would be more appropriate but there is no mechanism to do this in Base.
When skipping missing values, inference isn't able to realize that the result cannot be missing. This problem is fixed if I use a positional argument rather than a keyword argument for skipmissing but it's far from ideal for users.
Since eachrow(df::DataFrame) returns a DataFrameRows objects which is also a table, pairwise(cor, eachrow(df)) still computes correlation between columns rather than between rows. And anyway DataFrameRow is not accepted by cor
since it's not an AbstractVector. One needs something like (Vector(r) for r in eachrow(df)) to work around this limitation.
Since Tables.jl objects can be of any type, we must have single method that performs dispatch internally by calling Tables.istable. Even AbstractVector inputs can be either vectors of vectors or row-oriented or column-oriented tables. This means the method that returns a named array and the method that returns a plain matrix have to live in the same package (which has to depend on NamedArrays).

The package should probably be renamed if we want to include this as it's clearly not a frequency table.

It works well overall but there are a few issues: - When vectors have different types, `promote_typejoin` is used (via `broadcast`) to choose element type. `promote_type` would be more appropriate but there is no mechanism to do this in Base. - When skipping missing values, inference isn't able to realize that the result cannot be `missing`. This problem is fixed if I use a positional argument rather than a keyword argument for `skipmissing` but it's far from ideal for users. - Since `eachrow(df::DataFrame)` returns a `DataFrameRows` objects which is also a table, `pairwise(cor, eachrow(df))` still computes correlation between columns rather than between rows. And anyway `DataFrameRow` is not accepted by `cor` since it's not an `AbstractVector`. One needs something like `(Vector(r) for r in eachrow(df))` to work around this limitation. - Since Tables.jl objects can be of any type, we must have single method that performs dispatch internally by calling `Tables.istable`. Even `AbstractVector` inputs can be either vectors of vectors or row-oriented or column-oriented tables. This means the method that returns a named array and the method that returns a plain matrix have to live in the same package (which has to depend on NamedArrays).

bkamins · 2020-11-14T17:50:17Z

src/pairwise.jl

+    pairwise(f, x[, y], symmetric::Bool=false, skipmissing::Symbol=:none)
+
+Return a matrix holding the result of applying `f` to all possible pairs
+of vectors in iterators `x` and `y`. Rows correspond to


why vectors? Cannot we say that x and y have to be iterables of iterables and we apply f to a their cross-product?

OK - now I see why below. Maybe then require x and y to be AbstractVector{<:AbstractVector}?

Yes I guess we could allow anything, f can take care of throwing an error if needed.

Also maybe write exactly what you have in code - if both x and y are Table.jl compliant then they are treated as table. Otherwise BOTH are treated as iterators?

Actually, to skip missing values we rely on view so I guess for that case we need AbstractVector. In the future we could use skipmissings but I'd say it's not high priority.

bkamins · 2020-11-14T17:53:53Z

src/pairwise.jl

+`skipmissing` is different from `:none`).
+
+# Keyword arguments
+- `symmetric::Bool=false`: If `true`, `f` is only called to compute


formally commutative, as symmetric is a property of a binary relation typically.

"Symmetric" referred to the table. Not sure what's the best term.

bkamins · 2020-11-14T17:59:21Z

src/pairwise.jl

+            return _pairwise(Val(skipmissing), f, x′, y′, symmetric)
+        else
+            throw(ArgumentError("x and y must be either iterators of AbstractArrays, " *
+                                "or Tables.jl objects"))


reverse the order here I think.

bkamins · 2020-11-14T18:01:58Z

src/pairwise.jl

+        keys(yi) == inds ||
+            throw(ArgumentError("All input vectors must have the same indices"))
+    end
+    x′ = collect(x)


maybe better use comprehension? It will narrow down the eltype of x'

Good question. I'm not sure what's best. This will only make a difference for vector inputs, right? For these, if you write [x1, x2] you'll get a narrow type already (actually, narrower than what a comprehension would do alone thanks to promote_type as opposed to typejoin). So that would be useful mainly if you pass a vector that you allocated with an abstract type, in which case you may have reasons to do that (avoid unnecessary specialization...).

Anyway it shouldn't affect performance a lot since most of the time should be spent in f(xi, yi). Do you have a use case in mind?

no - I just know that collect does not narrow type.

bkamins · 2020-11-14T18:04:34Z

I like it very much. The only question is if it would be hard/needed to make it more general and allow not only pairwise but arbitrary dimensional cross-product (not sure if it is needed though).

nalimilan · 2020-11-14T18:08:08Z

OK, thanks. It shouldn't be hard to support more types when we don't skip missing values, but it's more tricky when skipping them. Not sure it's worth adding that complexity right now give the intended use cases. EDIT: sorry I hadn't understood your point (I thought it was about allowing any iterators and not just vectors). Yes I imagine it would be possible to accept more than two arguments, but then it wouldn't be pairwise anymore, would it? I guess it would be outer. :-)

mschauer · 2020-11-19T10:12:53Z

Maybe just anticipate the use for more than 2 arguments and rename it, without implementing it yet.

nalimilan · 2020-11-19T10:45:44Z

But I find it much more user-friendly to tell people "to compute pairwise correlation, use pairwise(cor, df)" than "use outer(cor, df)". Also pairwise is already the name used in Distances for this operation and I don't see it moving to outer.

mschauer · 2020-11-19T11:05:52Z

I like calling it pairwise, because it does some specific things which make sense in the statistical/data context and not so much as generic map on all tuples (e.g. dealing with cor and missing).

quinnj · 2020-11-20T06:06:38Z

I'm fine adding pairwise to DataAPI.jl if that'd be helpful.

nalimilan · 2020-12-08T08:56:58Z

@dkarrasch I don't know whether you've seen this. Ideally this would cohabit nicely with Distances.jl, but the new support for passing vectors of vectors there (JuliaStats/Distances.jl#188 and JuliaStats/Distances.jl#194) will create ambiguities. Not sure what to do about it.

dkarrasch · 2020-12-08T09:25:32Z

No, I haven't seen this. The issue is that we relax types in those PRs, right? Is there any way you could depend on Distances here, and then define your own metric types that have fields corresponding to the keyword arguments you use here, and finally extend Distances.pairwise?

nalimilan · 2020-12-10T13:28:57Z

The problem is that Tables.jl objects don't have a particular type. Any object can be a table, including notably vectors or iterators of names tuples. As long as Distances only defines methods for matrices or vectors, the conflict with this PR is quite limited: only Tables.jl objects that are also vectors are problematic (including Vector{<: NamedTuple}}, but I can live with that). But if Distances accepts any iterator (JuliaStats/Distances.jl#194) then the method from this PR will never get called when the first argument is a metric supported by Distances, since Distances methods are more specific. So even depending on Distances here wouldn't fix the dispatch problem.

The only solution AFAICT is to have the most general function check whether an input is a table in the Tables.jl sense, and adapt its behavior depending on that. But @KristofferC didn't like Distances depend on Tables.jl (JuliaStats/Distances.jl#123), and there's the additional problem that we also want to return a NamedArray (or a similar type) in that case, which adds another dependency.

We could maybe work around this problem by having Distances check that the eltype is <: Union{Number, Missing} and if not bail out to a function that could live in DataAPI and that this PR would provide when the package is loaded (might be FreqTables or another more general package).

At any rate these issues are tricky. But I think the stats ecosystem would really benefit from having a general pairwise method that works with any function, with Distances providing optimized support for the metrics it defines.

bkamins · 2020-12-10T15:01:47Z

Maybe for the time being we might require FreqTables.pairwise and Distances.pairwise to be written if both packages are loaded (which is somewhat likely but not super likely)? Essentially - making the pairwise function to be owned by the package.

nalimilan · 2020-12-11T16:44:17Z

@dkarrasch Regarding your comment at JuliaStats/Distances.jl#188 (review) (better keep the discussion in a single place):

@nalimilan I wonder if it helps FreqTables.jl if we made these pairwise methods fully generic for iterators, and removed the type constraints on a and b. Then the Distances.jl pairwise method would act metric::PreMetric.

Making pairwise fully generic for iterators is what JuliaStats/Distances.jl#194 does, right? I don't think that would help for the reasons I gave above: Distances would still define the most specific method, so the one in this PR wouldn't have a chance of catching e.g. pairwise(Euclidean(), DataFrame(...)).

nalimilan · 2021-01-01T16:34:59Z

After discussing this on Slack with @bkamins and @quinnj I think the way forward is to require passing iterators over rows or columns of Tables.jl objects explicitly, via Tables.AbstractRows and Tables.AbstractColumns, or Tables.jl objects wrapped in explicit types. Then we can have fallback methods taking any iterator in StatsBase and Distances, with a common empty function definition in a package like DataAPI or a future StatsAPI. See JuliaStats/StatsBase.jl#627 for the StatsBase part.

nalimilan added 3 commits November 12, 2020 22:05

Enable tests

f850f5c

Improve handling of diagonal

ad13045

Repository owner deleted a comment from coveralls Nov 13, 2020

nalimilan added 2 commits November 13, 2020 10:25

Use cor/cov where possible for inference tests

556e3c6

Fix test deps

f2d856d

bkamins reviewed Nov 14, 2020

View reviewed changes

nalimilan mentioned this pull request Jan 1, 2021

Add pairwise JuliaStats/StatsBase.jl#627

Merged

nalimilan mentioned this pull request Jul 25, 2021

Added Minimal Support for Reliability Scores (aka Crombach's alpha) JuliaStats/StatsBase.jl#701

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pairwise #54

Add pairwise #54

nalimilan commented Nov 12, 2020

bkamins Nov 14, 2020

bkamins Nov 14, 2020

nalimilan Nov 14, 2020

bkamins Nov 14, 2020

nalimilan Nov 14, 2020

bkamins Nov 14, 2020

nalimilan Nov 14, 2020

bkamins Nov 14, 2020

bkamins Nov 14, 2020

nalimilan Nov 14, 2020

bkamins Nov 14, 2020

bkamins commented Nov 14, 2020

nalimilan commented Nov 14, 2020 •

edited

Loading

mschauer commented Nov 19, 2020

nalimilan commented Nov 19, 2020

mschauer commented Nov 19, 2020

quinnj commented Nov 20, 2020

nalimilan commented Dec 8, 2020 •

edited

Loading

dkarrasch commented Dec 8, 2020

nalimilan commented Dec 10, 2020

bkamins commented Dec 10, 2020

nalimilan commented Dec 11, 2020

nalimilan commented Jan 1, 2021

Add pairwise #54

Are you sure you want to change the base?

Add pairwise #54

Conversation

nalimilan commented Nov 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Nov 14, 2020

nalimilan commented Nov 14, 2020 • edited Loading

mschauer commented Nov 19, 2020

nalimilan commented Nov 19, 2020

mschauer commented Nov 19, 2020

quinnj commented Nov 20, 2020

nalimilan commented Dec 8, 2020 • edited Loading

dkarrasch commented Dec 8, 2020

nalimilan commented Dec 10, 2020

bkamins commented Dec 10, 2020

nalimilan commented Dec 11, 2020

nalimilan commented Jan 1, 2021

nalimilan commented Nov 14, 2020 •

edited

Loading

nalimilan commented Dec 8, 2020 •

edited

Loading