Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pairwise #54

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

Add pairwise #54

wants to merge 5 commits into from

Conversation

nalimilan
Copy link
Owner

It works well overall but there are a few issues:

  • When vectors have different types, promote_typejoin is used (via broadcast) to choose element type. promote_type would be more appropriate but there is no mechanism to do this in Base.
  • When skipping missing values, inference isn't able to realize that the result cannot be missing. This problem is fixed if I use a positional argument rather than a keyword argument for skipmissing but it's far from ideal for users.
  • Since eachrow(df::DataFrame) returns a DataFrameRows objects which is also a table, pairwise(cor, eachrow(df)) still computes correlation between columns rather than between rows. And anyway DataFrameRow is not accepted by cor
    since it's not an AbstractVector. One needs something like (Vector(r) for r in eachrow(df)) to work around this limitation.
  • Since Tables.jl objects can be of any type, we must have single method that performs dispatch internally by calling Tables.istable. Even AbstractVector inputs can be either vectors of vectors or row-oriented or column-oriented tables. This means the method that returns a named array and the method that returns a plain matrix have to live in the same package (which has to depend on NamedArrays).

The package should probably be renamed if we want to include this as it's clearly not a frequency table.

It works well overall but there are a few issues:
- When vectors have different types, `promote_typejoin` is used (via `broadcast`)
  to choose element type. `promote_type` would be more appropriate but there is
  no mechanism to do this in Base.
- When skipping missing values, inference isn't able to realize that the result
  cannot be `missing`. This problem is fixed if I use a positional argument rather
  than a keyword argument for `skipmissing` but it's far from ideal for users.
- Since `eachrow(df::DataFrame)` returns a `DataFrameRows` objects which is also
  a table, `pairwise(cor, eachrow(df))` still computes correlation between columns
  rather than between rows. And anyway `DataFrameRow` is not accepted by `cor`
  since it's not an `AbstractVector`. One needs something like
  `(Vector(r) for r in eachrow(df))` to work around this limitation.
- Since Tables.jl objects can be of any type, we must have single method that
  performs dispatch internally by calling `Tables.istable`. Even `AbstractVector`
  inputs can be either vectors of vectors or row-oriented or column-oriented tables.
  This means the method that returns a named array and the method that returns a
  plain matrix have to live in the same package (which has to depend on NamedArrays).
Repository owner deleted a comment from coveralls Nov 13, 2020
Repository owner deleted a comment from coveralls Nov 13, 2020
pairwise(f, x[, y], symmetric::Bool=false, skipmissing::Symbol=:none)

Return a matrix holding the result of applying `f` to all possible pairs
of vectors in iterators `x` and `y`. Rows correspond to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why vectors? Cannot we say that x and y have to be iterables of iterables and we apply f to a their cross-product?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - now I see why below. Maybe then require x and y to be AbstractVector{<:AbstractVector}?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I guess we could allow anything, f can take care of throwing an error if needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe write exactly what you have in code - if both x and y are Table.jl compliant then they are treated as table. Otherwise BOTH are treated as iterators?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, to skip missing values we rely on view so I guess for that case we need AbstractVector. In the future we could use skipmissings but I'd say it's not high priority.

`skipmissing` is different from `:none`).

# Keyword arguments
- `symmetric::Bool=false`: If `true`, `f` is only called to compute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formally commutative, as symmetric is a property of a binary relation typically.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Symmetric" referred to the table. Not sure what's the best term.

return _pairwise(Val(skipmissing), f, x′, y′, symmetric)
else
throw(ArgumentError("x and y must be either iterators of AbstractArrays, " *
"or Tables.jl objects"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverse the order here I think.

keys(yi) == inds ||
throw(ArgumentError("All input vectors must have the same indices"))
end
x′ = collect(x)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better use comprehension? It will narrow down the eltype of x'

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I'm not sure what's best. This will only make a difference for vector inputs, right? For these, if you write [x1, x2] you'll get a narrow type already (actually, narrower than what a comprehension would do alone thanks to promote_type as opposed to typejoin). So that would be useful mainly if you pass a vector that you allocated with an abstract type, in which case you may have reasons to do that (avoid unnecessary specialization...).

Anyway it shouldn't affect performance a lot since most of the time should be spent in f(xi, yi). Do you have a use case in mind?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - I just know that collect does not narrow type.

@bkamins
Copy link
Collaborator

bkamins commented Nov 14, 2020

I like it very much. The only question is if it would be hard/needed to make it more general and allow not only pairwise but arbitrary dimensional cross-product (not sure if it is needed though).

@nalimilan
Copy link
Owner Author

nalimilan commented Nov 14, 2020

OK, thanks. It shouldn't be hard to support more types when we don't skip missing values, but it's more tricky when skipping them. Not sure it's worth adding that complexity right now give the intended use cases. EDIT: sorry I hadn't understood your point (I thought it was about allowing any iterators and not just vectors). Yes I imagine it would be possible to accept more than two arguments, but then it wouldn't be pairwise anymore, would it? I guess it would be outer. :-)

@mschauer
Copy link

Maybe just anticipate the use for more than 2 arguments and rename it, without implementing it yet.

@nalimilan
Copy link
Owner Author

But I find it much more user-friendly to tell people "to compute pairwise correlation, use pairwise(cor, df)" than "use outer(cor, df)". Also pairwise is already the name used in Distances for this operation and I don't see it moving to outer.

@mschauer
Copy link

I like calling it pairwise, because it does some specific things which make sense in the statistical/data context and not so much as generic map on all tuples (e.g. dealing with cor and missing).

@quinnj
Copy link
Contributor

quinnj commented Nov 20, 2020

I'm fine adding pairwise to DataAPI.jl if that'd be helpful.

@nalimilan
Copy link
Owner Author

nalimilan commented Dec 8, 2020

@dkarrasch I don't know whether you've seen this. Ideally this would cohabit nicely with Distances.jl, but the new support for passing vectors of vectors there (JuliaStats/Distances.jl#188 and JuliaStats/Distances.jl#194) will create ambiguities. Not sure what to do about it.

@dkarrasch
Copy link

No, I haven't seen this. The issue is that we relax types in those PRs, right? Is there any way you could depend on Distances here, and then define your own metric types that have fields corresponding to the keyword arguments you use here, and finally extend Distances.pairwise?

@nalimilan
Copy link
Owner Author

The problem is that Tables.jl objects don't have a particular type. Any object can be a table, including notably vectors or iterators of names tuples. As long as Distances only defines methods for matrices or vectors, the conflict with this PR is quite limited: only Tables.jl objects that are also vectors are problematic (including Vector{<: NamedTuple}}, but I can live with that). But if Distances accepts any iterator (JuliaStats/Distances.jl#194) then the method from this PR will never get called when the first argument is a metric supported by Distances, since Distances methods are more specific. So even depending on Distances here wouldn't fix the dispatch problem.

The only solution AFAICT is to have the most general function check whether an input is a table in the Tables.jl sense, and adapt its behavior depending on that. But @KristofferC didn't like Distances depend on Tables.jl (JuliaStats/Distances.jl#123), and there's the additional problem that we also want to return a NamedArray (or a similar type) in that case, which adds another dependency.

We could maybe work around this problem by having Distances check that the eltype is <: Union{Number, Missing} and if not bail out to a function that could live in DataAPI and that this PR would provide when the package is loaded (might be FreqTables or another more general package).

At any rate these issues are tricky. But I think the stats ecosystem would really benefit from having a general pairwise method that works with any function, with Distances providing optimized support for the metrics it defines.

@bkamins
Copy link
Collaborator

bkamins commented Dec 10, 2020

Maybe for the time being we might require FreqTables.pairwise and Distances.pairwise to be written if both packages are loaded (which is somewhat likely but not super likely)? Essentially - making the pairwise function to be owned by the package.

@nalimilan
Copy link
Owner Author

@dkarrasch Regarding your comment at JuliaStats/Distances.jl#188 (review) (better keep the discussion in a single place):

@nalimilan I wonder if it helps FreqTables.jl if we made these pairwise methods fully generic for iterators, and removed the type constraints on a and b. Then the Distances.jl pairwise method would act metric::PreMetric.

Making pairwise fully generic for iterators is what JuliaStats/Distances.jl#194 does, right? I don't think that would help for the reasons I gave above: Distances would still define the most specific method, so the one in this PR wouldn't have a chance of catching e.g. pairwise(Euclidean(), DataFrame(...)).

@nalimilan
Copy link
Owner Author

After discussing this on Slack with @bkamins and @quinnj I think the way forward is to require passing iterators over rows or columns of Tables.jl objects explicitly, via Tables.AbstractRows and Tables.AbstractColumns, or Tables.jl objects wrapped in explicit types. Then we can have fallback methods taking any iterator in StatsBase and Distances, with a common empty function definition in a package like DataAPI or a future StatsAPI. See JuliaStats/StatsBase.jl#627 for the StatsBase part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants