Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support empty arrays, improve ibis.array() API #9473

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Jun 29, 2024

Picking out the array stuff from #8666.
I think this is a better approach than #9458

Instead of trying to fit the two cases of 0-length and 1+ length arrays into the same op, I split them up into separate ones.
By doing this, if we guarantee that all the elements of ops.Array() have the right type before construction,
we don't have to store an explicit dtype on ops.Array, and instead can just use rlz.highest_precedence_dtype(). And we we don't have to do fancy casting during compilation, all the elements will already have been casted as needed.

This allows for the compilation of array on some sql backends like postgres.
If we tried to cast the entire array, you end up with SQL like cast [..] as STRUCT<...>[],
which postgres freaks about.
Instead, if we cast each individual element,
such as [cast({...} as ROW..., cast({...} as ROW...], then this is valid SQL.

I added a Length annotation to ops.Array to verify the length is 1+. IDK, this isn't really needed, since if you ever did construct one, then the rlz.highest_precedence_dtype([]) would fail. But that might fail at a later time,
and I wanted to raise an error at construction time. But, end users should never be constructing ops.Arrays directly,
so this is a guardrail just for us ibis devs. So IDK, we could remove it, but I think it is a nice hint for future us.

Copy link
Contributor

ACTION NEEDED

Ibis follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message.

Please update your PR title and description to match the specification.

Picking out the array stuff from ibis-project#8666

Instead of trying to fit the two cases of 0-length and 1+ length arrays into the same op, I split them up into separate ones.
By doing this, if we guarantee that all the elements of ops.Array() have the right type before construction,
we don't have to do any fancy casting during compilation, all the elements will
already have been casted as needed.

This allows for the compilation of array<structs> on some sql backends like postgres.
If we tried to cast the entire array, you end up with SQL like `cast [..] as STRUCT<...>[]`,
which postgres freaks about.
Instead, if we cast each individual element,
such as `[cast({...} as ROW..., cast({...} as ROW...]`, then this is valid SQL.

I added a Length annotation to ops.Array to verify the length is 1+. IDK, this isn't really needed, since if you ever did construct one, then the rlz.highest_precedence_dtype([]) would fail. But that might fail at a later time,
and I wanted to raise an error at construction time. But, end users should never be constructing ops.Arrays directly,
so this is a guardrail just for us ibis devs.
So IDK, we could remove it, but I think it is a nice hint for future us.
@NickCrews NickCrews changed the title WIP: try ops.EmptyArray feat: support empty arrays, improve ibis.array() API Jun 29, 2024
@NickCrews NickCrews marked this pull request as ready for review June 29, 2024 18:45
@NickCrews
Copy link
Contributor Author

@cpcloud OK this is finally looking pretty good and is ready for a review!

@NickCrews NickCrews enabled auto-merge (rebase) June 29, 2024 20:23
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still too much unnecessary munging here in my opinion.

return ir.null(type)

values = tuple(values)
if len(values) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use standard Python idioms for this.

Suggested change
if len(values) == 0:
if not values:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry will do.

if values is None:
if type is None:
raise ValidationError("If values is None/NULL, type must be provided")
return ir.null(type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this. It's unnecessary to have to support another way to construct a null value.

If someone wants a NULL with an array type, they should use the null function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a thought here but I better do it at my computer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See these tests in mismo:

I need to construct an array<int> with a variety of values:

  • NULL
  • []
  • [1,2,3]

There are three different ibis APIs for doing this:

  • ibis.null(type="array<int>"): this only works for the NULL case. I would need some if/else branching in my test setup to use this for the NULL case, but would need something else for the other cases.
  • ibis.literal(val, type="array<int>"). This is currently what I do, and it works for all 3 inputs here, no if/else branching required. But the docstring for literal() says that it is going to get deprecated soon for constructing complex types, and I should use array()/map()/struct() instead.
  • ibis.array(): If we keep the functionality the way I propose here, then I can do ibis.array(val, type="array<int>") and it works for all three input cases. If we don't allow passing in None, then I would need to do some if/else branching to use ibis.null() in that case.

I think it can be summarized as SOMEONE is gonna need to do this if values is None: branching, and I would rather have it be down here in one place in library code, than in many places in user code.

return ops.Array(tuple(values)).to_expr()
type = dt.dtype(type) if type is not None else None
if type is not None and not isinstance(type, dt.Array):
raise ValidationError(f"type must be an array, got {type}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just let this fall through and bubble up as it would? This function is already getting stuffed with a bunch of branching, and this doesn't seem to add much value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guarding against the footgun of someone passing in "float" instead of "array" (I literally accidentally did this as I adjusted the tests this last round), which would silently give the wrong result for some inputs. I agree not needed, I can remove, just a nicety

if isinstance(x, (ir.Value, Deferred)):
return x.cast(type) if type is not None else x
else:
return ibis.literal(x, type=type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of literal here?

Copy link
Contributor Author

@NickCrews NickCrews Jun 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do ibis.array([1, 2], type="array<float>"), we somehow need to get those ints to floats. My original implementation stored them in the array as ints, and convert to float during compilation. But that required storing the eventual type in ops.Array. so here we pre convert all the elements to floats, and now ops.Array can just use rlz.highest_precedence_dtype(). In order to do that pre casting, we need this literal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why this is required. Is there some API that fails to execute if you don't do these casts?

What if the elements themselves are arrays? Doesn't this then need to be recursive in the casting/call to _value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoot, github escaped the <> I posted, I meant ibis.array([1, 2], type="array<float>") not ibis.array([1, 2], type="array")

Now does the concern make sense? Two ways of storing this:

  1. store the inputs as is, and keep track of the dtype separately op.Array(values=[1,2], value_type="float")
  2. cast the inputs early, then we don't need to store the dtype: op.Array(values=[op.Literal(1, "float"), op.Literal(2, "float")])

I am trying to go for option 2.

I'm still not sure why this is required

Specificaly, if I remove the .literal() here, then we pass a plain python 1 into the op.Array constructor, and then using pattern matching it tries to coerce that to an ibis expression, which yields a op.Literal(1, int). But we wanted it to be a float, so we have to do the casting up front here.

What if the elements themselves are arrays

you mean what if x were a python list of ibis values? ie someone did array_of_arrays = ibis.array([[ibis.literal(1)], [ibis.literal(2)]])? Yes, this would fail, good catch. not sure what to do about it yet....

@NickCrews
Copy link
Contributor Author

I'm gonna be out of office for this week. So I hope when I get back I see that y'all will have figured it all out and everything is merged and happy 😁

gforsyth pushed a commit that referenced this pull request Jul 1, 2024
Workaround for pola-rs/polars#17294

pl.concat_list(Iterable[T]) results in pl.List[T], EXCEPT when T is a
pl.List, in which case pl.concat_list(Iterable[pl.List[T]]) results in
pl.List[T].
If polars ever supports a more consistent array constructor,
we should switch to that.

Found this when working on
#9473
@NickCrews
Copy link
Contributor Author

@cpcloud whenever you get the chance, I responded to each of the individual comments, curious what you think. I agree it's sorta mungy, but I can't think of a cleaner way of supporting the API I want. I don't really want to cut anything from the API, but are you willing to? Eg not support ibis.array(None)?

@NickCrews
Copy link
Contributor Author

Perhaps the larger question here is the tension between

  • provide a high level API that is very flexible in what it takes in (None, python lists of python vals, python lists of ibis exprs, ibis array exprs)
  • don't over guarantee anything that then we are going to have to continue supporting.

As is reasonable with our roles in ibis (user vs maintainer) It sounds like I care more about the 1st one, you about the 2nd haha 😂

One argument for favoring flexibility is consistency: I think we can both agree supporting list[ir.Value] is needed for ibis.array, .map,and .struct (they already do). but this sort of opens up Pandora's box: now these nested constructors are the only ibis constructor API that accepts ibis values. I want to bring the rest of the constructors in line with this. I think stopping with the scalar constructors would make sense, but also making ibis.memtable more generous with accepting ibis values could be a nice DX improvement, as I advocate for in some other issue.

I'm curious what @kszucs @gforsyth @jcrist think about this tension. Should we make things easier for users, or keep our API maintainable?

@cpcloud
Copy link
Member

cpcloud commented Jul 15, 2024

Part of the reason that pandas API is so large and complex to use and maintain (in my opinion) is that it favors flexibility over consistency. I think we should err on the side of consistency as a project, and provide flexibility where there's not a ton of additional complexity.

That means that sometimes the complexity of handling multiple input types gets pushed to the user and I would argue that that is the best place to handle non-trivial input normalization since the application has much better information about how it wants those inputs to look.

I am pretty against opening up all the ibis.* constructor APIs to accept other Ibis expressions as it adds a bunch of requirements and decision making around what's valid and how we validate the input beyond what we're already doing.

For nested types and tables, things get really complex as now you have to decide whether to accept mixed Ibis-expression-Python-value objects, and are we going to validate all of that?

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please leave all the additional complexity and decisions out here beyond addressing the empty array case and allow the user to specify a type?

if type is not None and not isinstance(type, dt.Array):
raise ValidationError(f"type must be an array, got {type}")

if isinstance(values, ir.Value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're going to accept inputs like this. Can you get rid of this code path?

if isinstance(x, (ir.Value, Deferred)):
return x.cast(type) if type is not None else x
else:
return ibis.literal(x, type=type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why this is required. Is there some API that fails to execute if you don't do these casts?

What if the elements themselves are arrays? Doesn't this then need to be recursive in the casting/call to _value?

@NickCrews
Copy link
Contributor Author

Can we please leave all the additional complexity and decisions out here beyond addressing the empty array case and allow the user to specify a type?

Yeah I think this is probably a good idea. I'll trim it down as I can.

To confirm, I am assuming that array_of_arrays = ibis.array([[ibis.literal(1)], [ibis.literal(2)]]) is a valid input. You also want to be able to support this, right?

@cpcloud
Copy link
Member

cpcloud commented Jul 16, 2024

To confirm, I am assuming that array_of_arrays = ibis.array([[ibis.literal(1)], [ibis.literal(2)]]) is a valid input. You also want to be able to support this, right?

No, I'm saying we shouldn't support this. It opens the door to mixing ibis expressions and python types, and we'd have to unify both and it's a lot of additional complexity.

@NickCrews
Copy link
Contributor Author

I currently have lots of code that looks like t.mutate(emails=ibis.array([t.email0, t.email1, t.email2]). Am I missing something here, how can I accomplish this goal?

@cpcloud
Copy link
Member

cpcloud commented Jul 23, 2024

@NickCrews I am still having a lot of trouble following everything here. I suspect that the PR is still trying to do too much.

Can you provide a minimum list of cases showing

  1. the current behavior
  2. the desired behavior

and avoid speculating on specific solutions on how to get the desired behavior? This problem space (complex type casting and inference) is really hairy, and it's very very difficult to review related PRs if they are trying to address a bunch of things at once.

@NickCrews
Copy link
Contributor Author

Yes, I agree it is really hairy, and I think that would help. Maybe I start with a PR that just adds or changes tests to precisely show what the current behavior is

@ncclementi
Copy link
Contributor

@NickCrews what's the status of this one? Is the plan to close it and start a new separate PR with a more targeted scope?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants