Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split off list constructor logic from pl.concat_list into pl.list #8510

Open
cmdlineluser opened this issue Apr 25, 2023 · 28 comments
Open

Split off list constructor logic from pl.concat_list into pl.list #8510

cmdlineluser opened this issue Apr 25, 2023 · 28 comments
Labels
A-api Area: changes to the public API A-dtype-list/array Area: list/array data type enhancement New feature or an improvement of an existing feature
Milestone

Comments

@cmdlineluser
Copy link
Contributor

Problem description

Discussing #8503 sparked the thought - would it make sense to "rename" pl.concat_list to pl.list?

pl.list is currently marked for deprecation as part of the .arr to .list rename.

DeprecationWarning: `pl.list` is deprecated, 
please use `pl.implode` instead.

Similar to how there is pl.struct and the .struct namespace:

pl.struct(...).struct

If pl.list were repurposed it would be:

pl.list(...).list
@cmdlineluser cmdlineluser added the enhancement New feature or an improvement of an existing feature label Apr 25, 2023
@stinodego
Copy link
Member

Fully agree. Will add this to the milestones for the next breaking release.

@cmdlineluser
Copy link
Contributor Author

Hi @stinodego - I just read your comment regarding horizontal sum which made me think of this issue.

After the concat_list/namespace rename

pl.concat_list(pl.col("nums") * 2).arr.sum()

Would become

pl.list(pl.col("nums") * 2).list.sum()

Your comment made me wonder if it would be possible to provide a syntax sugar version:

pl.list.sum(pl.col("nums") * 2)

sugar:

pl.sum("nums")      # vertical
pl.list.sum("nums") # horizontal

no-sugar:

pl.col("nums").sum()
pl.list("nums").list.sum()

@lucazanna
Copy link

Quick thoughts on this:

i like the idea of aligning the methods name between structs and lists.
however a pl.list(…).list does not seem very readable to me / clear that it works on rows instead of columns (at least at first)

What about keeping concat_list so we have
concat_list(…).list

And then rename struct to concat_struct so that we have
concat_struct(…).struct

i believe that concat word makes it clearer that we are concatenating multiple rows together

thoughts ?

@stinodego
Copy link
Member

I am also a bit on the fence about this. On the Rust side, struct is called as_struct. Maybe that's better; not sure. I'll brood on this for a bit.

@ritchie46
Copy link
Member

Yes, I agree that we miss an (I don't know the english name of this, adjective? verb?), but something that says "what" we are going to do with the noun "list".

We already have this on many places. read_csv, to_uppercase. So I think we should do one of those:

  • as_list, as_struct
  • to_list, to_struct,
  • pack_list, pack_struct

But the noun list, struct is not clear as we do not state what we do with those.

@stinodego stinodego self-assigned this May 26, 2023
@stinodego
Copy link
Member

stinodego commented May 28, 2023

We have a set of functions that take one or more expressions and combines them to generate a new column of a certain datatype. We should rename these to reflect that they are essentially doing the same thing. I have given it a lot of thought and I think the as_ prefix is the most expressive. It would result in the following names:

Before After
datetime as_datetime
date as_date
time as_time
duration as_duration
struct as_struct
concat_list as_list
concat_str as_string
format* format

* Convenience wrapper around concat_str, remains unchanged

I plan to do the following:

  1. Refactor to put all these functions into a single module, as well as gather any dedicated tests (DONE)
  2. Change behaviour to have the expression named after the datatype
  3. Add new function names, deprecate old.

@lucazanna
Copy link

hi @stinodego,

Would it make sense to add an ‘as_array’ method?

Same logic as ´as_list’ but with the result being an array

@stinodego
Copy link
Member

Yes, it's already up on my whiteboard 😄

@granthamtaylor
Copy link

I use pl.concat_list() to concatenate a number of columns together.

With a seemingly small number of assumptions, this operation will always result in a fixed size list.

It would be very convenient to be able to have the resulting concatenation be an array.

@stinodego
Copy link
Member

That won't be possible, as concat_list does not always result in lists of the same size. Consider:

import polars as pl

df = pl.DataFrame({"a": [[1, 2], [3], [3, 4]], "b": [1, 2, 3]})
result = df.with_columns(pl.concat_list("a", "b").alias("concat"))
shape: (3, 3)
┌───────────┬─────┬───────────┐
│ a         ┆ b   ┆ concat    │
│ ---       ┆ --- ┆ ---       │
│ list[i64] ┆ i64 ┆ list[i64] │
╞═══════════╪═════╪═══════════╡
│ [1, 2]    ┆ 1   ┆ [1, 2, 1] │
│ [3]       ┆ 2   ┆ [3, 2]    │
│ [3, 4]    ┆ 3   ┆ [3, 4, 3] │
└───────────┴─────┴───────────┘

We will add an as_array function soon enough - for now, you'll have to cast the result of concat_list to an array.

@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Dec 17, 2023
@stinodego stinodego added needs decision Awaiting decision by a maintainer and removed needs triage Awaiting prioritization by a maintainer labels Jan 19, 2024
@stinodego stinodego added the A-api Area: changes to the public API label May 23, 2024
@stinodego stinodego changed the title Rename pl.concat_list to pl.list Rename expression functions that create a specific data type column, e.g. pl.concat_list May 23, 2024
@stinodego
Copy link
Member

We have discussed this, and there is a distinction to be made between the functions listed.

There are "constructor" functions:

  • datetime
  • date
  • time
  • duration
  • struct

Those are fine as-is. Perhaps a prefix like new_ would be more clear (e.g. new_datetime), but there is no great need to change this right now.

Then there are the "concatenate" functions:

  • concat_list
  • concat_str

These names are clear and we don't want to change them.

The real issue here is that concat_list has extremely strange behavior - it functions both as a 'concatenate' and also as a 'constructor'. The following two code samples have the same result:

pl.select(pl.concat_list(1, 2, 3))
pl.select(pl.concat_list([1], [2], [3]))
shape: (1, 1)
┌───────────┐
│ literal   │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 2, 3] │
└───────────┘

The second one is clearly concatenating the inputs, while the first one is constructing a new list and taking the inputs as elements to that list.

We should add a new pl.list function that acts as a constructor and supports the first behavior. We should become much stricter in the concat_list function and only allow list inputs.

@stinodego stinodego changed the title Rename expression functions that create a specific data type column, e.g. pl.concat_list Split of list constructor logic from pl.concat_list into pl.list May 27, 2024
@stinodego stinodego added accepted Ready for implementation and removed needs decision Awaiting decision by a maintainer labels May 27, 2024
@stinodego stinodego changed the title Split of list constructor logic from pl.concat_list into pl.list Split off list constructor logic from pl.concat_list into pl.list May 28, 2024
@stinodego stinodego assigned reswqa and unassigned stinodego May 28, 2024
@reswqa
Copy link
Collaborator

reswqa commented Jun 6, 2024

@stinodego Quick Q: For the new pl.list constructor, does it only allow scalar inputs? For example:

pl.list(1, 2, 3, 4) # allow -> a Series[List] has only one element [1,2,3,4]

pl.list([1,2,3]) # disallow
pl.list(1,[2,3]) # disallow
pl.list(pl.col("a"), pl.col("b")) # disallow

Let's take a look at the other constructors, they all seem to support Expr. For instance, we allow

pl.time(pl.col("hour"), pl.col("minute"))

┌──────────┐
│ time     │
│ ---      │
│ time     │
╞══════════╡
│ 12:15:00 │
│ 13:30:00 │
│ 14:45:00 │
└──────────┘

So shall we also support pl.list(pl.col("non-list-col1"), pl.col("non-list-col2"))? This again looks very similar to pl.concat_list... Is the difference only that pl.concat_list only supports list inputs: pl.concat_list(pl.col("list-col1"), pl.col("list-col2"))?

@stinodego
Copy link
Member

I hadn't really thought very deeply about this, but I think basically the distinction will be that concat_list will remove one level of nesting, while pl.list does not.

pl.list will take any number of (possibly expression) inputs. These are cast to their supertype, and then the return type will be List(supertype) with every list having the same length (number of inputs).
We can possibly add pl.array constructor in exactly the same manner.

pl.concat_list will take any number of (possibly expression) inputs that must be of type list. The inner types can be cast to a supertype. Then the return type is List(_inner_ supertype).
We can possibly add a pl.concat_array in the same manner, which will guarantee the output has size (sum of all input sizes).

I think one of the issues currently is that supertypes allow adding nesting levels, e.g. supertype of List(Int8) and Int16 is List(Int16). This should probably not be allowed (these types should have no common supertype). That should clarify a lot. But I'm not sure what the implications are of this.

@reswqa
Copy link
Collaborator

reswqa commented Jun 12, 2024

Thanks for the clarification. I think basically the proposal makes sense, I just not sure if letting things like Int and List(Int) without supertype would break other codes.

But more importantly, we may need to have a discussion about whether supertypes should disallow adding nesting level(That's probably what we'll have to decide before 1.0).

@stinodego
Copy link
Member

But more importantly, we may need to have a discussion about whether supertypes should disallow adding nesting level(That's probably what we'll have to decide before 1.0).

Agreed!

@stinodego stinodego added the needs decision Awaiting decision by a maintainer label Jun 12, 2024
@stinodego
Copy link
Member

@reswqa : Supertypes no longer add a List nesting level, see the recent fix by Ritchie:
#16918

So with that, I think this can be implemented and tested properly.

@reswqa
Copy link
Collaborator

reswqa commented Jun 14, 2024

Great! Let me try to implement the proposal above.

@stinodego stinodego removed the needs decision Awaiting decision by a maintainer label Jun 17, 2024
@reswqa

This comment was marked as off-topic.

@reswqa reswqa removed their assignment Jun 21, 2024
@stinodego

This comment was marked as off-topic.

@stinodego
Copy link
Member

Now that the supertype issue has been fixed, I no longer think we really need to split up concat_list. The current behavior seems to be:

  1. Any non-list types are cast to list
  2. All columns are combined at one nesting level is removed.

This is perfectly usable and I don't think splitting up the function is going to really help anyone. It is technically a bit overloaded but that's more of a technicality, in my opinion.

I added some doc examples to clarify this behavior. With that, this can be closed, in my opinion. I will consider re-opening this if I hear some strong arguments for another approach here.

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jun 22, 2024
@stinodego stinodego removed the accepted Ready for implementation label Jun 22, 2024
@NickCrews
Copy link

NickCrews commented Jun 30, 2024

I want to argue that we want separate functions for pl.list() and pl.concat_list().

In ibis, we transpile ibis expressions into polars expressions. One of our rules is to turn an ibis.Array into a polars.List:

https://github.com/NickCrews/ibis/blob/1162137c6c702315ab136c55e8cafa0c32b3d422/ibis/backends/polars/compiler.py#L971-L980

  1. note how we need separate code paths for empty arrays and non-empty arrays. It would be great if there was one polars function that could handle both.
  2. The signature for our transpilation function is supposed to be ibis.Array[T] -> pl.List[T]. And it usually is, except for when T is an array. We have to mark a test as xfail for this case: https://github.com/NickCrews/ibis/blob/1162137c6c702315ab136c55e8cafa0c32b3d422/ibis/backends/tests/test_array.py#L1409-L1436. If polars stays the way it is, in order to make our compilation rule work, I think we need to add some special casing to our transpilation rule, something like
@translate.register(ops.Array)
def array_column(op, **kw):
    cols = [translate(col, **kw) for col in op.exprs]
    if op.dtype.is_array():
        cols = [c.implode() for c in cols]
    return pl.concat_list(cols)

In an effort very similar to this issue, I am trying to overhaul array construction in ibis-project/ibis#9473. The API that I am pushing for there is def array(values: Iterable[Any], *, type: ibis.DataType | str | None = None)
This API has some advantages:

  • Since it accepts a single Iterable[Any], instead of the variadic *args: Any version, it can be idempotent, ie ibis.array(ibis.array([1,2])) works. This is a nice property.
  • This also makes it have the same API as python's list(values: Iterable[Any]) constructor. Familiar to python users.
  • Since it has the dtype optional kwarg-only parameter, it can support empty arrays (without this, we can't infer the dtype of the contained values).

I advocate for polars to create a def list(values: Iterable[Any], *, type: pl.DataType | None = None) or similar API.

@NickCrews
Copy link

Just added a workaround for this in ibis: ibis-project/ibis#9484

@stinodego
Copy link
Member

Comment by @mcrumiller in the duplicated issue is relevant:
#17307 (comment)

Perhaps there is some merit to splitting the two after all.

@stinodego stinodego reopened this Jul 1, 2024
@stinodego stinodego added the A-dtype-list/array Area: list/array data type label Jul 1, 2024
@mcrumiller
Copy link
Contributor

If you do decide to go this route, the following seems simple and obvious from a user perspective:

  • pl.list() - make a list out of columns, i.e. each column becomes an element. All columns must have a common supertype.
  • pl.array() - make an array out of columns. All columns must have a common super dtype.
  • pl.concat_lists() - (note addition of the 's' because we are concatenating multiple lists). Take existing pl.List columns and concatenate them. If a column is scalar, wrap it in a list. All columns must have common inner supertype.
  • pl.concat_arrays() - Take existing pl.Array columns and concatenate. All columns must have common inner supertype.

@jeroenjanssens
Copy link
Contributor

@mcrumiller, will any of these functions allow us to combine two (or more) list[...] columns into one list[list[...]] column? Currently pl.concat_list() flattens the list[...] columns into one list[...] column.

@m00ngoose
Copy link

I agree with @mcrumiller 's suggestions.
@jeroenjanssens yes I think pl.list() would be that function

@NickCrews
Copy link

@mcrumiller's API seems good to me. Reiterating that I think they should all accept vals: Iterable[x], not *vals: x, for the reason I say above.

@deanm0000 deanm0000 added this to the 2.0.0 milestone Aug 15, 2024
@deanm0000
Copy link
Collaborator

@stinodego I just stumbled on this and saw you said you'd make it a milestone but I think that was before you had milestones setup in github so I made it a 2.0.0 milestone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API A-dtype-list/array Area: list/array data type enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests