Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ibis has no way to convert UDFs to substrait plan #644

Open
Anindyadeep opened this issue Jun 12, 2023 · 5 comments
Open

Ibis has no way to convert UDFs to substrait plan #644

Anindyadeep opened this issue Jun 12, 2023 · 5 comments

Comments

@Anindyadeep
Copy link

Ibis is doing some incredible work by integrating substrait for generating substrait plan of the user's query to support cross DB operations in python.

Suppose we have a table like this :

┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ cust_id ┃ income1 ┃ income2 ┃ income3  ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│ int64   │ float64 │ float64 │ float64  │
├─────────┼─────────┼─────────┼──────────┤
│       1 │ 20000.0 │ 3560.57 │      nan │
│       2 │ 34546.9 │ 6000.66 │   1000.0 │
│       3 │ 75430.2 │ 8111.01 │      nan │
│       4 │ 55430.2 │ 8111.01 │   1200.0 │
│       5 │     nan │ 8111.01 │      nan │
│       6 │     nan │     nan │ 100000.0 │
└─────────┴─────────┴─────────┴──────────┘

Right now we define udf's in ibis like this

import ibis.expr.datatypes as dt 
from ibis.backends.pandas.udf import udf

@udf.analytic(input_type = [dt.double, dt.double], output_type=dt.double)
def function(c1, c2):
    return c1 + c2 

And hence we can apply this function to our tables like this

function(table.income1, table.income2)

And applying this function returns this

┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ AnalyticVectorizedUDF() ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64                 │
├─────────────────────────┤
│                23560.57 │
│                40547.56 │
│                83541.21 │
│                63541.21 │
│                     nan │
│                     nan │
└─────────────────────────┘

Even we can mutate our existing table to add a new column with this function.

mutate_expression = table.mutate(
    added = function(table.income1, table.income2)
)

Before coming to the main problem, consider this, I have a simple expression like this

expression = table.income1 + table.income2

And now I can generate the substrait plan of this expression using this code :

from ibis_substrait.compiler.core import SubstraitCompiler

compiler = SubstraitCompiler()
expression = table.income1 + table.income2
substrait_plan = compiler.compile(table.mutate(expression))

Hence I can get the substrait plan. But when I am trying to get the substrait plan through an user defined function then I am getting this error:

udf_expression = table.mutate(
    added = function(table.income1, table.income2)
)

substrait_plan_udf = compiler.compile(table.mutate(udf_expression))

Doing this gives me the error : KeyError: 'AnalyticVectorizedUDF'.

I even thought that substrait might also not provide the support for now. But it seems like substrait do support :

  • UserDefined defined type
  • ParameterizedUserDefined type
  • UserDefined relation

But Not user defined relations.

This concludes that ibis is not supporting generating substrait plans for user defined functions. But it will be awesome if we have one.

@gforsyth
Copy link
Member

Hi @Anindyadeep -- thanks for raising this!

We're currently refactoring UDF support in Ibis to try to make it more consistent across backends. Once that lands, we can look in to how we can better support UDF compilation to substrait.

There's currently support for element-wise UDFs, tested against pyarrow.compute -- note that this does require registering the UDF separately with the backend. There's not currently a way (that I'm aware of) to serialize a UDF into a substrait plan. Instead, the current pattern is to register a function with the backend, then refer to it by name in the substrait plan.

@Anindyadeep
Copy link
Author

@gforsyth , Can you please provide me more details or the link of that (in the documentation) where I can refer to, about registering a function with the backend and then refer it by name in the substrait plan. Also is that thing to be done with substrait manually or can be done using ibis.

Thanks

@gforsyth
Copy link
Member

Documentation is very light (sorry). There's an integration test that shows how to create an elementwise UDF in Ibis, then registering that UDF with pyarrow.compute, then executing it using Substrait to send the plan from Ibis to PyArrow.

https://github.com/ibis-project/ibis-substrait/blob/main/ibis_substrait/tests/integration/test_pyarrow.py

It doesn't require using Substrait directly, but it does require defining the UDF in the pyarrow process manually.

@omriel1
Copy link

omriel1 commented Jun 29, 2023

Could not describe my questions better than you did @Anindyadeep! I'll also mention that this feature could be very useful.
In the meanwhile, your answers @gforsyth are very useful, thank you!

@Anindyadeep
Copy link
Author

Thanks @OmriLevyTau, but yes this feature would be super useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: backlog
Development

No branches or pull requests

3 participants