Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Replace pandas dependence/use with narwhals #7462

Open
cluhmann opened this issue Aug 15, 2024 · 6 comments · May be fixed by #7463
Open

ENH: Replace pandas dependence/use with narwhals #7462

cluhmann opened this issue Aug 15, 2024 · 6 comments · May be fixed by #7463

Comments

@cluhmann
Copy link
Member

Before

No response

After

No response

Context for the issue:

With the rise in popularity of packages such as polars, arrow, and others PyMC's dependence on pandas is looking less and less universally useful. Narwhals was developed for developers of python libraries that consumes dataframes who wish wishing to make their libraries completely dataframe-agnostic. So maybe we consider using that instead of pandas per se? Narwhals has no dependencies and has negligible overhead, so it seems relatively lightweight.

It will require some refactoring as it relies on (as subset of) the polars API.

@ricardoV94
Copy link
Member

Last time I checked we don't really depend on pandas, it's arviz that does

@cluhmann
Copy link
Member Author

Does that mean that we can use polars dataframes when using PyMC?

@ricardoV94
Copy link
Member

ricardoV94 commented Aug 15, 2024

Does that mean that we can use polars dataframes when using PyMC

As data? Not sure, there's some special logic for handling pandas.

But PyMC does not depend on pandas, so maybe you are requesting a new feature, not a change of dependency

@ricardoV94
Copy link
Member

ricardoV94 commented Aug 15, 2024

Btw pandas special logic is:

  1. Dispatch so pt.as_tensor accepts pandas series/matrices

I don't think we can replace that by narwhals since dispatch works on types at runtime. We would need to dispatch on polars as well.

  1. Special logic in pm.data? and observed for detecting nans and triggering automatic imputation. This could possibly be backend agnostic although that code is pretty spaghetti, so someone would need to check.

2.1 It may actually already work because IIRC it's all based on duck typing

@fonnesbeck
Copy link
Member

Yeah, was taking a peek at this today -- if we were using pandas-specific functionality (merging, etc.) then it would make sense to use narwhals. For the post part we are taking DataFrames (and Series) and turning them into ndarrays. The only exception may be in deriving dims from indexes, since polars does not use indexes.

@fonnesbeck fonnesbeck linked a pull request Aug 15, 2024 that will close this issue
10 tasks
@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Aug 16, 2024

Hey, thanks for looking into Narwhals 🙏

First, thanks for opening #7463, it's great to see Polars support come along - facilitating that was one of my goals with Narwhals, and if it can happen even without it, even better 💪

I think #7463 is already a net-positive, I just wanted to leave some comments, in case they're of interest:

  • pandas is currently a required a dependency of PyMC, and so is ArviZ. Would you consider making them both optional dependencies?
  • regarding the dispatch, I think Narwhals could simplify it: you call nw.from_native(..., eager_only=True) on the input, pass that to dataframe_to_tensor_variable, and then just do:
    @_as_tensor_variable.register(nw.DataFrame)
    @_as_tensor_variable.register(nw.Series)
    def dataframe_to_tensor_variable(df: , *args, **kwargs) -> TensorVariable:
        return pt.as_tensor_variable(df.to_numpy(), *args, **kwargs)
    to_numpy is part of the Narwhals API, and nw.from_native is a practically-free operation. But it can of course also be done without Narwhals, as you've done 🙌
  • if you need to do something pandas-specific, you can still do that without pandas being a required dependency:
    if (pd := sys.modules.get('pandas', None)) is not None and isinstance(df, pd.DataFrame):
        # pandas-specific logic goes here
    In fact, that's what Altair do, using is_pandas_dataframe

No hard feeling of course if you keep the current approach, I was wanted to point out the possibilities ♾️ All the best, and it was really fun meeting some of you at PyData London!

EDIT: upon further inspection, I was wrong about the pytensor part, Narwhals wouldn't help there (unless you used it in PyTensor too), I think it would only potentially help in convert_data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants