Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1458135 Implement DataFrame and Series initialization with lazy Index objects #2137

Open
wants to merge 60 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
2094c3f
Update Series and DataFrame constructors to handle lazy Index objects…
sfc-gh-vbudati Aug 21, 2024
97a7229
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 21, 2024
1979257
update changelog
sfc-gh-vbudati Aug 21, 2024
5dbb76d
add more tests
sfc-gh-vbudati Aug 21, 2024
7de467f
fix minor bug
sfc-gh-vbudati Aug 21, 2024
5dd06fd
fix isocalendar docstring error
sfc-gh-vbudati Aug 21, 2024
8b94462
truncate tests, update changelog wording, reduce 2 queries to one que…
sfc-gh-vbudati Aug 22, 2024
c89dc5d
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 22, 2024
a2089b8
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 22, 2024
a9376c1
Get rid of the join performed when only index is an Index object and …
sfc-gh-vbudati Aug 22, 2024
420a5ac
Add back the index join query to DataFrame/Series constructor, update…
sfc-gh-vbudati Aug 22, 2024
66d634c
Update tests
sfc-gh-vbudati Aug 23, 2024
f277041
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 23, 2024
6a2cb79
added edge case logic, fix test query count
sfc-gh-vbudati Aug 23, 2024
df96f4a
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 23, 2024
f971b0d
more test fixes
sfc-gh-vbudati Aug 23, 2024
13db956
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 23, 2024
8c78f8d
fix dict case
sfc-gh-vbudati Aug 23, 2024
7970101
more test case fixes
sfc-gh-vbudati Aug 24, 2024
f3de1c3
correct the logic for series created with dict and index
sfc-gh-vbudati Aug 26, 2024
2447022
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 26, 2024
82728bf
fix query counts
sfc-gh-vbudati Aug 26, 2024
23587a4
Merge branch 'vbudati/SNOW-1458135-df-series-init-with-lazy-index' of…
sfc-gh-vbudati Aug 26, 2024
1577ddc
fix join count
sfc-gh-vbudati Aug 26, 2024
8903f60
refactor series and df
sfc-gh-vbudati Sep 4, 2024
668c889
merge main into current branch
sfc-gh-vbudati Sep 4, 2024
67a07c1
refactor dataframe and series constructors
sfc-gh-vbudati Sep 4, 2024
a4351ba
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 4, 2024
1453680
fix docstring tests
sfc-gh-vbudati Sep 5, 2024
f39e751
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 5, 2024
b73f027
fix some tests
sfc-gh-vbudati Sep 6, 2024
c6fc05d
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 6, 2024
d422f86
replace series constructor
sfc-gh-vbudati Sep 6, 2024
1ea5d00
fix tests
sfc-gh-vbudati Sep 9, 2024
024acd8
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 9, 2024
f4a80f3
fix loc and iloc tests
sfc-gh-vbudati Sep 9, 2024
ce1ffa6
fix test
sfc-gh-vbudati Sep 9, 2024
00d2a8b
fix test
sfc-gh-vbudati Sep 9, 2024
cb91849
fix last valid index error
sfc-gh-vbudati Sep 9, 2024
d9fdbb0
remove stuff unnecessarily commented out
sfc-gh-vbudati Sep 9, 2024
3d5b785
explain high query count
sfc-gh-vbudati Sep 9, 2024
7f9dbaa
rewrite binary op test, fix coverage
sfc-gh-vbudati Sep 9, 2024
6de9f49
fix tests
sfc-gh-vbudati Sep 11, 2024
c2fb474
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 11, 2024
2274d1e
remove print statements and unnecessary comments
sfc-gh-vbudati Sep 11, 2024
9eef8d7
fix tests
sfc-gh-vbudati Sep 11, 2024
cc09403
increase coverage
sfc-gh-vbudati Sep 12, 2024
10c3954
try to move out common logic, add more tests
sfc-gh-vbudati Sep 14, 2024
64dda24
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 14, 2024
da56734
update df init
sfc-gh-vbudati Sep 14, 2024
8b47e17
moved common logic out, fixed some tests
sfc-gh-vbudati Sep 16, 2024
fa4eb09
remove unnecessary diffs
sfc-gh-vbudati Sep 16, 2024
db28630
fix doctest and couple of tests
sfc-gh-vbudati Sep 17, 2024
17be4c3
apply feedback to simplify logic
sfc-gh-vbudati Sep 18, 2024
301f47f
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 18, 2024
2eb14a7
update query counts to use constants
sfc-gh-vbudati Sep 18, 2024
95065f7
Merge branch 'vbudati/SNOW-1458135-df-series-init-with-lazy-index' of…
sfc-gh-vbudati Sep 18, 2024
d9bbd9b
remove docstring update, add docstrings for helper functions
sfc-gh-vbudati Sep 18, 2024
f40c5b4
try to break down df init into three steps: data, columns, and index
sfc-gh-vbudati Sep 20, 2024
8cce409
merge main into current branch
sfc-gh-vbudati Sep 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
- Added the following new functions in `snowflake.snowpark.functions`:
- `make_interval`
- Added support for using Snowflake Interval constants with `Window.range_between()` when the order by column is TIMESTAMP or DATE type
- Added support for constructing `Series` and `DataFrame` objects with the lazy `Index` object as `data`, `index`, and `columns` arguments.

#### Improvements

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ def get_valid_index_values(
-------
Optional[Row]: The desired index (a Snowpark Row) if it exists, else None.
"""
frame = frame.ensure_row_position_column()
index_quoted_identifier = frame.index_column_snowflake_quoted_identifiers
data_quoted_identifier = frame.data_column_snowflake_quoted_identifiers
row_position_quoted_identifier = frame.row_position_snowflake_quoted_identifier
Expand Down
147 changes: 146 additions & 1 deletion src/snowflake/snowpark/modin/plugin/_internal/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,10 @@

import numpy as np
import pandas as native_pd
from pandas._typing import Scalar
from pandas._typing import AnyArrayLike, Scalar
from pandas.core.dtypes.base import ExtensionDtype
from pandas.core.dtypes.common import is_integer_dtype, is_object_dtype, is_scalar
from pandas.core.dtypes.inference import is_list_like

import snowflake.snowpark.modin.pandas as pd
import snowflake.snowpark.modin.plugin._internal.statement_params_constants as STATEMENT_PARAMS
Expand Down Expand Up @@ -2001,3 +2003,146 @@ def create_frame_with_data_columns(
def rindex(lst: list, value: int) -> int:
"""Find the last index in the list of item value."""
return len(lst) - lst[::-1].index(value) - 1


def error_checking_for_init(
index: Any, dtype: Union[str, np.dtype, ExtensionDtype]
) -> None:
"""
Common error messages for the Series and DataFrame constructors.

Parameters
----------
index: Any
The index to check.
dtype: str, numpy.dtype, or ExtensionDtype
The dtype to check.
"""
from modin.pandas import DataFrame

if isinstance(index, DataFrame): # pandas raises the same error
raise ValueError("Index data must be 1-dimensional")

if dtype == "category":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this check come from? i don't see it was checked anywhere before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this check because it was not checked before - we do not support Categorical type yet. If the user passes in dtype=category, this makes the dtype of the Series/DataFrame category.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dtype seems only used when data is local, typically under this case, we should already apply the dtype check before uploading the data, we shouldn't need to do such check here. Is not not erroring out today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not erroring out over here today - it's because the data itself is not categorical but should be treated like categorical if dtype is category.

raise NotImplementedError("pandas type category is not implemented")


def assert_fields_are_none(
class_name: str, data: Any, index: Any, columns: Any = None
) -> None:
assert (
data is None
), f"Invalid {class_name} construction! Cannot pass both data and query_compiler."
assert (
index is None
), f"Invalid {class_name} construction! Cannot pass both index and query_compiler."
assert (
columns is None
), f"Invalid {class_name} construction! Cannot pass both columns and query_compiler."


def convert_index_to_qc(index: Any) -> Any:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to make the return type "SnowflakeQueryCompiler" without causing circular import issues

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using "SnowflakeQueryCompiler" with quotes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not work either - it still causes the issues

"""
Method to convert an object representing an index into a query compiler for set_index or reindex.

Parameters
----------
index: Any
The object to convert to a query compiler.

Returns
-------
SnowflakeQueryCompiler
The converted query compiler.
"""
from modin.pandas import Series

from snowflake.snowpark.modin.plugin.extensions.index import Index

if isinstance(index, Index):
idx_qc = index.to_series()._query_compiler
elif isinstance(index, Series):
idx_qc = index._query_compiler
else:
idx_qc = Series(index)._query_compiler
return idx_qc


def convert_index_to_list_of_qcs(index: Any) -> list:
"""
Method to convert an object representing an index into a list of query compilers for set_index.

Parameters
----------
index: Any
The object to convert to a list of query compilers.

Returns
-------
list
The list of query compilers.
"""
from modin.pandas import Series

from snowflake.snowpark.modin.plugin.extensions.index import Index

if (
not isinstance(index, (native_pd.MultiIndex, Series, Index))
and is_list_like(index)
and len(index) > 0
and all((is_list_like(i) and not isinstance(i, tuple)) for i in index)
):
# If given a list of lists, convert it to a MultiIndex.
index = native_pd.MultiIndex.from_arrays(index)
if isinstance(index, native_pd.MultiIndex):
index_qc_list = [
s._query_compiler
for s in [
Series(index.get_level_values(level)) for level in range(index.nlevels)
]
]
else:
index_qc_list = [convert_index_to_qc(index)]
return index_qc_list


def add_extra_columns_and_select_required_columns(
query_compiler: Any,
columns: Union[AnyArrayLike, list],
data_columns: Union[AnyArrayLike, list],
) -> Any:
"""
Method to add extra columns to and select the required columns from the provided query compiler.
This is used in DataFrame construction in the following cases:
- general case when data is a DataFrame
- data is a named Series, and this name is in `columns`

Parameters
----------
query_compiler: Any
The query compiler to select columns from, i.e., data's query compiler.
columns: AnyArrayLike or list
The columns to select from the query compiler.
data_columns: AnyArrayLike or list
The columns in the data. This is data.columns if data is a DataFrame or data.name if data is a Series.

"""
from modin.pandas import DataFrame

# The `columns` parameter is used to select the columns from `data` that will be in the resultant DataFrame.
# If a value in `columns` is not present in data's columns, it will be added as a new column filled with NaN values.
# These columns are tracked by the `extra_columns` variable.
if data_columns is not None and columns is not None:
extra_columns = [col for col in columns if col not in data_columns]
# To add these new columns to the DataFrame, perform `__getitem__` only with the extra columns
# and set them to None.
extra_columns_df = DataFrame(query_compiler=query_compiler)
extra_columns_df[extra_columns] = None
query_compiler = extra_columns_df._query_compiler

# To select the columns for the resultant DataFrame, perform `.loc[]` on the created query compiler.
# This step is performed to ensure that the right columns are picked from the InternalFrame since we
# never explicitly drop the unwanted columns. `.loc[]` also ensures that the columns in the resultant
# DataFrame are in the same order as the columns in the `columns` parameter.
columns = slice(None) if columns is None else columns
return DataFrame(query_compiler=query_compiler).loc[:, columns]._query_compiler
Original file line number Diff line number Diff line change
Expand Up @@ -2328,7 +2328,7 @@ def any(
def reindex(
self,
axis: int,
labels: Union[pandas.Index, "pd.Index", list[Any]],
labels: Union[pandas.Index, "pd.Index", list[Any], "SnowflakeQueryCompiler"],
**kwargs: dict[str, Any],
) -> "SnowflakeQueryCompiler":
"""
Expand All @@ -2338,7 +2338,7 @@ def reindex(
----------
axis : {0, 1}
Axis to align labels along. 0 is for index, 1 is for columns.
labels : list-like
labels : list-like, SnowflakeQueryCompiler
Index-labels to align with.
method : {None, "backfill"/"bfill", "pad"/"ffill", "nearest"}
Method to use for filling holes in reindexed frame.
Expand Down Expand Up @@ -2536,15 +2536,15 @@ def _add_columns_for_monotonicity_checks(

def _reindex_axis_0(
self,
labels: Union[pandas.Index, "pd.Index", list[Any]],
labels: Union[pandas.Index, "pd.Index", list[Any], "SnowflakeQueryCompiler"],
**kwargs: dict[str, Any],
) -> "SnowflakeQueryCompiler":
"""
Align QueryCompiler data with a new index.

Parameters
----------
labels : list-like
labels : list-like, SnowflakeQueryCompiler
Index-labels to align with.
method : {None, "backfill"/"bfill", "pad"/"ffill", "nearest"}
Method to use for filling holes in reindexed frame.
Expand All @@ -2562,12 +2562,15 @@ def _reindex_axis_0(
"""
self._raise_not_implemented_error_for_timedelta()

if isinstance(labels, native_pd.Index):
labels = pd.Index(labels)
if isinstance(labels, pd.Index):
new_index_qc = labels.to_series()._query_compiler
if isinstance(labels, SnowflakeQueryCompiler):
new_index_qc = labels
else:
new_index_qc = pd.Series(labels)._query_compiler
if isinstance(labels, native_pd.Index):
labels = pd.Index(labels)
if isinstance(labels, pd.Index):
new_index_qc = labels.to_series()._query_compiler
else:
new_index_qc = pd.Series(labels)._query_compiler

new_index_modin_frame = new_index_qc._modin_frame
modin_frame = self._modin_frame
Expand Down
15 changes: 6 additions & 9 deletions src/snowflake/snowpark/modin/plugin/docstrings/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,16 +82,13 @@ class DataFrame(BasePandasDataset):
Notes
-----
``DataFrame`` can be created either from passed `data` or `query_compiler`. If both
parameters are provided, data source will be prioritized in the next order:
parameters are provided, an assertion error will be raised. `query_compiler` can only
be specified when the `data`, `index`, and `columns` are None.

1) Modin ``DataFrame`` or ``Series`` passed with `data` parameter.
2) Query compiler from the `query_compiler` parameter.
3) Various pandas/NumPy/Python data structures passed with `data` parameter.

The last option is less desirable since import of such data structures is very
inefficient, please use previously created Modin structures from the fist two
options or import data using highly efficient Modin IO tools (for example
``pd.read_csv``).
Using pandas/NumPy/Python data structures as the `data` parameter is less desirable since
importing such data structures is very inefficient.
Please use previously created Modin structures or import data using highly efficient Modin IO
tools (for example ``pd.read_csv``).

Examples
--------
Expand Down
Loading
Loading