Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Issue] pem_private_key not redacted in Spark Logical Plan UI #525

Open
Loudegaste opened this issue Sep 8, 2023 · 4 comments
Open

Comments

@Loudegaste
Copy link

Loudegaste commented Sep 8, 2023

Hi,
we are using the snowflake spark connector to push data from foundry to snowflake. We noticed that the pem_private_key is not redacted from the Query Plan and therefore leaking.

We expect that the pem_private_key is redacted, just as the 'sfurl' in the screenshot.

We first raised the issue to the Foundry team. After review they concluded that this issue came from the Spark connector itself and should therefore be processed here.

Python version: 3.8.*
Pyspark version: 3.2.1

Here is the code used for the spark connector:

{
        "sfURL": config["snowflake_account"],
        "sfUser": "...",
        "pem_private_key": key,
        "role": "...",
        "sfWarehouse": config["warehouse"],
        "sfDatabase": config["database"],
        "sfSchema": config["schema"],
    }

inp.dataframe().write.format(SNOWFLAKE_SOURCE_NAME).options(
                **connection_parameters
            ).option("dbtable", f'"{raw_table_name}"').mode("overwrite").save()

printed_key

@Loudegaste
Copy link
Author

Hi,
following up on this, further experimentation on our side has revealed that this seems to be a non-deterministic issue. While making multiple runs of exactly the same pipeline, the pem_private_key is sometime redacted and sometime not. So far I couldn't find any factor predicting the behaviour.

@rshkv
Copy link

rshkv commented Sep 29, 2023

@Loudegaste, neither Snowflake's connector nor Foundry seem to do anything additional about redacting the pem_private_key. They just rely on Spark's built-in redaction mechanism.

Spark, when rendering the query plan, just goes through SQLConf.redact which redacts based on the config values for spark.sql.redaction.options.regex and spark.redaction.regex. The former defaults to (?i)url and the latter is overridden in Foundry to include additional keywords.

I wonder if the non-determinism you see is explained by the fact that Spark, when redacting, looks for sensitive keywords not just in the config key but also in the config value. If the pem_private_key differs between runs, you may sometimes see it redacted because it happens to contain the string url in that run.

@Loudegaste
Copy link
Author

Hi @rshkv,
thanks for the reply. That would mean the issue needs to be raised with Spark directly ?
Btw, do you think then that spark.redaction.string.regex could provide a work around in the mean time ?
We've actually tried to change spark.redaction.regex to include 'pem' and this didn't solve the issue.

@Loudegaste
Copy link
Author

As @rshkv suggested, the keys do indeed get redacted when they contain "url" as a substring. This allows us to have an ugly workaround by adding "url" at the end of the key being used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants