Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1612981: SNOW-1640968 SNOW-1629635 Massive Memory Load #1004

Open
watercable76 opened this issue Aug 6, 2024 · 24 comments
Open

SNOW-1612981: SNOW-1640968 SNOW-1629635 Massive Memory Load #1004

watercable76 opened this issue Aug 6, 2024 · 24 comments
Assignees
Labels
bug status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. status-triage_done Initial triage done, will be further handled by the driver team

Comments

@watercable76
Copy link

Issue Description

Since the snowflake outage on August 1st, 2024, my project has had a massive memory issue when making queries to Snowflake. Queries we've been using for over a year have never caused this many issues, but it all seemed to be after the outage.

We were using the Snowflake 3.1.0 C# package and had not made any code changes when the issue started.

Since the memory hog, we did the following:

  • Updated our connections to be await using var snowflakeDbConnection = new SnowflakeDbConnection(configuration.GetConnectionString("SnowflakeDb"));
  • Upgraded to 4.1.0

The issue seems to be slightly resolved, but the memory usage is still larger than we have seen it and worse than we want it to perform

@github-actions github-actions bot changed the title Massive Memory Load SNOW-1612981: Massive Memory Load Aug 6, 2024
@dliberatore-th
Copy link

dliberatore-th commented Aug 12, 2024

My company has been seeing this issue ramping up since early July at least and it's causing OOM crashing in containers, which causes a bad user experience for our customers. We're having to rearchitect a large portion of our application to work around this problem. A 2GB hard-limit container that simply queries Snowflake can crash OOM rapidly. Dotnet's out of memory exception doesn't even get thrown, the Linux OOM killer simply terminates the process and we lose the application. We've done some basic analysis and there appears to be a memory allocation issue somewhere in the REST API querying part of the code, and dotnet GC isn't applicable to whatever is being allocated.

@sfc-gh-dszmolka
Copy link
Contributor

hi all - thanks for raising this issue. Its appearance without any code change / driver change suggests that it probably does not come from the driver (= this repo) but perhaps the backend itself, but at this point of course there's not enough information to tell.

Speaking about which, does any of you (or other readers of this ticket, who hit the same issue) have any details on how to reproduce the issue to trigger it? Most helpful would be a code which when run, leads to the unexpected high memory usage.

Other details might be also relevant, as

  • do you observe the issue on one particular Snowflake deployment only , or multiple ones? (of course only relevant if you have multiple accounts on multiple different deployments)
  • is it related to a specific type of query (SELECT, INSERT, etc), specific table/type of data, specific Snowflake object (e.g. only when using Streams, only when calling stored procedures etc)
  • if executing same/similar workload from a different Snowflake driver (ODBC, Python, etc) is possible; does it also exhibit the same behaviour ?

At this point really any additional info might help; if you have queryIds to share that could be also useful. Finally, if you're not comfortable in sharing all these details in a publicly open place, here, you can raise a ticket with Snowflake Support to work 1:1 with a support engineer.

Thank you everyone in advance for every detail which could help narrowing down this issue.

@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Aug 13, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage Issue is under initial triage status-information_needed Additional information is required from the reporter labels Aug 13, 2024
@watercable76
Copy link
Author

We are doing just SELECT based statements on views that are getting data, and just using the C# driver. As far as I know, we have multiple databases, one for each environment, and the issue is happening on all of them

I've implemented GC on opening the connection to snowflake, creating the command and reading the data, like this:

await using var snowflakeDbConnection = new SnowflakeDbConnection(configuration.GetConnectionString("SnowflakeDb"));
snowflakeDbConnection.Open();

await using var snowflakeDbCommand = snowflakeDbConnection.CreateCommand();
snowflakeDbCommand.CommandText = @"SELECT data FROM created_view WHERE parameter = :parameter";
snowflakeDbCommand.CommandTimeout = 30;

var parameter = snowflakeDbCommand.CreateParameter();
parameter.ParameterName = "parameter";
parameter.Value = value;
parameter.DbType = DbType.String;
snowflakeDbCommand.Parameters.Add(parameter);

await using var dataReader = await snowflakeDbCommand.ExecuteReaderAsync();
snowflakeDbConnection.Close();
return mapper.Map<IEnumerable<Model>>(dataReader);

Even after implementing this, the memory usage is still higher than what we want or are comfortable with

@elliottcrush
Copy link

FYI - We experienced the exact same issues here and had to contact Snowflake support to get this resolved. They confirmed that they had to make a load balance change for our EU connection, as soon as that was made - our memory pattern improved drastically @watercable76 @dliberatore-th FYI

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Aug 15, 2024

This is very interesting @elliottcrush ; do you mind sharing the case number for reference here perhaps ? It could not only serve for reference, but our teams to sync internally on the possible issues ; if no driver change, no code change, no other change than switching load balancer could mitigate the issue . Thank you in advance !

edit i might have found it. also it could be indicative for the error if you see any elevated rate of HTTP502/503 besides the memory usage increase and/or connection resets?

@watercable76
Copy link
Author

Yes, all of the errors were 502 when trying to make the calls. We thought it was an issue with SNAT connections, which is why we made sure to close connections and setup our variables to be cleaned up with GC. Trying to work with Snowflake to get a support case opened up for this

@elliottcrush
Copy link

SF support said the memory issues we (our company) were seeing were causing 400's on the load balancer.

If you email me @sfc-gh-dszmolka - [email protected] I'll share the details privately.

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Aug 15, 2024

Thank you @elliottcrush , in the meantime I could find it . Appreciate for the quick turnaround with the offer though!

@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage status-information_needed Additional information is required from the reporter labels Aug 15, 2024
@elliottcrush
Copy link

@sfc-gh-dszmolka we're starting to see the exact same issue happen again in production within the last 15 minutes. I've re-raised with Snowflake support, any ideas?

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Aug 16, 2024

Thank you for re-raising it with Snowflake Support and please do keep working with Snowflake support on this particular issue instead of here. We have very different SLO's for the two kind of channels.

As a retrospective to my best knowledge, the re-occurrence happened due to the LB change not being configured as permanent. Now it is, since you re-raised it yesterday and I see from the associated support case you confirmed it helped (again). Apologies for the issues this problem caused.

edit: Our team responsible for the traffic ingest into Snowflake is investigating the root cause and if somehow turns out to be related to the driver, will update here too.

@watercable76
Copy link
Author

We raised a support ticket too, was able to get the load balancer configuration updated to the previous one we were on. Initial look is it's doing a lot better, but I'm going to be monitoring for another week before I can say it's actually fine again.

@dliberatore-th
Copy link

dliberatore-th commented Aug 16, 2024

Similar situation here re: a support ticket leading to a load balancer change resolving the client-side memory usage.

I'm inclined to say that this issue is indeed a valid concern for the connector library. Receiving an error response code from the REST APIs shouldn't cause a massive memory spike on the client side. At least, that's not expected or logical behavior from a client library.

The server-side problem was masked by the fact that these API call failures aren't shown in the logs unless you're running debug log level. For example, see Core/RestRequester.cs. This makes it difficult for monitoring software/services to signal on issues based on normal production configuration (non-debug-level) log ingestion. Perhaps it shouldn't be relegated to the Debug log level but more of a Warning or Error log level when there is an error status returned from the Snowflake backend? I'm not sure if this is worth opening a separate issue or not.

edit: Since we run the associated microservice in containers, reconfiguring production would have meant building a new container image with updated configuration, which is not really a great option for us.

@elliottcrush
Copy link

https://community.snowflake.com/s/article/FAQ-Updates-on-Migration-of-Traffic-Serving-Proxy-Load-Balancing-Infrastructure

The change to the load balancer and notes on what could be impacted is on the above article.

Which Snowflake drivers and clients are impacted?
Any web browser or Snowflake driver relying on a trust store provided by the operating system or Java runtime environment could be impacted. The following Snowflake drivers rely on the operating system's trust store:
Snowflake NodeJS driver
Snowflake .NET driver

@sfc-gh-dszmolka
Copy link
Contributor

Thank you all for the comments! The above article just refers to 'if you use these drivers, the new CA might be missing from the truststore, and you might need to manually install it'. At this point, not sure if it's related.

It could be better indeed related to how the HTTP connections are made and maintaned with Snowflake, from the driver.
The new Envoy LB from which the Snowflake Community article is about, is indeed the new platform and migrations are ongoing for long months now. This issue however, is fairly new.

Thus we're looking into this aspect, but at this moment could not yet reproduce the issue in such a magnitude where you folks see it.

Here's what we tried (tried this on Linux and MacOS too):

  1. install .NET SDK 8 and dotnet-counters
  2. checkout https://github.com/Snowflake-Labs/sf-samples/tree/main/samples/spcs/sf-connection/dotnet which is a simple console app, targeting .NET 8.0. Originally made for testing our container platform SPCS, will now use outside of that platform.
  3. modify Snowflake.App.Csproj to use a newer version of the .NET driver, because in the example it still uses old 2.1.4. Tried this with 3.1.0 and 4.1.0.
  4. (optionally) modified log4net.config to have TRACE level of logs
  5. Program used in repro attempt, which is a slight modification of the App.cs coming with the dotnet example repo. Constantly
using System.Data;
using Snowflake.Data.Client;

class App
{
    const string TOKENPATH = "/snowflake/session/token";

    static string getConnectionString()
    {
        string? account = Environment.GetEnvironmentVariable("SNOWFLAKE_ACCOUNT");
        string? host = Environment.GetEnvironmentVariable("SNOWFLAKE_HOST");
        if (File.Exists(TOKENPATH))
        {
            // automatically set by env
            string token = File.ReadAllText(TOKENPATH);
            return $"account={account};authenticator=oauth;token={token}";
        }
        else
        {
            // basic auth, variables must be set by user
            string? user = Environment.GetEnvironmentVariable("SNOWFLAKE_USER");
            string? password = Environment.GetEnvironmentVariable("SNOWFLAKE_PASSWORD");
            return $"account={account};user={user};password={password}";
        }
    }

    static void SetEnvVariables()
    {
        Environment.SetEnvironmentVariable("SNOWFLAKE_ACCOUNT", "{account targeting Envoy}");
        Environment.SetEnvironmentVariable("SNOWFLAKE_USER", "{user}");
        Environment.SetEnvironmentVariable("SNOWFLAKE_PASSWORD", "{value}");
    }

    static int Main()
    {
        SetEnvVariables();
        for (var i = 0; i < int.MaxValue; i++)
        {
            try
            {
                //Console.WriteLine("Starting connection!");
                using (IDbConnection conn = new SnowflakeDbConnection())
                {
                    conn.ConnectionString = getConnectionString();
                    conn.Open();
                    using (IDbCommand cmd = conn.CreateCommand())
                    {
                        cmd.CommandText = "SELECT current_account() AS account, current_database() AS database, current_schema() AS schema";
                        IDataReader reader = cmd.ExecuteReader();
                        if (reader.Read())
                        {
                            string account, database, schema;
                            account = reader.GetString(0);
                            database = reader.GetString(1);
                            schema = reader.GetString(2);
                            //Console.WriteLine("account: " + account);
                            //Console.WriteLine("database: " + database);
                            //Console.WriteLine("schema: " + schema);
                        }
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
                return 1;
            }

            Thread.Sleep(100);
        }

        return 0;
    }
}
  1. Run it under dotnet-counters and observe how memory is changing: ~/.dotnet/tools/dotnet-counters monitor --refresh-interval 1 -- dotnet run Program.cs

Experience is that even though connections are frequently being opened, GC Heap Size (MB) sits around 30-50 MB and Working Set (MB) sits around 200-215 MB.

So until now, we could not reproduce the issue you folks are seeing. Obviously there is something , triggering this problems for multiple users but at this moment we could not yet pinpoint it.

So if it would be possible to:

  1. share which Snowflake .NET driver versions you're using (i've seen info about 3.1.0/4.1.0 in the original issue submission) and the OS on which you experience the problem
  2. any details on how you manage connections with the driver; how you're connecting to Snowflake and closing those connections
  3. If a reproducible scenario could be also shared, which leads to the issue you're seeing, that would be invaluable for troubleshooting this issue and super appreciated too

Thank you in advance and again all the info you already shared here !

@sfc-gh-dszmolka
Copy link
Contributor

Quick status: we're still trying to generate the issue internally and could not get to it yet. Kudos to the one of you who provided a code example on the Support ticket opened with Snowflake Support ! 👍

May I ask please if you already have, or will have a ticket for this issue with Snowflake Support, if you have a DEV/test account we could use to move over to Envoy (the new LB on where the symptoms started) to profile it and observe the traffic more closely without interrupting important production flows - that could be very helpful.

If you have such account to spare, even for a limited amount of time (then migrate back to the previous LB), please send it on your Support case and mention it was requested for reproduction and profiling. You can of course also send it here. Thank you so much !

@elliottcrush
Copy link

Thanks @sfc-gh-dszmolka - that was us with the code example. We're leasing with support with respect to going back to the new LB. We don't have a separate environment - however, we are willing to upgrade our snowflake-connector-net and migrate to the new LB during an agreed time frame. All of which I'll continue to just chat about on the support ticket.

@sfc-gh-dszmolka
Copy link
Contributor

a short update: we discovered one possible culprit for the issue in the driver side, and would like to address it #1009.
However, this might not be an actual fix because currently without having a way to reproduce the issue , also we cannot confirm this proposed change actually fixes the issue described in this particular Issue.

Anyways, we're working on eliminating this possibility.

@sfc-gh-dszmolka sfc-gh-dszmolka removed the status-pr_pending_merge A PR is made and is under review label Aug 27, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added the status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. label Aug 27, 2024
@sfc-gh-dszmolka
Copy link
Contributor

the aforementioned change has been merged, and will be part of the next .NET driver release, scheduled to happen in the next coming days.

Due to the lack of reliable reproduction of this issue, we cannot verify if this change helps or not in this particular issue, but after the next release it's possible to test it and see how it performs.

@sfc-gh-dszmolka
Copy link
Contributor

in addition to that (and appreciate the suggestion) #1008 which is now also merged and will be part of the next releases, now exposes the non-successful messages at Error loglevel (was: Debug) so hopefully it will be more easier visible .

@watercable76
Copy link
Author

@sfc-gh-dszmolka Thanks for getting that PR in and working through all of these problems we've had. Do you know when that next release will be published?

@sfc-gh-dszmolka
Copy link
Contributor

I believe it should come in the next few days; I think first we're hoping to get yet another possible fix for a possible culprit in (team discovered some anomalies around Okta type of authenticators, which might be even related to this issue, and wanted to fix that as well -> #1010)

I would also like to grab the opportunity to ask here; just to confirm the above fix at 1010 can be related or not: are you guys using Okta type authenticators in the applications which saw the unexplainable sudden increase in memory usage, or something else? Thanks for providing data on this !

@watercable76
Copy link
Author

Not using Okta over here, more Azure based than anything

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Sep 2, 2024

short update; we're also hoping getting #1014 before the release. Which now warrants a list at this stage, which changes were done in relation to attempt to mitigate this issue:

@sfc-gh-dszmolka sfc-gh-dszmolka changed the title SNOW-1612981: Massive Memory Load SNOW-1612981: SNOW-1640968 Massive Memory Load Sep 2, 2024
@dliberatore-th
Copy link

dliberatore-th commented Sep 3, 2024

One possible factor here, different from your test code above which does things synchronously in a serial manner, is that we're doing things async from a REST API of our own to serve requests in parallel. This is intended to fill several components on a page.

I can't share our code with you and don't have a nice condensed test case for you, unfortunately, but you could probably simulate it well enough by doing async calls with a max of maybe 15 at a time, using a startup delay of a few ms each, and having it use a query that takes a few seconds to complete (and only return a few rows totaling a few KB) so they stack up.

Thanks for keeping up with the efforts to improve the software. We're looking forward to trying those fixes out.

edit: Not using Okta here either.

@sfc-gh-dszmolka sfc-gh-dszmolka changed the title SNOW-1612981: SNOW-1640968 Massive Memory Load SNOW-1612981: SNOW-1640968 SNOW-1629635 Massive Memory Load Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

5 participants