Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two 200 before the 304 #138

Open
3 tasks done
nitriques opened this issue Jul 21, 2016 · 21 comments
Open
3 tasks done

Two 200 before the 304 #138

nitriques opened this issue Jul 21, 2016 · 21 comments
Assignees
Milestone

Comments

@nitriques
Copy link
Member

nitriques commented Jul 21, 2016

When fetching an external image, jit now issues 2 HTTP 200 responses before sending the 304. Here's why:

  1. The modified time and etag are calculated with a 2nd http request to get the headers (I wonder why, more on that later)
  2. I've added things to use the filemtime of the cache file, which then creates new last modified and etag (so our logic fails)
  3. You then get a proper 304.

I wanted to be able to server the cache from the external server #130 and #136

The two request are shaped like this:

  1. The first one requests the whole file, skipping the headers
  2. The second one only request the headers. Based on libcurl's documentation, doing so sends a HEAD http request and should (if the remote server's smart enough) only sends headers...

I can't seem to figure out how we can get rid of one or the other, but I think we should never do both, as it's the way right now. Should we:

  1. Do the HEAD request always, to make sure the cache is valid ? What about servers that do not return cache headers ?
  2. Store this info in the DB, to be able to regenerate the original etag ?
  3. Give up ? I mean, this behavior occurs only for the first person to hit the url, since all others will get the cache and never see the original response ?
  4. Change our we cache external images ?

The more I think about it, the more I want performance. The simplest thing would be to number 4 including the following:

  1. Never cache external images that do not send proper cache headers ? See Proposition: Never cache external images that do not send proper cache headers #139
  2. Always respond with the cache file's mtime and etag and hide the one from the origin ?
  3. Randomly poke the remote server to validate the cache ?
@michael-e
Copy link
Member

michael-e commented Jul 21, 2016

jit now issues 2 HTTP 200 responses before sending the 304

I assume you mean the following: JIT issues two requests to the remote server and waits for the responses before managing to send a 304 response. Right?

If yes, do you mean that this has been introduced in 2.0? Or is it the same in 1.7?

BTW: I just discovered a bug with JIT 1.7 when I tried to test the reponses using cURL. (I don't know for sure if JIT 2.0 has the same behaviour.) The "If-Modified-Since" logic is not working properly. I only get a 304 response if the value of the If-Modified-Since header matches the Last-Modified response header exactly. Now if I add some time to the If-Modified-Since value (asking for a later point in time), it should still be a 304, but it is not. I get a 200 response instead.

The second one only request the headers. Based on libcurl's documentation, doing so sends a HEAD http request and should (if the remote server's smart enough) only sends headers...

I wouldn't care about this, because HEAD requests are standardized, AFAIK, and should work with any HTTP server. As you said, you can't rely on getting any cache headers from the remote server, (but that is the same with 200 responses).

Apart from that, none of your solutions sounds good to me. "Randomly poking" a remote server is not a clear solution, and it scales terribly. If you have some thousand external images in the cache, you will generate a lot of requests/traffic even if you don't need to serve a single images to a client (because it is used in old, archived articles, for example). This might mean less traffic than with the current behaviour, but it might as well be the other way 'round, depending on the percentage (and frequency) of cached images being actually served by JIT.

More questions:

  • Are we sure that JIT sends two requests to the remote server?
  • Does this only occur once or with every request?

If the latter is the case, there may be reasons for this (for example: servers not sending cache headers, as you mentioned above).

The more I think about it, the more I want performance.

Agreed. I once tested putting all my image files to Amazon AWS and serving them via JIT. Yes, this was significantly slower, you could see that in your browser when loading a page (without browser cache, of course). Using CloudFront made it a lot faster and smoother (but still a bit slower than with local files). At that time I also thought about speeding things up with external images, but I couldn't find a proper solution. I think that checking if a file still exists is the least JIT should do even with external images. So I decided to go for local images (and a bigger hard drive on the server)...

(The best idea I had was delivering the cache file first, then checking for the external file. This would solve any speed issues, but it would mean that you deliver a cache file for at least one more time if the original file has been removed. That felt wrong to me.)

About ETags: I personally don't use them at all, I remove all of them (in Apache):

If you're not taking advantage of the flexible validation model that ETags provide, it's better to just remove the ETag altogether. The Last-Modified header validates based on the component's timestamp.
https://developer.yahoo.com/blogs/ydn/high-performance-sites-rule-13-configure-etags-7211.html

Last question: Does JIT really use the remote files's ETag and cache headers? (I don't know, because I even reset image cache headers in Apache.)

Sorry for posting a lot of questions and no solutions. But it's a delicate matter (and I love JIT).

@nitriques
Copy link
Member Author

But it's a delicate matter (and I love JIT).

Same here !

I assume you mean the following: JIT issues two requests to the remote server and waits for the responses before managing to send a 304 response. Right?

You are right, but this is not what I've met: The request are those handle on the server where jit runs. I do not know if it's new to 2.0 (but it really looks like it)

The "If-Modified-Since" logic is not working properly.

I know! That the origin of my work on that extension haha!

Are we sure that JIT sends two requests to the remote server?

Totally.

Does this only occur once or with every request?

Right now, it's when there is no cache. When a cache file is created, it lasts forever, without ever checking if the url got modified. I was trying to solve #136 and stumble upon this for external images.

"Randomly poking" a remote server is not a clear solution, and it scales terribly.

Indeed. But how can you choose to check if you cache is still valid ?

Does JIT really use the remote files's ETag and cache headers?

Not at all. Only the Last Modified value is used. And then we calculate a etag based on that value and the filename.

@michael-e
Copy link
Member

You are right, but this is not what I've met: The request are those handle on the server where jit runs. I do not know if it's new to 2.0 (but it really looks like it)

Sorry, I still don't get it. Do you mean that JIT itself sends 3 answers to a single request??? Can you explain in simple words what is going on (on "Remote server", "JIT server" and "Client")?

When a cache file is created, it lasts forever, without ever checking if the url got modified.

So this has been broken in JIT 2? I am rather sure that JIT 1.7 sends a HEAD request to the remote server before delivering the cache file. (I vaguely remember that I myself implemented this a long time ago.)

"Randomly poking" a remote server is not a clear solution, and it scales terribly.

Indeed. But how can you choose to check if you cache is still valid ?

By sending a request to the remote server (using "If-Modifed since"), see above.

Does JIT really use the remote files's ETag and cache headers?

Not at all. Only the Last Modified value is used. And then we calculate a etag based on that value and the filename.

Ah, I see. It's unclear to me why ETags have been implemented at all. They don't add any value in this scenario.

@nitriques
Copy link
Member Author

Can you explain in simple words

Haha!

Forget the remote server. The remote server always send 200 responses since we never sent it any cache headers.

  1. First request: The server requests the external image, cache it and serve it as a 200.
  2. Second request: The server gets the cache headers from the client, is not able to use it since the cache has a different last modified value than the one send in request number 1. This also returns a 200 (which I am trying to transform into a 304).
  3. Third request: The 304 is properly sent.

So this has been broken in JIT 2? I am rather sure that JIT 1.7 sends a HEAD request to the remote server before delivering the cache file.

Yes!

By sending a request to the remote server (using "If-Modifed since")

Yeah good idea, but that was never implemented. We do not store the original date from the remote server's response. Could we use the cache file's time ?

Ah, I see. It's unclear to me why ETags have been implemented at all. They don't add any value in this scenario.

Indeed! We should only use the date, parse it into a timestamp and check if it's earlier than, not strictly equals.

@michael-e
Copy link
Member

michael-e commented Jul 21, 2016

  1. First request: The server requests the external image, cache it and serve it as a 200.

Let's assume that the JIT response to the client is: 200 and Last-Modified: Tue, 19 Jul 2016 10:43:29 GMT

  1. Second request: The server gets the cache headers from the client, is not able to use it since the cache has a different last modified value than the one send in request number 1. This also returns a 200 (which I am trying to transform into a 304).

Ah, I see. It should be different, of course. If it is the same client, it will send If-Modified-Since: Tue, 19 Jul 2016 10:43:29 GMT. In this case JIT should send a 304 repsonse. (JIT 1.7 does this, according to my tests.)

By sending a request to the remote server (using "If-Modifed since")

Yeah good idea, but that was never implemented. We do not store the original date from the remote server's response. Could we use the cache file's time ?

You are right. We only check if the file is still available under the given URL. (Which is probably sufficient for services like Flickr, for example, which have unique filenames.) My idea would make things more complicated, so I am not sure if it's a good one. :-)

Indeed! We should only use the date, parse it into a timestamp and check if it's earlier than, not strictly equals.

Yep.

@nitriques
Copy link
Member Author

JIT 1.7 does this, according to my tests.

Cool. 2.0 does not :( But only for external images. The big question is: how can we send the right Date the first time ? And all other times too and be able to invalidate it...

@nitriques
Copy link
Member Author

I knew you and I would have fun with this when I posted it yesterday night ;)

@nitriques nitriques added the Bug label Jul 21, 2016
@nitriques nitriques added this to the 2.1.0 milestone Jul 21, 2016
@nitriques nitriques self-assigned this Jul 21, 2016
@michael-e
Copy link
Member

michael-e commented Jul 21, 2016

I knew you and I would have fun with this when I posted it yesterday night ;)

:-)

The big question is: how can we send the right Date the first time ? And all other times too and be able to invalidate it...

AFAIK, you don't need to send any date with a 304 response. (As far as I see, JIT sets the header anyway, but either my Apache or my nginx frontend server seem to remove it if the status code is 304.)

Anyway, for external images, JIT 1.7 will always do a HEAD request to the remote server; this is the important line:

$last_modified = strtotime(Image::getHttpHeaderFieldValue($image_path, 'Last-Modified'));

A bit later:

if($_SERVER['HTTP_IF_MODIFIED_SINCE'] == $last_modified_gmt || str_replace('"', NULL, stripslashes($_SERVER['HTTP_IF_NONE_MATCH'])) == $etag){
    ...

@nitriques
Copy link
Member Author

my Apache or my nginx frontend server seem to remove it if the status code is 304.

Apache does it. nginx most likely also does it. It's in the spec: https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.5

This link is also in our codebase ;)

For external images, JIT 1.7 will always do a HEAD request

Indeed. That's what I am trying to remove while not breaking things as I currently did.

By caching the file on the spot, we could avoid using the data from the remote server but we must still be able to avoid having an infinite cache. A proxy would only use the max-age won't it ?

@michael-e
Copy link
Member

That's what I am trying to remove while not breaking things as I currently did.

Aha, so we get to the point, finally! :-))

This functionality has been implemented on purpose in order to no longer deliver a cache file if the original (external) file no longer exists. I think it is a bad idea to avoid the HEAD request, because in this case you will serve the cache file (at least for a while) although the original file has been removed from the interwebs.

@nitriques
Copy link
Member Author

I totally agree with you. But right now, this HEAD request causes problems since it loads the network card with request under high load.

It's funny because there was lots of code to avoid serving bad cache for external resources, but we would gladly sent bad cache for our local files ;)

@nitriques
Copy link
Member Author

I'd rather send invalid cache file and be able to support more requests

@michael-e
Copy link
Member

michael-e commented Jul 21, 2016

but we would gladly sent bad cache for our local files ;)

AFAIK, we don't serve a cache file if the original local file has been removed (which is the topic here).

I'd rather send invalid cache file and be able to support more requests

I understand that this might be desired in certain systems. But not always. So what about making it optional? Just adding an option like "Don't check if external images still exist". The simplest implementation would deliver the cache file forever. A more advanced — like mentioned above — might deliver the cache file first, then do the HEAD request (and remove the cache file if the original file has gone away).

@nitriques
Copy link
Member Author

AFAIK, we don't serve a cache file if the original local file has been removed (which is the topic here).

No I was referring to #136 which states that we serve the cache even if the file changed.

So what about making it optional

Let me bring random back! Set the configurable factor to one, the HEAD request is always made, bad cache is never used. That would be the default since it's the old behaviour. Set it to 0.5 or even 0.1 to lessen the chances of issuing a HEAD request to validate the cache but may send an invalid response.

Otherwise, we need to hit the Database or other storage to expire the cache on a url basis. Like I said, performance is key, so I think this is a no-go.

@michael-e
Copy link
Member

Like you, I am not fond of involving the database. So yes, your proposal sounds like a solution. (Documentation is king, of course.)

@nitriques
Copy link
Member Author

Great!

nitriques added a commit that referenced this issue Jul 28, 2016
The way we handled HEAD request was

1. Different from the GET request, since it lacked some checks
2. Way to general for our needs

This commit refactors the old things with a single brand new function
that will

1. Make sure the request is a 200
2. Return only cache related headers, since that's the goal

Re #136
Cc #138
Cc #130
Cc #137
nitriques added a commit that referenced this issue Jul 28, 2016
This commit cleans up the code and fixes a bunch of issues.

* Cache validation

There was a bug, #137, which would made the cache unusable with external
images. This has been fixed. The function also makes sure to validate
that the original file, either remote or local, has not changed since
the last modification of the cache file itself. If the cache file is no
longer valid, it gets deleted. This also fixes #136.

* Last modified value

There was a problem describe in #138 where a second request to a url
with proper cache heading would still send a 200, because the
Last-Modified value was read from a different source. This commit
changes this by doing the following.

1. Always use a local mtime: either the original local file or our
tempfile created from the http request.
2. If cache is enabled, always use the cache file mtime, even when the
request creates the cache file. This insure repeatable results for all
subsequent requests.

This brings a side effect that is needed:
before, we used to touch the cache file when found valid and we must
need to stop this to be able to get repeatable results.

---

Throughout the process, values are "exported" into the parameter array
when certain condition are met. This insure that the values sent to the
user will be the right one.

Sorry for the big commit, I did my best to split things up :)

Fixes #137
Fixes #136
Fixes #138
@nitriques
Copy link
Member Author

@michael-e Do not rush anything, just a reminder. I am not in a rush to release this, I'll wait for you.

nitriques added a commit that referenced this issue Aug 2, 2016
This commits adds a new hidden configuration value, `cache_invalidation_odds`
which allows developer to control how often cache is validated.

This is particularly usefull on stressed servers, and even better if
serving remote images, since it can lead to a 60% request time
diminution.

The default behaviour is not changed and cache validation will always
occur.

It also updates the x-jit debug header to specify the result of the
cache validation

Fixes #138
@nitriques
Copy link
Member Author

@michael-e again, no rush, just wanted to let you know that my random thingny works really great. I'll deploy it in production tomorrow 👍

@michael-e
Copy link
Member

Cool. Tell us how it works in production!

@nitriques
Copy link
Member Author

Still doing extensive testing, and it looks promising 👍

@michael-e
Copy link
Member

I have only done a short test (looking at the debug headers), and it seems to work "as advertised". But I guess in this case your production environment is a much better test!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants