Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric suggestions #184

Open
joshmoore opened this issue Jul 30, 2024 · 6 comments
Open

Metric suggestions #184

joshmoore opened this issue Jul 30, 2024 · 6 comments

Comments

@joshmoore
Copy link

In working on converting from OME-Zarr filesets from Zarr v2 to Zarr v3, I've been using zarr-python for the group and metadata migration and tensorstore for the large array conversion. The metrics API has been a welcome tool. Kudos and 👍 for it being non-experimental.

I have a few suggestions, though, ignoring how practical they may be to implement 🙂

  1. Capture read/write times. This will allow calculating throughput more accurately.
  2. Add a method on the Store implementation that returns the bytes_{read,written} etc. metrics with a generic prefix.
  3. Add an optional argument which takes previous metrics and subtracts them from the current metrics.

You can find some of these ideas loosely implemented in the simple TSMetrics class below:

class TSMetrics:
    """
    Instances of this class capture the current tensorstore metrics.
    If an existing instance is passed in on creation, it will be stored
    in order to deduct prevoius values from those measured by this instance.
    """

    CHUNK_CACHE_READS = "/tensorstore/cache/chunk_cache/reads"
    CHUNK_CACHE_WRITES = "/tensorstore/cache/chunk_cache/writes"

    BATCH_READ = "/tensorstore/kvstore/{store_type}/batch_read"
    BYTES_READ = "/tensorstore/kvstore/{store_type}/bytes_read"
    BYTES_WRITTEN = "/tensorstore/kvstore/{store_type}/bytes_written"

    OTHER = [
        "/tensorstore/cache/hit_count"
        "/tensorstore/cache/kvs_cache_read"
        "/tensorstore/cache/miss_count"
        "/tensorstore/kvstore/{store_type}/delete_range"
        "/tensorstore/kvstore/{store_type}/open_read"
        "/tensorstore/kvstore/{store_type}/read"
        "/tensorstore/kvstore/{store_type}/write"
        "/tensorstore/thread_pool/active"
        "/tensorstore/thread_pool/max_delay_ns"
        "/tensorstore/thread_pool/started"
        "/tensorstore/thread_pool/steal_count"
        "/tensorstore/thread_pool/task_providers"
        "/tensorstore/thread_pool/total_queue_time_ns"
        "/tensorstore/thread_pool/work_time_ns"
    ]

    def __init__(self, read_config, write_config, start=None):
        self.time = time.time()
        self.read_type = read_config["kvstore"]["driver"]
        self.write_type = write_config["kvstore"]["driver"]
        self.start = start
        self.data = ts.experimental_collect_matching_metrics()

    def value(self, key):
        rv = None
        for item in self.data:
            k = item["name"]
            v = item["values"]
            if k == key:
                if len(v) > 1:
                    raise Exception(f"Multiple values for {key}: {v}")
                rv = v[0]["value"]
                break
        if rv is None:
            raise Exception(f"unknown key: {key} from {self.data}")

        if self.start is not None:
            orig = self.start.value(key)
        else:
            orig = 0

        return rv - orig

    def read(self):
        return self.value(self.BYTES_READ.format(store_type=self.read_type))

    def written(self):
        return self.value(self.BYTES_WRITTEN.format(store_type=self.write_type))

    def elapsed(self):
        return self.start is not None and (self.time - self.start.time) or self.time

Example usage:

    ...
    write = ts.open(write_config).result()
    before = TSMetrics(read_config, write_config)
    future = write.write(read)
    future.result()
    after = TSMetrics(read_config, write_config, before)
@laramiel
Copy link
Collaborator

laramiel commented Jul 31, 2024

We do have http total (cumulative) latency: "/tensorstore/http/total_time_ms"
I could add a few more to capture latency of various kvstore drivers.

Several other drivers also include read_latency_ms / write_latency_ms, which are histograms and should give a good summary of read/write latency. I can make some of these more consistent, but not all kvstores will support all metrics.

You probably want ts.experimental_collect_matching_metrics('/tensorstore/'), it may not make a difference now, but we may add common prometheus metrics at some point.

@joshmoore
Copy link
Author

I could add a few more to capture latency of various kvstore drivers.

💯 on having the total wherever possible.

but not all kvstores will support all metrics.

Understood, but a vote for total_time_ms to be as wide-spread as possible.

You probably want ts.experimental_collect_matching_metrics('/tensorstore/')

👍

@laramiel
Copy link
Collaborator

laramiel commented Aug 1, 2024

With cc9fb66 I made the kvstore metrics a bit more consistent.

I think that the latency metrics are pretty close to what you want:

/tensorstore/kvstore/driver/read_latency_ms
/tensorstore/kvstore/driver/write_latency_ms

They will report buckets as well as aggregate latency values.

@joshmoore
Copy link
Author

Thanks, @laramiel! I've updated and now see:

/tensorstore/kvstore/file/read_latency_ms [{'0': 0, '1': 1, '2': 0, '3': 0, '4': 1, 'count': 2, 'mean': 2.5, 'sum_of_squared_deviation': 12.5}]

I might be misinterpreting but what I would like to achieve is something of the form:

read:  ~152.78ms @ 3.60GB/s

I can get the delta in read size from the total (by tracking the previous value) but with the aggregates for time, I can't, right?

@laramiel
Copy link
Collaborator

laramiel commented Aug 2, 2024

The total time is count * mean. Here it should be 5.
Then you need to look at the /tensorstore/kvstore/driver/bytes_read metric.

So, for example, if I have this:

/tensorstore/kvstore/file/batch_read=71804
/tensorstore/kvstore/file/bytes_read=301163610386
/tensorstore/kvstore/file/open_read=150897
/tensorstore/kvstore/file/read=150897
/tensorstore/kvstore/file/read_latency_ms={count=71804 mean=7.40619 buckets=[0,  50,1967,13997,30356,21320,3724,214,82,43,42,  9]}

Then my total read time = 71804 * 7.40619 / 1000 ) = 531.79 seconds
And my average throughput is 301163610386 / (1024 * 1024) / 531.79 = 540.0853725367284 MB/s

But also the latency value may not be quite what we want; the start time is recorded somewhat before os ::read call, so it includes some queue latency and some other delays.

@joshmoore
Copy link
Author

The total time is count * mean. Here it should be 5.

👍 Understood and thanks.

But also the latency value may not be quite what we want; the start time is recorded somewhat before os ::read call, so it includes some queue latency and some other delays.

Ah, this is good to know. I'm not overly concerned, but if there's a ever a different property that need to test, just let me know. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants