Recognize TV shows using an HLS playlist #305

nathandebalthasar · 2024-03-14T17:41:05Z

Hello,
I've been trying to recognize TV shows as well as ads ingested using DejaVu in real time using an HLS playlist. The shows last from a few minutes to hours and the ads generally last for a few dozen seconds.

The main problem lies in the fact that when doing the recognition on a TS segment that should match an audio file ingested by DejaVu, the input_confidence attribute, depending on the length of the segment, is really low, or not close enough to 1.

When using 60-second TS segments, the input confidence value tends towards 0. Often, the value is <= 0.1 using the default settings and can grow to <= 0.2 using these settings.
Using 6-second segments, the value is closer to 1, around 0.5 to 0.9 most of the time. However, the second result returned by the program is often closer to 1, which will be a wrong audio.

The files ingested are WMV files, and the audio specs are the following:

3 audio tracks
Codec WMA 9.2
Constant bit rate mode at 96kbps
2 channels
48 kHz sample rate

What I did is transform these WMV files into ts files using ffmpeg to match the ts segments characteristics, which are the following:

Single audio track
Codec AAC LC Version 4
Muxing Mode: ADTS
2 channels
48 kHz sample rate
Lossy compression mode

Also, something weird I noticed is that when taking a part of a TS file that I transformed from a WMV file which is ingested by DejaVu, the input_confidence will most of the time be 1 or close to 1. But when taking the same part of audio from a ts segment of my HLS playlist, the result will not be good, close to 0 for 60-second segments or close to 1 but not enough using 6-second segments. How can one explain that?

How can you get more relevant results?

mkommar · 2024-03-14T17:45:16Z

To be honest, you do want short clips. Google uses this method. It's there a design reason that requires 60 seconds? Think of the algorithm, any noise or variation will make the "beat" vary. In my attempt, I used 2 to 4 seconds.

…

On Thu, Mar 14, 2024, 1:41 PM nathandebalthasar ***@***.***> wrote: Hello, I've been trying to recognize TV shows as well as ads ingested using DejaVu in real time using an HLS playlist. The shows last from a few minutes to hours and the ads generally last for a few dozen seconds. The main problem lies in the fact that when doing the recognition on a TS segment that should match an audio file ingested by DejaVu, the input_confidence attribute, depending on the length of the segment, is really low, or not close enough to 1. When using 60-second TS segments, the input confidence value tends towards 0. Often, the value is <= 0.1 using the default settings and can grow to <= 0.2 using these <https://github.com/denis-stepanov/advent?tab=readme-ov-file#dejavu-tuning> settings. Using 6-second segments, the value is closer to 1, around 0.5 to 0.9 most of the time. However, the second result returned by the program is often closer to 1, which will be a wrong audio. The files ingested are WMV files, and the audio specs are the following: - 3 audio tracks - Codec WMA 9.2 - Constant bit rate mode at 96kbps - 2 channels - 48 kHz sample rate What I did is transform these WMV files into ts files using ffmpeg to match the ts segments characteristics, which are the following: - Single audio track - Codec AAC LC Version 4 - Muxing Mode: ADTS - 2 channels - 48 kHz sample rate - Lossy compression mode Also, something weird I noticed is that when taking a part of a TS file that I transformed from a WMV file which is ingested by DejaVu, the input_confidence will most of the time be 1 or close to 1. But when taking the same part of audio from a ts segment of my HLS playlist, the result will not be good, close to 0 for 60-second segments or close to 1 but not enough using 6-second segments. How can one explain that? How can you get more relevant results? — Reply to this email directly, view it on GitHub <#305>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALQSG5GV7XCFHSHCR5IKE3YYHOM3AVCNFSM6AAAAABEWR7GKGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4DMOJSGQ3TIMY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

quannabe · 2024-03-14T17:50:58Z

Agree with the above. Shorter clips should yield better results.

Curious about your use case. Can you share more?

nathandebalthasar · 2024-03-14T18:04:23Z

To be honest, you do want short clips. Google uses this method. It's there a design reason that requires 60 seconds? Think of the algorithm, any noise or variation will make the "beat" vary. In my attempt, I used 2 to 4 seconds.

No particular reason to use 60 seconds segments, I was using 6 seconds segments at the beginning and at some points, the false positives were fewer using longer segments at the cost of a loss of precision.

To be honest, you do want short clips.

Does it apply only to the files used during recognition? Or also the files that DejaVu ingests?

Agree with the above. Shorter clips should yield better results.

Curious about your use case. Can you share more?

I'm building a solution that aims to recognize a given Television program, serie or ad in real time using TS segments from an HLS playlist.

nathanagez · 2024-03-15T08:46:47Z

Hi @quannabe @mkommar, we tested with shorter clips but we ended up with low confidence results as well.

The is how we proceeded. We have TV ads that can last between 10-20 seconds we ingested in DejaVu, if I take the exact same file and compare it with what DejaVu fingerprinted we obtain a very good confidence level (close to one or 1).

Let's assume we have the following:

ad_1.wmv or .mp3 (it doesn't change a lot)
ad_2.wmv
ad_3.wmv

We ingest all of them in DejaVu, then if we provide ad_1.wmv for recognition it will match what we have in database and end up with an input_confidence result close to 1 or equals to 1.

Now let's do the same, we ingest:

ad_1.wmv
ad_2.wmv
ad_3.wmv

The start and end of our .ts segment contain audio unknown by DejaVu but in the middle it contains our ad_2 ingested previously.

If we run the recognition on this segment, this is where we end up with very low confidence.

quannabe · 2024-03-20T14:32:19Z

Interesting use case!

I've had issues with query times greatly increasing as the audio library size increases. Have you run into this?

mkommar · 2024-03-20T14:37:32Z

Got it. Are you requiring passive listening or is it from a direct recording that this identification will happen? Meaning is the use case always going to have a direct recording from a source stream? Or will it pick up audio from the background on a phone or Alexa device? Mahesh

…

On Wed, Mar 20, 2024, 10:32 AM William Sell ***@***.***> wrote: Interesting use case! I've had issues with query times greatly increasing as the audio library size increases. Have you run into this? — Reply to this email directly, view it on GitHub <#305 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALQSG3GQ2WFBVXNH2WSNE3YZGMZBAVCNFSM6AAAAABEWR7GKGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZG4YTOMRRGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognize TV shows using an HLS playlist #305

Recognize TV shows using an HLS playlist #305

nathandebalthasar commented Mar 14, 2024

mkommar commented Mar 14, 2024 via email

quannabe commented Mar 14, 2024

nathandebalthasar commented Mar 14, 2024 •

edited

Loading

nathanagez commented Mar 15, 2024 •

edited

Loading

quannabe commented Mar 20, 2024

mkommar commented Mar 20, 2024 via email

Recognize TV shows using an HLS playlist #305

Recognize TV shows using an HLS playlist #305

Comments

nathandebalthasar commented Mar 14, 2024

mkommar commented Mar 14, 2024 via email

quannabe commented Mar 14, 2024

nathandebalthasar commented Mar 14, 2024 • edited Loading

nathanagez commented Mar 15, 2024 • edited Loading

quannabe commented Mar 20, 2024

mkommar commented Mar 20, 2024 via email

nathandebalthasar commented Mar 14, 2024 •

edited

Loading

nathanagez commented Mar 15, 2024 •

edited

Loading