Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize TV shows using an HLS playlist #305

Open
nathandebalthasar opened this issue Mar 14, 2024 · 6 comments
Open

Recognize TV shows using an HLS playlist #305

nathandebalthasar opened this issue Mar 14, 2024 · 6 comments

Comments

@nathandebalthasar
Copy link

Hello,
I've been trying to recognize TV shows as well as ads ingested using DejaVu in real time using an HLS playlist. The shows last from a few minutes to hours and the ads generally last for a few dozen seconds.

The main problem lies in the fact that when doing the recognition on a TS segment that should match an audio file ingested by DejaVu, the input_confidence attribute, depending on the length of the segment, is really low, or not close enough to 1.

When using 60-second TS segments, the input confidence value tends towards 0. Often, the value is <= 0.1 using the default settings and can grow to <= 0.2 using these settings.
Using 6-second segments, the value is closer to 1, around 0.5 to 0.9 most of the time. However, the second result returned by the program is often closer to 1, which will be a wrong audio.

The files ingested are WMV files, and the audio specs are the following:

  • 3 audio tracks
  • Codec WMA 9.2
  • Constant bit rate mode at 96kbps
  • 2 channels
  • 48 kHz sample rate

What I did is transform these WMV files into ts files using ffmpeg to match the ts segments characteristics, which are the following:

  • Single audio track
  • Codec AAC LC Version 4
  • Muxing Mode: ADTS
  • 2 channels
  • 48 kHz sample rate
  • Lossy compression mode

Also, something weird I noticed is that when taking a part of a TS file that I transformed from a WMV file which is ingested by DejaVu, the input_confidence will most of the time be 1 or close to 1. But when taking the same part of audio from a ts segment of my HLS playlist, the result will not be good, close to 0 for 60-second segments or close to 1 but not enough using 6-second segments. How can one explain that?

How can you get more relevant results?

@mkommar
Copy link

mkommar commented Mar 14, 2024 via email

@quannabe
Copy link

Agree with the above. Shorter clips should yield better results.

Curious about your use case. Can you share more?

@nathandebalthasar
Copy link
Author

nathandebalthasar commented Mar 14, 2024

To be honest, you do want short clips. Google uses this method. It's there a design reason that requires 60 seconds? Think of the algorithm, any noise or variation will make the "beat" vary. In my attempt, I used 2 to 4 seconds.

No particular reason to use 60 seconds segments, I was using 6 seconds segments at the beginning and at some points, the false positives were fewer using longer segments at the cost of a loss of precision.

To be honest, you do want short clips.

Does it apply only to the files used during recognition? Or also the files that DejaVu ingests?

Agree with the above. Shorter clips should yield better results.

Curious about your use case. Can you share more?

I'm building a solution that aims to recognize a given Television program, serie or ad in real time using TS segments from an HLS playlist.

@nathanagez
Copy link

nathanagez commented Mar 15, 2024

Hi @quannabe @mkommar, we tested with shorter clips but we ended up with low confidence results as well.

The is how we proceeded. We have TV ads that can last between 10-20 seconds we ingested in DejaVu, if I take the exact same file and compare it with what DejaVu fingerprinted we obtain a very good confidence level (close to one or 1).

Let's assume we have the following:

  • ad_1.wmv or .mp3 (it doesn't change a lot)
  • ad_2.wmv
  • ad_3.wmv

We ingest all of them in DejaVu, then if we provide ad_1.wmv for recognition it will match what we have in database and end up with an input_confidence result close to 1 or equals to 1.

Now let's do the same, we ingest:

  • ad_1.wmv
  • ad_2.wmv
  • ad_3.wmv

The start and end of our .ts segment contain audio unknown by DejaVu but in the middle it contains our ad_2 ingested previously.

If we run the recognition on this segment, this is where we end up with very low confidence.

@quannabe
Copy link

Interesting use case!

I've had issues with query times greatly increasing as the audio library size increases. Have you run into this?

@mkommar
Copy link

mkommar commented Mar 20, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants