We are having a similar issues with source maps scraped by Sentry.
Sentry reports that our source file was not utf8.
However, it is valid utf8
Uploading the map directly to Sentry with the API works fine. The problem only arises when it is scraped.
Looking at the code that @benvinegar linked to, loading the file directly into a python repl and checking the contents, I see that the contents are indeed six.binary_type, and can be successfully decoded as utf8, so it’s unusual that the file is actually making it way into that code branch.
Running iconv -f utf-8 against the file is successful.
Is it possible Sentry is incorrectly raising a utf8 error when the problem is something else? Maybe the file isn’t downloaded completely or something?
We also pull this information out of the HTTP headers in the response. Can you share a link to one of these assets? If it’s on sentry.io, we can also help in support and check there.
We did this because I’m pretty sure without the charset being utf-8, it’s handled as ascii. Thought I think it might be safe to not be so strict since I’m pretty sure that ascii is just a safe subset of utf-8.
This problem surfaces when your server is sending back a text/* Content-Type without a charset. In which case were were getting an explicit ISO-8859-1 encoding value whereas we expected either None or utf-8. So the PR is explicitly allowing this charset since it’s a strict subset of utf-8 therefore fully compatible.