utils.parse: fix encoding in parse_html #4201

bastimeyer · 2021-11-22T08:46:31Z

This fixes the character encoding in parsed HTML documents. This should've been caught way sooner.

lxml's XML parser requires bytes as input, hence why strings must be encoded to bytes first (default utf8 encoding). The content is then decoded and parsed correctly as utf8 regardless whether the XML declaration with the document's encoding is missing or not.

The HTML parser on the other hand treats byte inputs differently, but only when the <meta charset="utf8"> tag is missing in the first X bytes of the document. So if we encode input strings here to bytes as well (default utf8 encoding) and if the tag is missing too, then this will lead to decoding errors, as the parser won't treat the input as utf8 encoded data.

See the added tests.

utils.parse: fix encoding in parse_html

d79a479

bastimeyer added the bug label Nov 22, 2021

bastimeyer mentioned this pull request Nov 22, 2021

plugins.ard_mediathek: fix plugin #4202

Merged

gravyboat merged commit 12c17c4 into streamlink:master Nov 23, 2021

Billy2011 added a commit to Billy2011/streamlink-27 that referenced this pull request Nov 23, 2021

[PATCH] utils.parse: fix encoding in parse_html (streamlink#4201)

e67f8c0

bastimeyer deleted the utils/parse/html-encoding branch November 23, 2021 11:06

back-to mentioned this pull request Nov 26, 2021

plugins.tviplayer: unable to handle CNN Portugal #4209

Closed

4 tasks

bastimeyer mentioned this pull request Apr 1, 2022

plugins.okru: Could not find metadata and cannot record stream #4414

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

utils.parse: fix encoding in parse_html #4201

utils.parse: fix encoding in parse_html #4201

Uh oh!

bastimeyer commented Nov 22, 2021

Uh oh!

Uh oh!

Uh oh!

utils.parse: fix encoding in parse_html #4201

utils.parse: fix encoding in parse_html #4201

Uh oh!

Conversation

bastimeyer commented Nov 22, 2021

Uh oh!

Uh oh!