Skip to content

utils.url: make update_scheme always update target #4053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 30, 2021

Conversation

bastimeyer
Copy link
Member

@bastimeyer bastimeyer commented Sep 27, 2021

Previously, update_scheme only added schemes on scheme-less URLs and
kept target URLs which included a scheme intact.

This is confusing for plugin implementers and it has already been used
incorrectly a lot of times when trying to force the HTTPS protocol on
unencrypted HTTP URLs. Since the method is called update_scheme, it
should do exactly that, in all cases.

However, always updating the target scheme will break those calls of
update_scheme where the scheme was previously only added if needed, eg.
on scheme-less URLs, partial URLs or URL paths.

An example of this are scheme-less input URLs passed to Streamlink's
Session.resolve_url(), which has to implicitly set the HTTP scheme.
Always overriding the scheme would turn explicit HTTPS schemes to HTTP,
which can break certain sites/plugins. The various stream-protocol
plugins have the same issue and have to implicitly set a scheme as well.

This change therefore introduces the force=True keyword, so that URL
schemes can be set implicitly again in the methods and plugins mentioned
above by setting force to False.

Another effect of forcing the scheme override by default is the pattern
currently used in many plugins, which often update (partial) URLs by
calling update_scheme(self.url, target). Since self.url depends on
the user's input and since the input URL's scheme is implicitly set to
HTTP by the Session's URL resolver, target URLs can be downgraded from
HTTPS to HTTP, which is not desired. This is not a problem in cases
where the server will redirect, but it can be a problem where the server
returns an error instead.

These plugins will therefore need to be fixed so that they either set
the URL schemes implicitly or so that they force the according scheme.

Another solution (which will be done eventually) is the change of the
implicit input URL scheme from HTTP to HTTPS in Session.resolve_url()
and the various stream-protocol plugins, which will turn all scheme-less
input URLs to HTTPS by default, unless the user explicitly sets an HTTP
URL.

  • always update target URL scheme with current URL scheme
  • add force=True keyword and do not override scheme if force is False
  • set force=False on all implicitly updated URL schemes
    • Session.resolve_url
    • plugins.{akamaihd,hds,hls,http}
  • parametrize tests, re-order/re-format assertions and add new ones

ref #4047

Here's a list of all update_scheme calls. As you can see, every call with "http://" as current URL have the force=False parameter set. The other ones are a mixture of "https://" (also via constants) and self.url. The self.url ones are problematic for the reasons mentioned above, but as I've also said, there's an upgrade path for that by switching from http by default to https (#4047).

Please don't merge unless this we're 100% sure that everything's good here.

$ grep -RHn 'update_scheme(' src/ | sort -V | sed -E 's/ +/ /g' | column -tl2 | head -n-2
src/streamlink/plugins/abweb.py:112:         iframe_url = update_scheme('https://', iframe_url)
src/streamlink/plugins/abweb.py:124:         hls_url = update_scheme('https://', m.group('url'))
src/streamlink/plugins/akamaihd.py:18:       url = update_scheme("http://", data.get("url"), force=False)
src/streamlink/plugins/albavision.py:89:     playlist_url = m and update_scheme(self.url, m.group(1))
src/streamlink/plugins/albavision.py:116:    playlist_url = m and update_scheme(self.url, m.group(1))
src/streamlink/plugins/ard_mediathek.py:45:  stream = HTTPStream(self.session, update_scheme("https://", url))
src/streamlink/plugins/ard_mediathek.py:49:  return HLSStream.parse_variant_playlist(self.session, update_scheme("https://", info["_stream"])).items()
src/streamlink/plugins/ard_mediathek.py:72:  stream_ = update_scheme("https://", stream_)
src/streamlink/plugins/atresplayer.py:50:    super().__init__(update_scheme("https://", url))
src/streamlink/plugins/cdnbg.py:65:          iframe_url = update_scheme(self.url, html_unescape(iframe_url))
src/streamlink/plugins/cdnbg.py:79:          update_scheme(iframe_url, stream_url),
src/streamlink/plugins/delfi.py:41:          src = update_scheme(self.url, x['src'])
src/streamlink/plugins/dogus.py:39:          mobile_url = mobile_url_m and update_scheme(self.url, mobile_url_m.group("url"))
src/streamlink/plugins/earthcam.py:61:       hls_url = update_scheme(self.url, f"{hls_domain}{hls_playpath}")
src/streamlink/plugins/euronews.py:47:       validate.transform(lambda url: update_scheme("https://", url))
src/streamlink/plugins/hds.py:21:            url = update_scheme("http://", data.get("url"), force=False)
src/streamlink/plugins/hls.py:21:            url = update_scheme("http://", data.get("url"), force=False)
src/streamlink/plugins/http.py:18:           url = update_scheme("http://", data.get("url"), force=False)
src/streamlink/plugins/idf1.py:40:           lambda x: [update_scheme(IDF1.DACAST_API_URL, x['hls']), x['hds']] + [y['src'] for y in x.get('html5', [])]
src/streamlink/plugins/invintus.py:41:       return HLSStream.parse_variant_playlist(self.session, update_scheme(self.url, hls_url))
src/streamlink/plugins/mediaklikk.py:70:     validate.map(lambda p: update_scheme(self.url, p["file"]))
src/streamlink/plugins/nbcsports.py:20:      url = update_scheme(self.url, platform_url)
src/streamlink/plugins/nbc.py:20:            url = update_scheme(self.url, platform_url)
src/streamlink/plugins/sportschau.py:25:     validate.transform(lambda url: update_scheme("https:", url))
src/streamlink/plugins/sportschau.py:45:     yield from HLSStream.parse_variant_playlist(self.session, update_scheme("https:", data.get("videoURL"))).items()
src/streamlink/plugins/sportschau.py:47:     yield "audio", HTTPStream(self.session, update_scheme("https:", data.get("audioURL")))
src/streamlink/plugins/streamable.py:33:     stream_url = update_scheme(self.url, info["url"])
src/streamlink/plugins/tv999.py:31:          validate.transform(lambda x: update_scheme('http:', x)),
src/streamlink/plugins/vk.py:56:             iframe_url = update_scheme(self.url, _i.attributes['src'])
src/streamlink/plugins/webtv.py:64:          url = update_scheme(self.url, source["src"])
src/streamlink/session.py:234:               self.http.proxies["http"] = update_scheme("http://", value, force=False)
src/streamlink/session.py:236:               self.http.proxies["https"] = update_scheme("http://", value, force=False)
src/streamlink/session.py:239:               self.http.proxies["https"] = update_scheme("https://", value)

@bastimeyer
Copy link
Member Author

Let's take a look at the update_scheme calls I've listed above:

Any calls that have https: in their current URL parameter (via a string or some kind of constant) have most likely been set intentionally, either with the (unmet) expectation that they would update the URL from http to https, or with the intention of adding a scheme with https as the rational default. This means that these calls can all be ignored because update_scheme will now do the correct thing in both cases.

Next are the calls with http:. Except for one plugin (tv999), these are only present in the session class and stream-protocol plugins as a way of implicitly adding a scheme, and they've been changed in this PR to only add the scheme if it doesn't exist via force=False. That one plugin exception has not been changed (yet).

This means that this is the list of plugins that need to be checked and updated if there are any issues with http to https redirection. As you can see, all of them except tv999 use self.url. Shouldn't take too long to check every plugin if https is available and switch to that instead of self.url.

src/streamlink/plugins/albavision.py:89:     playlist_url = m and update_scheme(self.url, m.group(1))
src/streamlink/plugins/albavision.py:116:    playlist_url = m and update_scheme(self.url, m.group(1))
src/streamlink/plugins/cdnbg.py:65:          iframe_url = update_scheme(self.url, html_unescape(iframe_url))
src/streamlink/plugins/cdnbg.py:79:          update_scheme(iframe_url, stream_url),
src/streamlink/plugins/delfi.py:41:          src = update_scheme(self.url, x['src'])
src/streamlink/plugins/dogus.py:39:          mobile_url = mobile_url_m and update_scheme(self.url, mobile_url_m.group("url"))
src/streamlink/plugins/earthcam.py:61:       hls_url = update_scheme(self.url, f"{hls_domain}{hls_playpath}")
src/streamlink/plugins/invintus.py:41:       return HLSStream.parse_variant_playlist(self.session, update_scheme(self.url, hls_url))
src/streamlink/plugins/mediaklikk.py:70:     validate.map(lambda p: update_scheme(self.url, p["file"]))
src/streamlink/plugins/nbcsports.py:20:      url = update_scheme(self.url, platform_url)
src/streamlink/plugins/nbc.py:20:            url = update_scheme(self.url, platform_url)
src/streamlink/plugins/streamable.py:33:     stream_url = update_scheme(self.url, info["url"])
src/streamlink/plugins/tv999.py:31:          validate.transform(lambda x: update_scheme('http:', x)),
src/streamlink/plugins/vk.py:56:             iframe_url = update_scheme(self.url, _i.attributes['src'])
src/streamlink/plugins/webtv.py:64:          url = update_scheme(self.url, source["src"])

How should this issue be solved?

There are two solutions (which can also be combined):

  1. Fix all update_scheme calls in the various plugins
  2. Switch from http to https in the Session's URL resolver and stream-protocol plugins

(1) could be done in this PR as an additional commit, but I would prefer solving (2) in a separate PR for visibility and legibility reasons.

Previously, update_scheme only added schemes on scheme-less URLs and
kept target URLs which included a scheme intact.

This is confusing for plugin implementers and it has already been used
incorrectly a lot of times when trying to force the HTTPS protocol on
unencrypted HTTP URLs. Since the method is called update_scheme, it
should do exactly that, in all cases.

However, always updating the target scheme will break those calls of
update_scheme where the scheme was previously only added if needed, eg.
on scheme-less URLs, partial URLs or URL paths.

An example of this are scheme-less input URLs passed to Streamlink's
`Session.resolve_url()`, which has to implicitly set the HTTP scheme.
Always overriding the scheme would turn explicit HTTPS schemes to HTTP,
which can break certain sites/plugins. The various stream-protocol
plugins have the same issue and have to implicitly set a scheme as well.

This change therefore introduces the `force=True` keyword, so that URL
schemes can be set implicitly again in the methods and plugins mentioned
above by setting force to False.

Another effect of forcing the scheme override by default is the pattern
currently used in many plugins, which often update (partial) URLs by
calling `update_scheme(self.url, target)`. Since `self.url` depends on
the user's input and since the input URL's scheme is implicitly set to
HTTP by the Session's URL resolver, target URLs can be downgraded from
HTTPS to HTTP, which is not desired. This is not a problem in cases
where the server will redirect, but it can be a problem where the server
returns an error instead.

These plugins will therefore need to be fixed so that they either set
the URL schemes implicitly or so that they force the according scheme.

Another solution (which will be done eventually) is the change of the
implicit input URL scheme from HTTP to HTTPS in `Session.resolve_url()`
and the various stream-protocol plugins, which will turn all scheme-less
input URLs to HTTPS by default, unless the user explicitly sets an HTTP
URL.

- always update target URL scheme with current URL scheme
- add force=True keyword and do not override scheme if force is False
- set force=False on all implicitly updated URL schemes
  - `Session.resolve_url`
  - `plugins.{akamaihd,hds,hls,http}`
- parametrize tests, re-order/re-format assertions and add new ones
Update URL schemes to https instead of using the input URL as reference
and set force to False on those plugins which support multiple sites, so
that the https scheme only gets added if it is missing.
@bastimeyer bastimeyer force-pushed the utils/url/update_scheme branch from 38d7a5a to 7cfd434 Compare September 28, 2021 15:03
@gravyboat
Copy link
Member

Thanks for the write up. Correct me if I'm wrong but it seems like option 2 will be the most beneficial and easy to understand going forward?

@bastimeyer
Copy link
Member Author

The changes of this PR should be good now and all plugins got fixed (see the commit message of the second commit).

Switching from http to https by default on scheme-less URLs as described by (2) is now independent of this, as it only affects the input URL and not the update_scheme calls. Anything else that previously relied on self.url has been changed to https://, with force=False on those "multi-site" plugins (because I couldn't check everything, and some of the plugins are from regions where web technology is at least 10-15 years behind).

@gravyboat gravyboat merged commit 481eab6 into streamlink:master Sep 30, 2021
@bastimeyer bastimeyer deleted the utils/url/update_scheme branch September 30, 2021 06:22
Billy2011 added a commit to Billy2011/streamlink-27 that referenced this pull request Sep 30, 2021
…k#4053)

Previously, update_scheme only added schemes on scheme-less URLs and
kept target URLs which included a scheme intact.

This is confusing for plugin implementers and it has already been used
incorrectly a lot of times when trying to force the HTTPS protocol on
unencrypted HTTP URLs. Since the method is called update_scheme, it
should do exactly that, in all cases.

However, always updating the target scheme will break those calls of
update_scheme where the scheme was previously only added if needed, eg.
on scheme-less URLs, partial URLs or URL paths.

An example of this are scheme-less input URLs passed to Streamlink's
`Session.resolve_url()`, which has to implicitly set the HTTP scheme.
Always overriding the scheme would turn explicit HTTPS schemes to HTTP,
which can break certain sites/plugins. The various stream-protocol
plugins have the same issue and have to implicitly set a scheme as well.

This change therefore introduces the `force=True` keyword, so that URL
schemes can be set implicitly again in the methods and plugins mentioned
above by setting force to False.

Another effect of forcing the scheme override by default is the pattern
currently used in many plugins, which often update (partial) URLs by
calling `update_scheme(self.url, target)`. Since `self.url` depends on
the user's input and since the input URL's scheme is implicitly set to
HTTP by the Session's URL resolver, target URLs can be downgraded from
HTTPS to HTTP, which is not desired. This is not a problem in cases
where the server will redirect, but it can be a problem where the server
returns an error instead.

These plugins will therefore need to be fixed so that they either set
the URL schemes implicitly or so that they force the according scheme.

Another solution (which will be done eventually) is the change of the
implicit input URL scheme from HTTP to HTTPS in `Session.resolve_url()`
and the various stream-protocol plugins, which will turn all scheme-less
input URLs to HTTPS by default, unless the user explicitly sets an HTTP
URL.

- always update target URL scheme with current URL scheme
- add force=True keyword and do not override scheme if force is False
- set force=False on all implicitly updated URL schemes
  - `Session.resolve_url`
  - `plugins.{akamaihd,hds,hls,http}`
- parametrize tests, re-order/re-format assertions and add new ones
Billy2011 added a commit to Billy2011/streamlink-27 that referenced this pull request Sep 30, 2021
Update URL schemes to https instead of using the input URL as reference
and set force to False on those plugins which support multiple sites, so
that the https scheme only gets added if it is missing.
@bastimeyer
Copy link
Member Author

I just noticed that this breaks socks4/socks5 proxy URLs for the https-proxy option, because it now always overrides the scheme (no force=False param). This needs to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants