Skip to content

Improved plugins URL matching API #3814

@bastimeyer

Description

@bastimeyer

Motivation

In order for Streamlink's Session to be able to find a matching plugin for the given input URL, plugins must implement the can_handle_url classmethod. For historical reasons and due to the work of many different people, every plugin is doing its own thing with how it handles the method's return value and how the URL regex(es) is/are defined, which is not only inconsistent, but also makes it difficult to read, adds complexity as well as maintenance burden and makes it difficult for new users to learn writing plugins.

The current API also doesn't allow for static code analysis of the URL regex(es), which would be a requirement if we want to eventually replace the inefficient plugin loading logic in the Session with a pre-build step that generates a JSON object with an array of built-in plugins and their regexes + matching priorities, that can then be used by the Session to find only a single plugin and load it instead of everything at once.

Proposal

A better solution would therefore be having strict definitions of each plugin's URL regex(es) and their matching priorities, in a declarative way.

Plugin.can_handle_url(url) and Plugin.priority(url) could then be removed from built-in plugins and deprecated for third-party plugins.

The Plugin.matchers attribute

Let's define a list of URL regexes and priorities.

from typing import ClassVar, List, NamedTuple, Pattern

class Matcher(NamedTuple):
    pattern: Pattern
    priority: int

class Plugin:
    matchers: ClassVar[List[Matcher]] = None

A plugin could implement it like this (more elegant solution down below):

from streamlink.plugin import LOW_PRIORITY, Matcher, Plugin

class MyPlugin(Plugin):
    matchers = [
        Matcher(re.compile("primary")),
        Matcher(re.compile("secondary", FLAGS), priority=LOW_PRIORITY)
    ]

And the Session could resolve an input URL like this:

from streamlink.plugin.plugin import Matcher, NO_PRIORITY, Plugin

class Session:
    def resolve_url(self, url: str, follow_redirects: bool = False) -> Plugin
        url = update_scheme("http://", url)

        matcher: Matcher
        candidate = None
        priority = NO_PRIORITY
        for plugin in self.plugins.values():
            for matcher in plugin.matchers or []:
                if matcher.priority > priority and matcher.pattern.match(url) is not None:
                    candidate = plugin
                    priority = matcher.priority

        if candidate:
            return candidate(url)

        # `follow_redirects` logic...

        raise NoPluginError

The pluginmatcher class decorator

Since defining the matchers attribute on each plugin doesn't 100% ensure consistency and also requires the import of the Matcher, a decorator could be used instead, which makes defining URLs at the top/head of each plugin definition mandatory and doesn't require special imports.

from typing import Callable, Pattern, Type

def pluginmatcher(pattern: Pattern, priority: int = NORMAL_PRIORITY) -> Callable[[Type[Plugin]], Type[Plugin]]:
    matcher = Matcher(pattern, priority)

    def decorator(cls: Type[Plugin]) -> Type[Plugin]:
        if not issubclass(cls, Plugin):
            raise TypeError(f"{cls!r} is not a Plugin")
        if cls.matchers is None:
            cls.matchers = []
        cls.matchers.insert(0, matcher)

        return cls

    return decorator
import re
from streamlink.plugin import LOW_PRIORITY, Plugin, pluginmatcher

@pluginmatcher(re.compile(
    r"https?://foo\.bar/"
))
@pluginmatcher(priority=LOW_PRIORITY, pattern=re.compile(r"""
    https?://baz\.qux/
""", re.VERBOSE))
class MyPlugin(Plugin):
    pass

An alternative decorator implementation could be compiling the regex in the decorator itself, so that re.compile doesn't have to be called in each plugin. I'd prefer it this way, as it's cleaner and it'd also simplify static code analysis in the future.

def pluginmatcher(pattern: str, flags: int = 0, priority: int = NORMAL_PRIORITY) -> Callable[[Type[Plugin]], Type[Plugin]]:
    matcher = Matcher(re.compile(pattern, flags), priority)

    # ...
@pluginmatcher(
    # language=PythonRegExp
    r"https?://foo\.bar/"
)
@pluginmatcher(
    # language=PythonVerboseRegExp
    r"""
    https?://baz\.qux/
    """,
    re.VERBOSE,
    priority=LOW_PRIORITY
)
class MyPlugin(Plugin):
    pass

One drawback though is that some IDEs / editors won't be able to parse the pattern string as a regex anymore without annotations, configs or plugins. PyCharm for example needs the re.compile(pattern) call for detecing the regex language injection in the pattern parameter, otherwise a # language=PythonRegExp or # language=PythonVerboseRegExp annotation is needed, and writing custom language injection rules in the IDE is not simple and also not portable.

The Plugin.matches, Plugin.match and Plugin.matcher attributes

As it is common with many plugins, URL regexes have capture groups where data gets read from. Since there are no custom re.Pattern class-attributes/variables anymore (_re_url, etc.), to be able to match the input URL to extract some data, each plugin would have to call self.matchers[n].pattern.match(self.url), which is awkward.

The Plugin should automatically define re.Match results in its constructor, for every item in the matcher list and for the one that first matched.

The matcher results should also be recomputed whenever the url gets updated.

from typing import ClassVar, List, Match, NamedTuple, Optional, Pattern, Sequence

class Matcher(NamedTuple):
    pattern: Pattern
    priority: int

class Plugin:
    matchers: ClassVar[List[Matcher]] = None
    matches: Sequence[Optional[Match]]
    matcher: Pattern
    match: Match

    _url: str

    @property
    def url(self) -> str:
        return self._url

    @url.setter
    def url(self, value: str):
        self._url = value

        matches = [(pattern, pattern.match(value)) for pattern, priority in self.matchers or []]
        self.matches = tuple(m for p, m in matches)
        self.matcher, self.match = next(((p, m) for p, m in matches if m is not None), (None, None))

    def __init__(self, url: str) -> None:
        self.url = url

        # ...

See this HLSPlugin example, which gets its self.match from the first matching regex. The .groupdict() call works for both regexes, as they define the same capture group names.

Multiple regexes with different capture groups can be accessed via self.matches[n] and the matching regex itself via self.matcher (in case that's needed).

@pluginmatcher(re.compile(
    r"hls(?:variant)?://(?P<url>\S+)(?:\s(?P<params>.+))?"
))
@pluginmatcher(priority=LOW_PRIORITY, pattern=re.compile(
    r"(?P<url>\S+\.m3u8(?:\?\S*)?)(?:\s(?P<params>.+))?"
))
class HLSPlugin(Plugin):
    def _get_streams(self):
        data = self.match.groupdict()
        url = update_scheme("http://", data.get("url"))
        params = parse_params(data.get("params"))

        # ...

Ideas

Caching

Instead of a Matcher NamedTuple, a custom Matcher class could be implemented which caches the pattern's match result, so that the regexes don't have to be matched against the input URL twice, first in Session.resolve_url(url) and in the Plugin constructor afterwards. However, as long as all plugins need to be in memory at the same time and are kept for the entire runtime, caching results of irrelevant plugins doesn't make much sense.

Plugin metadata

Instead of having to manually maintain plugin_matrix.rst in the docs, metadata could be added to the pluginmatcher decorator, which would describe the URLs in a natural human-readable way.

Feedback

I'd appreciate some feedback about my proposed changes and whether they make sense or not. Have I missed something obvious? Is there a better or more simple way? Not worth it?

I know this would be a big change and would require a lot of work to update every plugin, but I think this will be worth it for the reasons mentioned above. After having updated nearly half of the plugins yesterday, I haven't found a single case yet where I had problems, but I know there are some plugins I have yet to update like VK for example which have complex can_handle_url logic that needs to be translated (complex plugin matching logic doesn't make sense).

As mentioned earlier, a deprecation path could be implemented for third-party plugins, so that this won't be a breaking change, but I haven't thought about any of that yet.

This is just an early proposal / suggestion, so let's not rush things and discuss this first. An actual implementation can also wait until it's the right time.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions