Implement data extraction via web browser (Chrome devtools protocol)

Meta-thread for the implementation of data extraction via a web browser.

----

## Motivation

We need a way to extract data from websites which make it impossible to do this in a pure Python implementation, e.g. because of JS runtime and DOM requirements, etc.

There are Python implementations for executing JS code, and other projects also try to execute JS code in NodeJS, but none of this solves the problem we're facing with the Twitch plugin, not to mention that running untrusted JS code in NodeJS is insanely stupid.

See the following thread/comment which explains this in detail:

- https://github.com/streamlink/streamlink/issues/5370#issuecomment-1582806603

## Rationale

In the linked comment I came to the conclusion that adding `selenium` as a dependency for this task is impossible, because of the detection methods which certain sites implement when the web browser is launched via its `webdriver` interface, which is what Selenium does. It's not just that `Navigator.prototype.webdriver` returns `true` in that case, but there are many other ways for detecting this different environment the browser is run in.

This is a bit unfortunate, because the W3C `webdriver` and `bi-directional webdriver` specifications are the correct way to automate browsers, and it sadly limits us to chromium-based browsers and an implementation based on the Chrome Devtools Protocol (CDP) for remotely and bi-directionally controlling the browser process. The reason for this is that Chromium works just like normal when it's launched with the `--remote-debugging-port=...` parameter (unless the port is set to 0 for some reason) compared to using `chromedriver` (which itself is communicating via the CDP). Firefox also implements the CDP, but only a tiny subset which is unsuitable for our needs. Safari doesn't support this and other non-Chromium browsers are mostly irrelevant.

We can implement this ourselves fairly "easily" without having to rely on many additional dependencies. All that needs to be done is build APIs for launching the chromium-based web browser, the CDP websocket connection and session management, serialization/deserialization classes+methods for the devtools protocol, and finally a CDP client which plugins can use to write extraction logic.

On Streamlink's Matrix/Gitter channel I've already announced yesterday that I have the whole thing already working for the Twitch plugin, but in a very early stage without any tests and without proper separation in the published git branch. That's what I'm going to finish now.

## Roadmap

1. **CDP spec implementation**

   - [x] #5381

   Add `script/generate-cdp.py` which parses the JSON data of the CDP specification and which then generates {,de}serialization classes and methods for CDP commands and events with proper typing data. Those generated python modules will be committed into the code base and can be updated if needed.

   I will use a modification of the generator script of the [`python-chrome-devtools-protocol`](https://github.com/HyperionGray/python-chrome-devtools-protocol) project, which is MIT licensed, so no problem at all. It works beautifully with my applied fixes and added features.

2. **Implement web browser launcher**

   - [x] #5386 

   This will be based on [`trio`](https://github.com/python-trio/trio) for async I/O code and its async subprocess handling.

   Not writing async code and instead using different threads would make this a thousand times more difficult, especially when writing tests. Compared to `asyncio` of the stdlib, `trio` handles `KeyboardInterrupt`s gracefully and the code is much easier to understand, maintain and test. It's even worth making use of `trio` in other parts of the code at some point in the future, too.

   `trio` is packaged on all major Linux distros.

   The web browser launcher is split into a generic base class and an implementation for Chromium which the CDP client will rely on. The important stuff here is the collection of Chromium executable names, fallback paths and launch arguments.

3. **Implement CDP connection logic and CDP session management**

   - [x] #5388

   Based on [`trio-websocket`](https://github.com/HyperionGray/trio-websocket), which is a bit awkward considering that we already have `websocket-client` as a dependency for the already existing `plugin.api.websocket` implementation. But as said, not writing async code would be insane.

   `trio-websocket` seems to be only partially packaged on Linux. On Arch, it's in the AUR, and Fedora doesn't have an RPM. We can make the whole thing optional though if there are any packaging concerns.

   The CDP connection and session logic will be based on the design of the implementation of `trio-chrome-devtools-protocol` (MIT), with more improvements and fixes.

4. **Implement a CDP client**

   - [x] #5410

   A class with convenience methods for intercepting network requests/responses and for listening to console API calls (for data communication with injected scripts).

   Plugins like Twitch can then set up network hooks for intercepting or reading data, as well as executing scripts. Not going to track this implementation in this thread here.

Further changes are updates to the CLI argparser and Streamlink session for setting/overriding webbrowser paths, CDP host/port, etc.

I hope that I will have everything ready until Twitch turns the requirement on again.

## Concerns

- **Maintenance burden**
  I personally don't care because I'm implementing and will very likely be maintaining this myself anyway.
  The only added dependencies will be `trio`, `trio-websocket`, `typing-extensions` and `pytest-trio` (as well as `inflection` for the generator script, which is irrelevant for packagers). As mentioned above, we can make everything optional, so if a distributor doesn't have those packages, their users won't be able to use any of this.
- **Having the CDP classes in the code base will bloat it up**
  Unfortunate requirement, but I will limit this only to the needed/relevant CDP domains.
- **Chromium-based browser limitation due to the webdriver interface**
  Unfortunate restriction. I explained this above and in the linked comment.
- **Headless mode**
  ~~Since certain sites like Twitch also detect the browser when it's run in headless mode, this will make using such plugins without a desktop environment (loosely speaking) impossible. That is nothing which we can fix though unless we can circumvent such detections, which might be impossible ("canvas fingerprinting", etc.). There are lots of projects on the web which try to achieve this. I'm unaware of any successful ones.~~
   Headless mode apparently does work. We can add a switch for this if it causes problems.
- **User-friendliness**
  Having to launch the web browser which will display its window for a split second if not run in headless mode can be annoying, but once again, nothing which we can fix if headless mode gets detected on some sites. If other sites don't detect headless mode, then there won't be any distraction/disruption. The browser window size can be shrinked to its minimum. There's no way to launch Chromium in minimized mode, but maybe it's possible to force a certain window title, so users can apply rules to their window manager, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement data extraction via web browser (Chrome devtools protocol) #5380

Motivation

Rationale

Roadmap

Concerns

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Implement data extraction via web browser (Chrome devtools protocol) #5380

Description

Motivation

Rationale

Roadmap

Concerns

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions