-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Meta-thread for the implementation of data extraction via a web browser.
Motivation
We need a way to extract data from websites which make it impossible to do this in a pure Python implementation, e.g. because of JS runtime and DOM requirements, etc.
There are Python implementations for executing JS code, and other projects also try to execute JS code in NodeJS, but none of this solves the problem we're facing with the Twitch plugin, not to mention that running untrusted JS code in NodeJS is insanely stupid.
See the following thread/comment which explains this in detail:
Rationale
In the linked comment I came to the conclusion that adding selenium
as a dependency for this task is impossible, because of the detection methods which certain sites implement when the web browser is launched via its webdriver
interface, which is what Selenium does. It's not just that Navigator.prototype.webdriver
returns true
in that case, but there are many other ways for detecting this different environment the browser is run in.
This is a bit unfortunate, because the W3C webdriver
and bi-directional webdriver
specifications are the correct way to automate browsers, and it sadly limits us to chromium-based browsers and an implementation based on the Chrome Devtools Protocol (CDP) for remotely and bi-directionally controlling the browser process. The reason for this is that Chromium works just like normal when it's launched with the --remote-debugging-port=...
parameter (unless the port is set to 0 for some reason) compared to using chromedriver
(which itself is communicating via the CDP). Firefox also implements the CDP, but only a tiny subset which is unsuitable for our needs. Safari doesn't support this and other non-Chromium browsers are mostly irrelevant.
We can implement this ourselves fairly "easily" without having to rely on many additional dependencies. All that needs to be done is build APIs for launching the chromium-based web browser, the CDP websocket connection and session management, serialization/deserialization classes+methods for the devtools protocol, and finally a CDP client which plugins can use to write extraction logic.
On Streamlink's Matrix/Gitter channel I've already announced yesterday that I have the whole thing already working for the Twitch plugin, but in a very early stage without any tests and without proper separation in the published git branch. That's what I'm going to finish now.
Roadmap
-
CDP spec implementation
Add
script/generate-cdp.py
which parses the JSON data of the CDP specification and which then generates {,de}serialization classes and methods for CDP commands and events with proper typing data. Those generated python modules will be committed into the code base and can be updated if needed.I will use a modification of the generator script of the
python-chrome-devtools-protocol
project, which is MIT licensed, so no problem at all. It works beautifully with my applied fixes and added features. -
Implement web browser launcher
This will be based on
trio
for async I/O code and its async subprocess handling.Not writing async code and instead using different threads would make this a thousand times more difficult, especially when writing tests. Compared to
asyncio
of the stdlib,trio
handlesKeyboardInterrupt
s gracefully and the code is much easier to understand, maintain and test. It's even worth making use oftrio
in other parts of the code at some point in the future, too.trio
is packaged on all major Linux distros.The web browser launcher is split into a generic base class and an implementation for Chromium which the CDP client will rely on. The important stuff here is the collection of Chromium executable names, fallback paths and launch arguments.
-
Implement CDP connection logic and CDP session management
Based on
trio-websocket
, which is a bit awkward considering that we already havewebsocket-client
as a dependency for the already existingplugin.api.websocket
implementation. But as said, not writing async code would be insane.trio-websocket
seems to be only partially packaged on Linux. On Arch, it's in the AUR, and Fedora doesn't have an RPM. We can make the whole thing optional though if there are any packaging concerns.The CDP connection and session logic will be based on the design of the implementation of
trio-chrome-devtools-protocol
(MIT), with more improvements and fixes. -
Implement a CDP client
A class with convenience methods for intercepting network requests/responses and for listening to console API calls (for data communication with injected scripts).
Plugins like Twitch can then set up network hooks for intercepting or reading data, as well as executing scripts. Not going to track this implementation in this thread here.
Further changes are updates to the CLI argparser and Streamlink session for setting/overriding webbrowser paths, CDP host/port, etc.
I hope that I will have everything ready until Twitch turns the requirement on again.
Concerns
- Maintenance burden
I personally don't care because I'm implementing and will very likely be maintaining this myself anyway.
The only added dependencies will betrio
,trio-websocket
,typing-extensions
andpytest-trio
(as well asinflection
for the generator script, which is irrelevant for packagers). As mentioned above, we can make everything optional, so if a distributor doesn't have those packages, their users won't be able to use any of this. - Having the CDP classes in the code base will bloat it up
Unfortunate requirement, but I will limit this only to the needed/relevant CDP domains. - Chromium-based browser limitation due to the webdriver interface
Unfortunate restriction. I explained this above and in the linked comment. - Headless mode
Since certain sites like Twitch also detect the browser when it's run in headless mode, this will make using such plugins without a desktop environment (loosely speaking) impossible. That is nothing which we can fix though unless we can circumvent such detections, which might be impossible ("canvas fingerprinting", etc.). There are lots of projects on the web which try to achieve this. I'm unaware of any successful ones.
Headless mode apparently does work. We can add a switch for this if it causes problems. - User-friendliness
Having to launch the web browser which will display its window for a split second if not run in headless mode can be annoying, but once again, nothing which we can fix if headless mode gets detected on some sites. If other sites don't detect headless mode, then there won't be any distraction/disruption. The browser window size can be shrinked to its minimum. There's no way to launch Chromium in minimized mode, but maybe it's possible to force a certain window title, so users can apply rules to their window manager, etc.