Bert-Geosite-Classification

本项目是一个机器学习训练http内容二分类(Binary Classification)的pytorch 项目。

先利用 geosite 的网站列表下载数据，然后用 bert 微调的方式训练模型。

提供了用于识别的 api。

git clone 下面两个模型

https://huggingface.co/e1732a364fed/bert-geosite-classification-head-v1/tree/main

https://huggingface.co/e1732a364fed/bert-geosite-classification-body-v1/tree/main

然后对文件夹分别改名为 bert_geosite_by_body 和 bert_geosite_by_head。

下载好模型后就可直接跳到下面第三步进行预测了

本项目已在 ruci 代理项目中使用 :geosite_gfw

Steps

install requirements

pip install transformers numpy scikit-learn flask requests
pip install "requests[socks]"
pip install torch --index-url https://download.pytorch.org/whl/cu124

download bert-base-multilingual-cased from huggingface, store the files in ./bert/ folder

if you are not using nvidia gpu, you may obmit the --index-url parameter.

或者直接使用

uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

(生成 requirements.txt 的命令是 uv pip freeze > requirements.txt)

如果您想用 venv 而不是uv, 就是运行如下命令

python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt

1. pull geosite responses data with `python pull.py`

or you can set the geosite list by python pull.py -l proxy-list.txt

the project contains 2 geosite list from https://github.com/Loyalsoldier/v2ray-rules-dat

but maybe dated. You can use your own list file.

会生成 {list_name}_out 文件夹, 里面为每一个网站的响应

2. train model with

python classify.py --mode=train_head
python classify.py --mode=train_body

it will generate the trained model file.

you can set the ok and ban dir by --ok_dir and --ban_dir

You can download pretrained model files instead of training own your own.

3. predict with

python classify.py --mode predict_head --text "Your input text here"
python classify.py --mode predict_body --text "Your input text here"

for example,

# this well return Prediction: ban
python classify.py --mode predict_body --text "<body>google</body>"

# this well return Prediction: ban
python classify.py --mode predict_body --text "<body>porn</body>"

# this well return Prediction: ok
python classify.py --mode predict_body --text "<body>baidu</body>"

4. serve api with

python classify.py --mode serve_api --port 5134

mac上测试, 内存占用325.8MB

5. request

predict by passing the data

curl -X POST http://localhost:5134/predict \
    -H "Content-Type: application/json" \
    -d '{"text": "<body>your website http response body</body>", "model_name": "body"}'

response:

{
  "result": "ok"
}

or

{
  "result": "ban"
}

check by passing the domain

curl -X POST http://localhost:5134/check \
    -H "Content-Type: application/json" \
    -d '{"domain": "www.baidu.com"}'

with proxy:

curl -X POST http://localhost:5134/check \
    -H "Content-Type: application/json" \
    -d '{"domain": "www.google.com", "socks5_proxy": "127.0.0.1:10800"}'

for more arguments and options, see the source code

benchmark

there's a benchmark.py that benches cpu and mps. On macOS, mps is way faster than cpu. run python benchmark.py to see how fast it is on your mac.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.gitignore		.gitignore
LICENSE-APACHE		LICENSE-APACHE
README.md		README.md
bench_32vs8.py		bench_32vs8.py
benchmark.py		benchmark.py
china-list.txt		china-list.txt
classify.py		classify.py
proxy-list.txt		proxy-list.txt
pull.py		pull.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bert-Geosite-Classification

Steps

1. pull geosite responses data with `python pull.py`

2. train model with

3. predict with

4. serve api with

5. request

predict by passing the data

check by passing the domain

benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

e1732a364fed/bert-geosite-classification

Folders and files

Latest commit

History

Repository files navigation

Bert-Geosite-Classification

Steps

1. pull geosite responses data with python pull.py

2. train model with

3. predict with

4. serve api with

5. request

predict by passing the data

check by passing the domain

benchmark

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

1. pull geosite responses data with `python pull.py`

Packages