Tags · Tencent/TurboTransformers

v0.5.1

Albert model aware (#202)

* pass unitest

* albert model uses model-aware allocator.

* polish the albert unitest

* Support variable sequence length benchmarking for albert.

* gpu benchmark better log

* better log

* Polish code

* polish benchmark

* polish benchmark script

Nov 25, 2020
6387402
zip
tar.gz
Notes

v0.5.0

polish memory opt benchmark (#198)

Nov 19, 2020
ecaf698
zip
tar.gz
Notes

v0.4.2

Develop (#154)

Add tf ckpt converter for bert (#159)
The memory of addspilttranspose is not continous for Q, K, V out.
fix bert profiler bugs.
update gpu fixed length benchmark scripts.
add protection for splittranspose's QKV outputs

Aug 19, 2020
8fbbd2a
zip
tar.gz
Notes

v4.1.0

Jiaruifang/fix onnxrt docker (#152)

* onnxrt cpu and gpu are not compatible

* update readme

* docker ci use onnxruntime cpu version only

* use a fixed version miniconda
ci test docker use the image of dockerhub

* I want to pass ci test

* fix miniconda's version as py3.7

Jul 29, 2020
e623096
zip
tar.gz

v0.4.1

Jiaruifang/fix onnxrt docker (#152)

* onnxrt cpu and gpu are not compatible

* update readme

* docker ci use onnxruntime cpu version only

* use a fixed version miniconda
ci test docker use the image of dockerhub

* I want to pass ci test

* fix miniconda's version as py3.7

Jul 29, 2020
e623096
zip
tar.gz
Notes

v0.3.0

Develop (#86)

* Jiaruifang/update readme (#14)

update readme
set PyTorch version as 1.4.0
use 0.2.0 as turbo version

* Jiaruifang/update readme (#16)

upgrade onnxrt to v1.2.0 in dev cpu docker
add how to use dockerhub get a CPU version

* Jiaruifang/update readme (#17)

* update readme

* update readme

* add benchmark compared with gemmlowp

* return hidden_states from BertModel

* update p40 speedup fig

* add github action hook for branch develop

* revert to back a mat mul benchmark unitest

* Set PyTorch cpu version as 1.4.0

* fix a typo

* rm torchvision in docker_ci

* update readme and use 0.2.0 as version

* upgrade onnxrt to v1.2.0 in dev cpu docker
add how to use dockerhub get a CPU version

* fix a typo

* delete turbotransformer. and add a blank line in readme (#20)

* remove duplicated licenses comments.
update readme, more accuately describing variable-length for onnxruntime.

* Jiaruifang/polish (#30)

remove duplicated licenses comments.
update readme, more accurately describing variable-length for onnxruntime.

* because add hidden state in bert layer, fix it in sequence classification (#36)

* Jiaruifang/amd blis (#69)

add blis support for AMD cpus.

* Jiaruifang/decoder gpu allocator (#85)

* Jiaruifang/multi head attn (#29)

Add a more functional multiheadedattention.
Add positionwise-feed-forward.
Add multiheadedattention.

* Jiaruifang/transformer decoder layer (#32)

add TransformerDecoderLayer

* Jiaruifang/transformer decoder layer (#33)

* add TransformerDecoderLayer
* check multi headed attn's max_relative_positions be 0

* Jiaruifang/transformer decoder (#35)

fix multi_headed_attention_test.py bug

* Jiaruifang/fixbug multiheadedattn (#40)

* add attn as return values for decoder

* check attns in decoder_transformer_decoder_layer_test

* fix multi_headed_attention_test.py bug

* add set_stderr_verbose_level python interface

* add profiling method for decoder_multi_headed_attn_test

* fix bugs in multiheadedattn cased by mask

* option of WITH_PROFILER in CMakeLists set as OFF

* fix bug for profiler

* Jiaruifang/weight trans ffn (#43)

* profile ffn. tuned weight transpose for intel 61xx

* finetuned multi_headed_attention layer

* fix some bugs.

* Jiaruifang/merge bert multiheaded attn (#49)

use multiheaded attn to do bert attention

* Jiaruifang/gpu decoder (#51)

add gpu transformer decoder implementation.
using cub::cachingallocator still has some bugs to be fixed.
performance to be tuned.

* add layernorm support for multi heade attn from_torch

* fix a bug in from_torch of MultiHeadedAttention

* fix bugs from attn masks in transformer decoder layer. (#64)

* fix bugs from attn masks in transformer decoder layer.

* polish code

* Jiaruifang/debug decoder layer mask (#68)

transformere decoder mask float -> bool
make multiheaded attn is able to get layer_cache as input parameter.
add layer_cache for self attn.
self attn layer_cache

* softmax supports 3D mask (#72)

gpu softmax support 3D mask.

* Develop (#74)

Add blis support for AMD cpus.

* init best fit cuda allocator.

* fix a bug of GetInstance

* TODO remove temp tensor

* remove temp tensor.

* fix a bug

* add cuda allocator unitests.

* fix a bug in best fit cuda allocator.

* more unitests for cuda allocator.

* a wrong verion, all gpu unitests do not pass.

* add comments for best fit and upgrade release version.

* merge decoder and best fit cuda memory allocator.

* update readme

* Jiaruifang/cpu allocator (#88)

* Develop (#74)

Add blis support for AMD cpus.

* add cpu best fit allocator.

* Jiaruifang/debug decoder layer mask (#89)

* add cpu best fit allocator.
* fix a bug in allocator test.
* fix tgt_pad_mask bug
* update README
* revert back to cub allocator

* Jiaruifang/benchmark amd blas (#90)

* Develop (#74)

Add blis support for AMD cpus.

* Polish the benchmark code for BLAS on AMD CPU.

* add general GEMM benchmark.

* show blas type in matmul_benchmark

* Jiaruifang/gpu timer (#91)

* add gpu profiler.

* fix a bug caused by attn_score in bert attention.

* fix attn_score bug.

* Jiaruifang/gpu concat (#92)

* add gpu profiler.

* fix a bug caused by attn_score in bert attention.

* fix attn_score bug.

* accelerate GPU concat

* add loss file

* Jiaruifang/profiler kernels (#97)

* add gpu profiler.

* fix a bug caused by attn_score in bert attention.

* fix attn_score bug.

* accelerate GPU concat

* add loss file

* print profiling result in increasing order. Fix the best fit cuda allocator bug.

* move profiler into functions.

* Jiaruifang/fix bestfit bug (#98)

* Develop (#74)

Add blis support for AMD cpus.

* fix a bug in cpp mask (#95)

* Fix bestfit allocator bug.

* Update readme

* Jiaruifang/fix bestfit bug (#99)

* Develop (#74)

Add blis support for AMD cpus.

* fix a bug in cpp mask (#95)

* Fix bestfit allocator bug.

* Update readme

* add a missing file.

* update readme, and fix attn score bug in bert_attn (#100)

* update readme, and fix attn score bug in bert_attn

* fix shared ptr bug.

* fix cuda c++11 bug.

* Jiaruifang/decoder readme (#101)

* update readme, and fix attn score bug in bert_attn

* fix shared ptr bug.

* fix cuda c++11 bug.

* Update Readme

Co-authored-by: shicheng <523656402@qq.com>

Jun 28, 2020
72097bf
zip
tar.gz
Notes

0.2.1

Develop (#74)

Add blis support for AMD cpus.

Jun 11, 2020
a47bbf1
zip
tar.gz
Notes

v0.0.1

Jiaruifang/update readme (#11)

 update p40 speedup fig

Apr 25, 2020
21ddad5
zip
tar.gz
Notes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.5.1

v0.5.0

v0.4.2

v4.1.0

v0.4.1

v0.3.0

0.2.1

v0.0.1

Tags: Tencent/TurboTransformers