Tags: Tencent/TurboTransformers
Tags
Albert model aware (#202) * pass unitest * albert model uses model-aware allocator. * polish the albert unitest * Support variable sequence length benchmarking for albert. * gpu benchmark better log * better log * Polish code * polish benchmark * polish benchmark script
Jiaruifang/fix onnxrt docker (#152) * onnxrt cpu and gpu are not compatible * update readme * docker ci use onnxruntime cpu version only * use a fixed version miniconda ci test docker use the image of dockerhub * I want to pass ci test * fix miniconda's version as py3.7
Jiaruifang/fix onnxrt docker (#152) * onnxrt cpu and gpu are not compatible * update readme * docker ci use onnxruntime cpu version only * use a fixed version miniconda ci test docker use the image of dockerhub * I want to pass ci test * fix miniconda's version as py3.7
Develop (#86) * Jiaruifang/update readme (#14) update readme set PyTorch version as 1.4.0 use 0.2.0 as turbo version * Jiaruifang/update readme (#16) upgrade onnxrt to v1.2.0 in dev cpu docker add how to use dockerhub get a CPU version * Jiaruifang/update readme (#17) * update readme * update readme * add benchmark compared with gemmlowp * return hidden_states from BertModel * update p40 speedup fig * add github action hook for branch develop * revert to back a mat mul benchmark unitest * Set PyTorch cpu version as 1.4.0 * fix a typo * rm torchvision in docker_ci * update readme and use 0.2.0 as version * upgrade onnxrt to v1.2.0 in dev cpu docker add how to use dockerhub get a CPU version * fix a typo * delete turbotransformer. and add a blank line in readme (#20) * remove duplicated licenses comments. update readme, more accuately describing variable-length for onnxruntime. * Jiaruifang/polish (#30) remove duplicated licenses comments. update readme, more accurately describing variable-length for onnxruntime. * because add hidden state in bert layer, fix it in sequence classification (#36) * Jiaruifang/amd blis (#69) add blis support for AMD cpus. * Jiaruifang/decoder gpu allocator (#85) * Jiaruifang/multi head attn (#29) Add a more functional multiheadedattention. Add positionwise-feed-forward. Add multiheadedattention. * Jiaruifang/transformer decoder layer (#32) add TransformerDecoderLayer * Jiaruifang/transformer decoder layer (#33) * add TransformerDecoderLayer * check multi headed attn's max_relative_positions be 0 * Jiaruifang/transformer decoder (#35) fix multi_headed_attention_test.py bug * Jiaruifang/fixbug multiheadedattn (#40) * add attn as return values for decoder * check attns in decoder_transformer_decoder_layer_test * fix multi_headed_attention_test.py bug * add set_stderr_verbose_level python interface * add profiling method for decoder_multi_headed_attn_test * fix bugs in multiheadedattn cased by mask * option of WITH_PROFILER in CMakeLists set as OFF * fix bug for profiler * Jiaruifang/weight trans ffn (#43) * profile ffn. tuned weight transpose for intel 61xx * finetuned multi_headed_attention layer * fix some bugs. * Jiaruifang/merge bert multiheaded attn (#49) use multiheaded attn to do bert attention * Jiaruifang/gpu decoder (#51) add gpu transformer decoder implementation. using cub::cachingallocator still has some bugs to be fixed. performance to be tuned. * add layernorm support for multi heade attn from_torch * fix a bug in from_torch of MultiHeadedAttention * fix bugs from attn masks in transformer decoder layer. (#64) * fix bugs from attn masks in transformer decoder layer. * polish code * Jiaruifang/debug decoder layer mask (#68) transformere decoder mask float -> bool make multiheaded attn is able to get layer_cache as input parameter. add layer_cache for self attn. self attn layer_cache * softmax supports 3D mask (#72) gpu softmax support 3D mask. * Develop (#74) Add blis support for AMD cpus. * init best fit cuda allocator. * fix a bug of GetInstance * TODO remove temp tensor * remove temp tensor. * fix a bug * add cuda allocator unitests. * fix a bug in best fit cuda allocator. * more unitests for cuda allocator. * a wrong verion, all gpu unitests do not pass. * add comments for best fit and upgrade release version. * merge decoder and best fit cuda memory allocator. * update readme * Jiaruifang/cpu allocator (#88) * Develop (#74) Add blis support for AMD cpus. * add cpu best fit allocator. * Jiaruifang/debug decoder layer mask (#89) * add cpu best fit allocator. * fix a bug in allocator test. * fix tgt_pad_mask bug * update README * revert back to cub allocator * Jiaruifang/benchmark amd blas (#90) * Develop (#74) Add blis support for AMD cpus. * Polish the benchmark code for BLAS on AMD CPU. * add general GEMM benchmark. * show blas type in matmul_benchmark * Jiaruifang/gpu timer (#91) * add gpu profiler. * fix a bug caused by attn_score in bert attention. * fix attn_score bug. * Jiaruifang/gpu concat (#92) * add gpu profiler. * fix a bug caused by attn_score in bert attention. * fix attn_score bug. * accelerate GPU concat * add loss file * Jiaruifang/profiler kernels (#97) * add gpu profiler. * fix a bug caused by attn_score in bert attention. * fix attn_score bug. * accelerate GPU concat * add loss file * print profiling result in increasing order. Fix the best fit cuda allocator bug. * move profiler into functions. * Jiaruifang/fix bestfit bug (#98) * Develop (#74) Add blis support for AMD cpus. * fix a bug in cpp mask (#95) * Fix bestfit allocator bug. * Update readme * Jiaruifang/fix bestfit bug (#99) * Develop (#74) Add blis support for AMD cpus. * fix a bug in cpp mask (#95) * Fix bestfit allocator bug. * Update readme * add a missing file. * update readme, and fix attn score bug in bert_attn (#100) * update readme, and fix attn score bug in bert_attn * fix shared ptr bug. * fix cuda c++11 bug. * Jiaruifang/decoder readme (#101) * update readme, and fix attn score bug in bert_attn * fix shared ptr bug. * fix cuda c++11 bug. * Update Readme Co-authored-by: shicheng <523656402@qq.com>