Merge branch 'master' of github.com:Jittor/jittor

2022-09-16 01:01:20 +08:00 · 2022-09-16 01:01:20 +08:00 · e661f19e20
parent c0ed98cbd6 e0bd748ff1
commit e661f19e20
14 changed files with 1593 additions and 624 deletions
--- a/README.cn.md
+++ b/README.cn.md
@ -2,6 +2,7 @@

 ![Jittor Logo](https://cg.cs.tsinghua.edu.cn/jittor/favicon_package_v0/JittorLogo_Final1220.svg)

+
 [快速开始](#快速开始) | [安装](#安装) | [教程](#教程) | [English](./README.md)


@ -18,6 +19,7 @@ Jittor前端语言为Python。前端使用了模块化和动态图执行的设
 *  [Jittor文档](https://cg.cs.tsinghua.edu.cn/jittor/assets/docs/index.html)
 *  [Github](https://github.com/jittor/jittor)， [Gitee](https://gitee.com/jittor/jittor)
 *  [Jittor 论坛](https://discuss.jittor.org/)
+*  [Jittor 精选仓库](https://github.com/Jittor/jittor/blob/master/AWESOME-JITTOR-LIST.md)
 *  即时通信: QQ Group(761222083)


@ -88,40 +90,18 @@ for i,(x,y) in enumerate(get_data(n)):

 ## 安装

-
 Jittor框架对环境要求如下:

-Jittor 支持**Linux**(e.g. Ubuntu/CentOS/Arch), **macOS**,**Windows**， 其中**Linux**和**macOS**的依赖如下：
-* Python：版本 >= 3.7
-* C++编译器 （需要下列至少一个）
-    - g++ （>=5.4.0 for linux）
-    - clang （>=8.0 for mac）
-* GPU 编译器（可选）：nvcc >=10.0
-* GPU 加速库（可选）：cudnn-dev (cudnn开发版, 推荐使用tar安装方法，[参考链接](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar))

-Jittor 目前还支持主流国产Linux操作系统，如统信、麒麟、普华、龙芯Loongnix，安装方式可参考 Linux pip安装方法，准备好python和g++即可。
-
-**Windows**对环境的要求为：
-* Python：版本 >= 3.8(建议从Python官网安装：<https://www.python.org/downloads/windows/>)
-* x86_64处理器
-* Windows 10及以上。
-
-如果您不希望手动配置环境，我们推荐使用 Docker 进行安装。
-除此之外，您还可以使用 pip 安装和手动安装。
-
-注意1：macOS 用户需要安装额外依赖，请参考 [macOS 安装](#macOS-安装)。
+| OS                                                     | CPU                                 | Python | Compiler     | (Optional) GPU platform                                |
+|--------------------------------------------------------|-------------------------------------|--------|--------------|---------------------------------------------|
+| Linux<br>(Ubuntu, CentOS, Arch, <br>UOS, KylinOS, ...) | x86 <br>x86_64 <br>ARM <br>loongson | >= 3.7 | g++ >=5.4    | Nvidia CUDA >= 10.0, [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar) <br> or [AMD ROCm](https://docs.amd.com/) >= 4.0 <br> or [Hygon DCU DTK](https://tycloud.hpccube.com/doc/1.0.6/11277/general-handbook/software-tutorial/jittor.html) >= 22.04 |
+| macOS <br>(>= 10.14 Mojave)                            | intel<br>Apple Silicon              | >= 3.7 | clang >= 8.0 | -                                           |
+| Windows 10 & 11                                        | x86_64                              | [>= 3.8](https://www.python.org/downloads/windows/) | -            | Nvidia CUDA >= 10.2 [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-windows)                               |

 Jittor 提供了三种安装方法：pip、docker和手动安装：


-
-
-
-
-
-
-
-
 ## Pip 安装


@ -142,11 +122,11 @@ jittor会自动在路径中寻找合适的编译器, 如果您希望手动指定
 ### macOS 安装


-macOS 请使用 [homebrew](https://brew.sh) 安装额外的依赖 (python>=3.7, onednn)。
+macOS 请使用 [homebrew](https://brew.sh) 安装额外的依赖。


 ```bash
-brew install python@3.7 onednn libomp
+brew install onednn libomp
 ```

 之后您可以通过 pip 安装 jittor，并测试是否可以成功运行。
@ -157,7 +137,7 @@ python3.7 -m pip install jittor
 python3.7 -m jittor.test.test_example
 ```

-目前在macOS中，jittor 只支持 CPU 计算。
+目前在 macOS 中，jittor 只支持 CPU 计算。


 ### Windows安装
@ -439,3 +419,4 @@ Jittor目前由[清华大学计算机图形学组](https://cg.cs.tsinghua.edu.cn


 如LICENSE.txt文件中所示，Jittor使用Apache 2.0版权协议。
+
--- a/README.md
+++ b/README.md
@ -91,34 +91,14 @@ We provide some jupyter notebooks to help you quick start with Jittor.



-
-
-
-
-
-
-
 Jittor environment requirements:

-* System: **Linux**(e.g. Ubuntu/CentOS/Arch), **macOS**, or **Windows**, enviroment requirements of **Linux** and **Mac** are list below:
+| OS                                                     | CPU                                 | Python | Compiler     | (Optional) GPU platform                                |
+|--------------------------------------------------------|-------------------------------------|--------|--------------|---------------------------------------------|
+| Linux<br>(Ubuntu, CentOS, Arch, <br>UOS, KylinOS, ...) | x86 <br>x86_64 <br>ARM <br>loongson | >= 3.7 | g++ >=5.4    | Nvidia CUDA >= 10.0, [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar) <br> or [AMD ROCm](https://docs.amd.com/) >= 4.0 <br> or [Hygon DCU DTK](https://tycloud.hpccube.com/doc/1.0.6/11277/general-handbook/software-tutorial/jittor.html) >= 22.04 |
+| macOS <br>(>= 10.14 Mojave)                            | intel<br>Apple Silicon              | >= 3.7 | clang >= 8.0 | -                                           |
+| Windows 10 & 11                                        | x86_64                              | [>= 3.8](https://www.python.org/downloads/windows/) | -            | Nvidia CUDA >= 10.2 [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-windows)                               |

-* Python version >= 3.7
-* CPU compiler (require at least one of the following)
-    * g++ (>=5.4.0)
-    * clang (>=8.0)
-* GPU compiler (optional)
-    * nvcc (>=10.0 for g++ or >=10.2 for clang)
-* GPU library: cudnn-dev (recommend tar file installation, [reference link](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar))
-
-
-**Windows** requirements atr:
-
-* Python: version >= 3.8(recommend install from <https://www.python.org/downloads/windows/>)
-* x86_64 CPU processor
-* Windows 10 or above
-
-
-Note#1: macOS users have to install additional dependencies, see [macOS install](#macOS-install).

 Jittor offers three ways to install: pip, docker, or manual.

@ -142,7 +122,7 @@ python3.7 -m jittor.test.test_example
 Please first install additional dependencies with [homebrew](https://brew.sh).

 ```bash
-brew install python@3.7 onednn libomp
+brew install onednn libomp
 ```


@ -433,3 +413,4 @@ Jittor is currently maintained by the [Tsinghua CSCG Group](https://cg.cs.tsingh

 Jittor is Apache 2.0 licensed, as found in the LICENSE.txt file.

+
--- a/README.src.md
+++ b/README.src.md
@ -5,7 +5,7 @@

 [Quickstart](#quickstart) | [Install](#install) | [Tutorial](#tutorial) | [Chinese](./README.cn.md)

-[快速开始](#快速开始) | [安装](#安装) | [教程](#教程)
+[快速开始](#快速开始) | [安装](#安装) | [教程](#教程) | [English](./README.md)

 Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators. The whole framework and meta-operators are compiled just-in-time. A powerful op compiler and tuner are integrated into Jittor. It allowed us to generate high-performance code with specialized for your model. Jittor also contains a wealth of high-performance model libraries, including: image recognition, detection, segmentation, generation, differentiable rendering, geometric learning, reinforcement learning, etc. .

@ -22,6 +22,7 @@ Related Links:
 *  [Jittor Documents](https://cg.cs.tsinghua.edu.cn/jittor/assets/docs/index.html)
 *  [Github](https://github.com/jittor/jittor), [Gitee](https://gitee.com/jittor/jittor)
 *  [Jittor Forum](https://discuss.jittor.org/)
+*  [Awesome Jittor List](https://github.com/Jittor/jittor/blob/master/AWESOME-JITTOR-LIST.md)
 *  IM: QQ Group(761222083)

 相关链接：
@ -31,6 +32,7 @@ Related Links:
 *  [Jittor文档](https://cg.cs.tsinghua.edu.cn/jittor/assets/docs/index.html)
 *  [Github](https://github.com/jittor/jittor)， [Gitee](https://gitee.com/jittor/jittor)
 *  [Jittor 论坛](https://discuss.jittor.org/)
+*  [Jittor 精选仓库](https://github.com/Jittor/jittor/blob/master/AWESOME-JITTOR-LIST.md)
 *  即时通信: QQ Group(761222083)


@ -115,52 +117,17 @@ We provide some jupyter notebooks to help you quick start with Jittor.

 ## 安装

-
 Jittor框架对环境要求如下:

-Jittor 支持**Linux**(e.g. Ubuntu/CentOS/Arch), **macOS**,**Windows**， 其中**Linux**和**macOS**的依赖如下：
-* Python：版本 >= 3.7
-* C++编译器 （需要下列至少一个）
-    - g++ （>=5.4.0 for linux）
-    - clang （>=8.0 for mac）
-* GPU 编译器（可选）：nvcc >=10.0
-* GPU 加速库（可选）：cudnn-dev (cudnn开发版, 推荐使用tar安装方法，[参考链接](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar))
-
-Jittor 目前还支持主流国产Linux操作系统，如统信、麒麟、普华、龙芯Loongnix，安装方式可参考 Linux pip安装方法，准备好python和g++即可。
-
-**Windows**对环境的要求为：
-* Python：版本 >= 3.8(建议从Python官网安装：<https://www.python.org/downloads/windows/>)
-* x86_64处理器
-* Windows 10及以上。
-
-如果您不希望手动配置环境，我们推荐使用 Docker 进行安装。
-除此之外，您还可以使用 pip 安装和手动安装。
-
-注意1：macOS 用户需要安装额外依赖，请参考 [macOS 安装](#macOS-安装)。
-
-Jittor 提供了三种安装方法：pip、docker和手动安装：
-
 Jittor environment requirements:

-* System: **Linux**(e.g. Ubuntu/CentOS/Arch), **macOS**, or **Windows**, enviroment requirements of **Linux** and **Mac** are list below:
+| OS                                                     | CPU                                 | Python | Compiler     | (Optional) GPU platform                                |
+|--------------------------------------------------------|-------------------------------------|--------|--------------|---------------------------------------------|
+| Linux<br>(Ubuntu, CentOS, Arch, <br>UOS, KylinOS, ...) | x86 <br>x86_64 <br>ARM <br>loongson | >= 3.7 | g++ >=5.4    | Nvidia CUDA >= 10.0, [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar) <br> or [AMD ROCm](https://docs.amd.com/) >= 4.0 <br> or [Hygon DCU DTK](https://tycloud.hpccube.com/doc/1.0.6/11277/general-handbook/software-tutorial/jittor.html) >= 22.04 |
+| macOS <br>(>= 10.14 Mojave)                            | intel<br>Apple Silicon              | >= 3.7 | clang >= 8.0 | -                                           |
+| Windows 10 & 11                                        | x86_64                              | [>= 3.8](https://www.python.org/downloads/windows/) | -            | Nvidia CUDA >= 10.2 [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-windows)                               |

-* Python version >= 3.7
-* CPU compiler (require at least one of the following)
-    * g++ (>=5.4.0)
-    * clang (>=8.0)
-* GPU compiler (optional)
-    * nvcc (>=10.0 for g++ or >=10.2 for clang)
-* GPU library: cudnn-dev (recommend tar file installation, [reference link](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar))
-
-
-**Windows** requirements atr:
-
-* Python: version >= 3.8(recommend install from <https://www.python.org/downloads/windows/>)
-* x86_64 CPU processor
-* Windows 10 or above
-
-
-Note#1: macOS users have to install additional dependencies, see [macOS install](#macOS-install).
+Jittor 提供了三种安装方法：pip、docker和手动安装：

 Jittor offers three ways to install: pip, docker, or manual.

@ -186,12 +153,12 @@ jittor会自动在路径中寻找合适的编译器, 如果您希望手动指定

 ### macOS install

-macOS 请使用 [homebrew](https://brew.sh) 安装额外的依赖 (python>=3.7, onednn)。
+macOS 请使用 [homebrew](https://brew.sh) 安装额外的依赖。

 Please first install additional dependencies with [homebrew](https://brew.sh).

 ```bash
-brew install python@3.7 onednn libomp
+brew install onednn libomp
 ```

 之后您可以通过 pip 安装 jittor，并测试是否可以成功运行。
@ -203,7 +170,7 @@ python3.7 -m pip install jittor
 python3.7 -m jittor.test.test_example
 ```

-目前在macOS中，jittor 只支持 CPU 计算。
+目前在 macOS 中，jittor 只支持 CPU 计算。

 Currently jittor only supports CPU in macOS.

--- a/python/jittor/init.py
+++ b/python/jittor/init.py
@ -9,7 +9,7 @@
 # file 'LICENSE.txt', which is part of this source code package.
 # ***************************************************************

-__version__ = '1.3.5.11'
+__version__ = '1.3.5.14'
 from jittor_utils import lock
 with lock.lock_scope():
    ori_int = int
--- a/python/jittor/init.pyi
+++ b/python/jittor/init.pyi
--- a/python/jittor/extern/cuda/cutt/ops/cutt_transpose_op.cc
+++ b/python/jittor/extern/cuda/cutt/ops/cutt_transpose_op.cc
@ -69,6 +69,10 @@ unordered_map<string, unsigned int> cutt_plan_cache;
 EXTERN_LIB unordered_map<string, unsigned int> cutt_plan_cache;

 void CuttTransposeOp::jit_run() {
+    // Return if x is empty
+    if (x->num == 0)
+        return;
+
    cudaGetLastError();
    auto* __restrict__ xp = x->mem_ptr;
    auto* __restrict__ yp = y->mem_ptr;
@ -116,4 +120,4 @@ void CuttTransposeOp::jit_run() {
 }
 #endif // JIT

-} // jittor
+} // jittor
--- a/python/jittor/misc.py
+++ b/python/jittor/misc.py
@ -525,20 +525,224 @@ _triple = _ntuple(3)
 _quadruple = _ntuple(4)


-def unique(x):
+def unique(
+    input: jt.Var, 
+    return_inverse: bool=False, 
+    return_counts: bool=False, 
+    dim: int=None):
+
    r'''
    Returns the unique elements of the input tensor.

    Args:

-        x– the input tensor.
+        input (var) – the input var
+
+        return_inverse (bool) – Whether to also return the indices for where elements in the original input ended up in the returned unique list. default: False
+
+        return_counts (bool) – Whether to also return the counts for each unique element. default: False
+
+        dim (int) – the dimension to apply unique. If None, the unique of the flattened input is returned. default: None
+
+    Example:
+
+        >>> jittor.unique(jittor.array([1, 3, 2, 3]))
+        jt.Var([1 2 3], dtype=int32)
+
+        >>> jittor.unique(jittor.array([1, 3, 2, 3, 2]), return_inverse=True, return_counts=True)
+        (jt.Var([1 2 3], dtype=int32), jt.Var([0 2 1 2 1], dtype=int32), jt.Var([1 2 2], dtype=int32))
+
+        >>> jittor.unique(jittor.array([[1, 3], [2, 3]]), return_inverse=True)
+            (jt.Var([1 2 3], dtype=int32), jt.Var([[0 2]
+                                                   [1 2]], dtype=int32))
+
+        >>> jittor.unique(jittor.array([[1, 3], [1, 3]]), dim=0)
+            jt.Var([[1 3]], dtype=int32)
    '''
-    x = x.reshape(-1)
-    _,x = jt.argsort(x)
-    index,= jt.index((x.shape[0],))
-    y = x[1:][x[index[1:]] != x[index[:-1]]]
-    x = jt.concat([x[:1],y],dim=0)
-    return x
+
+    temp_shape = None
+    if dim == None:
+        temp_shape = list(input.shape)
+        input_flatten = input.flatten()
+        dim = 0
+    else:
+        input_flatten = input
+
+    input_flatten = input_flatten.transpose(dim, 0)
+    orig_shape = input_flatten.shape
+    input_flatten = input_flatten.view(orig_shape[0], -1)
+    
+    with jt.flag_scope(compile_options = {"FLAGS:  --extended-lambda ": 1} if jt.flags.use_cuda else {}):
+        indice = jt.code((input_flatten.shape[0], ), 'int32', [input_flatten],
+            cpu_header='''
+            #include <algorithm>
+            ''',
+            cpu_src='''
+            @alias(input_flatten, in0)
+            @alias(indice, out)
+
+            int dimlen = input_flatten_shape0, dimsize = input_flatten_shape1;
+            for(int i = 0; i < dimlen; ++i) @indice(i) = i;
+            std::sort(&@indice(0), &@indice(dimlen), [&](int a, int b){
+                for(int i = 0; i < dimsize; ++i) {
+                    int lhs = @input_flatten(a, i), rhs = @input_flatten(b, i);
+                    if (lhs != rhs) return lhs < rhs;
+                }
+                return false;
+            });
+            ''',
+            cuda_header='''
+            #undef out
+            #include <thrust/extrema.h>
+            #include <thrust/device_ptr.h>
+            #include <thrust/execution_policy.h>
+            #include <thrust/device_vector.h>
+            #include <thrust/sequence.h>
+    
+            #include <cub/cub.cuh> 
+            #include <executor.h>
+            ''',
+            cuda_src=
+            '''
+                @alias(input_flatten, in0)
+                @alias(indice, out)
+                int dimlen = indice_shape0, dimsize = input_flatten_shape1;
+
+                if (dimsize == 1) {
+                    size_t raw_allocation, d_allocation, temp_storage_bytes = 0;
+                    void *d_temp_storage = NULL;
+                    int32_t* raw_ptr = (int32_t*)exe.allocator->alloc(dimlen * (sizeof(int32_t) + sizeof(input_flatten_type)), raw_allocation);
+
+                    thrust::device_ptr<int32_t> arange_ptr = thrust::device_pointer_cast(raw_ptr);
+                    thrust::sequence(arange_ptr, arange_ptr + dimlen);
+
+                    cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, input_flatten_p, 
+                                                    (input_flatten_type*)(raw_ptr + dimlen), thrust::raw_pointer_cast(arange_ptr), indice_p, dimlen);
+                    d_temp_storage = exe.allocator->alloc(temp_storage_bytes, d_allocation);
+                    cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, input_flatten_p,
+                                                    (input_flatten_type*)(raw_ptr + dimlen), thrust::raw_pointer_cast(arange_ptr), indice_p, dimlen);
+
+                    exe.allocator->free(raw_ptr, dimlen * (sizeof(int) + sizeof(input_flatten_type)), raw_allocation);
+                    exe.allocator->free(d_temp_storage, temp_storage_bytes, d_allocation);
+                } else {
+                    thrust::device_ptr<input_flatten_type> input_ptr = thrust::device_pointer_cast(input_flatten_p);
+                    thrust::device_ptr<int32_t> indice_ptr = thrust::device_pointer_cast(indice_p);
+
+                    thrust::sequence(indice_ptr, indice_ptr + dimlen);
+                    thrust::sort(thrust::device, indice_ptr, indice_ptr + dimlen,
+                        [=] __device__ (int32_t a, int32_t b)->bool {
+                            for(int i = 0; i < dimsize; ++i) {
+                                input_flatten_type lhs = input_ptr[i + a * dimsize],
+                                                rhs = input_ptr[i + b * dimsize];
+                                if (lhs != rhs) return lhs < rhs;
+                            }
+                            return false;
+                        });
+                }
+            '''
+        )
+    input_sorted = input_flatten[indice][:]
+    
+    dimlen = indice.shape[0]
+
+    diff = jt.logical_not(jt.all(input_sorted[1:] == input_sorted[: -1], 1))
+    diff = jt.concat([jt.Var([False]), diff], 0)
+    diff = jt.array(diff, dtype = jt.int32)
+  
+    with jt.flag_scope(compile_options = {"FLAGS:  --extended-lambda ": 1} if jt.flags.use_cuda else {}):
+        output, inverse = jt.code(
+            [(-input_sorted.shape[0], ), (indice.shape)],
+            [input_sorted.dtype, indice.dtype],
+            [input_sorted, diff, indice],
+            cpu_header='''
+                #include <algorithm>
+                @alias(input_sorted, in0)
+                @alias(diff, in1)
+                @alias(indice, in2)
+                @alias(output, out0)
+                @alias(inverse, out1)
+            ''',
+            cpu_src=
+            f"bool return_inverse = {int(return_inverse)};" +
+            '''
+                int tot = -1;
+                for (int i = 0; i < input_sorted_shape0; ++i) {
+                    if (i == 0 || @diff(i)) {
+                        ++tot; @output(tot) = i;
+                    }
+                    if (return_inverse)
+                        @inverse(@indice(i)) = tot;
+                }
+                output->set_shape({tot + 1});
+            ''',
+            cuda_header='''
+                #undef out
+
+                #include <thrust/extrema.h>
+                #include <thrust/device_ptr.h>
+                #include <thrust/execution_policy.h>
+                #include <thrust/scan.h>
+                #include <executor.h>
+
+                @alias(input_sorted, in0)
+                @alias(diff, in1)
+                @alias(indice, in2)
+                @alias(output, out0)
+                @alias(inverse, out1)
+            ''',
+            cuda_src=
+            f"bool return_inverse = {int(return_inverse)};" +
+            '''
+                int dimlen = input_sorted_shape0, dimsize = input_sorted_shape1;
+                size_t raw_allocation;
+                int32_t* raw_ptr = (int32_t*)exe.allocator->alloc(2 * dimlen * sizeof(int), raw_allocation);
+
+                thrust::device_ptr<int32_t> diff_ptr = thrust::device_pointer_cast(diff_p),
+                                            inverse_ptr = thrust::device_pointer_cast(inverse_p),
+                                            array_ptr = thrust::device_pointer_cast(raw_ptr),
+                                            sum_ptr = thrust::device_pointer_cast(raw_ptr + dimlen),
+                                            indice_ptr = thrust::device_pointer_cast(indice_p);
+                thrust::device_ptr<input_sorted_type> input_ptr = thrust::device_pointer_cast(input_sorted_p);
+
+                if (return_inverse) {
+                    thrust::inclusive_scan(diff_ptr, diff_ptr + dimlen, sum_ptr);
+                    thrust::scatter(sum_ptr, sum_ptr + dimlen, indice_ptr, inverse_ptr);
+                }
+
+                thrust::sequence(array_ptr, array_ptr + dimlen);
+                int32_t num = thrust::unique(array_ptr, array_ptr + dimlen,
+                    [=] __device__ (int32_t a, int32_t b)->bool {
+                        for(int i = 0; i < dimsize; ++i) {
+                            input_sorted_type lhs = input_ptr[i + a * dimsize],
+                                            rhs = input_ptr[i + b * dimsize];
+                            if (lhs != rhs) return false;
+                        }
+                        return true;
+                    }) - array_ptr;
+
+                cudaMemcpy(output_p, raw_ptr, sizeof(int32_t) * num, cudaMemcpyDeviceToDevice);
+                exe.allocator->free(raw_ptr, 2 * dimlen * sizeof(int32_t), raw_allocation);
+                output->set_shape({ num });
+            '''
+        )
+    indice_shape = (output.shape[0], )
+    output = input_sorted[output][:]
+
+    new_shape = list(orig_shape[1:])
+    new_shape.insert(0, -1)
+    output = output.view(new_shape).transpose(dim, 0)
+    if temp_shape != None:
+        inverse = inverse.view(temp_shape).transpose(dim, 0)
+
+    if return_inverse:
+        if return_counts:
+            counts = jt.zeros(indice_shape, dtype=jt.int32)
+            jt.scatter_(counts, 0, inverse.flatten(), jt.ones(dimlen), reduce='add')
+            return output, inverse, counts
+        else:
+            return output, inverse
+    else:
+        return output

 jt.Var.unique = unique

@ -1176,8 +1380,11 @@ def numpy_cumprod(a, dim):
    return func(a, dim)

 def linspace(start, end, steps):
-    res = jt.index((steps,))[0]
-    res = res*float((end-start)/(steps-1))+start
+    if steps > 1:
+        res = jt.index((steps,))[0]
+        res = res*float((end-start)/(steps-1))+start
+    else:
+        res = jt.array([start])
    return res

 def randperm(n, dtype="int32"):
--- a/python/jittor/nn.py
+++ b/python/jittor/nn.py
@ -12,7 +12,6 @@
 # file 'LICENSE.txt', which is part of this source code package.
 # ***************************************************************
 from abc import abstractmethod
-from sys import breakpointhook
 import jittor as jt
 from jittor import flatten, init, Module
 import numpy as np
@ -262,7 +261,8 @@ def sign(x: jt.Var) -> jt.Var:
 def gelu(x):
    r''' Applies the element-wise function:

-   .. math:: \text{GELU}(x) = x * \Phi(x)
+    .. math::
+        \text{GELU}(x) = x * \Phi(x)

    where :math:`\Phi(x)` is the Cumulative Distribution Function for Gaussian Distribution.

@ -546,6 +546,31 @@ class Dropout(Module):
 def dropout(x,p=0.5,is_train=False):
    return Dropout(p=p,is_train=is_train)(x)

+class DropPath(Module):
+    '''Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
+    '''
+    def __init__(self, p=0.5, is_train=False):
+        '''
+            :param p: Specifies the probability of each batch retention. Defaults to 0.5.
+            :type p: float dtype
+            :param is_train: Specify whether it is a training model. Defaults to False.
+            :type is_train: bool
+        '''
+        self.p = p
+        self.is_train = is_train
+        #TODO: test model.train() to change self.is_train
+    def execute(self, x):
+        if self.p == 0. or not self.is_train:
+            return x
+        keep_prob = 1 - self.p
+        shape = (x.shape[0], ) + (1, ) * (x.ndim - 1)
+        random_tensor = keep_prob + jt.rand(shape, dtype=x.dtype)
+        output = x.divide(keep_prob) * random_tensor.floor()
+        return output
+
+def droppath(x,p=0.5,is_train=False):
+    return DropPath(p=p,is_train=is_train)(x)
+
 class Linear(Module):
    def __init__(self, in_features, out_features, bias=True):
        self.in_features = in_features
@ -2799,4 +2824,91 @@ def _fft2(x, inverse=False):
    y = jt.compile_extern.cufft_ops.cufft_fft(x, inverse)
    if inverse:
        y /= x.shape[1] * x.shape[2]
-    return y
+    return y
+
+
+def one_hot(x: jt.Var, num_classes: int=-1) -> jt.Var:
+    ''' Returns the one_hot encoding of inputs.
+
+    :param x: class values of any shape
+    :type x: jt.Var with bool or integer dtype
+
+    :param num_classes: Total number of classes. If set to -1, the number of classes will be inferred as one greater than the largest class value in the input tensor.
+    :type num_classes: int, optional
+
+    :return: a Var with one more dimension with 1 values at the index 
+    of last dimension indicated by the input, and 0 everywhere else.
+    :rtype: jt.Var
+
+    .. note::
+        if the values in x are greater than num_class or less than 0, 
+        the returned one_hot will be all zeros.
+
+    Example:
+        >>> jt.nn.one_hot(jt.arange(5) % 3)
+            jt.Var([[1 0 0]
+                [0 1 0]
+                [0 0 1]
+                [1 0 0]
+                [0 1 0]], dtype=int32)
+        >>> jt.nn.one_hot(jt.arange(5) % 3, num_classes=5)
+            jt.Var([[1 0 0 0 0]
+                [0 1 0 0 0]
+                [0 0 1 0 0]
+                [1 0 0 0 0]
+                [0 1 0 0 0]], dtype=int32)
+        >>> jt.nn.one_hot(jt.arange(6).reshape(3,2) % 3)
+            jt.Var([[[1 0 0]
+                [0 1 0]]
+
+                [[0 0 1]
+                [1 0 0]]
+
+                [[0 1 0]
+                [0 0 1]]], dtype=int32)
+    '''
+
+    assert x.dtype in [jt.bool, jt.int8, jt.int16, jt.int32, jt.int64, jt.uint8, jt.uint16, jt.uint32, jt.uint64]
+    if num_classes == -1:
+        num_classes = x.max().item() + 1
+
+    N = len(x.shape)
+    indices = ["i"+str(i) for i in range(N)]
+    y = jt.ones_like(x).reindex(
+        x.shape + [num_classes],
+        indices, 
+        extras=[x],
+        overflow_conditions=[f"i{N} != @e0({','.join(indices)})"],
+        overflow_value=0)
+    return y
+
+
+class KLDivLoss(Module):
+    ''' Computes the Kullback-Leibler divergence loss.
+    '''
+
+    def __init__(self, reduction: str = 'mean', log_target: bool = False):
+        '''
+            :param reduction: Specifies the reduction to apply to the output. Can be 'mean', 'sum', 'batchmean', or 'none'. Defaults to 'mean'.
+            :type reduction: str, optional
+            :param log_target: Specifies whether target is the log space. Defaults to False.
+            :type log_target: bool, optional
+        '''
+        self.reduction = reduction
+        self.log_target = log_target
+
+    def execute(self, input: jt.Var, target: jt.Var) -> jt.Var:
+        if not self.log_target:
+            loss_pointwise = target * (target.log() - input)
+        else:
+            loss_pointwise = target.exp() * (target - input)
+
+        if self.reduction == "mean":
+            loss = loss_pointwise.mean()
+        elif self.reduction == "batchmean":
+            loss = loss_pointwise.sum() / input.size(0)
+        elif self.reduction == "sum":
+            loss = loss_pointwise.sum()
+        else:
+            loss = loss_pointwise
+        return loss
--- a/python/jittor/src/ops/index_op.h
+++ b/python/jittor/src/ops/index_op.h
@ -32,9 +32,9 @@ struct IndexOp : Op {

    Example::

-        print(jt.index([2,2], 0)())
+        print(jt.index([2,2], 0))
        # output: [[0,0],[1,1]]
-        print(jt.index([2,2], 1)())
+        print(jt.index([2,2], 1))
        # output: [[0,1],[0,1]]
     */
    IndexOp(NanoVector shape, int64 dim, NanoString dtype=ns_int32);
--- a/python/jittor/test/test_cutt_transpose_op.py
+++ b/python/jittor/test/test_cutt_transpose_op.py
@ -90,5 +90,14 @@ class TestCuttTransposeOp(unittest.TestCase):
            assert ((da-jda.data)<1e-5).all(), (da, jda.data, da-jda.data)
            assert ((db-jdb.data)<1e-5).all(), (db-jdb.data)

+    @unittest.skipIf(cutt_ops==None, "Not use cutt, Skip")
+    @jt.flag_scope(use_cuda=1)
+    def test_matmul_grad(self):
+        a = jt.zeros((0, 10))
+        b = a.transpose(1, 0)
+        c = b.data
+        assert c.shape[0] == 10
+        assert c.shape[1] == 0
+
 if __name__ == "__main__":
    unittest.main()
--- a/python/jittor/test/test_unique.py
+++ b/python/jittor/test/test_unique.py
@ -0,0 +1,57 @@
+
+# ***************************************************************
+# Copyright (c) 2022 Jittor. All Rights Reserved. 
+# Maintainers: 
+#     Dun Liang <randonlang@gmail.com>. 
+#     Xiangli Li <1905692338@qq.com>
+#     Jiapeng Zhang <zhangjp20@mails.tsinghua.edu.cn>
+# 
+# This file is subject to the terms and conditions defined in
+# file 'LICENSE.txt', which is part of this source code package.
+# ***************************************************************
+
+from cgi import test
+import unittest
+import jittor as jt
+import numpy as np
+
+skip_this_test = False
+
+try:
+    jt.dirty_fix_pytorch_runtime_error()
+    import torch
+except:
+    torch = None
+    skip_this_test = True
+
+def test_unique_with_torch(input, dim=None):
+    jt0, jt1, jt2 = jt.unique(jt.array(input), True, True, dim)
+    torch0, torch1, torch2 = torch.unique(torch.tensor(input), True, True, True, dim)
+    assert np.allclose(jt0, torch0) and np.allclose(jt1, torch1) and np.allclose(jt2, torch2)
+
+
+@unittest.skipIf(skip_this_test, "No Torch found")
+class TestSparse(unittest.TestCase):
+
+    def test_unique(self):
+        test_unique_with_torch(np.array([1, 3, 2, 3, 3, 3], dtype=np.int32))
+        test_unique_with_torch(np.array([[1, 3], [2, 3], [1, 2]], dtype=np.int64))
+
+    def test_unique_dim(self):
+        test_unique_with_torch(np.array([[1, 3], [2, 3], [1, 3], [2, 3]]), 0)
+        test_unique_with_torch(np.array([[1, 3], [2, 3], [1, 3], [2, 3]]), 1)
+
+
+    @unittest.skipIf(not jt.compiler.has_cuda, "No CUDA found")
+    @jt.flag_scope(use_cuda=1)
+    def test_unique_cuda(self):
+        self.test_unique()
+
+    @unittest.skipIf(not jt.compiler.has_cuda, "No CUDA found")
+    @jt.flag_scope(use_cuda=1)
+    def test_unique_dim_cuda(self):
+        self.test_unique_dim()
+    
+        
+if __name__ == "__main__":
+    unittest.main()
--- a/python/jittor/utils/gen_pyi.py
+++ b/python/jittor/utils/gen_pyi.py
@ -16,7 +16,7 @@ In detail, autocompletion of the following functions are supported.
 - methods of jittor.Var

 Prerequisite:
- mypy for automatic stub generation
+- mypy for automatic stub generation, installation: pip install mypy

 Usage: python3 -m jittor.utils.gen_pyi

@ -35,7 +35,7 @@ def add_indent(s: str, n=1):
 def ctype_to_python(type_str):
    if type_str == "bool":
        return "bool"
-    if type_str in ["int", "uint", "int64", "uint64", "size_t"]:
+    if type_str in ["int", "uint", "uint8", "int64", "uint64", "size_t"]:
        return "int"
    if type_str in ["float32", "float64"]:
        return "float"
@ -49,6 +49,8 @@ def ctype_to_python(type_str):
        return "Var"
    if type_str in ["vector<VarHolder*>", "vector<VarHolder*>&&"]:
        return "List[Var]"
+    if type_str in ["vector_to_tuple<VarHolder*>"]:
+        return "Tuple[Var]"
    if type_str == "NanoVector":
        return "Tuple[int]"
    if type_str == "vector<NanoVector>&&":
@ -161,11 +163,22 @@ def gen_ops_stub(jittor_path):
        if func_name == "bool":
            continue

-        docstring = func.__doc__[:func.__doc__.find("Declaration:")]
-        docstring = docstring.replace("'''", '"""').strip()
-        declarations = re.findall(r"Declaration:\n(.+)\n", func.__doc__)
+        docstrings = []
+        declarations = []
+        for i, doc in enumerate(re.split(r"Declaration:\n(.+)\n", func.__doc__)):
+            if i % 2 == 0:
+                if not doc.strip() and docstrings:  
+                    # if the current docstring is empty, use the last docstring
+                    docstrings.append(docstrings[-1])
+                else:
+                    docstrings.append(doc.replace("'''", '"""').strip())
+            else:
+                declarations.append(doc)
+
+        for i in range(len(declarations)):
+            decl = declarations[i]
+            docstring = docstrings[i]

-        for decl in declarations:
            decorators = "@overload\n" if len(declarations) > 1 else ""
            return_type = ctype_to_python(decl.split(' ', maxsplit=1)[0])
            param_hints = decl_to_param_hints(decl)
--- a/python/jittor_utils/load_pytorch.py
+++ b/python/jittor_utils/load_pytorch.py
@ -8,6 +8,22 @@ import numpy as np
 from typing import Any, BinaryIO, cast, Dict, Optional, Type, Tuple, Union, IO, List

 loaded_storages = {}
+deserialized_objects = {}
+def _is_zipfile(fn):
+    f = open(fn, "rb")
+    read_bytes = []
+    start = f.tell()
+
+    byte = f.read(1)
+    while byte != "":
+        read_bytes.append(byte)
+        if len(read_bytes) == 4:
+            break
+        byte = f.read(1)
+    f.seek(start)
+
+    local_header_magic_number = [b'P', b'K', b'\x03', b'\x04']
+    return read_bytes == local_header_magic_number

 def _maybe_decode_ascii(bytes_str: Union[bytes, str]) -> str:
    if isinstance(bytes_str, bytes):
@ -19,14 +35,14 @@ def load_tensor(contents, dtype, numel, key, location):
    loaded_storages[key] = np.frombuffer(contents[name], dtype).copy()

 def get_dtype_size(dtype):
-    if dtype is np.float32 or dtype is np.int32:
+    dtype = dtype.__str__()
+    if dtype == "float32" or dtype == "int32":
        return 4
-    elif dtype is np.float64 or dtype is np.int64:
+    if dtype == "float64" or dtype == "int64":
        return 8
-    elif dtype is np.float16 or dtype is np.int16:
+    if dtype == "float16" or dtype == "int16":
        return 2
-    else:
-        return 1
+    return 1

 def persistent_load(saved_id):
    global contents
@ -38,7 +54,7 @@ def persistent_load(saved_id):
    storage_type, key, location, numel = data
    dtype = storage_type.dtype
    if key not in loaded_storages:
-        nbytes = numel * get_dtype_size(dtype)
+        nbytes = numel
        load_tensor(contents, dtype, nbytes, key, _maybe_decode_ascii(location))
    return loaded_storages[key]

@ -72,13 +88,12 @@ class StorageType():
        return f'StorageType(dtype={self.dtype})'

 def jittor_rebuild(storage, storage_offset, size, stride, requires_grad, backward_hooks):
-    # print(storage, size)
    if len(size) == 0:
        return jt.array(storage)
    return jt.array(storage).reshape(size)

 def jittor_rebuild_var(data, requires_grad, backward_hooks):
-    v =  jt.array(data)
+    v = jt.array(data)
    v.requires_grad = requires_grad
    return v

@ -96,6 +111,38 @@ class UnpicklerWrapper(pickle.Unpickler):  # type: ignore[name-defined]
        
        return super().find_class(mod_name, name)

+class ArrayWrapper:
+    def __init__(self, storage, size=None, requires_grad=None):
+        self.requires_grad = requires_grad
+        self.size = size
+        self.storage = storage
+    
+    def __str__(self):
+        return self.storage.__str__()
+
+def jittor_rebuild_direct(storage, storage_offset, size, stride, requires_grad, backward_hooks):
+    if len(size) == 0:
+        return ArrayWrapper(storage, size=size)
+    storage.reshape(size)
+    return ArrayWrapper(storage, size=size)
+
+def jittor_rebuild_var_direct(data, requires_grad, backward_hooks):
+    v = ArrayWrapper(storage, requires_grad=requires_grad)
+    return v
+
+class DirectUnpicklerWrapper(pickle.Unpickler):  # type: ignore[name-defined]
+    def find_class(self, mod_name, name):
+        if type(name) is str and 'Storage' in name:
+            try:
+                return StorageType(name)
+            except KeyError:
+                pass
+        if type(name) is str and '_rebuild_tensor_v2' in name:
+            return super().find_class("jittor_utils.load_pytorch", "jittor_rebuild_direct")
+        if type(name) is str and '_rebuild_parameter' in name:
+            return super().find_class("jittor_utils.load_pytorch", "jittor_rebuild_var_direct")
+        return super().find_class(mod_name, name)
+
 def _check_seekable(f) -> bool:
    def raise_err_msg(patterns, e):
        for p in patterns:
@ -117,18 +164,97 @@ def extract_zip(input_zip):
    input_zip = ZipFile(input_zip)
    return {name: input_zip.read(name) for name in input_zip.namelist()}

+def _is_compressed_file(f):
+    compress_modules = ['gzip']
+    try:
+        return f.__module__ in compress_modules
+    except AttributeError:
+        return False
+
+def _should_read_directly(f):
+    if _is_compressed_file(f):
+        return False
+    try:
+        return f.fileno() >= 0
+    except io.UnsupportedOperation:
+        return False
+    except AttributeError:
+        return False
+
+def persistent_load_direct(saved_id):
+    global deserialized_objects
+    assert isinstance(saved_id, tuple)
+    typename = _maybe_decode_ascii(saved_id[0])
+    data = saved_id[1:]
+    if typename == 'module':
+        # Ignore containers that don't have any sources saved
+        return data[0]
+    elif typename == 'storage':
+        data_type, root_key, location, size, view_metadata = data
+        location = _maybe_decode_ascii(location)
+        if root_key not in deserialized_objects:
+            deserialized_objects[root_key] = np.zeros(size, dtype=data_type)
+        storage = deserialized_objects[root_key]
+        if view_metadata is not None:
+            view_key, offset, view_size = view_metadata
+            if view_key not in deserialized_objects:
+                deserialized_objects[view_key] = storage[offset:offset + view_size]
+            return deserialized_objects[view_key]
+        else:
+            return storage
+    else:
+        raise RuntimeError("Unknown saved id type: %s" % saved_id[0])
+
 def load_pytorch(fn_name):
-    global contents
+    global contents, deserialized_objects
    if not fn_name.endswith(".pth"):
        print("This function is designed to load pytorch pth format files.")
        return None
    else:
-        contents = extract_zip(fn_name)
-        data_file = io.BytesIO(contents['archive/data.pkl'])
-        pickle_load_args = {'encoding': 'utf-8'}
-        unpickler = UnpicklerWrapper(data_file,  **pickle_load_args)
-        unpickler.persistent_load = persistent_load
-        result = unpickler.load()
+        if _is_zipfile(fn_name):
+            contents = extract_zip(fn_name)
+            data_file = io.BytesIO(contents['archive/data.pkl'])
+            pickle_load_args = {'encoding': 'utf-8'}
+            unpickler = UnpicklerWrapper(data_file,  **pickle_load_args)
+            unpickler.persistent_load = persistent_load
+            result = unpickler.load()
+        else:
+            deserialized_objects = {}
+            f = open(fn_name, "rb")
+            f_should_read_directly = _should_read_directly(f)
+            MAGIC_NUMBER = 0x1950a86a20f9469cfc6c
+            PROTOCOL_VERSION = 1001
+            pickle_load_args = {'encoding': 'utf-8'}
+            magic_number = pickle.load(f, **pickle_load_args)
+            if magic_number != MAGIC_NUMBER:
+                raise RuntimeError("Invalid magic number; corrupt file?")
+            protocol_version = pickle.load(f, **pickle_load_args)
+            if PROTOCOL_VERSION != protocol_version:
+                raise RuntimeError("Invalid protocal version.")
+            _sys_info = pickle.load(f, **pickle_load_args)
+            unpickler = DirectUnpicklerWrapper(f, **pickle_load_args)
+            unpickler.persistent_load = persistent_load_direct
+            result = unpickler.load()
+            offset = f.tell() if f_should_read_directly else None
+            deserialized_storage_keys = pickle.load(f, **pickle_load_args)
+            f.read(8)
+            for key in deserialized_storage_keys:
+                assert key in deserialized_objects
+                dtype = deserialized_objects[key].dtype
+                size = deserialized_objects[key].size * get_dtype_size(dtype)
+                byte_data = f.read(size)
+                deserialized_objects[key][:] = np.frombuffer(byte_data, dtype).copy()
+                f.read(8)
+                if offset is not None:
+                    offset = f.tell()
+            for key, params in result.items():
+                requires_grad = params.requires_grad
+                shape = params.size
+                result[key] = jt.array(params.storage)
+                if shape is not None and len(shape) > 0:
+                    result[key] = result[key].reshape(shape)
+                if requires_grad is not None:
+                    result[key].requires_grad = requires_grad
        return result

 if __name__ == "__main__":
--- a/python/jittor_utils/translator.py
+++ b/python/jittor_utils/translator.py
@ -19,23 +19,69 @@ def check_is_en(src):
    return en_cnt == len(src)

 def check_is_both(src):
+    if src.startswith("!"):
+        return True
    return len(src) < 2

+def splite_markdown_blocks(src):
+    ''' split markdown document into text, code, table blocks
+    '''
+    blocks = []
+    block = ""
+    status = "text"
+
+    def commit_block():
+        blocks.append((block, status))
+
+    for line in src.split('\n'):
+        line = line + "\n"
+        if line.startswith("```"):
+            assert status in ["text", "code"]
+            if status == "text":
+                commit_block()
+                status = "code"
+                block = line
+            elif status == "code":
+                block += line
+                commit_block()
+                status = "text"
+                block = ""
+        elif line.strip().startswith('|') and line.strip().endswith('|'):
+            assert status in ["text", "table"]
+            if status == "text":
+                commit_block()
+                status = "table"
+                block = line
+            else:
+                block += line
+        else:
+            if status == "table":
+                commit_block()
+                status = "text"
+                block = line
+            else:
+                block += line
+    if status != "code":
+        commit_block()
+    return blocks
+
 for mdname in all_src_md:
    print(mdname)
    with open(mdname, "r", encoding='utf8') as f:
        src = f.read()
-    src = src.split("```")
-    en_src = []
-    cn_src = []
-    for i, s in enumerate(src):
-        if i%2==1:
-            en_src.append(s)
-            cn_src.append(s)
+
+    src_blocks = splite_markdown_blocks(src)
+
+    en_src = ""
+    cn_src = ""
+    for block, status in src_blocks:
+        if status == "code" or status == "table":
+            en_src += block
+            cn_src += block
        else:
            en_s = []
            cn_s = []
-            for line in s.split('\n'):
+            for line in block.split('\n'):
                if check_is_both(line):
                    en_s.append(line)
                    cn_s.append(line)
@ -43,10 +89,9 @@ for mdname in all_src_md:
                    en_s.append(line)
                else:
                    cn_s.append(line)
-            en_src.append("\n".join(en_s))
-            cn_src.append("\n".join(cn_s))
-    en_src = "```".join(en_src)
-    cn_src = "```".join(cn_src)
+            en_src += "\n".join(en_s)
+            cn_src += "\n".join(cn_s)
+    
    with open(mdname.replace(".src.md", ".md"), 'w', encoding='utf8') as f:
        f.write(en_src)
    with open(mdname.replace(".src.md", ".cn.md"), 'w', encoding='utf8') as f: