merge.

2022-10-06 16:42:08 +08:00 · 2022-10-06 16:42:08 +08:00 · 17db747ae4
parent b21ee40360 a57de764f6
commit 17db747ae4
26 changed files with 1854 additions and 727 deletions
--- a/README.cn.md
+++ b/README.cn.md
@ -2,6 +2,7 @@

 ![Jittor Logo](https://cg.cs.tsinghua.edu.cn/jittor/favicon_package_v0/JittorLogo_Final1220.svg)

+
 [快速开始](#快速开始) | [安装](#安装) | [教程](#教程) | [English](./README.md)


@ -18,6 +19,7 @@ Jittor前端语言为Python。前端使用了模块化和动态图执行的设
 *  [Jittor文档](https://cg.cs.tsinghua.edu.cn/jittor/assets/docs/index.html)
 *  [Github](https://github.com/jittor/jittor)， [Gitee](https://gitee.com/jittor/jittor)
 *  [Jittor 论坛](https://discuss.jittor.org/)
+*  [Jittor 精选仓库](https://github.com/Jittor/jittor/blob/master/AWESOME-JITTOR-LIST.md)
 *  即时通信: QQ Group(761222083)


@ -88,40 +90,18 @@ for i,(x,y) in enumerate(get_data(n)):

 ## 安装

-
 Jittor框架对环境要求如下:

-Jittor 支持**Linux**(e.g. Ubuntu/CentOS/Arch), **macOS**,**Windows**， 其中**Linux**和**macOS**的依赖如下：
-* Python：版本 >= 3.7
-* C++编译器 （需要下列至少一个）
-    - g++ （>=5.4.0 for linux）
-    - clang （>=8.0 for mac）
-* GPU 编译器（可选）：nvcc >=10.0
-* GPU 加速库（可选）：cudnn-dev (cudnn开发版, 推荐使用tar安装方法，[参考链接](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar))

-Jittor 目前还支持主流国产Linux操作系统，如统信、麒麟、普华、龙芯Loongnix，安装方式可参考 Linux pip安装方法，准备好python和g++即可。
-
-**Windows**对环境的要求为：
-* Python：版本 >= 3.8(建议从Python官网安装：<https://www.python.org/downloads/windows/>)
-* x86_64处理器
-* Windows 10及以上。
-
-如果您不希望手动配置环境，我们推荐使用 Docker 进行安装。
-除此之外，您还可以使用 pip 安装和手动安装。
-
-注意1：macOS 用户需要安装额外依赖，请参考 [macOS 安装](#macOS-安装)。
+| OS                                                     | CPU                                 | Python | Compiler     | (Optional) GPU platform                                |
+|--------------------------------------------------------|-------------------------------------|--------|--------------|---------------------------------------------|
+| Linux<br>(Ubuntu, CentOS, Arch, <br>UOS, KylinOS, ...) | x86 <br>x86_64 <br>ARM <br>loongson | >= 3.7 | g++ >=5.4    | Nvidia CUDA >= 10.0, [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar) <br> or [AMD ROCm](https://docs.amd.com/) >= 4.0 <br> or [Hygon DCU DTK](https://tycloud.hpccube.com/doc/1.0.6/11277/general-handbook/software-tutorial/jittor.html) >= 22.04 |
+| macOS <br>(>= 10.14 Mojave)                            | intel<br>Apple Silicon              | >= 3.7 | clang >= 8.0 | -                                           |
+| Windows 10 & 11                                        | x86_64                              | [>= 3.8](https://www.python.org/downloads/windows/) | -            | Nvidia CUDA >= 10.2 [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-windows)                               |

 Jittor 提供了三种安装方法：pip、docker和手动安装：


-
-
-
-
-
-
-
-
 ## Pip 安装


@ -142,11 +122,11 @@ jittor会自动在路径中寻找合适的编译器, 如果您希望手动指定
 ### macOS 安装


-macOS 请使用 [homebrew](https://brew.sh) 安装额外的依赖 (python>=3.7, onednn)。
+macOS 请使用 [homebrew](https://brew.sh) 安装额外的依赖。


 ```bash
-brew install python@3.7 onednn libomp
+brew install onednn libomp
 ```

 之后您可以通过 pip 安装 jittor，并测试是否可以成功运行。
@ -157,7 +137,7 @@ python3.7 -m pip install jittor
 python3.7 -m jittor.test.test_example
 ```

-目前在macOS中，jittor 只支持 CPU 计算。
+目前在 macOS 中，jittor 只支持 CPU 计算。


 ### Windows安装
@ -439,3 +419,4 @@ Jittor目前由[清华大学计算机图形学组](https://cg.cs.tsinghua.edu.cn


 如LICENSE.txt文件中所示，Jittor使用Apache 2.0版权协议。
+
--- a/README.md
+++ b/README.md
@ -91,34 +91,14 @@ We provide some jupyter notebooks to help you quick start with Jittor.



-
-
-
-
-
-
-
 Jittor environment requirements:

-* System: **Linux**(e.g. Ubuntu/CentOS/Arch), **macOS**, or **Windows**, enviroment requirements of **Linux** and **Mac** are list below:
+| OS                                                     | CPU                                 | Python | Compiler     | (Optional) GPU platform                                |
+|--------------------------------------------------------|-------------------------------------|--------|--------------|---------------------------------------------|
+| Linux<br>(Ubuntu, CentOS, Arch, <br>UOS, KylinOS, ...) | x86 <br>x86_64 <br>ARM <br>loongson | >= 3.7 | g++ >=5.4    | Nvidia CUDA >= 10.0, [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar) <br> or [AMD ROCm](https://docs.amd.com/) >= 4.0 <br> or [Hygon DCU DTK](https://tycloud.hpccube.com/doc/1.0.6/11277/general-handbook/software-tutorial/jittor.html) >= 22.04 |
+| macOS <br>(>= 10.14 Mojave)                            | intel<br>Apple Silicon              | >= 3.7 | clang >= 8.0 | -                                           |
+| Windows 10 & 11                                        | x86_64                              | [>= 3.8](https://www.python.org/downloads/windows/) | -            | Nvidia CUDA >= 10.2 [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-windows)                               |

-* Python version >= 3.7
-* CPU compiler (require at least one of the following)
-    * g++ (>=5.4.0)
-    * clang (>=8.0)
-* GPU compiler (optional)
-    * nvcc (>=10.0 for g++ or >=10.2 for clang)
-* GPU library: cudnn-dev (recommend tar file installation, [reference link](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar))
-
-
-**Windows** requirements atr:
-
-* Python: version >= 3.8(recommend install from <https://www.python.org/downloads/windows/>)
-* x86_64 CPU processor
-* Windows 10 or above
-
-
-Note#1: macOS users have to install additional dependencies, see [macOS install](#macOS-install).

 Jittor offers three ways to install: pip, docker, or manual.

@ -142,7 +122,7 @@ python3.7 -m jittor.test.test_example
 Please first install additional dependencies with [homebrew](https://brew.sh).

 ```bash
-brew install python@3.7 onednn libomp
+brew install onednn libomp
 ```


@ -433,3 +413,4 @@ Jittor is currently maintained by the [Tsinghua CSCG Group](https://cg.cs.tsingh

 Jittor is Apache 2.0 licensed, as found in the LICENSE.txt file.

+
--- a/README.src.md
+++ b/README.src.md
@ -5,7 +5,7 @@

 [Quickstart](#quickstart) | [Install](#install) | [Tutorial](#tutorial) | [Chinese](./README.cn.md)

-[快速开始](#快速开始) | [安装](#安装) | [教程](#教程)
+[快速开始](#快速开始) | [安装](#安装) | [教程](#教程) | [English](./README.md)

 Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators. The whole framework and meta-operators are compiled just-in-time. A powerful op compiler and tuner are integrated into Jittor. It allowed us to generate high-performance code with specialized for your model. Jittor also contains a wealth of high-performance model libraries, including: image recognition, detection, segmentation, generation, differentiable rendering, geometric learning, reinforcement learning, etc. .

@ -22,6 +22,7 @@ Related Links:
 *  [Jittor Documents](https://cg.cs.tsinghua.edu.cn/jittor/assets/docs/index.html)
 *  [Github](https://github.com/jittor/jittor), [Gitee](https://gitee.com/jittor/jittor)
 *  [Jittor Forum](https://discuss.jittor.org/)
+*  [Awesome Jittor List](https://github.com/Jittor/jittor/blob/master/AWESOME-JITTOR-LIST.md)
 *  IM: QQ Group(761222083)

 相关链接：
@ -31,6 +32,7 @@ Related Links:
 *  [Jittor文档](https://cg.cs.tsinghua.edu.cn/jittor/assets/docs/index.html)
 *  [Github](https://github.com/jittor/jittor)， [Gitee](https://gitee.com/jittor/jittor)
 *  [Jittor 论坛](https://discuss.jittor.org/)
+*  [Jittor 精选仓库](https://github.com/Jittor/jittor/blob/master/AWESOME-JITTOR-LIST.md)
 *  即时通信: QQ Group(761222083)


@ -115,52 +117,17 @@ We provide some jupyter notebooks to help you quick start with Jittor.

 ## 安装

-
 Jittor框架对环境要求如下:

-Jittor 支持**Linux**(e.g. Ubuntu/CentOS/Arch), **macOS**,**Windows**， 其中**Linux**和**macOS**的依赖如下：
-* Python：版本 >= 3.7
-* C++编译器 （需要下列至少一个）
-    - g++ （>=5.4.0 for linux）
-    - clang （>=8.0 for mac）
-* GPU 编译器（可选）：nvcc >=10.0
-* GPU 加速库（可选）：cudnn-dev (cudnn开发版, 推荐使用tar安装方法，[参考链接](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar))
-
-Jittor 目前还支持主流国产Linux操作系统，如统信、麒麟、普华、龙芯Loongnix，安装方式可参考 Linux pip安装方法，准备好python和g++即可。
-
-**Windows**对环境的要求为：
-* Python：版本 >= 3.8(建议从Python官网安装：<https://www.python.org/downloads/windows/>)
-* x86_64处理器
-* Windows 10及以上。
-
-如果您不希望手动配置环境，我们推荐使用 Docker 进行安装。
-除此之外，您还可以使用 pip 安装和手动安装。
-
-注意1：macOS 用户需要安装额外依赖，请参考 [macOS 安装](#macOS-安装)。
-
-Jittor 提供了三种安装方法：pip、docker和手动安装：
-
 Jittor environment requirements:

-* System: **Linux**(e.g. Ubuntu/CentOS/Arch), **macOS**, or **Windows**, enviroment requirements of **Linux** and **Mac** are list below:
+| OS                                                     | CPU                                 | Python | Compiler     | (Optional) GPU platform                                |
+|--------------------------------------------------------|-------------------------------------|--------|--------------|---------------------------------------------|
+| Linux<br>(Ubuntu, CentOS, Arch, <br>UOS, KylinOS, ...) | x86 <br>x86_64 <br>ARM <br>loongson | >= 3.7 | g++ >=5.4    | Nvidia CUDA >= 10.0, [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar) <br> or [AMD ROCm](https://docs.amd.com/) >= 4.0 <br> or [Hygon DCU DTK](https://tycloud.hpccube.com/doc/1.0.6/11277/general-handbook/software-tutorial/jittor.html) >= 22.04 |
+| macOS <br>(>= 10.14 Mojave)                            | intel<br>Apple Silicon              | >= 3.7 | clang >= 8.0 | -                                           |
+| Windows 10 & 11                                        | x86_64                              | [>= 3.8](https://www.python.org/downloads/windows/) | -            | Nvidia CUDA >= 10.2 [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-windows)                               |

-* Python version >= 3.7
-* CPU compiler (require at least one of the following)
-    * g++ (>=5.4.0)
-    * clang (>=8.0)
-* GPU compiler (optional)
-    * nvcc (>=10.0 for g++ or >=10.2 for clang)
-* GPU library: cudnn-dev (recommend tar file installation, [reference link](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar))
-
-
-**Windows** requirements atr:
-
-* Python: version >= 3.8(recommend install from <https://www.python.org/downloads/windows/>)
-* x86_64 CPU processor
-* Windows 10 or above
-
-
-Note#1: macOS users have to install additional dependencies, see [macOS install](#macOS-install).
+Jittor 提供了三种安装方法：pip、docker和手动安装：

 Jittor offers three ways to install: pip, docker, or manual.

@ -186,12 +153,12 @@ jittor会自动在路径中寻找合适的编译器, 如果您希望手动指定

 ### macOS install

-macOS 请使用 [homebrew](https://brew.sh) 安装额外的依赖 (python>=3.7, onednn)。
+macOS 请使用 [homebrew](https://brew.sh) 安装额外的依赖。

 Please first install additional dependencies with [homebrew](https://brew.sh).

 ```bash
-brew install python@3.7 onednn libomp
+brew install onednn libomp
 ```

 之后您可以通过 pip 安装 jittor，并测试是否可以成功运行。
@ -203,7 +170,7 @@ python3.7 -m pip install jittor
 python3.7 -m jittor.test.test_example
 ```

-目前在macOS中，jittor 只支持 CPU 计算。
+目前在 macOS 中，jittor 只支持 CPU 计算。

 Currently jittor only supports CPU in macOS.

--- a/python/jittor/init.py
+++ b/python/jittor/init.py
@ -9,7 +9,7 @@
 # file 'LICENSE.txt', which is part of this source code package.
 # ***************************************************************

-__version__ = '1.3.5.11'
+__version__ = '1.3.5.18'
 from jittor_utils import lock
 with lock.lock_scope():
    ori_int = int
@ -168,7 +168,7 @@ single_log_capture = None
 class log_capture_scope(_call_no_record_scope):
    """log capture scope

-    example::
+    Example::

        with jt.log_capture_scope(log_v=0) as logs:
            LOG.v("...")
@ -301,12 +301,13 @@ def array(data, dtype=None):
    ----------------

    Example::
-    >>> jt.array(1)
-    jt.Var([1], dtype=int32)
-    >>> jt.array([0, 2.71, 3.14]) 
-    jt.Var([0.   2.71 3.14], dtype=float32)
-    >>> jt.array(np.arange(4, dtype=np.uint8))  
-    jt.Var([0 1 2 3], dtype=uint8)
+
+        >>> jt.array(1)
+        jt.Var([1], dtype=int32)
+        >>> jt.array([0, 2.71, 3.14]) 
+        jt.Var([0.   2.71 3.14], dtype=float32)
+        >>> jt.array(np.arange(4, dtype=np.uint8))  
+        jt.Var([0 1 2 3], dtype=uint8)
    '''
    if isinstance(data, core.Var):
        if dtype is None:
@ -346,9 +347,10 @@ def random(shape, dtype="float32", type="uniform"):
    ----------------

    Example::
-    >>> jt.random((2, 3))
-    jt.Var([[0.96788853 0.28334728 0.30482838]
-     [0.46107793 0.62798643 0.03457401]], dtype=float32)
+
+        >>> jt.random((2, 3))
+        jt.Var([[0.96788853 0.28334728 0.30482838]
+                [0.46107793 0.62798643 0.03457401]], dtype=float32)
    '''
    # TODO: move those code to core
    if dtype == "float16":
@ -489,7 +491,7 @@ def var(x, dim=None, dims=None, unbiased=False, keepdims=False):
    :param keepdim: if True, the output shape is same as input shape except for the dimension in dim.
    :type keepdim: bool.

-    Example:
+    Example::

        >>> a = jt.rand(3)
        >>> a
@ -615,6 +617,64 @@ def clamp(x, min_v=None, max_v=None):

 Var.clamp = clamp

+def clamp_(x, min_v=None, max_v=None):
+    ''' In-place version of clamp().
+
+    Args:
+        x (Jittor Var):
+            the input var
+        min_v ( Number or Var, optional) - lower-bound of clamp range
+        max_v ( Number or Var, optional) - upper-bound of clamp range
+    
+    Return:
+        x itself after clamp.
+    
+    '''
+    return x.assign(x.clamp(min_v=min_v, max_v=max_v))
+Var.clamp_ = clamp_
+
+def erfinv_(x):
+    ''' In-place version of erfinv().
+    '''
+    return x.assign(x.erfinv())
+Var.erfinv_ = erfinv_
+
+def erf_(x):
+    ''' In-place version of erf().
+    '''
+    return x.assign(x.erf())
+Var.erf_ = erf_
+
+def abs_(x):
+    ''' In-place version of abs().
+    '''
+    return x.assign(x.abs())
+Var.abs_ = abs_
+
+def sigmoid_(x):
+    ''' In-place version of sigmoid().
+    '''
+    return x.assign(x.sigmoid())
+Var.sigmoid_ = sigmoid_
+
+def sqrt_(x):
+    ''' In-place version of sqrt().
+    '''
+    return x.assign(x.sqrt())
+Var.sqrt_ = sqrt_
+
+def add_(x, y):
+    ''' In-place version of add().
+    '''
+    return x.assign(x.add(y))
+Var.add_ = add_
+
+def multiply_(x, y):
+    ''' In-place version of multiply().
+    '''
+    return x.assign(x.multiply(y))
+Var.multiply_ = multiply_
+
 def type_as(a, b):
    return a.unary(op=b.dtype)
 Var.type_as = type_as
@ -645,11 +705,51 @@ def pow(x, y):
    return core.ops.pow(x, y)
 Var.pow = Var.__pow__ = pow

-def argmax(x, dim, keepdims:bool=False):
+def argmax(x: Var, dim: int, keepdims:bool=False):
+    ''' Returns the indices and values of the maximum elements along the specified dimension.
+
+    :param x: the input Var.
+    :type x: jt.Var, numpy array, or python sequence.
+    :param dim: the dimension to reduce.
+    :type dim: int.
+    :param keepdims: whether the output Var has dim retained or not. Defaults to False
+    :type keepdims: bool, optional
+    
+    Example::
+
+        >>> a = jt.randn((2, 4))
+        >>> a
+        jt.Var([[-0.33272865 -0.4951588   1.4128606   0.13734372]
+                [-1.633469    0.19593953 -0.7803732  -0.5260756 ]], dtype=float32)
+        >>> a.argmax(dim=0)
+        (jt.Var([0 1 0 0], dtype=int32), jt.Var([-0.33272865  0.19593953  1.4128606   0.13734372], dtype=float32))
+        >>> a.argmax(dim=1)
+        (jt.Var([2 1], dtype=int32), jt.Var([1.4128606  0.19593953], dtype=float32))
+    '''
    return jt.arg_reduce(x, "max", dim, keepdims)
 Var.argmax = argmax

-def argmin(x, dim, keepdims:bool=False):
+def argmin(x, dim: int, keepdims:bool=False):
+    ''' Returns the indices and values of the minimum elements along the specified dimension.
+
+    :param x: the input Var.
+    :type x: jt.Var, numpy array, or python sequence.
+    :param dim: the dimension to reduce.
+    :type dim: int.
+    :param keepdims: whether the output Var has dim retained or not. Defaults to False
+    :type keepdims: bool, optional
+
+    Example::
+    
+        >>> a = jt.randn((2, 4))
+        >>> a
+        jt.Var([[-0.33272865 -0.4951588   1.4128606   0.13734372]
+                [-1.633469    0.19593953 -0.7803732  -0.5260756 ]], dtype=float32)
+        >>> a.argmin(dim=0)
+        (jt.Var([1 0 1 1], dtype=int32), jt.Var([-1.633469  -0.4951588 -0.7803732 -0.5260756], dtype=float32))
+        >>> a.argmin(dim=1)
+        (jt.Var([1 0], dtype=int32), jt.Var([-0.4951588 -1.633469 ], dtype=float32))
+    '''
    return jt.arg_reduce(x, "min", dim, keepdims)
 Var.argmin = argmin

@ -665,7 +765,7 @@ def randn(*size, dtype="float32", requires_grad=True) -> Var:
    :param requires_grad: whether to enable gradient back-propgation, defaults to True.
    :type requires_grad: bool, optional

-    Example:
+    Example::

        >>> jt.randn(3)
        jt.Var([-1.019889   -0.30377278 -1.4948598 ], dtype=float32)
@ -690,7 +790,7 @@ def rand(*size, dtype="float32", requires_grad=True) -> Var:
    :param requires_grad: whether to enable gradient back-propgation. defaults to True.
    :type requires_grad: bool, optional

-    Example:
+    Example::

        >>> jt.rand(3)
        jt.Var([0.31005102 0.02765604 0.8150749 ], dtype=float32)
@ -713,7 +813,7 @@ def rand_like(x, dtype=None) -> Var:
        Otherwise, use the specified dtype. Defaults to None.
    :type dtype: str, optional
    
-    Example:
+    Example::
        
        >>> x = jt.zeros((2, 3))
        >>> jt.rand_like(x)
@ -733,7 +833,7 @@ def randn_like(x, dtype=None) -> Var:
        Otherwise, use the specified dtype. Defaults to None.
    :type dtype: str, optional

-    Example:
+    Example::
        
        >>> x = jt.zeros((2, 3))
        >>> jt.randn_like(x)
@ -758,16 +858,16 @@ def randint(low, high=None, shape=(1,), dtype="int32") -> Var:
    :param dtype: data type of the output, defaults to "int32".
    :type dtype: str, optional

-    Example:
+    Example::
        
        >>> jt.randint(3, shape=(3, 3))
        jt.Var([[2 0 2]
-         [2 1 2]
-         [2 0 1]], dtype=int32)
+                [2 1 2]
+                [2 0 1]], dtype=int32)
        >>> jt.randint(1, 3, shape=(3, 3))
        jt.Var([[2 2 2]
-         [1 1 2]
-         [1 1 1]], dtype=int32)
+                [1 1 2]
+                [1 1 1]], dtype=int32)
    '''
    if high is None: low, high = 0, low
    v = (jt.random(shape) * (high - low) + low).clamp(low, high-0.5)
@ -786,15 +886,15 @@ def randint_like(x, low, high=None) -> Var:
    :param high: One above the highest integer to be drawn from the distribution.
    :type high: int

-    Example:
+    Example::

        >>> x = jt.zeros((2, 3))
        >>> jt.randint_like(x, 10)
        jt.Var([[9. 3. 4.]
-         [4. 8. 5.]], dtype=float32)
+                [4. 8. 5.]], dtype=float32)
        >>> jt.randint_like(x, 10, 20)
        jt.Var([[17. 11. 18.]
-        [14. 17. 15.]], dtype=float32)
+                [14. 17. 15.]], dtype=float32)
     ''' 

    return randint(low, high, x.shape, x.dtype)
@ -817,15 +917,15 @@ def normal(mean, std, size=None, dtype="float32") -> Var:
    :param dtype: data type of the output, defaults to "float32".
    :type dtype: str, optional

-    Example:
+    Example::
    
        >>> jt.normal(5, 3, size=(2,3))
        jt.Var([[ 8.070848   7.654219  10.252696 ]
-         [ 6.383718   7.8817277  3.0786133]], dtype=float32)
+                [ 6.383718   7.8817277  3.0786133]], dtype=float32)
        >>> mean = jt.randint(low=0, high=10, shape=(10,))
        >>> jt.normal(mean, 0.1)
        jt.Var([1.9524184 1.0749301 7.9864206 5.9407325 8.1596155 4.824019  7.955083
-         8.972998  6.0674286 8.88026  ], dtype=float32)
+                8.972998  6.0674286 8.88026  ], dtype=float32)
    '''
    if size is None:
        if isinstance(mean, Var) and isinstance(std, Var):
@ -996,17 +1096,18 @@ class Module:
        ----------------

        Example::
-        >>> net = nn.Sequential(nn.Linear(2, 10), nn.ReLU(), nn.Linear(10, 2))
-        >>> for p in net.parameters():
-        ...     print(p.name)
-        ... 
-        >>> for p in net.parameters():
-        ...     print(p.name())
-        ... 
-        0.weight
-        0.bias
-        2.weight
-        2.bias
+        
+            >>> net = nn.Sequential(nn.Linear(2, 10), nn.ReLU(), nn.Linear(10, 2))
+            >>> for p in net.parameters():
+            ...     print(p.name)
+            ... 
+            >>> for p in net.parameters():
+            ...     print(p.name())
+            ... 
+            0.weight
+            0.bias
+            2.weight
+            2.bias
        '''
        ps = []
        stack = []
@ -1093,14 +1194,15 @@ class Module:
        ----------------

        Example::
-        >>> net = nn.Linear(2, 5)
-        >>> net.named_parameters() 
-        [('weight', jt.Var([[ 0.5964666  -0.3175258 ]
-        [ 0.41493994 -0.66982657]
-        [-0.32677156  0.49614117]
-        [-0.24102807 -0.08656466]
-        [ 0.15868133 -0.12468725]], dtype=float32)), 
-        ('bias', jt.Var([-0.38282675  0.36271113 -0.7063226   0.02899247  0.52210844], dtype=float32))]
+
+            >>> net = nn.Linear(2, 5)
+            >>> net.named_parameters() 
+            [('weight', jt.Var([[ 0.5964666  -0.3175258 ]
+            [ 0.41493994 -0.66982657]
+            [-0.32677156  0.49614117]
+            [-0.24102807 -0.08656466]
+            [ 0.15868133 -0.12468725]], dtype=float32)), 
+            ('bias', jt.Var([-0.38282675  0.36271113 -0.7063226   0.02899247  0.52210844], dtype=float32))]

        '''
        state_dict = self.state_dict()
@ -1118,13 +1220,14 @@ class Module:
        ----------------

        Example::
-        >>> net = nn.Sequential(nn.Linear(2, 10), nn.ReLU(), nn.Linear(10, 2))
-        >>> net.modules()
-        [Sequential(
-            0: Linear(2, 10, float32[10,], None)
-            1: relu()
-            2: Linear(10, 2, float32[2,], None)
-        ), Linear(2, 10, float32[10,], None), relu(), Linear(10, 2, float32[2,], None)]
+
+            >>> net = nn.Sequential(nn.Linear(2, 10), nn.ReLU(), nn.Linear(10, 2))
+            >>> net.modules()
+            [Sequential(
+                0: Linear(2, 10, float32[10,], None)
+                1: relu()
+                2: Linear(10, 2, float32[2,], None)
+            ), Linear(2, 10, float32[10,], None), relu(), Linear(10, 2, float32[2,], None)]
        '''
        ms = []
        def callback(parents, k, v, n):
@ -1139,13 +1242,14 @@ class Module:
        ----------------

        Example::
-        >>> net = nn.Sequential(nn.Linear(2, 10), nn.ReLU(), nn.Linear(10, 2))
-        >>> net.named_modules() 
-        [('', Sequential(
-            0: Linear(2, 10, float32[10,], None)
-            1: relu()
-            2: Linear(10, 2, float32[2,], None)
-        )), ('0', Linear(2, 10, float32[10,], None)), ('1', relu()), ('2', Linear(10, 2, float32[2,], None))]
+
+            >>> net = nn.Sequential(nn.Linear(2, 10), nn.ReLU(), nn.Linear(10, 2))
+            >>> net.named_modules() 
+            [('', Sequential(
+                0: Linear(2, 10, float32[10,], None)
+                1: relu()
+                2: Linear(10, 2, float32[2,], None)
+            )), ('0', Linear(2, 10, float32[10,], None)), ('1', relu()), ('2', Linear(10, 2, float32[2,], None))]
        '''
        ms = []
        stack = []
@ -1369,7 +1473,7 @@ Arguments of hook are defined as::
        :param path: path to save.
        :type path: str

-        Example:
+        Example::

            >>> class Net(nn.Module):
            >>> ...
@ -1392,7 +1496,7 @@ Arguments of hook are defined as::
        :param path: path to load.
        :type path: str
        
-        Example:
+        Example::

            >>> class Net(nn.Module):
            >>> ...
@ -1621,14 +1725,16 @@ def grad_hooker(args, hook):

 def register_hook(v, hook):
    """ register hook of any jittor Variables, if hook return not None,
-the gradient of this variable will be alter, Example::
+the gradient of this variable will be alter, 

-    x = jt.array([0.0, 0.0])
-    y = x * [1,2]
-    y.register_hook(lambda g: g*2)
-    dx = jt.grad(y, x)
-    print(dx)
-    # will be [2, 4]
+    Example::
+
+        x = jt.array([0.0, 0.0])
+        y = x * [1,2]
+        y.register_hook(lambda g: g*2)
+        dx = jt.grad(y, x)
+        print(dx)
+        # will be [2, 4]

    """
    def _hook(grads):
@ -1710,7 +1816,9 @@ def jittor_exit():
 atexit.register(jittor_exit)

 def vtos(v):
-    return f"jt.Var({v.data}, dtype={v.dtype})"
+    data_str = f"jt.Var({v.data}, dtype={v.dtype})"
+    data_str = data_str.replace("\n", "\n       ")
+    return data_str

 Var.__str__ = vtos
 Var.__repr__ = vtos
--- a/python/jittor/init.pyi
+++ b/python/jittor/init.pyi
--- a/python/jittor/compiler.py
+++ b/python/jittor/compiler.py
@ -1329,6 +1329,9 @@ if use_data_gz:
            f.write(md5)
    files.append(data_o_path)
    files = [f for f in files if "__data__" not in f]
+else:
+    files = [f for f in files 
+        if "__data__" not in f or "src" in f.split("__data__")[1]]

 cc_flags += f" -l\"jit_utils_core{lib_suffix}\" "
 compile(cc_path, cc_flags+opt_flags, files, 'jittor_core'+extension_suffix)
--- a/python/jittor/init.py
+++ b/python/jittor/init.py
@ -691,7 +691,7 @@ def trunc_normal_(var, mean=0., std=1., a=-2., b=2.):
        print(linear.weight)
        linear.weight.trunc_normal_(std=.02) # This is ok too
    """
-    return _no_grad_trunc_normal_(var, mean, std, a, b)
+    return var.assign(_no_grad_trunc_normal_(var, mean, std, a, b))
 Var.trunc_normal_ = trunc_normal_

 def _no_grad_trunc_normal_(var, mean, std, a, b):
--- a/python/jittor/misc.py
+++ b/python/jittor/misc.py
@ -525,20 +525,224 @@ _triple = _ntuple(3)
 _quadruple = _ntuple(4)


-def unique(x):
+def unique(
+    input: jt.Var, 
+    return_inverse: bool=False, 
+    return_counts: bool=False, 
+    dim: int=None):
+
    r'''
    Returns the unique elements of the input tensor.

    Args:

-        x– the input tensor.
+        input (var) – the input var
+
+        return_inverse (bool) – Whether to also return the indices for where elements in the original input ended up in the returned unique list. default: False
+
+        return_counts (bool) – Whether to also return the counts for each unique element. default: False
+
+        dim (int) – the dimension to apply unique. If None, the unique of the flattened input is returned. default: None
+
+    Example:
+
+        >>> jittor.unique(jittor.array([1, 3, 2, 3]))
+        jt.Var([1 2 3], dtype=int32)
+
+        >>> jittor.unique(jittor.array([1, 3, 2, 3, 2]), return_inverse=True, return_counts=True)
+        (jt.Var([1 2 3], dtype=int32), jt.Var([0 2 1 2 1], dtype=int32), jt.Var([1 2 2], dtype=int32))
+
+        >>> jittor.unique(jittor.array([[1, 3], [2, 3]]), return_inverse=True)
+            (jt.Var([1 2 3], dtype=int32), jt.Var([[0 2]
+                                                   [1 2]], dtype=int32))
+
+        >>> jittor.unique(jittor.array([[1, 3], [1, 3]]), dim=0)
+            jt.Var([[1 3]], dtype=int32)
    '''
-    x = x.reshape(-1)
-    _,x = jt.argsort(x)
-    index,= jt.index((x.shape[0],))
-    y = x[1:][x[index[1:]] != x[index[:-1]]]
-    x = jt.concat([x[:1],y],dim=0)
-    return x
+
+    temp_shape = None
+    if dim == None:
+        temp_shape = list(input.shape)
+        input_flatten = input.flatten()
+        dim = 0
+    else:
+        input_flatten = input
+
+    input_flatten = input_flatten.transpose(dim, 0)
+    orig_shape = input_flatten.shape
+    input_flatten = input_flatten.view(orig_shape[0], -1)
+    
+    with jt.flag_scope(compile_options = {"FLAGS:  --extended-lambda ": 1} if jt.flags.use_cuda else {}):
+        indice = jt.code((input_flatten.shape[0], ), 'int32', [input_flatten],
+            cpu_header='''
+            #include <algorithm>
+            ''',
+            cpu_src='''
+            @alias(input_flatten, in0)
+            @alias(indice, out)
+
+            int dimlen = input_flatten_shape0, dimsize = input_flatten_shape1;
+            for(int i = 0; i < dimlen; ++i) @indice(i) = i;
+            std::sort(&@indice(0), &@indice(dimlen), [&](int a, int b){
+                for(int i = 0; i < dimsize; ++i) {
+                    int lhs = @input_flatten(a, i), rhs = @input_flatten(b, i);
+                    if (lhs != rhs) return lhs < rhs;
+                }
+                return false;
+            });
+            ''',
+            cuda_header='''
+            #undef out
+            #include <thrust/extrema.h>
+            #include <thrust/device_ptr.h>
+            #include <thrust/execution_policy.h>
+            #include <thrust/device_vector.h>
+            #include <thrust/sequence.h>
+    
+            #include <cub/cub.cuh> 
+            #include <executor.h>
+            ''',
+            cuda_src=
+            '''
+                @alias(input_flatten, in0)
+                @alias(indice, out)
+                int dimlen = indice_shape0, dimsize = input_flatten_shape1;
+
+                if (dimsize == 1) {
+                    size_t raw_allocation, d_allocation, temp_storage_bytes = 0;
+                    void *d_temp_storage = NULL;
+                    int32_t* raw_ptr = (int32_t*)exe.allocator->alloc(dimlen * (sizeof(int32_t) + sizeof(input_flatten_type)), raw_allocation);
+
+                    thrust::device_ptr<int32_t> arange_ptr = thrust::device_pointer_cast(raw_ptr);
+                    thrust::sequence(arange_ptr, arange_ptr + dimlen);
+
+                    cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, input_flatten_p, 
+                                                    (input_flatten_type*)(raw_ptr + dimlen), thrust::raw_pointer_cast(arange_ptr), indice_p, dimlen);
+                    d_temp_storage = exe.allocator->alloc(temp_storage_bytes, d_allocation);
+                    cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, input_flatten_p,
+                                                    (input_flatten_type*)(raw_ptr + dimlen), thrust::raw_pointer_cast(arange_ptr), indice_p, dimlen);
+
+                    exe.allocator->free(raw_ptr, dimlen * (sizeof(int) + sizeof(input_flatten_type)), raw_allocation);
+                    exe.allocator->free(d_temp_storage, temp_storage_bytes, d_allocation);
+                } else {
+                    thrust::device_ptr<input_flatten_type> input_ptr = thrust::device_pointer_cast(input_flatten_p);
+                    thrust::device_ptr<int32_t> indice_ptr = thrust::device_pointer_cast(indice_p);
+
+                    thrust::sequence(indice_ptr, indice_ptr + dimlen);
+                    thrust::sort(thrust::device, indice_ptr, indice_ptr + dimlen,
+                        [=] __device__ (int32_t a, int32_t b)->bool {
+                            for(int i = 0; i < dimsize; ++i) {
+                                input_flatten_type lhs = input_ptr[i + a * dimsize],
+                                                rhs = input_ptr[i + b * dimsize];
+                                if (lhs != rhs) return lhs < rhs;
+                            }
+                            return false;
+                        });
+                }
+            '''
+        )
+    input_sorted = input_flatten[indice][:]
+    
+    dimlen = indice.shape[0]
+
+    diff = jt.logical_not(jt.all(input_sorted[1:] == input_sorted[: -1], 1))
+    diff = jt.concat([jt.Var([False]), diff], 0)
+    diff = jt.array(diff, dtype = jt.int32)
+  
+    with jt.flag_scope(compile_options = {"FLAGS:  --extended-lambda ": 1} if jt.flags.use_cuda else {}):
+        output, inverse = jt.code(
+            [(-input_sorted.shape[0], ), (indice.shape)],
+            [input_sorted.dtype, indice.dtype],
+            [input_sorted, diff, indice],
+            cpu_header='''
+                #include <algorithm>
+                @alias(input_sorted, in0)
+                @alias(diff, in1)
+                @alias(indice, in2)
+                @alias(output, out0)
+                @alias(inverse, out1)
+            ''',
+            cpu_src=
+            f"bool return_inverse = {int(return_inverse)};" +
+            '''
+                int tot = -1;
+                for (int i = 0; i < input_sorted_shape0; ++i) {
+                    if (i == 0 || @diff(i)) {
+                        ++tot; @output(tot) = i;
+                    }
+                    if (return_inverse)
+                        @inverse(@indice(i)) = tot;
+                }
+                output->set_shape({tot + 1});
+            ''',
+            cuda_header='''
+                #undef out
+
+                #include <thrust/extrema.h>
+                #include <thrust/device_ptr.h>
+                #include <thrust/execution_policy.h>
+                #include <thrust/scan.h>
+                #include <executor.h>
+
+                @alias(input_sorted, in0)
+                @alias(diff, in1)
+                @alias(indice, in2)
+                @alias(output, out0)
+                @alias(inverse, out1)
+            ''',
+            cuda_src=
+            f"bool return_inverse = {int(return_inverse)};" +
+            '''
+                int dimlen = input_sorted_shape0, dimsize = input_sorted_shape1;
+                size_t raw_allocation;
+                int32_t* raw_ptr = (int32_t*)exe.allocator->alloc(2 * dimlen * sizeof(int), raw_allocation);
+
+                thrust::device_ptr<int32_t> diff_ptr = thrust::device_pointer_cast(diff_p),
+                                            inverse_ptr = thrust::device_pointer_cast(inverse_p),
+                                            array_ptr = thrust::device_pointer_cast(raw_ptr),
+                                            sum_ptr = thrust::device_pointer_cast(raw_ptr + dimlen),
+                                            indice_ptr = thrust::device_pointer_cast(indice_p);
+                thrust::device_ptr<input_sorted_type> input_ptr = thrust::device_pointer_cast(input_sorted_p);
+
+                if (return_inverse) {
+                    thrust::inclusive_scan(diff_ptr, diff_ptr + dimlen, sum_ptr);
+                    thrust::scatter(sum_ptr, sum_ptr + dimlen, indice_ptr, inverse_ptr);
+                }
+
+                thrust::sequence(array_ptr, array_ptr + dimlen);
+                int32_t num = thrust::unique(array_ptr, array_ptr + dimlen,
+                    [=] __device__ (int32_t a, int32_t b)->bool {
+                        for(int i = 0; i < dimsize; ++i) {
+                            input_sorted_type lhs = input_ptr[i + a * dimsize],
+                                            rhs = input_ptr[i + b * dimsize];
+                            if (lhs != rhs) return false;
+                        }
+                        return true;
+                    }) - array_ptr;
+
+                cudaMemcpy(output_p, raw_ptr, sizeof(int32_t) * num, cudaMemcpyDeviceToDevice);
+                exe.allocator->free(raw_ptr, 2 * dimlen * sizeof(int32_t), raw_allocation);
+                output->set_shape({ num });
+            '''
+        )
+    indice_shape = (output.shape[0], )
+    output = input_sorted[output][:]
+
+    new_shape = list(orig_shape[1:])
+    new_shape.insert(0, -1)
+    output = output.view(new_shape).transpose(dim, 0)
+    if temp_shape != None:
+        inverse = inverse.view(temp_shape).transpose(dim, 0)
+
+    if return_inverse:
+        if return_counts:
+            counts = jt.zeros(indice_shape, dtype=jt.int32)
+            jt.scatter_(counts, 0, inverse.flatten(), jt.ones(dimlen), reduce='add')
+            return output, inverse, counts
+        else:
+            return output, inverse
+    else:
+        return output

 jt.Var.unique = unique

@ -558,11 +762,14 @@ jt.Var.deg2rad = deg2rad

 def arctan2(y,x):
    angle = jt.zeros(x.shape,dtype=x.dtype)
-    x = (x!=0.0).ternary(x, x+1e-30)
+    x = (x!=0.0).ternary(x, 1e-30)
    angle = (y/x).arctan()
-    mask = y<0 | ((y==0) & (x<0))
+    mask = (x<0)&(y<0)
+    angle = angle - mask*np.pi
+    mask = (x<0)&(y>=0)
    angle = angle + mask*np.pi
    return angle
+atan2 = arctan2


 def nonzero(x):
@ -1176,8 +1383,11 @@ def numpy_cumprod(a, dim):
    return func(a, dim)

 def linspace(start, end, steps):
-    res = jt.index((steps,))[0]
-    res = res*float((end-start)/(steps-1))+start
+    if steps > 1:
+        res = jt.index((steps,))[0]
+        res = res*float((end-start)/(steps-1))+start
+    else:
+        res = jt.array([start])
    return res

 def randperm(n, dtype="int32"):
--- a/python/jittor/nn.py
+++ b/python/jittor/nn.py
@ -12,7 +12,6 @@
 # file 'LICENSE.txt', which is part of this source code package.
 # ***************************************************************
 from abc import abstractmethod
-from sys import breakpointhook
 import jittor as jt
 from jittor import flatten, init, Module
 import numpy as np
@ -262,7 +261,8 @@ def sign(x: jt.Var) -> jt.Var:
 def gelu(x):
    r''' Applies the element-wise function:

-   .. math:: \text{GELU}(x) = x * \Phi(x)
+    .. math::
+        \text{GELU}(x) = x * \Phi(x)

    where :math:`\Phi(x)` is the Cumulative Distribution Function for Gaussian Distribution.

@ -546,6 +546,31 @@ class Dropout(Module):
 def dropout(x,p=0.5,is_train=False):
    return Dropout(p=p,is_train=is_train)(x)

+class DropPath(Module):
+    '''Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
+    '''
+    def __init__(self, p=0.5, is_train=False):
+        '''
+            :param p: Specifies the probability of each batch retention. Defaults to 0.5.
+            :type p: float dtype
+            :param is_train: Specify whether it is a training model. Defaults to False.
+            :type is_train: bool
+        '''
+        self.p = p
+        self.is_train = is_train
+        #TODO: test model.train() to change self.is_train
+    def execute(self, x):
+        if self.p == 0. or not self.is_train:
+            return x
+        keep_prob = 1 - self.p
+        shape = (x.shape[0], ) + (1, ) * (x.ndim - 1)
+        random_tensor = keep_prob + jt.rand(shape, dtype=x.dtype)
+        output = x.divide(keep_prob) * random_tensor.floor()
+        return output
+
+def droppath(x,p=0.5,is_train=False):
+    return DropPath(p=p,is_train=is_train)(x)
+
 class Linear(Module):
    def __init__(self, in_features, out_features, bias=True):
        self.in_features = in_features
@ -755,7 +780,23 @@ LeakyReLU = Leaky_relu
 ReLU6 = jt.make_module(relu6)
 Softmax = jt.make_module(softmax, 2)
 GELU = jt.make_module(gelu)
-Flatten = jt.make_module(jt.flatten)
+
+class Flatten(Module):
+    ''' Flattens the contiguous range of dimensions in a Var.
+
+    :param start_dim: the first dimension to be flattened. Defaults: 1.
+    :type start_dim: int
+
+    :param end_dim: the last dimension to be flattened. Defaults: -1.
+    :type end_dim: int
+    '''
+    def __init__(self, start_dim=1, end_dim=-1):
+        self.start_dim = start_dim
+        self.end_dim = end_dim
+
+    def execute(self, x) -> jt.Var:
+        return x.flatten(self.start_dim, self.end_dim)
+

 from jittor.depthwise_conv import DepthwiseConv

@ -1744,6 +1785,23 @@ def resize(img, size, mode="nearest", align_corners=False, tf_mode=False):
    elif mode == 'nearest':
        x = hid * (h / H)
        y = wid * (w / W)
+    elif mode == "area":
+        '''
+        Area interpolation uses AdaptivePool2D to resize origin images.
+        '''
+        stride = (h // H, w // W)
+        assert stride[0] > 0 and stride[1] > 0
+        x, y = jt.meshgrid(jt.arange(0, H, 1), jt.arange(0, W, 1))
+        startH = jt.floor(x*h/H).int32()
+        endH = jt.ceil((x+1)*h/H).int32()
+        maxH = int(jt.max(endH - startH).data)
+        startW = jt.floor(y*w/W).int32()
+        endW = jt.ceil((y+1)*w/W).int32()
+        maxW = int(jt.max(endW - startW).data)
+        pixel_count = (endH - startH) * (endW - startW)
+        adaptive_output = img.reindex([img.shape[0], img.shape[1], H, W, maxH, maxW], ["i0", "i1", "@e0(i2, i3) + i4", "@e2(i2, i3) + i5"], extras=[startH, endH, startW, endW], overflow_conditions=["i4 >= @e1(i2, i3) - @e0(i2, i3)", "i5 >= @e3(i2, i3) - @e2(i2, i3)"], overflow_value=0)
+        adaptive_output = adaptive_output.reduce("sum", [4,5]) / pixel_count[None, None, ...]
+        return adaptive_output
    else:
        if (tf_mode):
            x = hid * (h / H)
@ -2799,4 +2857,91 @@ def _fft2(x, inverse=False):
    y = jt.compile_extern.cufft_ops.cufft_fft(x, inverse)
    if inverse:
        y /= x.shape[1] * x.shape[2]
-    return y
+    return y
+
+
+def one_hot(x: jt.Var, num_classes: int=-1) -> jt.Var:
+    ''' Returns the one_hot encoding of inputs.
+
+    :param x: class values of any shape
+    :type x: jt.Var with bool or integer dtype
+
+    :param num_classes: Total number of classes. If set to -1, the number of classes will be inferred as one greater than the largest class value in the input tensor.
+    :type num_classes: int, optional
+
+    :return: a Var with one more dimension with 1 values at the index 
+    of last dimension indicated by the input, and 0 everywhere else.
+    :rtype: jt.Var
+
+    .. note::
+        if the values in x are greater than num_class or less than 0, 
+        the returned one_hot will be all zeros.
+
+    Example:
+        >>> jt.nn.one_hot(jt.arange(5) % 3)
+            jt.Var([[1 0 0]
+                [0 1 0]
+                [0 0 1]
+                [1 0 0]
+                [0 1 0]], dtype=int32)
+        >>> jt.nn.one_hot(jt.arange(5) % 3, num_classes=5)
+            jt.Var([[1 0 0 0 0]
+                [0 1 0 0 0]
+                [0 0 1 0 0]
+                [1 0 0 0 0]
+                [0 1 0 0 0]], dtype=int32)
+        >>> jt.nn.one_hot(jt.arange(6).reshape(3,2) % 3)
+            jt.Var([[[1 0 0]
+                [0 1 0]]
+
+                [[0 0 1]
+                [1 0 0]]
+
+                [[0 1 0]
+                [0 0 1]]], dtype=int32)
+    '''
+
+    assert x.dtype in [jt.bool, jt.int8, jt.int16, jt.int32, jt.int64, jt.uint8, jt.uint16, jt.uint32, jt.uint64]
+    if num_classes == -1:
+        num_classes = x.max().item() + 1
+
+    N = len(x.shape)
+    indices = ["i"+str(i) for i in range(N)]
+    y = jt.ones_like(x).reindex(
+        x.shape + [num_classes],
+        indices, 
+        extras=[x],
+        overflow_conditions=[f"i{N} != @e0({','.join(indices)})"],
+        overflow_value=0)
+    return y
+
+
+class KLDivLoss(Module):
+    ''' Computes the Kullback-Leibler divergence loss.
+    '''
+
+    def __init__(self, reduction: str = 'mean', log_target: bool = False):
+        '''
+            :param reduction: Specifies the reduction to apply to the output. Can be 'mean', 'sum', 'batchmean', or 'none'. Defaults to 'mean'.
+            :type reduction: str, optional
+            :param log_target: Specifies whether target is the log space. Defaults to False.
+            :type log_target: bool, optional
+        '''
+        self.reduction = reduction
+        self.log_target = log_target
+
+    def execute(self, input: jt.Var, target: jt.Var) -> jt.Var:
+        if not self.log_target:
+            loss_pointwise = target * (target.log() - input)
+        else:
+            loss_pointwise = target.exp() * (target - input)
+
+        if self.reduction == "mean":
+            loss = loss_pointwise.mean()
+        elif self.reduction == "batchmean":
+            loss = loss_pointwise.sum() / input.size(0)
+        elif self.reduction == "sum":
+            loss = loss_pointwise.sum()
+        else:
+            loss = loss_pointwise
+        return loss
--- a/python/jittor/src/executor.cc
+++ b/python/jittor/src/executor.cc
@ -39,6 +39,7 @@ Executor exe;
 EXTERN_LIB MemoryProfiler memory_profiler;
 DECLARE_FLAG(int, profile_memory_enable);
 DEFINE_FLAG(int, gopt_disable, 0, "Disable graph optimizer.");
+DEFINE_FLAG(int, use_threading, 0, "Allow to use python threading with jittor.");

 // from fetch_op.cc
 EXTERN_LIB list<VarPtr> fetcher_to_free;
@ -192,7 +193,7 @@ static void top_weak_sync(vector<Var*>& vars) {
 }

 void Executor::run_sync(vector<Var*> vars, bool device_sync, bool weak_sync) {
-    if (weak_sync)
+    if (weak_sync && !use_threading)
        top_weak_sync(vars);
    auto allocator = get_allocator();
    auto temp_allocator = get_allocator(true);
--- a/python/jittor/src/jit_key.h
+++ b/python/jittor/src/jit_key.h
@ -69,6 +69,16 @@ struct JitKey {
        uint data;
        explicit Oxhex2(uint data) : data(data) {}
    };
+
+    struct dec1 {
+        uint data;
+        explicit dec1(uint data) : data(data) {}
+    };
+
+    struct dec2 {
+        uint data;
+        explicit dec2(uint data) : data(data) {}
+    };
 };

 struct __jk_int128 {
@ -173,6 +183,30 @@ inline JK& operator<<(JK& jk, const JK::Oxhex2& h) {
    return jk << "0x" << JK::hex2(h.data);
 }

+inline JK& operator<<(JK& jk, const JK::dec2& h) {
+    uint8 a = h.data % 10;
+    uint8 b = h.data / 10;
+    if (b) jk << (char)(b+'0');
+    return jk << (char)(a+'0');
+}
+
+inline JK& operator<<(JK& jk, const JK::dec1& h) {
+    uint8 a = h.data % 10;
+    return jk << (char)(a+'0');
+}
+
+inline std::ostream& operator<<(std::ostream& os, const JK::dec2& h) {
+    uint8 a = h.data % 10;
+    uint8 b = h.data / 10;
+    if (b) os << (char)(b+'0');
+    return os << (char)(a+'0');
+}
+
+inline std::ostream& operator<<(std::ostream& os, const JK::dec1& h) {
+    uint8 a = h.data % 10;
+    return os << (char)(a+'0');
+}
+
 inline JK& operator<<(JK& jk, int c) {
    if (c<0) {
        c = -c;
--- a/python/jittor/src/mem/mem_info.cc
+++ b/python/jittor/src/mem/mem_info.cc
@ -67,26 +67,39 @@ void display_memory_info(const char* fileline, bool dump_var, bool red_color) {
        << "lived_vars:" << Var::number_of_lived_vars
        << "lived_ops:" << Op::number_of_lived_ops >> '\n';

-    #ifdef NODE_MEMCHECK
    // get the oldest var
-    // vector<Node*> queue;
-    // auto t = ++Node::tflag_count;
-    // for (auto& vh : hold_vars)
-    //     if (vh->var->tflag != t) {
-    //         vh->var->tflag = t;
-    //         queue.push_back(vh->var);
-    //     }
-    // bfs_both(queue, [](Node*){return true;});
-    // vector<pair<int64, Node*>> nodes;
-    // nodes.reserve(queue.size());
-    // for (auto* node : queue)
-    //     nodes.push_back({node->__id(), node});
-    // std::sort(nodes.begin(), nodes.end());
-    // log << "list of the oldest nodes:\n";
-    // for (int i=0; i<10 && i<nodes.size(); i++) {
-    //     log << "ID#" >> nodes[i].first >> ":" << nodes[i].second << "\n";
-    // }
-    #endif
+    if (trace_py_var) {
+        vector<Node*> queue;
+        auto t = ++Node::tflag_count;
+        for (auto& vh : hold_vars)
+            if (vh->var->tflag != t) {
+                vh->var->tflag = t;
+                queue.push_back(vh->var);
+            }
+        bfs_both(queue, [](Node*){return true;});
+        static unordered_map<int64, int> cnt;
+        auto cnt_bk = cnt;
+        map<int,int> stat;
+        for (auto* node : queue) {
+            auto &x = cnt[node->id];
+            x++;
+            if (x == 3 && node->is_var()) {
+                LOGe << node;
+            }
+            stat[x]++;
+        }
+        for (auto x : cnt_bk) {
+            if (x.second == cnt[x.first]) {
+                cnt.erase(x.first);
+            }
+        }
+        LOGe << stat << lived_nodes_id.size();
+        for (auto nid : lived_nodes_id) {
+            if (!cnt.count(nid.first)) {
+                LOGe << nid;
+            }
+        }
+    }

    if (use_stat_allocator) {
        log << "stat:" << use_stat_allocator;
--- a/python/jittor/src/misc/nano_string.cc
+++ b/python/jittor/src/misc/nano_string.cc
@ -95,6 +95,21 @@ static unordered_set<string> float_ops = {
    "sqrt",
    "mean",
    "divide",
+    "sin",
+    "asin",
+    "sinh",
+    "asinh",
+    "tan",
+    "atan",
+    "tanh",
+    "atanh",
+    "cos",
+    "acos",
+    "cosh",
+    "acosh",
+    "sigmoid",
+    "erf",
+    "erfinv"
 };
 static unordered_set<string> int_ops = {
    "round_int",
--- a/python/jittor/src/ops/code_op.cc
+++ b/python/jittor/src/ops/code_op.cc
@ -100,13 +100,13 @@ VarPtr CodeOp::grad(Var* out, Var* dout, Var* v, int v_index) {
    // TODO: remove unused deps
    // dout -> dout
    std::stringstream new_alias;
-    new_alias << "\n@alias(dout,in" << inputs.size() << ")\n";
+    new_alias << "\n@alias(dout,in" << JK::dec2(inputs.size()) << ")\n";
    inputs.push_back(dout);
    // _outputs[i] -> poutj
    for (int i=0; i<_outputs.size(); i++) {
-        new_alias << "\n@alias(pout" << i << ",in" << inputs.size() << ")\n";
+        new_alias << "\n@alias(pout" << JK::dec2(i) << ",in" << JK::dec2(inputs.size()) << ")\n";
        if (_outputs[i] == out)
-            new_alias << "\n@alias(pout,in" << inputs.size() << ")\n";
+            new_alias << "\n@alias(pout,in" << JK::dec2(inputs.size()) << ")\n";
        inputs.push_back(_outputs[i]);
    }
    auto alias = new_alias.str();
@ -123,18 +123,18 @@ void CodeOp::jit_prepare(JK& jk) {

    // forward: in0 in1 in2 -> out0 out1
    // backward: in0 in1 in2 in3(pout0) in4(pout1)
-    jk << "«IN_SIZE=" << JK::hex(_inputs.size());
+    jk << "«IN_SIZE:" << JK::dec2(_inputs.size());
    for (uint i=0; i<_inputs.size(); i++) {
-        jk << "«in" << JK::hex(i) << "_dim="
+        jk << "«in" << JK::dec2(i) << "_dim:"
            << JK::hex1(_inputs[i]->shape.size());
-        jk << "«in" << JK::hex(i) << "_type:"
+        jk << "«in" << JK::dec2(i) << "_type:"
            << _inputs[i]->dtype();
    }
-    jk << "«OUT_SIZE=" << JK::hex(_outputs.size());
+    jk << "«OUT_SIZE:" << JK::dec2(_outputs.size());
    for (uint i=0; i<_outputs.size(); i++) {
-        jk << "«out" << JK::hex(i) << "_dim="
+        jk << "«out" << JK::dec2(i) << "_dim:"
            << JK::hex1(_outputs[i]->shape.size());
-        jk << "«out" << JK::hex(i) << "_type:"
+        jk << "«out" << JK::dec2(i) << "_type:"
            << _outputs[i]->dtype();
    }
    string& header = flags.get(NodeFlags::_cuda) ? 
--- a/python/jittor/src/ops/index_op.h
+++ b/python/jittor/src/ops/index_op.h
@ -32,9 +32,9 @@ struct IndexOp : Op {

    Example::

-        print(jt.index([2,2], 0)())
+        print(jt.index([2,2], 0))
        # output: [[0,0],[1,1]]
-        print(jt.index([2,2], 1)())
+        print(jt.index([2,2], 1))
        # output: [[0,1],[0,1]]
     */
    IndexOp(NanoVector shape, int64 dim, NanoString dtype=ns_int32);
--- a/python/jittor/src/ops/reshape_op.cc
+++ b/python/jittor/src/ops/reshape_op.cc
@ -44,8 +44,12 @@ void ReshapeOp::infer_shape() {
    if (uncertain_dim == 0) {
        CHECKop(x_items,==,y_items) << "reshape shape is invalid for input of size";
    } else {
-        CHECK(y_items != 0 && x_items % y_items == 0) << "reshape shape is invalid for input of size " << x_items;
-        uncertain_dim = x_items / y_items;
+        if (x_items == 0) {
+            uncertain_dim = 0;
+        } else {
+            CHECK(y_items != 0 && x_items % y_items == 0) << "reshape shape is invalid for input of size " << x_items;
+            uncertain_dim = x_items / y_items;
+        }        
        yshape.clear();
        for (auto a : shape)
            yshape.push_back(a<0 ? uncertain_dim : a);
--- a/python/jittor/test/test_adamw.py
+++ b/python/jittor/test/test_adamw.py
@ -82,4 +82,8 @@ class TestAdamw(unittest.TestCase):
                lt = float(loss_torch.detach().numpy())
                lj = float(loss_jittor.data)
                # print(abs(lt - lj))
-                assert abs(lt - lj) < 1e-5
+                assert abs(lt - lj) < 1e-5
+
+if __name__ == "__main__":
+    unittest.main()
+    
--- a/python/jittor/test/test_code_op.py
+++ b/python/jittor/test/test_code_op.py
@ -59,6 +59,26 @@ class TestCodeOp(unittest.TestCase):
            cpu_src="out0_p[0] = ++a_global_int_var; ").item() == 124
        assert jt.code([1], "int", [], cpu_header=header,
            cpu_src="out0_p[0] = ++a_global_int_var; ").item() == 125
+    
+    def test_ten_args(self):
+        a = jt.random([10])
+        b = jt.code([a.shape]*11, [a.dtype]*11, [jt.random([10])]*10+[a],
+            cpu_src='''
+                for (int i=0; i<in10_shape0; i++)
+                    @out10(i) = @in10(i)*@in10(i)*2;
+            ''',
+            cpu_grad_src = ['']*10+['''
+                for (int i=0; i<in10_shape0; i++) {
+                    @out0(i) = @dout(i)*@in10(i)*4;
+                }
+            '''])[-1]
+        na, nb = jt.fetch_sync([a,b])
+        assert np.allclose(na*na*2, nb)
+        
+        c = jt.random([10])
+        da = jt.grad(c*b, a)
+        assert np.allclose(c.data*na*4, da.data), (c.data*na*4, da.data)
+

    def test_use_func(self):
        class Func(Function):
--- a/python/jittor/test/test_interpolation.py
+++ b/python/jittor/test/test_interpolation.py
@ -0,0 +1,34 @@
+# ***************************************************************
+# Copyright (c) 2022 Jittor. All Rights Reserved.
+# Maintainers:
+#     Haoyang Peng <2247838039@qq.com>
+#     Guowei Yang <471184555@qq.com>
+#     Dun Liang <randonlang@gmail.com>.
+#
+# This file is subject to the terms and conditions defined in
+# file 'LICENSE.txt', which is part of this source code package.
+# ***************************************************************
+import jittor as jt
+from jittor import nn
+import numpy as np
+import unittest
+
+try:
+    import torch
+    has_torch = True
+except:
+    has_torch = False
+
+@unittest.skipIf(not has_torch, "No pytorch installation found.")
+class TestInterpolation(unittest.TestCase):
+    def test_interpolation_area(self):
+        img = np.random.uniform(0, 1, (1, 3, 24, 10))
+        output_shape = (12, 5)
+        jimg = jt.array(img)
+        timg = torch.from_numpy(img)
+        joutput = nn.interpolate(jimg, output_shape, mode="area")
+        toutput = torch.nn.functional.interpolate(timg, output_shape, mode="area")
+        np.testing.assert_allclose(joutput.numpy(), toutput.numpy(), rtol=1e-7, atol=1e-7)
+
+if __name__ == "__main__":
+    unittest.main()
--- a/python/jittor/test/test_misc_issue.py
+++ b/python/jittor/test/test_misc_issue.py
@ -191,13 +191,26 @@ print(matmul_transpose(a, b))
        x = m.state_dict()
        m.load_state_dict(x)

-    # def test_res2net(self):
-    #     import jittor.models
-    #     net = jittor.models.res2net50(True)
-    #     img = jt.random((2,3,224,224))
-    #     out = net(img)
-    #     print(out.shape, out.sum())
-    #     assert out.shape == [2,1000]
+    def test_res2net(self):
+        import jittor.models
+        net = jittor.models.res2net50(True)
+        img = jt.random((2,3,224,224))
+        out = net(img)
+        print(out.shape, out.sum())
+        jt.display_memory_info()
+        jt.display_memory_info()
+        assert out.shape == [2,1000]
+
+    def test_argmax_memleak(self):
+        a = jt.random([10])
+        _, m = jt.argmax(a, 0)
+        del _
+        m.sync()
+        g = jt.grad(m*10, a)
+        g.sync()
+        del a, g, m
+        jt.display_memory_info()
+        assert jt.liveness_info()["lived_ops"] == 0


 if __name__ == "__main__":
--- a/python/jittor/test/test_misc_op.py
+++ b/python/jittor/test/test_misc_op.py
@ -256,14 +256,19 @@ class TestOther(unittest.TestCase):
        assert (x[3]['b'] == np.array([1,2,3])).all()

    def test_arctan2(self):
-        a = jt.arctan2(jt.array([1,1.0,0]), jt.array([1,0.0,-1]))
-        np.testing.assert_allclose(a.data, [0.7853982,1.5707964,3.1415927])
-
-        y = jt.random((100,))
-        x = jt.random((100,))
+        x = jt.float32([1,1,-1,-1,  1,-1,0,0,0])
+        y = jt.float32([-1,1,-1,1,  0,0,1,-1,0])
        z = jt.arctan2(y, x)
        z2 = np.arctan2(y.data, x.data)
        np.testing.assert_allclose(z.data, z2, atol=1e-6)
+        
+        y = jt.random((100,)) * 2 - 1
+        x = jt.random((100,)) * 2 - 1
+        z = jt.arctan2(y, x)
+        z2 = np.arctan2(y.data, x.data)
+        np.testing.assert_allclose(z.data, z2, atol=1e-6)
+
+        np.testing.assert_allclose(jt.array([1]).arctan().item(), 0.7853982)

    def test_softmax_precision(self):
        # jt.flags.use_cuda = 1
--- a/python/jittor/test/test_reshape.py
+++ b/python/jittor/test/test_reshape.py
@ -75,6 +75,13 @@ class TestReshapeOp(unittest.TestCase):
        a = jt.zeros(10)
        b = a.reshape(a.shape)

+    def test_reshape_empty(self):
+        a = jt.array([])
+        b = a.reshape(0, 1, 2)
+        assert b.shape == [0, 1, 2]
+        b = a.reshape(0, -1)
+        assert b.shape == [0, 0]
+

 if __name__ == "__main__":
    unittest.main()
--- a/python/jittor/test/test_unique.py
+++ b/python/jittor/test/test_unique.py
@ -0,0 +1,57 @@
+
+# ***************************************************************
+# Copyright (c) 2022 Jittor. All Rights Reserved. 
+# Maintainers: 
+#     Dun Liang <randonlang@gmail.com>. 
+#     Xiangli Li <1905692338@qq.com>
+#     Jiapeng Zhang <zhangjp20@mails.tsinghua.edu.cn>
+# 
+# This file is subject to the terms and conditions defined in
+# file 'LICENSE.txt', which is part of this source code package.
+# ***************************************************************
+
+from cgi import test
+import unittest
+import jittor as jt
+import numpy as np
+
+skip_this_test = False
+
+try:
+    jt.dirty_fix_pytorch_runtime_error()
+    import torch
+except:
+    torch = None
+    skip_this_test = True
+
+def test_unique_with_torch(input, dim=None):
+    jt0, jt1, jt2 = jt.unique(jt.array(input), True, True, dim)
+    torch0, torch1, torch2 = torch.unique(torch.tensor(input), True, True, True, dim)
+    assert np.allclose(jt0, torch0) and np.allclose(jt1, torch1) and np.allclose(jt2, torch2)
+
+
+@unittest.skipIf(skip_this_test, "No Torch found")
+class TestSparse(unittest.TestCase):
+
+    def test_unique(self):
+        test_unique_with_torch(np.array([1, 3, 2, 3, 3, 3], dtype=np.int32))
+        test_unique_with_torch(np.array([[1, 3], [2, 3], [1, 2]], dtype=np.int64))
+
+    def test_unique_dim(self):
+        test_unique_with_torch(np.array([[1, 3], [2, 3], [1, 3], [2, 3]]), 0)
+        test_unique_with_torch(np.array([[1, 3], [2, 3], [1, 3], [2, 3]]), 1)
+
+
+    @unittest.skipIf(not jt.compiler.has_cuda, "No CUDA found")
+    @jt.flag_scope(use_cuda=1)
+    def test_unique_cuda(self):
+        self.test_unique()
+
+    @unittest.skipIf(not jt.compiler.has_cuda, "No CUDA found")
+    @jt.flag_scope(use_cuda=1)
+    def test_unique_dim_cuda(self):
+        self.test_unique_dim()
+    
+        
+if __name__ == "__main__":
+    unittest.main()
--- a/python/jittor/utils/data.gz
+++ b/python/jittor/utils/data.gz
--- a/python/jittor/utils/gen_pyi.py
+++ b/python/jittor/utils/gen_pyi.py
@ -16,7 +16,7 @@ In detail, autocompletion of the following functions are supported.
 - methods of jittor.Var

 Prerequisite:
- mypy for automatic stub generation
+- mypy for automatic stub generation, installation: pip install mypy

 Usage: python3 -m jittor.utils.gen_pyi

@ -35,7 +35,7 @@ def add_indent(s: str, n=1):
 def ctype_to_python(type_str):
    if type_str == "bool":
        return "bool"
-    if type_str in ["int", "uint", "int64", "uint64", "size_t"]:
+    if type_str in ["int", "uint", "uint8", "int64", "uint64", "size_t"]:
        return "int"
    if type_str in ["float32", "float64"]:
        return "float"
@ -49,6 +49,8 @@ def ctype_to_python(type_str):
        return "Var"
    if type_str in ["vector<VarHolder*>", "vector<VarHolder*>&&"]:
        return "List[Var]"
+    if type_str in ["vector_to_tuple<VarHolder*>"]:
+        return "Tuple[Var]"
    if type_str == "NanoVector":
        return "Tuple[int]"
    if type_str == "vector<NanoVector>&&":
@ -161,11 +163,22 @@ def gen_ops_stub(jittor_path):
        if func_name == "bool":
            continue

-        docstring = func.__doc__[:func.__doc__.find("Declaration:")]
-        docstring = docstring.replace("'''", '"""').strip()
-        declarations = re.findall(r"Declaration:\n(.+)\n", func.__doc__)
+        docstrings = []
+        declarations = []
+        for i, doc in enumerate(re.split(r"Declaration:\n(.+)\n", func.__doc__)):
+            if i % 2 == 0:
+                if not doc.strip() and docstrings:  
+                    # if the current docstring is empty, use the last docstring
+                    docstrings.append(docstrings[-1])
+                else:
+                    docstrings.append(doc.replace("'''", '"""').strip())
+            else:
+                declarations.append(doc)
+
+        for i in range(len(declarations)):
+            decl = declarations[i]
+            docstring = docstrings[i]

-        for decl in declarations:
            decorators = "@overload\n" if len(declarations) > 1 else ""
            return_type = ctype_to_python(decl.split(' ', maxsplit=1)[0])
            param_hints = decl_to_param_hints(decl)
--- a/python/jittor_utils/translator.py
+++ b/python/jittor_utils/translator.py
@ -19,23 +19,69 @@ def check_is_en(src):
    return en_cnt == len(src)

 def check_is_both(src):
+    if src.startswith("!"):
+        return True
    return len(src) < 2

+def splite_markdown_blocks(src):
+    ''' split markdown document into text, code, table blocks
+    '''
+    blocks = []
+    block = ""
+    status = "text"
+
+    def commit_block():
+        blocks.append((block, status))
+
+    for line in src.split('\n'):
+        line = line + "\n"
+        if line.startswith("```"):
+            assert status in ["text", "code"]
+            if status == "text":
+                commit_block()
+                status = "code"
+                block = line
+            elif status == "code":
+                block += line
+                commit_block()
+                status = "text"
+                block = ""
+        elif line.strip().startswith('|') and line.strip().endswith('|'):
+            assert status in ["text", "table"]
+            if status == "text":
+                commit_block()
+                status = "table"
+                block = line
+            else:
+                block += line
+        else:
+            if status == "table":
+                commit_block()
+                status = "text"
+                block = line
+            else:
+                block += line
+    if status != "code":
+        commit_block()
+    return blocks
+
 for mdname in all_src_md:
    print(mdname)
    with open(mdname, "r", encoding='utf8') as f:
        src = f.read()
-    src = src.split("```")
-    en_src = []
-    cn_src = []
-    for i, s in enumerate(src):
-        if i%2==1:
-            en_src.append(s)
-            cn_src.append(s)
+
+    src_blocks = splite_markdown_blocks(src)
+
+    en_src = ""
+    cn_src = ""
+    for block, status in src_blocks:
+        if status == "code" or status == "table":
+            en_src += block
+            cn_src += block
        else:
            en_s = []
            cn_s = []
-            for line in s.split('\n'):
+            for line in block.split('\n'):
                if check_is_both(line):
                    en_s.append(line)
                    cn_s.append(line)
@ -43,10 +89,9 @@ for mdname in all_src_md:
                    en_s.append(line)
                else:
                    cn_s.append(line)
-            en_src.append("\n".join(en_s))
-            cn_src.append("\n".join(cn_s))
-    en_src = "```".join(en_src)
-    cn_src = "```".join(cn_src)
+            en_src += "\n".join(en_s)
+            cn_src += "\n".join(cn_s)
+    
    with open(mdname.replace(".src.md", ".md"), 'w', encoding='utf8') as f:
        f.write(en_src)
    with open(mdname.replace(".src.md", ".cn.md"), 'w', encoding='utf8') as f: