Merge branch 'master' of https://github.com/Jittor/jittor

2021-08-17 13:35:18 +08:00 · 2021-08-17 13:35:18 +08:00 · 4e38190483
parent 4fd355202c 4217359f86
commit 4e38190483
160 changed files with 11104 additions and 751 deletions
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@ -0,0 +1,25 @@
+---
+name: Bug report
+about: Create a report to help us improve
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+## Describe the bug
+A clear and concise description of what the bug is. 使用中文也可以。
+
+## Full Log
+
+Provide a full log of Jittor execution, Jittor will log environment information which help us to locate your bugs. Provide a screenshot is also acceptable.
+
+## Minimal Reproduce
+
+Reproduce this error with a file or several lines of code. 
+If it is not possible, leave it blank.
+
+## Expected behavior
+A clear and concise description of what you expected to happen.
+
+If you are submitting an issue for the first time, please refer to [our guideline](https://github.com/Jittor/jittor/issues/238)
--- a/.gitignore
+++ b/.gitignore
@ -1,5 +1,6 @@
 my
 .refresh
+.DS_Store
 __pycache__
 .ipynb_checkpoints/
 .vscode/
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -1,3 +1,5 @@
 exclude __data__
 exclude __pycache__
 prune **/__data__/
+prune **/__pycache__
+prune *.pyc
--- a/README.cn.md
+++ b/README.cn.md
@ -89,7 +89,7 @@ for i,(x,y) in enumerate(get_data(n)):
 Jittor框架对环境要求如下:


-* 操作系统: **Linux**(e.g. Ubuntu/CentOS/Arch) 或 **Windows Subsystem of Linux（WSL）**
+* 操作系统: **Linux**(e.g. Ubuntu/CentOS/Arch), **macOS**（x86_64）或 **Windows Subsystem of Linux（WSL）**
 * Python：版本 >= 3.7
 * C++编译器 （需要下列至少一个）
    - g++ （>=5.4.0）
@ -100,7 +100,9 @@ Jittor框架对环境要求如下:
 如果您不希望手动配置环境，我们推荐使用 Docker 进行安装。
 除此之外，您还可以使用 pip 安装和手动安装。

-注意：目前Jittor通过WSL的方式在Windows操作系统上运行，WSL的安装方法请参考[微软官网](https://docs.microsoft.com/en-us/windows/wsl/install-win10)，WSL版本目前尚不支持CUDA。
+注意1：目前Jittor通过WSL的方式在Windows操作系统上运行，WSL的安装方法请参考[微软官网](https://docs.microsoft.com/en-us/windows/wsl/install-win10)，WSL版本目前尚不支持CUDA。
+
+注意2：macOS 用户需要安装额外依赖，请参考 [macOS 安装](#macOS-安装)。

 Jittor 提供了三种安装方法：docker，pip和手动安装：

@ -112,6 +114,7 @@ Jittor 提供了三种安装方法：docker，pip和手动安装：



+
 ## Docker 安装

 我们提供了Docker安装方式，免去您配置环境，Docker安装方法如下：
@ -145,6 +148,27 @@ python3.7 -m jittor.test.test_example
 如果测试运行通过,恭喜你已经安装完成.
 jittor会自动在路径中寻找合适的编译器, 如果您希望手动指定编译器, 请使用环境变量 `cc_path` 和 `nvcc_path`(可选).

+## macOS 安装
+
+
+macOS 请使用 [homebrew](https://brew.sh) 安装额外的依赖 (python>=3.7, onednn)。
+
+
+```bash
+brew install python@3.7 onednn libomp
+```
+
+之后您可以通过 pip 安装 jittor，并测试是否可以成功运行。
+
+
+```bash
+python3.7 -m pip install jittor
+python3.7 -m jittor.test.test_example
+```
+
+目前在macOS中，jittor 只支持 CPU 计算。
+
+
 ## 手动安装


@ -313,11 +337,11 @@ help(jt.ops)



-[1]: notebook/example.src.md	"示例"
-[2]: notebook/basics.src.md	"基本概念"
-[3]: notebook/meta_op.src.md	"元算子"
-[4]: notebook/custom_op.src.md	"自定义算子"
-[5]: notebook/profiler.src.md	"性能分析器"
+[1]: python/jittor/notebook/example.src.md	"示例"
+[2]: python/jittor/notebook/basics.src.md	"基本概念"
+[3]: python/jittor/notebook/meta_op.src.md	"元算子"
+[4]: python/jittor/notebook/custom_op.src.md	"自定义算子"
+[5]: python/jittor/notebook/profiler.src.md	"性能分析器"


 这些notebooks可以通过python3.7 -m jittor.notebook在您自己的计算机中运行。
--- a/README.md
+++ b/README.md
@ -92,9 +92,10 @@ We provide some jupyter notebooks to help you quick start with Jittor.



+
 Jittor environment requirements:

-* System: **Linux**(e.g. Ubuntu/CentOS/Arch) (or **Windows** Subsystem of Linux)
+* System: **Linux**(e.g. Ubuntu/CentOS/Arch), **macOS**, or **Windows Subsystem of Linux (WSL)**
 * Python version >= 3.7
 * CPU compiler (require at least one of the following)
    * g++ (>=5.4.0)
@ -105,7 +106,9 @@ Jittor environment requirements:



-Note: Currently Jittor runs on the Windows operating system through WSL. For the installation method of WSL, please refer to [Microsoft official website](https://docs.microsoft.com/en-us/windows/wsl/install-win10). WSL does not yet support CUDA.
+Note#1: Currently Jittor runs on the Windows operating system through WSL. For the installation method of WSL, please refer to [Microsoft official website](https://docs.microsoft.com/en-us/windows/wsl/install-win10). WSL does not yet support CUDA.
+
+Note#2: macOS users have to install additional dependencies, see [macOS install](#macOS-install).

 Jittor offers three ways to install: docker, pip, or manual.

@ -139,6 +142,27 @@ python3.7 -m jittor.test.test_example
 ```


+
+## macOS install
+
+
+Please first install additional dependencies with [homebrew](https://brew.sh).
+
+```bash
+brew install python@3.7 onednn libomp
+```
+
+
+Then you can install jittor through pip and run the example.
+
+```bash
+python3.7 -m pip install jittor
+python3.7 -m jittor.test.test_example
+```
+
+
+Currently jittor only supports CPU in macOS.
+
 ## manual install

 We will show how to install Jittor in Ubuntu 16.04 step by step, Other Linux distributions may have similar commands.
@ -307,11 +331,11 @@ If you want to know more about Jittor, please check out the notebooks below:



-[1]: notebook/example.src.md	"example"
-[2]: notebook/basics.src.md	"basics"
-[3]: notebook/meta_op.src.md	"meta_op"
-[4]: notebook/custom_op.src.md	"custom_op"
-[5]: notebook/profiler.src.md	"profiler"
+[1]: python/jittor/notebook/example.src.md	"example"
+[2]: python/jittor/notebook/basics.src.md	"basics"
+[3]: python/jittor/notebook/meta_op.src.md	"meta_op"
+[4]: python/jittor/notebook/custom_op.src.md	"custom_op"
+[5]: python/jittor/notebook/profiler.src.md	"profiler"

 Those notebooks can be started in your own computer by `python3.7 -m jittor.notebook`

@ -366,7 +390,7 @@ Jittor is currently maintained by the [Tsinghua CSCG Group](https://cg.cs.tsingh
@article{hu2020jittor,
  title={Jittor: a novel deep learning framework with meta-operators and unified graph execution},
  author={Hu, Shi-Min and Liang, Dun and Yang, Guo-Ye and Yang, Guo-Wei and Zhou, Wen-Yang},
-  journal={Information Sciences},
+  journal={Science China Information Sciences},
  volume={63},
  number={222103},
  pages={1--21},
--- a/README.src.md
+++ b/README.src.md
@ -113,7 +113,7 @@ We provide some jupyter notebooks to help you quick start with Jittor.
 Jittor框架对环境要求如下:


-* 操作系统: **Linux**(e.g. Ubuntu/CentOS/Arch) 或 **Windows Subsystem of Linux（WSL）**
+* 操作系统: **Linux**(e.g. Ubuntu/CentOS/Arch), **macOS**（x86_64）或 **Windows Subsystem of Linux（WSL）**
 * Python：版本 >= 3.7
 * C++编译器 （需要下列至少一个）
    - g++ （>=5.4.0）
@ -124,13 +124,15 @@ Jittor框架对环境要求如下:
 如果您不希望手动配置环境，我们推荐使用 Docker 进行安装。
 除此之外，您还可以使用 pip 安装和手动安装。

-注意：目前Jittor通过WSL的方式在Windows操作系统上运行，WSL的安装方法请参考[微软官网](https://docs.microsoft.com/en-us/windows/wsl/install-win10)，WSL版本目前尚不支持CUDA。
+注意1：目前Jittor通过WSL的方式在Windows操作系统上运行，WSL的安装方法请参考[微软官网](https://docs.microsoft.com/en-us/windows/wsl/install-win10)，WSL版本目前尚不支持CUDA。
+
+注意2：macOS 用户需要安装额外依赖，请参考 [macOS 安装](#macOS-安装)。

 Jittor 提供了三种安装方法：docker，pip和手动安装：

 Jittor environment requirements:

-* System: **Linux**(e.g. Ubuntu/CentOS/Arch) (or **Windows** Subsystem of Linux)
+* System: **Linux**(e.g. Ubuntu/CentOS/Arch), **macOS**, or **Windows Subsystem of Linux (WSL)**
 * Python version >= 3.7
 * CPU compiler (require at least one of the following)
    * g++ (>=5.4.0)
@ -141,7 +143,9 @@ Jittor environment requirements:



-Note: Currently Jittor runs on the Windows operating system through WSL. For the installation method of WSL, please refer to [Microsoft official website](https://docs.microsoft.com/en-us/windows/wsl/install-win10). WSL does not yet support CUDA.
+Note#1: Currently Jittor runs on the Windows operating system through WSL. For the installation method of WSL, please refer to [Microsoft official website](https://docs.microsoft.com/en-us/windows/wsl/install-win10). WSL does not yet support CUDA.
+
+Note#2: macOS users have to install additional dependencies, see [macOS install](#macOS-install).

 Jittor offers three ways to install: docker, pip, or manual.

@ -183,6 +187,31 @@ python3.7 -m jittor.test.test_example
 如果测试运行通过,恭喜你已经安装完成.
 jittor会自动在路径中寻找合适的编译器, 如果您希望手动指定编译器, 请使用环境变量 `cc_path` 和 `nvcc_path`(可选).

+## macOS 安装
+
+## macOS install
+
+macOS 请使用 [homebrew](https://brew.sh) 安装额外的依赖 (python>=3.7, onednn)。
+
+Please first install additional dependencies with [homebrew](https://brew.sh).
+
+```bash
+brew install python@3.7 onednn libomp
+```
+
+之后您可以通过 pip 安装 jittor，并测试是否可以成功运行。
+
+Then you can install jittor through pip and run the example.
+
+```bash
+python3.7 -m pip install jittor
+python3.7 -m jittor.test.test_example
+```
+
+目前在macOS中，jittor 只支持 CPU 计算。
+
+Currently jittor only supports CPU in macOS.
+
 ## 手动安装
 ## manual install

@ -387,16 +416,16 @@ If you want to know more about Jittor, please check out the notebooks below:



-[1]: notebook/example.src.md	"example"
-[2]: notebook/basics.src.md	"basics"
-[3]: notebook/meta_op.src.md	"meta_op"
-[4]: notebook/custom_op.src.md	"custom_op"
-[5]: notebook/profiler.src.md	"profiler"
-[1]: notebook/example.src.md	"示例"
-[2]: notebook/basics.src.md	"基本概念"
-[3]: notebook/meta_op.src.md	"元算子"
-[4]: notebook/custom_op.src.md	"自定义算子"
-[5]: notebook/profiler.src.md	"性能分析器"
+[1]: python/jittor/notebook/example.src.md	"example"
+[2]: python/jittor/notebook/basics.src.md	"basics"
+[3]: python/jittor/notebook/meta_op.src.md	"meta_op"
+[4]: python/jittor/notebook/custom_op.src.md	"custom_op"
+[5]: python/jittor/notebook/profiler.src.md	"profiler"
+[1]: python/jittor/notebook/example.src.md	"示例"
+[2]: python/jittor/notebook/basics.src.md	"基本概念"
+[3]: python/jittor/notebook/meta_op.src.md	"元算子"
+[4]: python/jittor/notebook/custom_op.src.md	"自定义算子"
+[5]: python/jittor/notebook/profiler.src.md	"性能分析器"

 Those notebooks can be started in your own computer by `python3.7 -m jittor.notebook`

--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -27,6 +27,9 @@
   jittor.mpi
   jittor.linalg
   jittor.console
+   jittor.distributions
+   jittor.attention
+   jittor.loss3d


 .. toctree::
@ -34,7 +37,6 @@
   :caption: 其他:
   
   教程 <https://cg.cs.tsinghua.edu.cn/jittor/tutorial/>
-   todo

 Indices and tables
 ==================
--- a/doc/source/jittor.attention.md
+++ b/doc/source/jittor.attention.md
@ -1,7 +1,7 @@
 jittor.attention
 =====================

-这里是Jittor的 数据变换 模块的API文档，您可以通过`from jittor import attention`来获取该模块。
+这里是Jittor的 注意力 模块的API文档，您可以通过`from jittor import attention`来获取该模块。

 ```eval_rst
 .. automodule:: jittor.attention
--- a/doc/source/jittor.distributions.md
+++ b/doc/source/jittor.distributions.md
@ -0,0 +1,10 @@
+jittor.distributions
+=====================
+
+这里是Jittor的随机分布模块的API文档，您可以通过`from jittor import distributions`来获取该模块。
+
+```eval_rst
+.. automodule:: jittor.distributions
+   :members:
+   :undoc-members:
+```
--- a/doc/source/jittor.loss3d.md
+++ b/doc/source/jittor.loss3d.md
@ -0,0 +1,10 @@
+jittor.loss3d
+=====================
+
+这里是Jittor的 3d 损失函数 模块的API文档，您可以通过`from jittor import loss3d`来获取该模块。
+
+```eval_rst
+.. automodule:: jittor.loss3d
+   :members: chamfer_loss, ChamferLoss, earth_mover_distance, EarthMoverDistance
+   :undoc-members:
+```
--- a/doc/source/jittor.mpi.md
+++ b/doc/source/jittor.mpi.md
@ -95,6 +95,9 @@ def val(epoch):

 下面是 jittor 的 mpi api reference.

+* `jt.world_rank`: 获取当前进程总数量，如果没有用mpi，则为1。
+* `jt.rank`: 获取当前进程的编号，区间为`0 ～ jt.world_rank-1`， 如果没有用mpi，则为0。
+
 ```eval_rst
 .. automodule:: jittor_mpi_core
   :members:
--- a/doc/source/jittor.nn.md
+++ b/doc/source/jittor.nn.md
@ -10,7 +10,7 @@ jittor.nn

 .. automodule:: jittor.nn
   :imported-members:
-   :members: Pool, pool, AdaptiveAvgPool2d
+   :members: Pool, pool, AdaptiveAvgPool2d, Pool3d, AdaptiveMaxPool2d, AdaptiveAvgPool3d, AdaptiveMaxPool2d, pool3d, AvgPool2d, AvgPool3d, avg_pool2d, MaxPool2d, MaxPool3d, max_pool2d, max_pool3d, MaxUnpool2d, MaxUnpool3d
   :undoc-members: 

 .. autoclass:: jittor.nn.ReLU
--- a/python/jittor/init.py
+++ b/python/jittor/init.py
@ -9,7 +9,7 @@
 # file 'LICENSE.txt', which is part of this source code package.
 # ***************************************************************

-__version__ = '1.2.3.12'
+__version__ = '1.2.3.92'
 from jittor_utils import lock
 with lock.lock_scope():
    ori_int = int
@ -23,7 +23,7 @@ with lock.lock_scope():
    from jittor_core import *
    from jittor_core.ops import *
    from . import compile_extern
-    from .compile_extern import mkl_ops, mpi, mpi_ops, in_mpi, rank
+    from .compile_extern import mkl_ops, mpi, mpi_ops, in_mpi, rank, world_size
    if core.get_device_count() == 0:
        has_cuda = compile_extern.has_cuda = compiler.has_cuda = False
    if has_cuda:
@ -437,11 +437,11 @@ def pow(x, y):
 Var.pow = Var.__pow__ = pow

 def argmax(x, dim, keepdims:bool=False):
-    return x.arg_reduce("max", dim, keepdims)
+    return jt.arg_reduce(x, "max", dim, keepdims)
 Var.argmax = argmax

 def argmin(x, dim, keepdims:bool=False):
-    return x.arg_reduce("min", dim, keepdims)
+    return jt.arg_reduce(x, "min", dim, keepdims)
 Var.argmin = argmin

 def randn(*size, dtype="float32", requires_grad=True) -> Var:
@ -562,7 +562,7 @@ def randint(low, high=None, shape=(1,), dtype="int32") -> Var:
    '''
    if high is None: low, high = 0, low
    v = (jt.random(shape) * (high - low) + low).clamp(low, high-0.5)
-    v = jt.floor(v)
+    v = jt.floor_int(v)
    return v.astype(dtype)

 def randint_like(x, low, high=None) -> Var:
@ -780,11 +780,16 @@ class Module:
        stack = []
        def callback(parents, k, v, n):
            stack.append(str(k))
-            for k2, p in v.__dict__.items():
-                if k2.startswith("_"): continue
+            dc = v.__dict__
+            if isinstance(v, nn.ParameterList):
+                dc = v.params
+            for k2, p in dc.items():
+                if isinstance(k2, str) and k2.startswith("_"): continue
                if isinstance(p, Var):
                    ps.append(p)
-                    p.name(".".join(stack[1:]+[str(k2)]))
+                    pname = ".".join(stack[1:]+[str(k2)])
+                    if len(pname) > len(p.name()):
+                        p.name(pname)
        def callback_leave(parents, k, v, n):
            stack.pop()
        self.dfs([], None, callback, callback_leave)
@ -822,53 +827,101 @@ class Module:
        self.dfs([], "", callback, callback_leave)
        return ms

-    def register_forward_hook(self, func):
+    def requires_grad_(self, requires_grad=True):
+        self._requires_grad = requires_grad
+        self._place_hooker()
+        return self
+
+    def __hooked_call__(self, *args, **kw):
+        if hasattr(self, "__fhook2__"):
+            if len(kw):
+                self.__fhook2__(self, args, kw)
+            else:
+                self.__fhook2__(self, args)
+        if hasattr(self, "__bihook__"):
+            if len(kw):
+                LOG.w("backward hook not support kw")
+            args = grad_hooker(args, self.__bihook__)
+        if hasattr(self, "_requires_grad") and not self._requires_grad:
+            with jt.no_grad():
+                ret = self.__hooked_call__(*args, **kw)
+        else:
+            ret = self.__hooked_call__(*args, **kw)
+        if hasattr(self, "__bohook__"):
+            if len(kw):
+                LOG.w("backward hook not support kw")
+            if isinstance(ret, Var):
+                ret = grad_hooker((ret,), self.__bohook__)[0]
+            else:
+                ret = grad_hooker(ret, self.__bohook__)
+        if hasattr(self, "__fhook__"):
+            if len(kw):
+                self.__fhook__(self, args, ret, kw)
+            else:
+                self.__fhook__(self, args, ret)
+        return ret
+
+    def _place_hooker(self):
        cls = self.__class__
-        self.__fhook__ = func
        if hasattr(cls, "__hooked__"):
            return
        cls.__hooked__ = True
-        origin_call = cls.__call__
-        def new_call(self, *args, **kw):
-            ret = origin_call(self, *args, **kw)
-            if hasattr(self, "__fhook__"):
-                if len(kw):
-                    self.__fhook__(self, args, ret, kw)
-                else:
-                    self.__fhook__(self, args, ret)
-            return ret
-        self.__class__.__call__ = new_call
+        cls.__call__, cls.__hooked_call__ = \
+            cls.__hooked_call__, cls.__call__
+
+    def register_forward_hook(self, func):
+        self.__fhook__ = func
+        self._place_hooker()
    
    def remove_forward_hook(self):
-        cls = self.__class__
-        if hasattr(cls,"__hooked__"):
-            delattr(cls,"__hooked__")
        if hasattr(self,"__fhook__"):
            delattr(self,"__fhook__")

    def register_pre_forward_hook(self, func):
-        cls = self.__class__
        self.__fhook2__ = func
-        if hasattr(cls, "__hooked2__"):
-            return
-        cls.__hooked2__ = True
-        origin_call = cls.__call__
-        def new_call(self, *args, **kw):
-            if hasattr(self, "__fhook2__"):
-                if len(kw):
-                    self.__fhook2__(self, args, kw)
-                else:
-                    self.__fhook2__(self, args)
-            return origin_call(self, *args, **kw)
-        self.__class__.__call__ = new_call
+        self._place_hooker()

    def remove_pre_forward_hook(self):
-        cls = self.__class__
-        if hasattr(cls,"__hooked2__"):
-            delattr(cls,"__hooked2__")
        if hasattr(self,"__fhook2__"):
            delattr(self,"__fhook2__")

+    def register_input_backward_hook(self, func):
+        self.__bihook__ = func
+        self._place_hooker()
+
+    def remove_input_backward_hook(self):
+        if hasattr(self,"__bihook__"):
+            delattr(self,"__bihook__")
+
+    def register_output_backward_hook(self, func):
+        self.__bohook__ = func
+        self._place_hooker()
+
+    def remove_output_backward_hook(self):
+        if hasattr(self,"__bohook__"):
+            delattr(self,"__bohook__")
+
+    def register_backward_hook(self, func):
+        ''' hook both input and output on backpropergation of this module.
+Arguments of hook are defined as::
+
+    hook(module, grad_input:tuple(jt.Var), grad_output:tuple(jt.Var)) -> tuple(jt.Var) or None
+
+`grad_input` is the origin gradients of input of this module, `grad_input` is the  gradients of output of this module, return value is used to replace the gradient of input.
+        '''
+        _grad_output = None
+        def bohook(grad_output):
+            nonlocal _grad_output
+            _grad_output = grad_output
+        def bihook(grad_input):
+            return func(self, grad_input, _grad_output)
+        self.register_input_backward_hook(bihook)
+        self.register_output_backward_hook(bohook)
+
+    def remove_backward_hook(self):
+        self.remove_input_backward_hook()
+        self.remove_output_backward_hook()
+
    def children(self):
        cd = []
        def callback(parents, k, v, n):
@ -1051,7 +1104,7 @@ can store value for backward computation)::
            return grad0 * self.y, grad1 * self.x
    a = jt.array(3.0)
    b = jt.array(4.0)
-    func = MyFunc()
+    func = MyFunc.apply
    c,d = func(a, b)
    da, db = jt.grad(c+d*3, [a, b])
    assert da.data == 4
@ -1074,7 +1127,7 @@ can also be None)::
            return grad0 * self.y, None
    a = jt.array(3.0)
    b = jt.array(4.0)
-    func = MyFunc()
+    func = MyFunc.apply
    c,d = func(a, b)
    d.stop_grad()
    da, db = jt.grad(c+d*3, [a, b])
@ -1144,6 +1197,44 @@ can also be None)::
        func = cls()
        return func(*args, **kw)

+class GradHooker(Function):
+    def __init__(self, hook):
+        self.hook = hook
+
+    def execute(self, *args):
+        return args
+    
+    def grad(self, *grad_input):
+        ret = self.hook(grad_input)
+        if ret: grad_input = ret
+        return grad_input
+
+def grad_hooker(args, hook):
+    hooker = GradHooker(hook)
+    return hooker(*args)
+
+def register_hook(v, hook):
+    """ register hook of any jittor Variables, if hook return not None,
+the gradient of this variable will be alter, Example::
+
+    x = jt.array([0.0, 0.0])
+    y = x * [1,2]
+    y.register_hook(lambda g: g*2)
+    dx = jt.grad(y, x)
+    print(dx)
+    # will be [2, 4]
+
+    """
+    def _hook(grads):
+        g = hook(grads[0])
+        if g is not None:
+            return (g,)
+        return None
+    hooker = GradHooker(_hook)
+    v.swap(hooker(v)[0])
+    return v
+
+Var.register_hook = register_hook

 def make_module(func, exec_n_args=1):
    class MakeModule(Module):
@ -1169,9 +1260,11 @@ def dirty_fix_pytorch_runtime_error():
        jt.dirty_fix_pytorch_runtime_error()
        import torch
    '''
-    import os
-    os.RTLD_GLOBAL = os.RTLD_GLOBAL | os.RTLD_DEEPBIND
+    import os, platform

+    if platform.system() == 'Linux':
+        os.RTLD_GLOBAL = os.RTLD_GLOBAL | os.RTLD_DEEPBIND
+    

 import atexit

@ -1249,11 +1342,15 @@ def get_len(var):
 Var.__len__ = get_len
 int = int32
 Var.int = Var.int32
+Var.long = Var.int32
 float = float32
 Var.float = Var.float32
 double = float64
 Var.double = Var.float64

+def is_var(v):
+    return isinstance(v, Var)
+
 # __array__ interface is used for np.array(jt_var)
 Var.__array__ = Var.numpy
 Var.__array_priority__ = 2000
@ -1271,3 +1368,5 @@ from . import numpy2cupy
 from .contrib import concat
 from .misc import *
 from . import sparse
+from . import optim
+from . import dataset
--- a/python/jittor/compile_extern.py
+++ b/python/jittor/compile_extern.py
@ -5,6 +5,7 @@
 # file 'LICENSE.txt', which is part of this source code package.
 # ***************************************************************
 import os, sys, shutil
+import platform
 from .compiler import *
 from jittor_utils import run_cmd, get_version, get_int_version
 from jittor_utils.misc import download_url_to_local
@ -23,71 +24,99 @@ def search_file(dirs, name, prefer_version=()):
 def install_mkl(root_folder):
    # origin url is
    # url = "https://github.com/intel/mkl-dnn/releases/download/v1.0.2/mkldnn_lnx_1.0.2_cpu_gomp.tgz"
-    url = "https://cloud.tsinghua.edu.cn/f/da02bf62b55b4aa3b8ee/?dl=1"
-    filename = "mkldnn_lnx_1.0.2_cpu_gomp.tgz"
-    # newest version for oneDNN
-    # url = "https://github.com/oneapi-src/oneDNN/releases/download/v2.2/dnnl_lnx_2.2.0_cpu_gomp.tgz"
-    # filename = "dnnl_lnx_2.2.0_cpu_gomp.tgz"
+    import platform
+    if platform.system()=="Linux":
+        if platform.machine()=='x86_64':
+            filename = "dnnl_lnx_2.2.0_cpu_gomp.tgz"
+            md5 = "35bbbdf550a9d8ad54db798e372000f6"
+        elif platform.machine()=='aarch64':
+            filename = "dnnl_lnx_2.2.0_cpu_gomp_aarch64.tgz"
+            md5 = "72cf9b0b8fd6c3c786d35a9daaee22b8"
+        else:
+            raise RuntimeError(f"platform.machine()=={platform.machine()} not support yet,"
+            " Please contact us on https://github.com/jittor/jittor ")
+    else:
+        raise RuntimeError(f"platform.machine()=={platform.machine()} not support yet,"
+        " Please contact us on https://github.com/jittor/jittor ")
+
+    url = "https://cg.cs.tsinghua.edu.cn/jittor/assets/" + filename
    fullname = os.path.join(root_folder, filename)
    dirname = os.path.join(root_folder, filename.replace(".tgz",""))

-    if not os.path.isfile(os.path.join(dirname, "examples", "test")):
+    if not os.path.isfile(os.path.join(dirname, "lib", "libmkldnn.so")):
        LOG.i("Downloading mkl...")
-        download_url_to_local(url, filename, root_folder, "47187284ede27ad3bd64b5f0e7d5e730")
-        # newest version for oneDNN
-        # download_url_to_local(url, filename, root_folder, "35bbbdf550a9d8ad54db798e372000f6")
+        download_url_to_local(url, filename, root_folder, md5)
        import tarfile

        with tarfile.open(fullname, "r") as tar:
            tar.extractall(root_folder)

        assert 0 == os.system(f"cd {dirname}/examples && "
-            f"{cc_path} -std=c++14 cpu_cnn_inference_f32.cpp -Ofast -lmkldnn -I ../include -L ../lib -o test && LD_LIBRARY_PATH=../lib/ ./test")
-        # newest version for oneDNN
-        # assert 0 == os.system(f"cd {dirname}/examples && "
-        #     f"{cc_path} -std=c++14 cnn_inference_f32.cpp -Ofast -lmkldnn -I ../include -L ../lib -o test && LD_LIBRARY_PATH=../lib/ ./test")
+            f"{cc_path} -std=c++14 cnn_inference_f32.cpp -Ofast -lmkldnn -I ../include -L ../lib -o test && LD_LIBRARY_PATH=../lib/ ./test")

 def setup_mkl():
    global mkl_ops, use_mkl
    use_mkl = os.environ.get("use_mkl", "1")=="1"
    mkl_ops = None
    if not use_mkl: return
+
+    # pytorch mkl is conflict with jittor mkl
+    # yield error "free: invalide size" or
+    # "mmap error"
+    # import pytorch(>1.8) first can fix this problem
+    # try:
+    #     # jt.dirty_fix_pytorch_runtime_error()
+    #     import torch
+    #     from torch import nn
+    # except:
+    #     torch = None
+
    mkl_include_path = os.environ.get("mkl_include_path")
    mkl_lib_path = os.environ.get("mkl_lib_path")
    
-    if mkl_lib_path is None or mkl_include_path is None:
-        mkl_install_sh = os.path.join(jittor_path, "script", "install_mkl.sh")
-        LOG.v("setup mkl...")
-        # mkl_path = os.path.join(cache_path, "mkl")
-        # mkl_path decouple with cc_path
-        from pathlib import Path
-        mkl_path = os.path.join(str(Path.home()), ".cache", "jittor", "mkl")
-        
-        make_cache_dir(mkl_path)
-        install_mkl(mkl_path)
-        mkl_home = ""
-        for name in os.listdir(mkl_path):
-            if name.startswith("mkldnn_lnx") and os.path.isdir(os.path.join(mkl_path, name)):
-                mkl_home = os.path.join(mkl_path, name)
-                break
-        assert mkl_home!=""
+    if platform.system() == 'Linux':
+        if mkl_lib_path is None or mkl_include_path is None:
+            mkl_install_sh = os.path.join(jittor_path, "script", "install_mkl.sh")
+            LOG.v("setup mkl...")
+            # mkl_path = os.path.join(cache_path, "mkl")
+            # mkl_path decouple with cc_path
+            from pathlib import Path
+            mkl_path = os.path.join(str(Path.home()), ".cache", "jittor", "mkl")
+            
+            make_cache_dir(mkl_path)
+            install_mkl(mkl_path)
+            mkl_home = ""
+            for name in os.listdir(mkl_path):
+                if name.startswith("dnnl") and os.path.isdir(os.path.join(mkl_path, name)):
+                    mkl_home = os.path.join(mkl_path, name)
+                    break
+            assert mkl_home!=""
        mkl_include_path = os.path.join(mkl_home, "include")
        mkl_lib_path = os.path.join(mkl_home, "lib")

-    mkl_lib_name = os.path.join(mkl_lib_path, "libmkldnn.so")
-    assert os.path.isdir(mkl_include_path)
-    assert os.path.isdir(mkl_lib_path)
-    assert os.path.isfile(mkl_lib_name)
-    LOG.v(f"mkl_include_path: {mkl_include_path}")
-    LOG.v(f"mkl_lib_path: {mkl_lib_path}")
-    LOG.v(f"mkl_lib_name: {mkl_lib_name}")
-    # We do not link manualy, link in custom ops
-    # ctypes.CDLL(mkl_lib_name, dlopen_flags)
+        mkl_lib_name = os.path.join(mkl_lib_path, "libmkldnn.so")
+        assert os.path.isdir(mkl_include_path)
+        assert os.path.isdir(mkl_lib_path)
+        assert os.path.isfile(mkl_lib_name)
+        LOG.v(f"mkl_include_path: {mkl_include_path}")
+        LOG.v(f"mkl_lib_path: {mkl_lib_path}")
+        LOG.v(f"mkl_lib_name: {mkl_lib_name}")
+        # We do not link manualy, link in custom ops
+        # ctypes.CDLL(mkl_lib_name, dlopen_flags)
+        extra_flags = f" -I'{mkl_include_path}' -L'{mkl_lib_path}' -lmkldnn -Wl,-rpath='{mkl_lib_path}' "
+
+    elif platform.system() == 'Darwin':
+        mkl_lib_paths = [
+            "/usr/local/lib/libmkldnn.dylib",       # x86_64
+            "/opt/homebrew/lib/libmkldnn.dylib",    # arm64
+        ]
+        if not any([os.path.exists(lib) for lib in mkl_lib_paths]):
+            raise RuntimeError("Not found onednn, please install it by the command 'brew install onednn'")
+        extra_flags = f" -lmkldnn "

    mkl_op_dir = os.path.join(jittor_path, "extern", "mkl", "ops")
    mkl_op_files = [os.path.join(mkl_op_dir, name) for name in os.listdir(mkl_op_dir)]
-    mkl_ops = compile_custom_ops(mkl_op_files, 
-        extra_flags=f" -I'{mkl_include_path}' -L'{mkl_lib_path}' -lmkldnn -Wl,-rpath='{mkl_lib_path}' ")
+    mkl_ops = compile_custom_ops(mkl_op_files, extra_flags=extra_flags)
    LOG.vv("Get mkl_ops: "+str(dir(mkl_ops)))


@ -154,13 +183,20 @@ def setup_cuda_extern():
            line = traceback.format_exc()
            LOG.w(f"CUDA found but {lib_name} is not loaded:\n{line}")
            if lib_name == "cudnn":
-                LOG.w(f"""Develop version of CUDNN not found, 
+                msg = """Develop version of CUDNN not found, 
 please refer to CUDA offical tar file installation: 
-https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar
+https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar"""
+                if platform.machine() == "x86_64":
+                    msg += """
 or you can let jittor install cuda and cudnn for you:
->>> python3.{sys.version_info.minor} -m jittor_utils.install_cuda""")
+>>> python3.{sys.version_info.minor} -m jittor_utils.install_cuda
+"""
+                LOG.w(msg)

 def setup_cuda_lib(lib_name, link=True, extra_flags=""):
+    arch_key = "x86_64"
+    if platform.machine() != "x86_64":
+        arch_key = "aarch64"
    globals()[lib_name+"_ops"] = None
    globals()[lib_name] = None
    if not has_cuda: return
@ -172,20 +208,26 @@ def setup_cuda_lib(lib_name, link=True, extra_flags=""):

    link_flags = ""
    if link:
-        extra_include_path = os.path.abspath(os.path.join(cuda_include, "..", "targets/x86_64-linux/include"))
-        extra_lib_path = os.path.abspath(os.path.join(cuda_lib, "..", "targets/x86_64-linux/lib"))
+        extra_include_path = os.path.abspath(os.path.join(cuda_include, "..", f"targets/{arch_key}-linux/include"))
+        extra_lib_path = os.path.abspath(os.path.join(cuda_lib, "..", f"targets/{arch_key}-linux/lib"))
        cuda_include_name = search_file([cuda_include, extra_include_path, "/usr/include"], lib_name+".h")
        # cuda11 prefer cudnn 8
        nvcc_version = get_int_version(nvcc_path)
        prefer_version = ()
        if nvcc_version[0] == 11:
            prefer_version = ("8",)
-        culib_path = search_file([cuda_lib, extra_lib_path, "/usr/lib/x86_64-linux-gnu"], f"lib{lib_name}.so", prefer_version)
+        culib_path = search_file([cuda_lib, extra_lib_path, f"/usr/lib/{arch_key}-linux-gnu", "/usr/lib"], f"lib{lib_name}.so", prefer_version)

-        if lib_name == "cublas":
+        if lib_name == "cublas" and nvcc_version[0] >= 10:
            # manual link libcublasLt.so
-            cublas_lt_lib_path = search_file([cuda_lib, extra_lib_path, "/usr/lib/x86_64-linux-gnu"], f"libcublasLt.so", nvcc_version)
-            ctypes.CDLL(cublas_lt_lib_path, dlopen_flags)
+            try:
+                cublas_lt_lib_path = search_file([cuda_lib, extra_lib_path, f"/usr/lib/{arch_key}-linux-gnu", "/usr/lib"], f"libcublasLt.so", nvcc_version)
+                ctypes.CDLL(cublas_lt_lib_path, dlopen_flags)
+            except:
+                # some aarch64 os, such as uos with FT2000 cpu,
+                # it's cuda 10 doesn't have libcublasLt.so
+                pass
+


        if lib_name == "cudnn":
@ -193,7 +235,7 @@ def setup_cuda_lib(lib_name, link=True, extra_flags=""):
            if nvcc_version >= (11,0,0):
                libs = ["libcudnn_ops_infer.so", "libcudnn_ops_train.so", "libcudnn_cnn_infer.so", "libcudnn_cnn_train.so"]
                for l in libs:
-                    ex_cudnn_path = search_file([cuda_lib, extra_lib_path, "/usr/lib/x86_64-linux-gnu"], l, prefer_version)
+                    ex_cudnn_path = search_file([cuda_lib, extra_lib_path, f"/usr/lib/{arch_key}-linux-gnu", "/usr/lib"], l, prefer_version)
                    ctypes.CDLL(ex_cudnn_path, dlopen_flags)

        # dynamic link cuda library
@ -326,7 +368,7 @@ def install_nccl(root_folder):
        if len(flags.cuda_archs):
            arch_flag = f" -arch=compute_{min(flags.cuda_archs)} "
            arch_flag += ''.join(map(lambda x:f' -code=sm_{x} ', flags.cuda_archs))
-        run_cmd(f"make -j8 src.build CUDA_HOME='{cuda_home}' NVCC_GENCODE='{arch_flag} --cudart=shared -ccbin=\"{cc_path}\" ' ", cwd=dirname)
+        run_cmd(f"CC=\"{cc_path}\" CXX=\"{cc_path}\" make -j8 src.build CUDA_HOME='{cuda_home}' NVCC_GENCODE='{arch_flag} --cudart=shared ' ", cwd=dirname)
    return dirname

 def setup_nccl():
@ -453,12 +495,22 @@ def setup_mpi():
        if k == "mpi_test": continue
        setattr(core.Var, k, warper(mpi_ops.__dict__[k]))

+if os.environ.get("FIX_TORCH_ERROR", "0") == "1":
+    try:
+        import torch
+    except:
+        pass
+
 setup_mpi()
 in_mpi = inside_mpi()
 rank = mpi.world_rank() if in_mpi else 0
+world_size = mpi.world_size() if in_mpi else 1
 setup_nccl()

 setup_cutt()
-setup_mkl()
+try:
+    setup_mkl()
+except Exception as e:
+    LOG.w("MKL install failed, msg:", e)

 setup_cuda_extern()
--- a/python/jittor/compiler.py
+++ b/python/jittor/compiler.py
@ -12,7 +12,9 @@ import glob
 import inspect
 import datetime
 import threading
+import platform
 import ctypes
+import platform
 from ctypes import cdll
 from ctypes.util import find_library

@ -22,6 +24,7 @@ from . import pyjt_compiler
 from jittor_utils import lock
 from jittor_utils import install_cuda
 from jittor import __version__
+import hashlib

 def find_jittor_path():
    return os.path.dirname(__file__)
@ -71,10 +74,11 @@ def compile(compiler, flags, inputs, output, combind_build=False):
    output = os.path.join(cache_path, output)
    # don't recompile object file in inputs
    obj_files = []
+    ex_obj_files = []
    new_inputs = []
    for name in inputs:
        if name[-1] in 'oa':
-            obj_files.append(name)
+            ex_obj_files.append(name)
        else:
            new_inputs.append(os.path.join(jittor_path, name))
            obj_files.append(os.path.join(
@ -104,12 +108,11 @@ def compile(compiler, flags, inputs, output, combind_build=False):
            cmd = cmd.replace("-Ofast", "-O2")
        cmds.append(cmd)
    jit_utils.run_cmds(cmds, cache_path, jittor_path, "Compiling "+base_output)
+    obj_files += ex_obj_files
    cmd = f"\"{compiler}\" {' '.join(obj_files)} {flags} {lto_flags} {link} -o {output}"
    return do_compile(cmd)

 def gen_jit_tests():
-    # all_src = run_cmd('find -L src/ | grep "cc$"', jittor_path).splitlines()
-    # all_src = glob.glob(os.path.join(jittor_path,"src","**","*.cc"), recursive=True)
    all_src = glob.glob(jittor_path+"/src/**/*.cc", recursive=True)
    jit_declares = []
    re_def = re.compile("JIT_TEST\\((.*?)\\)")
@ -159,7 +162,6 @@ def gen_jit_tests():
        f.write(jit_src)

 def gen_jit_flags():
-    # all_src = run_cmd('find -L src/ | grep "cc$"', jittor_path).splitlines()
    all_src = glob.glob(jittor_path+"/src/**/*.cc", recursive=True)
    jit_declares = []
    re_def = re.compile("DEFINE_FLAG(_WITH_SETTER)?\\((.*?)\\);", re.DOTALL)
@ -608,7 +610,7 @@ def compile_custom_ops(
    filenames, 
    extra_flags="", 
    return_module=False,
-    dlopen_flags=os.RTLD_GLOBAL | os.RTLD_NOW | os.RTLD_DEEPBIND,
+    dlopen_flags=None,
    gen_name_ = ""):
    """Compile custom ops
    filenames: path of op source files, filenames must be
@ -618,6 +620,11 @@ def compile_custom_ops(
    return_module: return module rather than ops(default: False)
    return: compiled ops
    """
+    if dlopen_flags is None:
+        dlopen_flags = os.RTLD_GLOBAL | os.RTLD_NOW
+        if platform.system() == 'Linux':
+            dlopen_flags |= os.RTLD_DEEPBIND
+
    srcs = {}
    headers = {}
    builds = []
@ -650,7 +657,7 @@ def compile_custom_ops(
    if gen_name_ != "":
        gen_name = gen_name_
    if len(gen_name) > 100:
-        gen_name = gen_name[:80] + "___hash" + str(hash(gen_name))
+        gen_name = gen_name[:80] + "___hash" + hashlib.md5(gen_name.encode()).hexdigest()

    includes = sorted(list(set(includes)))
    includes = "".join(map(lambda x: f" -I'{x}' ", includes))
@ -716,7 +723,7 @@ def get_full_path_of_executable(name):

 def compile_extern():
    # compile llvm passes
-    if cc_type != "clang":
+    if cc_type != "clang" or platform.system() != 'Linux':
        return
    global kernel_opt_flags
    cache_path_llvm = os.path.join(cache_path, "llvm")
@ -777,7 +784,7 @@ def compile_extern():
    LOG.vv(f"Compile extern llvm passes: {str(files)}")

 def check_cuda():
-    if nvcc_path == "":
+    if not nvcc_path:
        return
    global cc_flags, has_cuda, core_link_flags, cuda_dir, cuda_lib, cuda_include, cuda_home
    cuda_dir = os.path.dirname(get_full_path_of_executable(nvcc_path))
@ -859,11 +866,15 @@ def check_debug_flags():

 cc_flags = " "
 # os.RTLD_NOW | os.RTLD_GLOBAL cause segfault when import torch first
-import_flags = os.RTLD_NOW | os.RTLD_GLOBAL | os.RTLD_DEEPBIND
+import_flags = os.RTLD_NOW | os.RTLD_GLOBAL
+if platform.system() == 'Linux':
+    import_flags |= os.RTLD_DEEPBIND
 # if cc_type=="icc":
 #     # weird link problem, icc omp library may conflict and cause segfault
 #     import_flags = os.RTLD_NOW | os.RTLD_GLOBAL
-dlopen_flags = os.RTLD_NOW | os.RTLD_GLOBAL | os.RTLD_DEEPBIND
+dlopen_flags = os.RTLD_NOW | os.RTLD_GLOBAL
+if platform.system() == 'Linux':
+    import_flags |= os.RTLD_DEEPBIND

 with jit_utils.import_scope(import_flags):
    jit_utils.try_import_jit_utils_core()
@ -885,9 +896,21 @@ python_path = sys.executable
 ex_python_path = python_path + '.' + str(sys.version_info.minor)
 if os.path.isfile(ex_python_path):
    python_path = ex_python_path
-nvcc_path = env_or_try_find('nvcc_path', 'nvcc') or try_find_exe('/usr/local/cuda/bin/nvcc') or try_find_exe('/usr/bin/nvcc')
+
+# if jtcuda is already installed
+nvcc_path = None
+if install_cuda.has_installation():
+    nvcc_path = install_cuda.install_cuda()
+    if nvcc_path:
+        nvcc_path = try_find_exe(nvcc_path)
+# check system installed cuda
+if not nvcc_path:
+    nvcc_path = env_or_try_find('nvcc_path', 'nvcc') or \
+        try_find_exe('/usr/local/cuda/bin/nvcc') or \
+        try_find_exe('/usr/bin/nvcc') or \
+        try_find_exe('/opt/cuda/bin/nvcc')
+# if system has no cuda, install jtcuda
 if not nvcc_path:
-    cuda_driver = install_cuda.get_cuda_driver()
    nvcc_path = install_cuda.install_cuda()
    if nvcc_path:
        nvcc_path = try_find_exe(nvcc_path)
@ -897,14 +920,44 @@ gdb_path = try_find_exe('gdb')
 addr2line_path = try_find_exe('addr2line')
 has_pybt = check_pybt(gdb_path, python_path)

-cc_flags += " -Wall -Werror -Wno-unknown-pragmas -std=c++14 -fPIC -march=native "
+
+def check_clang_latest_supported_cpu():
+    output = run_cmd('clang --print-supported-cpus')
+    apple_cpus = [l.strip() for l in output.split('\n') if 'apple-a' in l]
+    apple_cpus_id = max([int(cpu[7:]) for cpu in apple_cpus])
+    return f'apple-a{apple_cpus_id}'
+
+# cc_flags += " -Wall -Werror -Wno-unknown-pragmas -std=c++14 -fPIC "
+cc_flags += " -Wall -Wno-unknown-pragmas -std=c++14 -fPIC "
+# 1. Arch/CPU specific optimization
+if platform.machine() == "x86_64":
+    cc_flags += " -march=native " 
+elif platform.machine() == 'arm64' and platform.system() == "Darwin":
+    cc_flags += f" -mcpu={check_clang_latest_supported_cpu()} "
 cc_flags += " -fdiagnostics-color=always "
+# 2. Non standard include path
+if platform.system() == 'Darwin' and platform.machine() == 'arm64':
+    cc_flags += " -I/opt/homebrew/include "
+# 3. User specified flags
 if "cc_flags" in os.environ:
    cc_flags += os.environ["cc_flags"] + ' '
+
 link_flags = " -lstdc++ -ldl -shared "
+if platform.system() == 'Darwin':
+    # TODO: if not using apple clang, there is no need to add -lomp
+    link_flags += "-undefined dynamic_lookup -lomp "
+    if platform.machine() == "arm64":
+        link_flags += " -L/opt/homebrew/lib "
+
 core_link_flags = ""
 opt_flags = ""
-kernel_opt_flags = os.environ.get("kernel_flags", "") + opt_flags + " -fopenmp "
+
+kernel_opt_flags = os.environ.get("kernel_flags", "") + opt_flags
+if platform.system() == 'Darwin':
+    # TODO: if not using apple clang, cannot add -Xpreprocessor
+    kernel_opt_flags = kernel_opt_flags + " -Xpreprocessor -fopenmp "
+else:
+    kernel_opt_flags = kernel_opt_flags + " -fopenmp "
 if os.name == 'nt':
    link_flags = link_flags.replace('-ldl', '')
    py3_link_path = '-L"' + os.path.join(
@ -978,7 +1031,6 @@ if has_cuda:
 gen_jit_flags()
 gen_jit_tests()
 op_headers = glob.glob(jittor_path+"/src/ops/**/*op.h", recursive=True)
-# op_headers = run_cmd('find -L src/ops/ | grep "op.h$"', jittor_path).splitlines()
 jit_src = gen_jit_op_maker(op_headers)
 LOG.vvvv(jit_src)
 with open(os.path.join(cache_path, "gen", "jit_op_maker.h"), 'w') as f:
@ -1023,54 +1075,49 @@ for file in jit_utils_core_files:
    files.remove(file)
 LOG.vv("compile order:", files)

-if os.name != 'nt':
-    # manual Link omp using flags(os.RTLD_NOW | os.RTLD_GLOBAL)
-    # if cc_type=="icc":
-    #     os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
+if platform.system() == 'Linux':
    libname = {"clang":"omp", "icc":"iomp5", "g++":"gomp"}[cc_type]
    libname = ctypes.util.find_library(libname)
    assert libname is not None, "openmp library not found"
    ctypes.CDLL(libname, os.RTLD_NOW | os.RTLD_GLOBAL)

-    # get os release
+if platform.machine()=='sw_64':
+    import ssl
+    ssl._create_default_https_context = ssl._create_unverified_context

-    with open("/etc/os-release", "r", encoding='utf8') as f:
-        s = f.read().splitlines()
-        os_release = {}
-        for line in s:
-            a = line.split('=')
-            if len(a) != 2: continue
-            os_release[a[0]] = a[1].replace("\"", "")
-
-os_type = {
-    "ubuntu": "ubuntu",
-    "debian": "ubuntu",
-    "centos": "centos",
-    "rhel": "ubuntu",
-    "fedora": "ubuntu",
-}
-version_file = os.path.join(jittor_path, "version")
-if os.path.isfile(version_file) and not os.path.isdir(os.path.join(jittor_path, "src", "__data__")):
-    with open(version_file, 'r') as f:
-        version = f.read().strip()
-    # key = f"{version}-{cc_type}-{'cuda' if has_cuda else 'cpu'}.o"
-    key = f"{version}-g++-cpu"
-    os_id = os_release["ID"]
-    os_key = os_type.get(os_id, "ubuntu")
-    if "os_key" in os.environ:
-        os_key = os.environ['os_key']
-    LOG.i("OS type:", os_id, " OS key:", os_key)
-    key += '-' + os_key + '.o'
-    # TODO: open the website
-    extra_obj = os.path.join(cache_path, key)
-    url = os.path.join("https://cg.cs.tsinghua.edu.cn/jittor/assets/build/"+key)
-    jit_utils.download(url, extra_obj)
-    files.append(extra_obj)
+data_gz_path = os.path.join(jittor_path, "utils", "data.gz")
+use_data_gz = os.path.isfile(data_gz_path)
+if os.environ.get("use_data_gz", "1") == "0":
+    use_data_gz = False
+if use_data_gz:
+    import gzip
+    with gzip.open(data_gz_path, 'rb') as f:
+        data = f.read()
+        md5 = hashlib.md5(data).hexdigest()
+    target_md5 = None
+    data_gz_md5_path = os.path.join(cache_path, "data.md5")
+    if os.path.isfile(data_gz_md5_path):
+        with open(data_gz_md5_path, 'r') as f:
+            target_md5 = f.read()
+    data_o_path = os.path.join(cache_path, "data.o")
+    if target_md5 != md5:
+        data_s_path = os.path.join(cache_path, "data.cc")
+        with open(data_s_path, "w") as f:
+            f.write(data.decode("utf8"))
+        dflags = (cc_flags+opt_flags)\
+            .replace("-Wall", "") \
+            .replace("-Werror", "")
+        run_cmd(f"{cc_path} {dflags} \"-D_P(...)=\" {data_s_path} -c -o {data_o_path}")
+        os.remove(data_s_path)
+        with open(data_gz_md5_path, 'w') as f:
+            f.write(md5)
+    files.append(data_o_path)
+    files = [f for f in files if "__data__" not in f]

 compile(cc_path, cc_flags+opt_flags, files, 'jittor_core'+extension_suffix)

 # TODO: move to compile_extern.py
-compile_extern()
+# compile_extern()

 with jit_utils.import_scope(import_flags):
    import jittor_core as core
--- a/python/jittor/contrib.py
+++ b/python/jittor/contrib.py
@ -206,6 +206,23 @@ def setitem(x, slices, value):
 jt.Var.__getitem__ = jt.Var.slice_var = getitem
 jt.Var.__setitem__ = setitem

+
+def _merge_dtypes(dtypes):
+    s = -1
+    e = -1
+    names = ["bool","uint","int","float"]
+    dbytes = ["8","16","32","64"]
+    for d in dtypes:
+        for name in names:
+            if d.startswith(name):
+                s = max(s,names.index(name))
+        for db in dbytes:
+            if d.endswith(db):
+                e = max(e,dbytes.index(db))
+    assert s>=0 and s<4 and e<4
+    dtype = names[s]+("" if e ==-1 else dbytes[e])
+    return dtype 
+
 def concat(arr, dim=0):
    '''Concat Operator can concat a list of jt Var at a specfic dimension.
    
@ -226,12 +243,14 @@ Example::
        raise ValueError("need at least one array to concat")
    total_dim = 0
    if dim < 0: dim += len(arr[0].shape)
+    dtypes = []
    for a in arr:
        total_dim += a.shape[dim]
+        dtypes.append(str(a.dtype))
    cdim = 0
    shape = list(a.shape)
    shape[dim] = total_dim
-    s = jt.empty(shape, a.dtype)
+    s = jt.empty(shape, dtype = _merge_dtypes(dtypes))
    slices = [slice(None)]*len(a.shape)
    for a in arr:
        if a.shape[dim] == 0:
--- a/python/jittor/dataset/init.py
+++ b/python/jittor/dataset/init.py
@ -1,5 +1,6 @@

-from .dataset import Dataset, ImageFolder
+from .dataset import Dataset, ImageFolder, dataset_root, TensorDataset, VarDataset
 from .mnist import MNIST
+from .cifar import CIFAR10, CIFAR100
 from .voc import VOC
 from .sampler import *
--- a/python/jittor/dataset/cifar.py
+++ b/python/jittor/dataset/cifar.py
@ -0,0 +1,189 @@
+
+import os
+from jittor_utils.misc import download_and_extract_archive, check_integrity
+from PIL import Image
+import sys, pickle
+import numpy as np
+from jittor.dataset import Dataset, dataset_root
+
+class CIFAR10(Dataset):
+    """`CIFAR10 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ Dataset.
+
+    Args:
+        root (string): Root directory of dataset where directory
+            ``cifar-10-batches-py`` exists or will be saved to if download is set to True.
+        train (bool, optional): If True, creates dataset from training set, otherwise
+            creates from test set.
+        transform (callable, optional): A function/transform that takes in an PIL image
+            and returns a transformed version. E.g, ``transforms.RandomCrop``
+        target_transform (callable, optional): A function/transform that takes in the
+            target and transforms it.
+        download (bool, optional): If true, downloads the dataset from the internet and
+            puts it in root directory. If dataset is already downloaded, it is not
+            downloaded again.
+
+    Example::
+
+
+        from jittor.dataset.cifar import CIFAR10
+        a = CIFAR10()
+        a.set_attrs(batch_size=16)
+        for imgs, labels in a:
+            print(imgs.shape, labels.shape)
+            break
+
+    """
+    base_folder = 'cifar-10-batches-py'
+    url = "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
+    filename = "cifar-10-python.tar.gz"
+    tgz_md5 = 'c58f30108f718f92721af3b95e74349a'
+    train_list = [
+        ['data_batch_1', 'c99cafc152244af753f735de768cd75f'],
+        ['data_batch_2', 'd4bba439e000b95fd0a9bffe97cbabec'],
+        ['data_batch_3', '54ebc095f3ab1f0389bbae665268c751'],
+        ['data_batch_4', '634d18415352ddfa80567beed471001a'],
+        ['data_batch_5', '482c414d41f54cd18b22e5b47cb7c3cb'],
+    ]
+
+    test_list = [
+        ['test_batch', '40351d587109b95175f43aff81a1287e'],
+    ]
+    meta = {
+        'filename': 'batches.meta',
+        'key': 'label_names',
+        'md5': '5ff9c542aee3614f3951f8cda6e48888',
+    }
+
+    def __init__(self, root=dataset_root+"/cifar_data/", train=True, transform=None, target_transform=None,
+                 download=True):
+
+        super(CIFAR10, self).__init__()
+        self.root = root
+        self.transform=transform
+        self.target_transform=target_transform
+
+        self.train = train  # training set or test set
+
+        if download:
+            self.download()
+
+        if not self._check_integrity():
+            raise RuntimeError('Dataset not found or corrupted.' +
+                               ' You can use download=True to download it')
+
+        if self.train:
+            downloaded_list = self.train_list
+        else:
+            downloaded_list = self.test_list
+
+        self.data = []
+        self.targets = []
+
+        # now load the picked numpy arrays
+        for file_name, checksum in downloaded_list:
+            file_path = os.path.join(self.root, self.base_folder, file_name)
+            with open(file_path, 'rb') as f:
+                if sys.version_info[0] == 2:
+                    entry = pickle.load(f)
+                else:
+                    entry = pickle.load(f, encoding='latin1')
+                self.data.append(entry['data'])
+                if 'labels' in entry:
+                    self.targets.extend(entry['labels'])
+                else:
+                    self.targets.extend(entry['fine_labels'])
+
+        self.data = np.vstack(self.data).reshape(-1, 3, 32, 32)
+        self.data = self.data.transpose((0, 2, 3, 1))  # convert to HWC
+
+        self._load_meta()
+
+    def _load_meta(self):
+        path = os.path.join(self.root, self.base_folder, self.meta['filename'])
+        if not check_integrity(path, self.meta['md5']):
+            raise RuntimeError('Dataset metadata file not found or corrupted.' +
+                               ' You can use download=True to download it')
+        with open(path, 'rb') as infile:
+            if sys.version_info[0] == 2:
+                data = pickle.load(infile)
+            else:
+                data = pickle.load(infile, encoding='latin1')
+            self.classes = data[self.meta['key']]
+        self.class_to_idx = {_class: i for i, _class in enumerate(self.classes)}
+
+    def __getitem__(self, index):
+        """
+        Args:
+            index (int): Index
+
+        Returns:
+            tuple: (image, target) where target is index of the target class.
+        """
+        img, target = self.data[index], self.targets[index]
+
+        # doing this so that it is consistent with all other datasets
+        # to return a PIL Image
+        img = Image.fromarray(img)
+
+        if self.transform is not None:
+            img = self.transform(img)
+
+        if self.target_transform is not None:
+            target = self.target_transform(target)
+
+        return img, target
+
+    def __len__(self):
+        return len(self.data)
+
+    def _check_integrity(self):
+        root = self.root
+        for fentry in (self.train_list + self.test_list):
+            filename, md5 = fentry[0], fentry[1]
+            fpath = os.path.join(root, self.base_folder, filename)
+            if not check_integrity(fpath, md5):
+                return False
+        return True
+
+    def download(self):
+        if self._check_integrity():
+            print('Files already downloaded and verified')
+            return
+        download_and_extract_archive(self.url, self.root, filename=self.filename, md5=self.tgz_md5)
+
+    def extra_repr(self):
+        return "Split: {}".format("Train" if self.train is True else "Test")
+
+
+class CIFAR100(CIFAR10):
+    """`CIFAR100 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ Dataset.
+
+    This is a subclass of the `CIFAR10` Dataset.
+
+
+    Example::
+
+
+        from jittor.dataset.cifar import CIFAR100
+        a = CIFAR100()
+        a.set_attrs(batch_size=16)
+        for imgs, labels in a:
+            print(imgs.shape, labels.shape)
+            break
+    """
+    base_folder = 'cifar-100-python'
+    url = "https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz"
+    filename = "cifar-100-python.tar.gz"
+    tgz_md5 = 'eb9058c3a382ffc7106e4002c42a8d85'
+    train_list = [
+        ['train', '16019d7e3df5f24257cddd939b257f8d'],
+    ]
+
+    test_list = [
+        ['test', 'f0ef6b0ae62326f3e7ffdfab6717acfc'],
+    ]
+    meta = {
+        'filename': 'meta',
+        'key': 'fine_label_names',
+        'md5': '7973b15100ade9c7d40fb424638fde48',
+    }
--- a/python/jittor/dataset/dataset.py
+++ b/python/jittor/dataset/dataset.py
@ -26,6 +26,7 @@ dataset_root = os.path.join(pathlib.Path.home(), ".cache", "jittor", "dataset")
 mp_log_v = os.environ.get("mp_log_v", 0) 
 mpi = jt.mpi
 img_open_hook = HookTimer(Image, "open")
+CHECK_MEMORY = int(os.environ.get("CHECK_MEMORY", "0"))

 class Worker:
    def __init__(self, target, args, buffer_size, keep_numpy_array=False):
@ -48,6 +49,8 @@ class Dataset(object):
        [in] drop_last(bool): if true, the last batch of dataset might smaller than batch_size, default True.
        [in] num_workers(int): number of workers for loading data.
        [in] buffer_size(int): buffer size for each worker in bytes, default(512MB).
+        [in] keep_numpy_array(bool): return numpy array rather than jittor array, default(False).
+        [in] endless(bool): will this dataset yield data forever, default(False).
    
    Example::

@ -70,8 +73,11 @@ class Dataset(object):
                 num_workers = 0,
                 buffer_size = 512*1024*1024,
                 stop_grad = True,
-                 keep_numpy_array = False):
+                 keep_numpy_array = False,
+                 endless = False):
        super().__init__()
+        if os.environ.get("DISABLE_MULTIPROCESSING", '0') == '1':
+            num_workers = 0
        self.total_len = None
        self.batch_size = batch_size
        self.shuffle = shuffle
@ -80,7 +86,10 @@ class Dataset(object):
        self.buffer_size = buffer_size
        self.stop_grad = stop_grad
        self.keep_numpy_array = keep_numpy_array
+        self.endless = endless
+        self.epoch_id = 0
        self.sampler = None
+        self._disable_workers = False

    def __getitem__(self, index):
        raise NotImplementedError
@ -129,13 +138,16 @@ class Dataset(object):
            if self.stop_grad else jt.array(x)
        if isinstance(batch, np.ndarray):
            return to_jt(batch)
+        if isinstance(batch, dict):
+            new_batch = {}
+            for k,v in batch.items():
+                new_batch[k] = self.to_jittor(v)
+            return new_batch
        if not isinstance(batch, (list, tuple)):
            return batch
        new_batch = []
        for a in batch:
-            if isinstance(a, np.ndarray) or \
-                isinstance(a, int) or \
-                isinstance(a, float):
+            if isinstance(a, np.ndarray):
                new_batch.append(to_jt(a))
            else:
                new_batch.append(self.to_jittor(a))
@ -180,15 +192,20 @@ class Dataset(object):
            while True:
                # get id
                with gid_lock:
-                    while gid_obj.value >= self.batch_len or buffer.is_stop():
+                    while buffer.is_stop() or self.idqueue.is_stop() or \
+                        gid_obj.value >= self.batch_len:
                        self.num_idle.value += 1
                        self.num_idle_c.notify()
                        self.gidc.wait()
                        self.num_idle.value -= 1
                    cid = gid_obj.value
-                    self.idmap[cid] = worker_id
+                    batch_index_list = self.index_list_numpy[
+                        cid*self.real_batch_size:
+                        min(self.real_len, (cid+1)*self.real_batch_size)
+                    ].copy()
                    gid_obj.value += 1
-                    self.gidc.notify()
+                with self.idqueue_lock:
+                    self.idqueue.push(worker_id)
                now = time.time()
                other_time = now - start
                start = now
@ -197,8 +214,8 @@ class Dataset(object):
                batch = []
                if mp_log_v:
                    print(f"#{worker_id} {os.getpid()} load batch", cid*self.real_batch_size, min(self.real_len, (cid+1)*self.real_batch_size))
-                for i in range(cid*self.real_batch_size, min(self.real_len, (cid+1)*self.real_batch_size)):
-                    batch.append(self[self.index_list[i]])
+                for i in batch_index_list:
+                    batch.append(self[i])
                batch = self.collate_batch(batch)
                now = time.time()
                data_time = now - start
@ -276,10 +293,10 @@ Example::
        if not hasattr(self, "workers"):
            return
        msg = [""]
-        msg.append(f"progress:{self.last_id}/{self.batch_len}")
+        msg.append(f"progress:{self.batch_id}/{self.batch_len}")
        msg.append(f"batch(s): {self.batch_time:.3f}\twait(s):{self.wait_time:.3f}")
        msg.append(f"recv(s): {self.recv_time:.3f}\tto_jittor(s):{self.to_jittor_time:.3f}")
-        msg.append(f"last 10 workers: {self.idmap[max(0, self.last_id-9):self.last_id+1]}")
+        msg.append(f"last 10 workers: {self.last_ids}")
        msg.append(f"ID\twait(s)\topen(s)\tload(s)\tsend(s)\ttotal(s)")
        for i in range(self.num_workers):
            w = self.workers[i]
@ -291,6 +308,7 @@ Example::
        # stop workers
        for w in self.workers:
            w.buffer.stop()
+        self.idqueue.stop()
        # wait until all workers idle
        if self.num_idle.value < self.num_workers:
            with self.gid.get_lock():
@ -304,29 +322,34 @@ Example::
        # clean workers' buffer
        for w in self.workers:
            w.buffer.clear()
+        self.idqueue.clear()
+        self.gid_obj.value = 0
            
-    def _init_workers(self):
+    def _init_workers(self, index_list):
        jt.clean()
        jt.gc()
        self.index_list = mp.Array('i', self.real_len, lock=False)
        workers = []
-        # batch id to worker id
-        self.idmap = mp.Array('i', self.batch_len, lock=False)
+        # get worker id
+        self.idqueue = jt.RingBuffer(2048)
+        self.idqueue_lock = mp.Lock()
        # global token index
        self.gid = mp.Value('i', self.batch_len)
+        self.gid.value = 0
        # global token index condition
        self.gidc = mp.Condition(self.gid.get_lock())
        # number of idle workers
        self.num_idle = mp.Value('i', 0, lock=False)
        # number of idle workers condition
        self.num_idle_c = mp.Condition(self.gid.get_lock())
+        self.index_list_numpy = np.ndarray(dtype='int32', shape=self.real_len, buffer=self.index_list)
+        self.index_list_numpy[:] = index_list
        for i in range(self.num_workers):
            w = Worker(target=self._worker_main, args=(i,), 
                       buffer_size=self.buffer_size,
                       keep_numpy_array=self.keep_numpy_array)
            workers.append(w)
        self.workers = workers
-        self.index_list_numpy = np.ndarray(dtype='int32', shape=self.real_len, buffer=self.index_list)

    def reset(self):
        if not hasattr(self, "workers"):
@ -351,8 +374,8 @@ Example::
        if self.total_len is None:
            self.total_len = len(self)
        return self.total_len
-        
-    def __iter__(self):
+
+    def _get_index_list(self):
        if self.total_len is None:
            self.total_len = len(self)
        # maybe rewrite by sampler
@ -381,7 +404,8 @@ Example::
            world_size = mpi.world_size()
            world_rank = mpi.world_rank()
            index_list = np.int32(index_list)
-            mpi.broadcast(index_list, 0)
+            # TODO: mpi broadcast in subprocess has bug, fix it
+            # mpi.broadcast(index_list, 0)

            assert self.batch_size >= world_size, \
                f"Batch size({self.batch_size}) is smaller than MPI world_size({world_size})"
@ -416,71 +440,104 @@ Example::
        else:
            self.real_len = self.total_len
            self.real_batch_size = self.batch_size
-            
        self.batch_len = self.__batch_len__()
+        return index_list
+
+    def _epochs(self):
+        if self.endless:
+            while True:
+                yield
+                self.epoch_id += 1
+        else:
+            yield
+        
+    def __iter__(self):
+        if self._disable_workers:
+            self.num_workers = 0
+        index_list = self._get_index_list()
        
        if not hasattr(self, "workers") and self.num_workers:
-            self._init_workers()
+            self._init_workers(index_list)
+            self.last_ids = [-1] * 10
        
        if self.num_workers:
-            self._stop_all_workers()
-            self.index_list_numpy[:] = index_list
-            gid_obj = self.gid.get_obj()
-            gid_lock = self.gid.get_lock()
-            with gid_lock:
-                gid_obj.value = 0
-                self.gidc.notify_all()
            start = time.time()
            self.batch_time = 0
-            for i in range(self.batch_len):
-                # try not get lock first
-                if gid_obj.value <= i:
-                    with gid_lock:
-                        if gid_obj.value <= i:
-                            if mp_log_v:
-                                print("wait")
-                            self.gidc.wait()
-                now = time.time()
-                self.wait_time = now - start
-                start = now
+            gid_obj = self.gid.get_obj()
+            gid_lock = self.gid.get_lock()

-                self.last_id = i
-                worker_id = self.idmap[i]
-                w = self.workers[worker_id]
-                if mp_log_v:
-                    print(f"#{worker_id} {os.getpid()} recv buffer", w.buffer)
-                batch = w.buffer.recv()
-                now = time.time()
-                self.recv_time = now - start
-                start = now
+            for _ in self._epochs():
+                with gid_lock:
+                    if self.num_idle.value:
+                        self.gidc.notify_all()

-                if mp_log_v:
-                    print(f"#{worker_id} {os.getpid()} recv", type(batch).__name__, [ type(b).__name__ for b in batch ])
-                batch = self.to_jittor(batch)
-                now = time.time()
-                self.to_jittor_time = now - start
-                start = now
+                for i in range(self.batch_len):
+                    if self.num_idle.value:
+                        with gid_lock:
+                            if self.num_idle.value and \
+                                gid_obj.value >= self.batch_len:
+                                index_list = self._get_index_list()
+                                self.index_list_numpy[:] = index_list
+                                gid_obj.value = 0
+                                self.gidc.notify_all()

-                yield batch
+                    # get which worker has this batch
+                    worker_id = self.idqueue.pop()

-                now = time.time()
-                self.batch_time = now - start
-                start = now
+                    now = time.time()
+                    self.wait_time = now - start
+                    start = now
+
+                    self.last_ids[i%10] = worker_id
+                    self.batch_id = i
+                    w = self.workers[worker_id]
+                    if mp_log_v:
+                        print(f"#{worker_id} {os.getpid()} recv buffer", w.buffer)
+                    batch = w.buffer.recv()
+
+                    now = time.time()
+                    self.recv_time = now - start
+                    start = now
+
+                    if mp_log_v:
+                        print(f"#{worker_id} {os.getpid()} recv", type(batch).__name__, [ type(b).__name__ for b in batch ])
+                    batch = self.to_jittor(batch)
+                    
+                    now = time.time()
+                    self.to_jittor_time = now - start
+                    start = now
+
+                    yield batch
+
+                    now = time.time()
+                    self.batch_time = now - start
+                    start = now
+
+                    if CHECK_MEMORY and self.batch_id % CHECK_MEMORY == 0:
+                        jt.display_memory_info()
        else:
-            batch_data = []
-            for idx in index_list:
-                batch_data.append(self[int(idx)])
-                if len(batch_data) == self.real_batch_size:
+            for _ in self._epochs():
+                self.batch_id = 0
+                batch_data = []
+                for idx in index_list:
+                    batch_data.append(self[int(idx)])
+                    if len(batch_data) == self.real_batch_size:
+                        batch_data = self.collate_batch(batch_data)
+                        tmp = batch_data
+                        batch_data = self.to_jittor(batch_data)
+                        # breakpoint()
+                        yield batch_data
+                        self.batch_id += 1
+                        if CHECK_MEMORY and self.batch_id % CHECK_MEMORY == 0:
+                            jt.display_memory_info()
+                        batch_data = []
+
+                # depend on drop_last
+                if not self.drop_last and len(batch_data) > 0:
                    batch_data = self.collate_batch(batch_data)
                    batch_data = self.to_jittor(batch_data)
+                    self.batch_id += 1
                    yield batch_data
-                    batch_data = []
-
-            # depend on drop_last
-            if not self.drop_last and len(batch_data) > 0:
-                batch_data = self.collate_batch(batch_data)
-                batch_data = self.to_jittor(batch_data)
-                yield batch_data


 class ImageFolder(Dataset):
@ -537,3 +594,47 @@ class ImageFolder(Dataset):
            if self.transform:
                img = self.transform(img)
            return img, self.imgs[k][1]
+
+class VarDataset(Dataset):
+    """ Dataset using Var directly, TensorDataset is alias of VarDataset, Example::
+
+    import jittor as jt
+    from jittor.dataset import VarDataset
+
+    x = jt.array([1,2,3])
+    y = jt.array([4,5,6])
+    z = jt.array([7,8,9])
+    dataset = VarDataset(x, y, z)
+    dataset.set_attrs(batch_size=1)
+
+    for a,b,c in dataset:
+        print(a,b,c)
+    # will print
+    #  1,4,7
+    #  2,5,8
+    #  3,6,9
+
+    """
+    def __init__(self, *args):
+        super().__init__()
+        self.args = args
+        self._disable_workers = True
+        assert len(args), "At lease one args"
+        l = len(args[0])
+        for a in args:
+            assert l == len(a), "Len should be the same"
+        self.set_attrs(total_len=l)
+
+    def __getitem__(self, idx):
+        return [ a[idx] for a in self.args ]
+        
+
+    def collate_batch(self, batch):
+        b = collate_batch(batch)
+        for i in range(len(self.args)):
+            x = b[i]
+            if jt.is_var(self.args[i]) and self.args[i].ndim == 1:
+                x.assign(x.squeeze(-1))
+        return b
+
+TensorDataset = VarDataset
--- a/python/jittor/dataset/utils.py
+++ b/python/jittor/dataset/utils.py
@ -27,8 +27,7 @@ def collate_batch(batch):
    elem = batch[0]
    elem_type = type(elem)
    if isinstance(elem, jt.Var):
-        # TODO: use jittor
-        temp_data = np.stack([data.data for data in batch], 0)
+        temp_data = jt.stack([data for data in batch], 0)
        return temp_data
    if elem_type is np.ndarray:
        temp_data = np.stack([data for data in batch], 0)
--- a/python/jittor/distributions.py
+++ b/python/jittor/distributions.py
@ -29,18 +29,7 @@ kernel(in0->num/in0->shape[in0->shape.size()-1], 0, in0_p, out0_p, in0->shape[in

 class OneHotCategorical:
    def __init__(self, probs=None, logits=None):
-        assert not (probs is None and logits is None)
-        if probs is None:
-            # cannot align to pytorch
-            probs = jt.sigmoid(logits)
-        elif logits is None:
-            logits = jt.log(probs)
-        with jt.no_grad():
-            self.probs = probs / probs.sum(-1, True)
-            self.cum_probs = simple_presum(self.probs)
-            self.cum_probs_l = self.cum_probs[..., :-1]
-            self.cum_probs_r = self.cum_probs[..., 1:]
-            self.logits = logits
+        Categorical.__init__(self, probs, logits)

    def sample(self, sample_shape=[]):
        shape = sample_shape + self.probs.shape[:-1] + (1,)
@ -48,17 +37,12 @@ class OneHotCategorical:
        one_hot = jt.logical_and(self.cum_probs_l < rand, rand <= self.cum_probs_r).float()
        return one_hot
    
-    def log_prob(self,x):
-        if len(x.shape) == 1:
-            x = x.unsqueeze(0)
-        logits = self.logits.broadcast(x.shape)
-        indices = jt.argmax(x, dim=-1)[0]
-        return logits.gather(1, indices.unsqueeze(-1)).reshape(-1)
+    def log_prob(self, x):
+        x = jt.argmax(x, dim=-1)[0]
+        return Categorical.log_prob(self, x)
    
    def entropy(self):
-        min_real = -(math.pow(2,23)-1) / math.pow(2,22) * math.pow(2,127)
-        logits = jt.clamp(self.logits,min_v=min_real)
-        p_log_p = logits * self.probs
+        p_log_p = self.logits * self.probs
        return -p_log_p.sum(-1)
    
    
@ -68,29 +52,32 @@ class Categorical:
        if probs is None:
            # cannot align to pytorch
            probs = jt.sigmoid(logits)
-        elif logits is None:
-            logits = jt.log(probs)
+        probs = probs / probs.sum(-1, True)
+        if logits is None:
+            logits = jt.safe_log(probs)
        with jt.no_grad():
-            self.probs = probs / probs.sum(-1, True)
+            self.probs = probs
            self.logits = logits
-            self.cum_probs = simple_presum(probs)
+            self.cum_probs = simple_presum(self.probs)
            self.cum_probs_l = self.cum_probs[..., :-1]
            self.cum_probs_r = self.cum_probs[..., 1:]

-    def sample(self, sample_shape=[]):
+    def sample(self, sample_shape=()):
        shape = sample_shape + self.probs.shape[:-1] + (1,)
        rand = jt.rand(shape)
        one_hot = jt.logical_and(self.cum_probs_l < rand, rand <= self.cum_probs_r)
-        index = one_hot.index(one_hot.ndim-1)
+        index = one_hot.index(one_hot.ndim - 1)
        return (one_hot * index).sum(-1)
-    
+
    def log_prob(self, x):
-        return jt.log(self.probs)[0,x]
-    
+        a = self.probs.ndim
+        b = x.ndim
+        indexes = tuple( f'i{i}' for i in range(b-a+1, b) )
+        indexes = indexes + (x,)
+        return jt.safe_log(self.probs).getitem(indexes)
+
    def entropy(self):
-        min_real = -(math.pow(2,23)-1) / math.pow(2,22) * math.pow(2,127)
-        logits = jt.clamp(self.logits,min_v=min_real)
-        p_log_p = logits * self.probs
+        p_log_p = self.logits * self.probs
        return -p_log_p.sum(-1)


@ -104,11 +91,11 @@ class Normal:

    def log_prob(self, x):
        var = self.sigma**2
-        log_scale = jt.log(self.sigma)
+        log_scale = jt.safe_log(self.sigma)
        return -((x-self.mu)**2) / (2*var) - log_scale-np.log(np.sqrt(2*np.pi))
    
    def entropy(self):
-        return 0.5+0.5*np.log(2*np.pi)+jt.log(self.sigma)
+        return 0.5+0.5*np.log(2*np.pi)+jt.safe_log(self.sigma)


 class Uniform:
@ -123,10 +110,10 @@ class Uniform:
    def log_prob(self,x):
        if x < self.low or x >= self.high:
            return math.inf
-        return -jt.log(self.high - self.low)
+        return -jt.safe_log(self.high - self.low)
    
    def entropy(self):
-        return jt.log(self.high - self.low)
+        return jt.safe_log(self.high - self.low)


 class Geometric:
@ -138,15 +125,14 @@ class Geometric:
            self.logits = logits
        elif logits is None:
            self.prob = p
-            self.logits = -jt.log(1. / p - 1)
+            self.logits = -jt.safe_log(1. / p - 1)
        
    def sample(self, sample_shape):
-        tiny = jt.info(self.probs.dtype).tiny
-        u = jt.clamp(jt.rand(sample_shape),min_v=tiny)
-        return (jt.log(u) / (jt.log(-self.probs+1))).floor()
+        u = jt.rand(sample_shape)
+        return (jt.safe_log(u) / (jt.safe_log(-self.probs+1))).floor_int()
    
    def log_prob(self, x):
-        return x*jt.log(-self.prob+1)+jt.log(self.prob)
+        return x*jt.safe_log(-self.prob+1)+jt.safe_log(self.prob)
    
    def entropy(self):
        return binary_cross_entropy_with_logits(jt.array(self.logits),jt.array(self.prob)) / self.prob
@ -157,16 +143,14 @@ def kl_divergence(cur_dist, old_dist):
    if isinstance(cur_dist, Normal):
        vr = (cur_dist.sigma / old_dist.sigma)**2
        t1 = ((cur_dist.mu - old_dist.mu) / old_dist.sigma)**2
-        return 0.5*(vr+t1-1-jt.log(vr))
+        return 0.5*(vr+t1-1-jt.safe_log(vr))
    if isinstance(cur_dist, Categorical) or isinstance(cur_dist,OneHotCategorical):
        t = cur_dist.probs * (cur_dist.logits-old_dist.logits)
-        t[jt.array((old_dist.probs == 0))] = math.inf
-        t[jt.array((cur_dist.probs == 0))] = 0
        return t.sum(-1)
    if isinstance(cur_dist, Uniform):
-        res = jt.log((old_dist.high - old_dist.low) / (cur_dist.high - cur_dist.low))
+        res = jt.safe_log((old_dist.high - old_dist.low) / (cur_dist.high - cur_dist.low))
        if old_dist.low > cur_dist.low or old_dist.high < cur_dist.high:
            res = math.inf
        return res
    if isinstance(cur_dist, Geometric):
-        return -cur_dist.entropy() - jt.log(-old_dist.prob+1) / cur_dist.prob - old_dist.logits
+        return -cur_dist.entropy() - jt.safe_log(-old_dist.prob+1) / cur_dist.prob - old_dist.logits
--- a/python/jittor/extern/cuda/cublas/inc/cublas_warper.h
+++ b/python/jittor/extern/cuda/cublas/inc/cublas_warper.h
@ -11,6 +11,7 @@
 #include <cuda_runtime.h>
 #include <cublas_v2.h>

+#include "utils/log.h"
 #include "helper_cuda.h"
 #include "fp16_emu.h"
 #include "common.h"
--- a/python/jittor/extern/cuda/cublas/src/cublas_matmul_test.cc
+++ b/python/jittor/extern/cuda/cublas/src/cublas_matmul_test.cc
@ -58,6 +58,7 @@

 // CUDA and CUBLAS functions
 #include <helper_functions.h>
+#include "utils/log.h"
 #include "helper_cuda.h"

 #ifndef min
--- a/python/jittor/extern/cuda/cudnn/inc/cudnn_warper.h
+++ b/python/jittor/extern/cuda/cudnn/inc/cudnn_warper.h
@ -8,6 +8,7 @@
 #include <cuda_runtime.h>
 #include <cudnn.h>

+#include "utils/log.h"
 #include "helper_cuda.h"
 #include "fp16_emu.h"
 #include "common.h"
--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_backward_w_op.cc
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_backward_w_op.cc
@ -0,0 +1,288 @@
+// ***************************************************************
+// Copyright (c) 2021 Jittor. All Rights Reserved. 
+// Maintainers: 
+//     Dun Liang <randonlang@gmail.com>
+//     Guowei Yang <471184555@qq.com>
+// 
+// This file is subject to the terms and conditions defined in
+// file 'LICENSE.txt', which is part of this source code package.
+// ***************************************************************
+#include "mem/allocator.h"
+#include "var.h"
+#include "cudnn_conv3d_backward_w_op.h"
+#include "cudnn_warper.h"
+#include "executor.h"
+#include "ops/op_register.h"
+
+using namespace std;
+
+namespace jittor {
+
+#pragma GCC diagnostic ignored "-Wunused-variable"
+
+#ifndef JIT
+
+CudnnConv3dBackwardWOp::CudnnConv3dBackwardWOp(Var* x, Var* dy, int kd, int kh, int kw, int strided, int strideh, int stridew, int paddingd, int paddingh, int paddingw, int dilationd, int dilationh, int dilationw, int groups, string xformat)
+        : x(x), dy(dy), kd(kd), kh(kh), kw(kw), strided(strided), strideh(strideh), stridew(stridew), paddingd(paddingd), paddingh(paddingh), paddingw(paddingw), dilationd(dilationd), dilationh(dilationh), dilationw(dilationw), groups(groups),
+      xformat(move(xformat)) {
+    flags.set(NodeFlags::_cuda, 1);
+    flags.set(NodeFlags::_cpu, 0);
+    dw = create_output(nullptr, dtype_infer(dy->ns, x->ns));
+}
+
+void CudnnConv3dBackwardWOp::infer_shape() {
+    ASSERTop(x->shape.size(),==,5);
+    ASSERTop(dy->shape.size(),==,5);
+    int xn, xc, xd, xh, xw, wd, wh, ww, wci, wco, yn, yc, yd, yh, yw;
+
+    if (xformat == "ncdhw") {
+        x->shape.unpack(xn, xc, xd, xh, xw);
+        dy->shape.unpack(yn, yc, yd, yh, yw);
+    } else {
+        x->shape.unpack(xn, xd, xh, xw, xc);
+        dy->shape.unpack(yn, yd, yh, yw, yc);
+    }
+    wco = yc, wci = xc / groups;
+    wh = kh;
+    ww = kw;
+    wd = kd;
+    dw->set_shape(NanoVector(wco, wci, wd, wh, ww));
+}
+
+void CudnnConv3dBackwardWOp::jit_prepare(JK& jk) {
+    jk << _CS("[Tx:") << x->dtype();
+    jk << _CS("][Ty:") << dy->dtype();
+    jk << _CS("][Tw:") << dw->dtype();
+    jk << ']';
+}
+
+static auto make_conv3d = get_op_info("cudnn_conv3d")
+    .get_constructor<VarPtr, Var*, Var*, int, int, int, int, int, int, int, int, int, int, string>();
+static auto make_backwardx = get_op_info("cudnn_conv3d_backward_x")
+    .get_constructor<VarPtr, Var*, Var*, int, int, int, int, int, int, int, int, int, int, int, int, int, string>();
+
+
+VarPtr CudnnConv3dBackwardWOp::grad(Var* out, Var* dout, Var* v, int v_index) {
+    int xn, xc, xd, xh, xw, wd, wh, ww, wci, wco, yn, yc, yd, yh, yw;
+
+    if (xformat == "ncdhw") {
+        x->shape.unpack(xn, xc, xd, xh, xw);
+        dy->shape.unpack(yn, yc, yd, yh, yw);
+    } else {
+        x->shape.unpack(xn, xd, xh, xw, xc);
+        dy->shape.unpack(yn, yd, yh, yw, yc);
+    }
+
+    if (v_index == 0) {
+        return make_backwardx(dout, dy, xd, xh, xw, strided, strideh, stridew, paddingd, paddingh, paddingw, dilationd, dilationh, dilationw, groups, xformat);
+    } else {
+        return make_conv3d(x, dout, strided, strideh, stridew, paddingd, paddingh, paddingw, dilationd, dilationh, dilationw, groups, xformat);
+    }
+}
+
+// unordered_map<string, cudnnConvolutionBwdFilterAlgo_t> bwdw_algo_cache;
+
+#else // JIT
+#ifdef JIT_cuda
+
+#pragma clang diagnostic ignored "-Wtautological-compare"
+
+extern unordered_map<string, cudnnConvolutionBwdFilterAlgo_t> bwdw_algo_cache;
+
+template <typename T_ELEM> __inline__  cudnnDataType_t getDataType();
+template <> __inline__ cudnnDataType_t getDataType<half1>() { return CUDNN_DATA_HALF;   }
+template <> __inline__ cudnnDataType_t getDataType<float>() { return CUDNN_DATA_FLOAT;  }
+
+void CudnnConv3dBackwardWOp::jit_run() {
+    auto w = dw;
+    auto y = dy;        
+    cudnnHandle_t& handle_ = cudnn_handle;
+
+    cudnnTensorDescriptor_t cudnnIdesc;
+    cudnnFilterDescriptor_t cudnnFdesc;
+    cudnnTensorDescriptor_t cudnnOdesc;
+    cudnnConvolutionDescriptor_t cudnnConvDesc;
+    
+    checkCudaErrors(cudnnCreateTensorDescriptor( &cudnnIdesc ));
+    checkCudaErrors(cudnnCreateFilterDescriptor( &cudnnFdesc ));
+    checkCudaErrors(cudnnCreateTensorDescriptor( &cudnnOdesc ));
+    checkCudaErrors(cudnnCreateConvolutionDescriptor( &cudnnConvDesc ));
+    checkCudaErrors(cudnnSetConvolutionGroupCount( cudnnConvDesc, groups ));
+
+    int xn, xc, xd, xh, xw, wd, wh, ww, wci, wco, yn, yc, yd, yh, yw;
+    int sx[] = {0,0,0,0,1};
+    for (int i=3; i>=0; i--) sx[i] = sx[i+1] * x->shape[i+1];
+    int strideX[5];
+    if (xformat == "ncdhw") {
+        x->shape.unpack(xn, xc, xd, xh, xw);
+        int tmp[5] = {sx[0],sx[1],sx[2],sx[3],sx[4]};
+        memcpy(strideX, tmp, sizeof(tmp));
+    } else {
+        x->shape.unpack(xn, xd, xh, xw, xc);
+        int tmp[5] = {sx[0],sx[2],sx[3],sx[4],sx[1]};
+        memcpy(strideX, tmp, sizeof(tmp));
+    }
+    int dimX[] = {xn, xc, xd, xh, xw};
+    // dimX: ncdhw
+    checkCudaErrors(cudnnSetTensorNdDescriptor(
+        cudnnIdesc, getDataType<Tx>(),
+        5, dimX, strideX
+    ));
+
+    auto ws = w->shape;
+    int dimW[] = {(int)ws[0],(int)ws[1],(int)ws[2],(int)ws[3],(int)ws[4]};
+    // cudnn only support this two format
+    // https://docs.nvidia.com/deeplearning/sdk/cudnn-api/index.html#cudnnSetFilterNdDescriptor
+    #define filterFormat_oihw CUDNN_TENSOR_NCHW
+    #define filterFormat_ohwi CUDNN_TENSOR_NHWC
+
+    // dimW: KCRS(oihw)
+    checkCudaErrors(cudnnSetFilterNdDescriptor(
+        cudnnFdesc, getDataType<Tw>(),
+        // filterFormat_@WFORMAT, 5, dimW
+        filterFormat_oihw, 5, dimW
+    ));
+
+    int padA[] = {paddingd, paddingh, paddingw};
+    int convstrideA[] = {strided, strideh, stridew};
+    int dilationA[] = {dilationd, dilationh, dilationw};
+    // difference between
+    // CUDNN_CONVOLUTION and CUDNN_CROSS_CORRELATION
+    // is the kernel rc order
+    // currently, No perf difference is observed between
+    // this two mode
+    checkCudaErrors(cudnnSetConvolutionNdDescriptor(
+        cudnnConvDesc, 3,
+        padA, convstrideA, dilationA,
+        CUDNN_CROSS_CORRELATION, getDataType<Ty>()
+    ));
+
+    // using tensor core
+    // checkCudaErrors( cudnnSetConvolutionMathType(cudnnConvDesc, CUDNN_TENSOR_OP_MATH) );
+
+
+    int sy[] = {0,0,0,0,1};
+    for (int i=3; i>=0; i--) sy[i] = sy[i+1] * y->shape[i+1];
+    int strideY[5];
+    if (xformat == "ncdhw") {
+        y->shape.unpack(yn, yc, yd, yh, yw);
+        int tmp[5] = {sy[0],sy[1],sy[2],sy[3],sy[4]};
+        memcpy(strideY, tmp, sizeof(tmp));
+    } else {
+        y->shape.unpack(yn, yd, yh, yw, yc);
+        int tmp[5] = {sy[0],sy[2],sy[3],sy[4],sy[1]};
+        memcpy(strideY, tmp, sizeof(tmp));
+    }
+    int dimY[] = {yn, yc, yd, yh, yw};
+
+    checkCudaErrors(cudnnSetTensorNdDescriptor(
+        cudnnOdesc, getDataType<Ty>(),
+        5, dimY, strideY
+    ));
+
+    cudnnConvolutionBwdFilterAlgo_t algos[] = {
+        CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0,
+        CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1,
+        CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT,
+        CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3,
+        CUDNN_CONVOLUTION_BWD_FILTER_ALGO_WINOGRAD_NONFUSED,
+        CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT_TILING,
+    };
+    int num_algos = CUDNN_CONVOLUTION_BWD_FILTER_ALGO_COUNT;
+    int perf_count;
+    cudnnConvolutionBwdFilterAlgoPerf_t perf_results[num_algos];
+    cudnnConvolutionBwdFilterAlgo_t algo;
+    bool benchmark=true;
+
+    jk.clear();
+    jk << dimX[0] << "," << dimX[1] << "," << dimX[2] << "," << dimX[3] << "," << dimX[4] << ",";
+    jk << dimW[0] << "," << dimW[1] << "," << dimW[2] << "," << dimW[3] << "," << dimW[4] << ",";
+    jk << paddingd << paddingh << paddingw << "," << strided << strideh <<stridew << "," << dilationd << dilationh << dilationw << "," << groups << ".";
+    auto iter = bwdw_algo_cache.find(jk.to_string());
+    
+    if (iter!=bwdw_algo_cache.end()) algo = iter->second;
+    else {
+        if (bwdw_algo_cache.size()>=max_cache_size) benchmark = false;
+        if (benchmark) {
+            size_t max_ws_size = 0;
+            for (int i = 0; i < num_algos; i++) {
+                size_t sz;
+                cudnnStatus_t ret = cudnnGetConvolutionBackwardFilterWorkspaceSize(handle_, cudnnIdesc, cudnnOdesc, cudnnConvDesc, cudnnFdesc, algos[i], &sz);
+                // continue if use too much workspace
+                if (sz > mem_info.total_cuda_ram * max_workspace_ratio) continue;
+                if (CUDNN_STATUS_SUCCESS == ret && sz > max_ws_size) max_ws_size = sz;
+            } 
+            size_t allocation;
+            void* ws = exe.temp_allocator->alloc(max_ws_size, allocation);
+            checkCudaErrors(cudnnFindConvolutionBackwardFilterAlgorithmEx(
+                handle_,
+                cudnnIdesc, x->ptr<Tx>(),
+                cudnnOdesc, y->ptr<Ty>(),
+                cudnnConvDesc,
+                cudnnFdesc, w->ptr<Tw>(),
+                num_algos,
+                &perf_count,
+                perf_results,
+                ws,
+                max_ws_size));
+            exe.temp_allocator->free(ws, max_ws_size, allocation);
+        } else {
+            checkCudaErrors(cudnnGetConvolutionBackwardFilterAlgorithm_v7(
+                handle_,
+                cudnnIdesc,
+                cudnnOdesc,
+                cudnnConvDesc,
+                cudnnFdesc,
+                num_algos,
+                &perf_count,
+                perf_results));
+        }
+        int best_algo_idx=-1;
+        for (int i = 0; i < perf_count; i++) 
+            if (perf_results[i].status == CUDNN_STATUS_SUCCESS){
+                best_algo_idx=i;
+                break;
+            }
+        ASSERT(best_algo_idx!=-1);
+        algo=perf_results[best_algo_idx].algo;
+        if (benchmark) {
+            bwdw_algo_cache[jk.to_string()] = algo;
+            if (bwdw_algo_cache.size()==max_cache_size)
+                LOGw << "backward w algorithm cache is full";
+        }
+    }
+
+    // TODO: warp work space
+    void *workSpace = 0;
+    size_t workSpaceSize;
+    checkCudaErrors (cudnnGetConvolutionBackwardFilterWorkspaceSize(
+        handle_, cudnnIdesc, cudnnOdesc, cudnnConvDesc, 
+        cudnnFdesc, algo, &workSpaceSize));
+    size_t allocation;
+    if (workSpaceSize > 0) {
+        workSpace = exe.temp_allocator->alloc(workSpaceSize, allocation);
+    }
+    float alpha=1, beta=0;
+    checkCudaErrors(cudnnConvolutionBackwardFilter(
+        handle_,
+        (void*)(&alpha),
+        cudnnIdesc,  x->ptr<Tx>(),
+        cudnnOdesc, y->ptr<Ty>(),
+        cudnnConvDesc,
+        algo,
+        workSpace, workSpaceSize,
+        (void*)(&beta),
+        cudnnFdesc,  w->ptr<Tw>())
+    );
+    if (workSpace)
+        exe.temp_allocator->free(workSpace, workSpaceSize, allocation);
+        
+    checkCudaErrors(cudnnDestroyTensorDescriptor( cudnnIdesc ));
+    checkCudaErrors(cudnnDestroyFilterDescriptor( cudnnFdesc ));
+    checkCudaErrors(cudnnDestroyTensorDescriptor( cudnnOdesc ));
+    checkCudaErrors(cudnnDestroyConvolutionDescriptor( cudnnConvDesc ));
+}
+#endif
+#endif // JIT
+
+} // jittor
--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_backward_w_op.h
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_backward_w_op.h
@ -0,0 +1,28 @@
+// ***************************************************************
+// Copyright (c) 2021 Jittor. All Rights Reserved. 
+// Maintainers: 
+//     Dun Liang <randonlang@gmail.com>
+//     Guowei Yang <471184555@qq.com>
+// 
+// This file is subject to the terms and conditions defined in
+// file 'LICENSE.txt', which is part of this source code package.
+// ***************************************************************
+#pragma once
+#include "op.h"
+
+namespace jittor {
+
+struct CudnnConv3dBackwardWOp : Op {
+    Var* x, * dy, * dw;
+    int kd, kh, kw, strided, strideh, stridew, paddingd, paddingh, paddingw, dilationd, dilationh, dilationw, groups;
+    string xformat;
+
+    CudnnConv3dBackwardWOp(Var* x, Var* y, int kd, int kh, int kw, int strided, int strideh, int stridew, int paddingd, int paddingh, int paddingw, int dilationd, int dilationh, int dilationw, int groups=1, string xformat="ncdhw");
+    
+    const char* name() const override { return "cudnn_conv3d_backward_w"; }
+    VarPtr grad(Var* out, Var* dout, Var* v, int v_index) override;
+    void infer_shape() override;
+    DECLARE_jit_run;
+};
+
+} // jittor
--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_backward_x_op.cc
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_backward_x_op.cc
@ -0,0 +1,279 @@
+// ***************************************************************
+// Copyright (c) 2021 Jittor. All Rights Reserved. 
+// Maintainers: 
+//     Dun Liang <randonlang@gmail.com>
+//     Guowei Yang <471184555@qq.com>
+// 
+// This file is subject to the terms and conditions defined in
+// file 'LICENSE.txt', which is part of this source code package.
+// ***************************************************************
+#include "mem/allocator.h"
+#include "var.h"
+#include "cudnn_conv3d_backward_x_op.h"
+#include "cudnn_warper.h"
+#include "executor.h"
+#include "ops/op_register.h"
+
+using namespace std;
+
+namespace jittor {
+
+#pragma GCC diagnostic ignored "-Wunused-variable"
+
+#ifndef JIT
+
+CudnnConv3dBackwardXOp::CudnnConv3dBackwardXOp(Var* w, Var* dy, int depth, int height, int width, int strided, int strideh, int stridew, int paddingd, int paddingh, int paddingw, int dilationd, int dilationh, int dilationw, int groups, string xformat) 
+        : w(w), dy(dy), xd(depth), xh(height), xw(width), strided(strided), strideh(strideh), stridew(stridew), paddingd(paddingd), paddingh(paddingh), paddingw(paddingw), dilationd(dilationd), dilationh(dilationh), dilationw(dilationw), groups(groups),
+      xformat(move(xformat)) {
+    flags.set(NodeFlags::_cuda, 1);
+    flags.set(NodeFlags::_cpu, 0);
+    dx = create_output(nullptr, dtype_infer(dy->ns, w->ns));
+}
+
+void CudnnConv3dBackwardXOp::infer_shape() {
+    ASSERTop(w->shape.size(),==,5);
+    ASSERTop(dy->shape.size(),==,5);
+    int xn, xc, wd, wh, ww, wci, wco, yn, yc, yd, yh, yw;
+    w->shape.unpack(wco, wci, wd, wh, ww);
+    if (xformat == "ncdhw")
+        dy->shape.unpack(yn, yc, yd, yh, yw);
+    else
+        dy->shape.unpack(yn, yd, yh, yw, yc);
+    xn = yn, xc = wci * groups;
+    if (xformat == "ncdhw")
+        dx->set_shape(NanoVector(xn, xc, xd, xh, xw));
+    else
+        dx->set_shape(NanoVector(xn, xd, xh, xw, xc));
+}
+
+void CudnnConv3dBackwardXOp::jit_prepare(JK& jk) {
+    jk << _CS("[Tx:") << dx->dtype();
+    jk << _CS("][Ty:") << dy->dtype();
+    jk << _CS("][Tw:") << w->dtype();
+    jk << ']';
+}
+
+
+static auto make_conv3d = get_op_info("cudnn_conv3d")
+    .get_constructor<VarPtr, Var*, Var*, int, int, int, int, int, int, int, int, int, int, string>();
+static auto make_backwardw = get_op_info("cudnn_conv3d_backward_w")
+    .get_constructor<VarPtr, Var*, Var*, int, int, int, int, int, int, int, int, int, int, int, int, int, string>();
+
+
+VarPtr CudnnConv3dBackwardXOp::grad(Var* out, Var* dout, Var* v, int v_index) {
+    int xn, xc, wd, wh, ww, wci, wco, yn, yc, yd, yh, yw;
+    w->shape.unpack(wco, wci, wd, wh, ww);
+    
+    if (v_index == 0) {
+        return make_backwardw(dout, dy, wd, wh, ww, strided, strideh, stridew, paddingd, paddingh, paddingw, dilationd, dilationh, dilationw, groups, xformat);
+    } else {
+        return make_conv3d(dout, w, strided, strideh, stridew, paddingd, paddingh, paddingw, dilationd, dilationh, dilationw, groups, xformat);
+    }
+}
+// unordered_map<string, cudnnConvolutionBwdDataAlgo_t> bwdx_algo_cache;
+
+#else // JIT
+#ifdef JIT_cuda
+
+#pragma clang diagnostic ignored "-Wtautological-compare"
+
+extern unordered_map<string, cudnnConvolutionBwdDataAlgo_t> bwdx_algo_cache;
+
+template <typename T_ELEM> __inline__  cudnnDataType_t getDataType();
+template <> __inline__ cudnnDataType_t getDataType<half1>() { return CUDNN_DATA_HALF;   }
+template <> __inline__ cudnnDataType_t getDataType<float>() { return CUDNN_DATA_FLOAT;  }
+
+void CudnnConv3dBackwardXOp::jit_run() {
+    auto x = dx;
+    auto y = dy;        
+    cudnnHandle_t& handle_ = cudnn_handle;
+
+    cudnnTensorDescriptor_t cudnnIdesc;
+    cudnnFilterDescriptor_t cudnnFdesc;
+    cudnnTensorDescriptor_t cudnnOdesc;
+    cudnnConvolutionDescriptor_t cudnnConvDesc;
+    
+    checkCudaErrors(cudnnCreateTensorDescriptor( &cudnnIdesc ));
+    checkCudaErrors(cudnnCreateFilterDescriptor( &cudnnFdesc ));
+    checkCudaErrors(cudnnCreateTensorDescriptor( &cudnnOdesc ));
+    checkCudaErrors(cudnnCreateConvolutionDescriptor( &cudnnConvDesc ));
+    checkCudaErrors(cudnnSetConvolutionGroupCount( cudnnConvDesc, groups ));
+
+
+    int xn, xc, xd, xh, xw, wd, wh, ww, wci, wco, yn, yc, yd, yh, yw;
+    int sx[] = {0,0,0,0,1};
+    for (int i=3; i>=0; i--) sx[i] = sx[i+1] * x->shape[i+1];
+    int strideX[5];
+    if (xformat == "ncdhw") {
+        x->shape.unpack(xn, xc, xd, xh, xw);
+        int tmp[5] = {sx[0],sx[1],sx[2],sx[3],sx[4]};
+        memcpy(strideX, tmp, sizeof(tmp));
+    } else {
+        x->shape.unpack(xn, xd, xh, xw, xc);
+        int tmp[5] = {sx[0],sx[2],sx[3],sx[4],sx[1]};
+        memcpy(strideX, tmp, sizeof(tmp));
+    }
+    int dimX[] = {xn, xc, xd, xh, xw};
+    // dimX: ncdhw
+    checkCudaErrors(cudnnSetTensorNdDescriptor(
+        cudnnIdesc, getDataType<Tx>(),
+        5, dimX, strideX
+    ));
+
+    auto ws = w->shape;
+    int dimW[] = {(int)ws[0],(int)ws[1],(int)ws[2],(int)ws[3],(int)ws[4]};
+    // cudnn only support this two format
+    // https://docs.nvidia.com/deeplearning/sdk/cudnn-api/index.html#cudnnSetFilterNdDescriptor
+    #define filterFormat_oihw CUDNN_TENSOR_NCHW
+    #define filterFormat_ohwi CUDNN_TENSOR_NHWC
+
+    // dimW: KCRS(oihw)
+    checkCudaErrors(cudnnSetFilterNdDescriptor(
+        cudnnFdesc, getDataType<Tw>(),
+        // filterFormat_@WFORMAT, 5, dimW
+        filterFormat_oihw, 5, dimW
+    ));
+
+    int padA[] = {paddingd, paddingh, paddingw};
+    int convstrideA[] = {strided, strideh, stridew};
+    int dilationA[] = {dilationd, dilationh, dilationw};
+    // difference between
+    // CUDNN_CONVOLUTION and CUDNN_CROSS_CORRELATION
+    // is the kernel rc order
+    // currently, No perf difference is observed between
+    // this two mode
+    checkCudaErrors(cudnnSetConvolutionNdDescriptor(
+        cudnnConvDesc, 3,
+        padA, convstrideA, dilationA,
+        CUDNN_CROSS_CORRELATION, getDataType<Ty>()
+    ));
+
+    // using tensor core
+    // checkCudaErrors( cudnnSetConvolutionMathType(cudnnConvDesc, CUDNN_TENSOR_OP_MATH) );
+
+
+    int sy[] = {0,0,0,0,1};
+    for (int i=3; i>=0; i--) sy[i] = sy[i+1] * y->shape[i+1];
+    int strideY[5];
+    if (xformat == "ncdhw") {
+        y->shape.unpack(yn, yc, yd, yh, yw);
+        int tmp[5] = {sy[0],sy[1],sy[2],sy[3],sy[4]};
+        memcpy(strideY, tmp, sizeof(tmp));
+    } else {
+        y->shape.unpack(yn, yd, yh, yw, yc);
+        int tmp[5] = {sy[0],sy[2],sy[3],sy[4],sy[1]};
+        memcpy(strideY, tmp, sizeof(tmp));
+    }
+    int dimY[] = {yn, yc, yd, yh, yw};
+
+    checkCudaErrors(cudnnSetTensorNdDescriptor(
+        cudnnOdesc, getDataType<Ty>(),
+        5, dimY, strideY
+    ));
+
+    cudnnConvolutionBwdDataAlgo_t algos[] = {
+        CUDNN_CONVOLUTION_BWD_DATA_ALGO_0,
+        CUDNN_CONVOLUTION_BWD_DATA_ALGO_1,
+        CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT,
+        CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING,
+        CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD,
+        CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED
+    };
+    int num_algos = CUDNN_CONVOLUTION_BWD_DATA_ALGO_COUNT;
+    int perf_count;
+    cudnnConvolutionBwdDataAlgoPerf_t perf_results[num_algos];
+    cudnnConvolutionBwdDataAlgo_t algo;
+    bool benchmark=true;
+
+    jk.clear();
+    jk << dimX[0] << "," << dimX[1] << "," << dimX[2] << "," << dimX[3] << "," << dimX[4] << ",";
+    jk << dimW[0] << "," << dimW[1] << "," << dimW[2] << "," << dimW[3] << "," << dimW[4] << ",";
+    jk << paddingd << paddingh << paddingw << "," << strided << strideh <<stridew << "," << dilationd << dilationh << dilationw << "," << groups << ".";
+    auto iter = bwdx_algo_cache.find(jk.to_string());
+    
+    if (iter!=bwdx_algo_cache.end()) algo = iter->second;
+    else {
+        if (bwdx_algo_cache.size()>=max_cache_size) benchmark = false;
+        if (benchmark) {
+            size_t max_ws_size = 0;
+            for (int i = 0; i < num_algos; i++) {
+                size_t sz;
+                cudnnStatus_t ret = cudnnGetConvolutionBackwardDataWorkspaceSize(handle_, cudnnFdesc, cudnnOdesc, cudnnConvDesc, cudnnIdesc, algos[i], &sz);
+                // continue if use too much workspace
+                if (sz > mem_info.total_cuda_ram * max_workspace_ratio) continue;
+                if (CUDNN_STATUS_SUCCESS == ret && sz > max_ws_size) max_ws_size = sz;
+            } 
+            size_t allocation;
+            void* ws = exe.temp_allocator->alloc(max_ws_size, allocation);
+            checkCudaErrors(cudnnFindConvolutionBackwardDataAlgorithmEx(
+                handle_,
+                cudnnFdesc, w->ptr<Tw>(),
+                cudnnOdesc, y->ptr<Ty>(),
+                cudnnConvDesc,
+                cudnnIdesc, x->ptr<Tx>(),
+                num_algos,
+                &perf_count,
+                perf_results,
+                ws,
+                max_ws_size));
+            exe.temp_allocator->free(ws, max_ws_size, allocation);
+        } else {
+            checkCudaErrors(cudnnGetConvolutionBackwardDataAlgorithm_v7(
+                handle_,
+                cudnnFdesc,
+                cudnnOdesc,
+                cudnnConvDesc,
+                cudnnIdesc,
+                num_algos,
+                &perf_count,
+                perf_results));
+        }
+        int best_algo_idx=-1;
+        for (int i = 0; i < perf_count; i++) 
+            if (perf_results[i].status == CUDNN_STATUS_SUCCESS){
+                best_algo_idx=i;
+                break;
+            }
+        ASSERT(best_algo_idx!=-1);
+        algo=perf_results[best_algo_idx].algo;
+        if (benchmark) {
+            bwdx_algo_cache[jk.to_string()] = algo;
+            if (bwdx_algo_cache.size()==max_cache_size)
+                LOGw << "backward x algorithm cache is full";
+        }
+    }
+
+    // TODO: warp work space
+    void *workSpace = 0;
+    size_t workSpaceSize;
+    checkCudaErrors (cudnnGetConvolutionBackwardDataWorkspaceSize(
+        handle_, cudnnFdesc, cudnnOdesc, cudnnConvDesc, 
+        cudnnIdesc, algo, &workSpaceSize));
+    size_t allocation;
+    if (workSpaceSize > 0) {
+        workSpace = exe.temp_allocator->alloc(workSpaceSize, allocation);
+    }
+    float alpha=1, beta=0;
+    checkCudaErrors(cudnnConvolutionBackwardData(
+        handle_,
+        (void*)(&alpha),
+        cudnnFdesc, w->ptr<Tw>(),
+        cudnnOdesc, y->ptr<Ty>(),
+        cudnnConvDesc,
+        algo,
+        workSpace, workSpaceSize,
+        (void*)(&beta),
+        cudnnIdesc, x->ptr<Tx>())
+    );
+    if (workSpace)
+        exe.temp_allocator->free(workSpace, workSpaceSize, allocation);
+        
+    checkCudaErrors(cudnnDestroyTensorDescriptor( cudnnIdesc ));
+    checkCudaErrors(cudnnDestroyFilterDescriptor( cudnnFdesc ));
+    checkCudaErrors(cudnnDestroyTensorDescriptor( cudnnOdesc ));
+    checkCudaErrors(cudnnDestroyConvolutionDescriptor( cudnnConvDesc ));
+}
+#endif
+#endif // JIT
+
+} // jittor
--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_backward_x_op.h
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_backward_x_op.h
@ -0,0 +1,28 @@
+// ***************************************************************
+// Copyright (c) 2021 Jittor. All Rights Reserved. 
+// Maintainers: 
+//     Dun Liang <randonlang@gmail.com>
+//     Guowei Yang <471184555@qq.com>
+// 
+// This file is subject to the terms and conditions defined in
+// file 'LICENSE.txt', which is part of this source code package.
+// ***************************************************************
+#pragma once
+#include "op.h"
+
+namespace jittor {
+
+struct CudnnConv3dBackwardXOp : Op {
+    Var* w, * dy, * dx;
+    int xd, xh, xw, strided, strideh, stridew, paddingd, paddingh, paddingw, dilationd, dilationh, dilationw, groups;
+    string xformat;
+
+    CudnnConv3dBackwardXOp(Var* w, Var* y, int depth, int height, int width, int strided, int strideh, int stridew, int paddingd, int paddingh, int paddingw, int dilationd, int dilationh, int dilationw, int groups=1, string xformat="ncdhw");
+    
+    const char* name() const override { return "cudnn_conv3d_backward_x"; }
+    VarPtr grad(Var* out, Var* dout, Var* v, int v_index) override;
+    void infer_shape() override;
+    DECLARE_jit_run;
+};
+
+} // jittor
--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_op.cc
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_op.cc
@ -0,0 +1,284 @@
+// ***************************************************************
+// Copyright (c) 2021 Jittor. All Rights Reserved. 
+// Maintainers: Dun Liang <randonlang@gmail.com>. 
+// 
+// This file is subject to the terms and conditions defined in
+// file 'LICENSE.txt', which is part of this source code package.
+// ***************************************************************
+#include "var.h"
+#include "cudnn_conv3d_op.h"
+#include "cudnn_warper.h"
+#include "executor.h"
+#include "ops/op_register.h"
+
+using namespace std;
+
+namespace jittor {
+
+#pragma GCC diagnostic ignored "-Wunused-variable"
+
+#ifndef JIT
+
+CudnnConv3dOp::CudnnConv3dOp(Var* x, Var* w, int strided, int strideh, int stridew, int paddingd, int paddingh, int paddingw, int dilationd, int dilationh, int dilationw, int groups, string xformat)
+    : x(x), w(w), strided(strided), strideh(strideh), stridew(stridew), paddingd(paddingd), paddingh(paddingh), paddingw(paddingw), dilationd(dilationd), dilationh(dilationh), dilationw(dilationw), groups(groups),
+      xformat(move(xformat)) {
+    flags.set(NodeFlags::_cuda, 1);
+    flags.set(NodeFlags::_cpu, 0);
+    y = create_output(nullptr, dtype_infer(x->ns, w->ns));
+}
+
+void CudnnConv3dOp::infer_shape() {
+    ASSERTop(x->shape.size(),==,5);
+    ASSERTop(w->shape.size(),==,5);
+    int xn, xc, xd, xh, xw, wd, wh, ww, wci, wco, yn, yc, yd, yh, yw;
+    if (xformat == "ncdhw")
+        x->shape.unpack(xn, xc, xd, xh, xw);
+    else
+        x->shape.unpack(xn, xd, xh, xw, xc);
+    w->shape.unpack(wco, wci, wd, wh, ww);
+    ASSERTop(wci * groups,==,xc);
+    yn = xn, yc = wco;
+    yd = (xd+paddingd*2-wd*dilationd+dilationd-1)/strided+1;
+    yh = (xh+paddingh*2-wh*dilationh+dilationh-1)/strideh+1;
+    yw = (xw+paddingw*2-ww*dilationw+dilationw-1)/stridew+1;
+    if (xformat == "ncdhw")
+        y->set_shape(NanoVector(yn, yc, yd, yh, yw));
+    else
+        y->set_shape(NanoVector(yn, yd, yh, yw, yc));
+}
+
+void CudnnConv3dOp::jit_prepare(JK& jk) {
+    jk << _CS("[Tx:") << x->dtype();
+    jk << _CS("][Ty:") << y->dtype();
+    jk << _CS("][Tw:") << w->dtype();
+    jk << ']';
+}
+
+static auto make_backwardx = get_op_info("cudnn_conv3d_backward_x")
+    .get_constructor<VarPtr, Var*, Var*, int, int, int, int, int, int, int, int, int, int, int, int, int, string>();
+static auto make_backwardw = get_op_info("cudnn_conv3d_backward_w")
+    .get_constructor<VarPtr, Var*, Var*, int, int, int, int, int, int, int, int, int, int, int, int, int, string>();
+
+VarPtr CudnnConv3dOp::grad(Var* out, Var* dout, Var* v, int v_index) {
+    int xn, xc, xd, xh, xw, wd, wh, ww, wci, wco, yn, yc, yd, yh, yw;
+    if (xformat == "ncdhw")
+        x->shape.unpack(xn, xc, xd, xh, xw);
+    else
+        x->shape.unpack(xn, xd, xh, xw, xc);
+    w->shape.unpack(wco, wci, wd, wh, ww);
+    if (v_index == 0) {
+        return make_backwardx(w, dout, xd, xh, xw, strided, strideh, stridew, paddingd, paddingh, paddingw, dilationd, dilationh, dilationw, groups, xformat);
+    } else {
+        return make_backwardw(x, dout, wd, wh, ww, strided, strideh, stridew, paddingd, paddingh, paddingw, dilationd, dilationh, dilationw, groups, xformat);
+    }
+}
+
+// unordered_map<string, cudnnConvolutionFwdAlgo_t> fwd_algo_cache;
+
+#else // JIT
+#ifdef JIT_cuda
+
+#pragma clang diagnostic ignored "-Wtautological-compare"
+
+extern unordered_map<string, cudnnConvolutionFwdAlgo_t> fwd_algo_cache;
+
+template <typename T_ELEM> __inline__  cudnnDataType_t getDataType();
+template <> __inline__ cudnnDataType_t getDataType<half1>() { return CUDNN_DATA_HALF;   }
+template <> __inline__ cudnnDataType_t getDataType<float>() { return CUDNN_DATA_FLOAT;  }
+
+void CudnnConv3dOp::jit_run() {
+    cudnnHandle_t& handle_ = cudnn_handle;
+
+    cudnnTensorDescriptor_t cudnnIdesc;
+    cudnnFilterDescriptor_t cudnnFdesc;
+    cudnnTensorDescriptor_t cudnnOdesc;
+    cudnnConvolutionDescriptor_t cudnnConvDesc;
+    
+    checkCudaErrors(cudnnCreateTensorDescriptor( &cudnnIdesc ));
+    checkCudaErrors(cudnnCreateFilterDescriptor( &cudnnFdesc ));
+    checkCudaErrors(cudnnCreateTensorDescriptor( &cudnnOdesc ));
+    checkCudaErrors(cudnnCreateConvolutionDescriptor( &cudnnConvDesc ));
+    checkCudaErrors(cudnnSetConvolutionGroupCount( cudnnConvDesc, groups ));
+
+
+    int xn, xc, xd, xh, xw, wd, wh, ww, wci, wco, yn, yc, yd, yh, yw;
+    int sx[] = {0,0,0,0,1};
+    for (int i=3; i>=0; i--) sx[i] = sx[i+1] * x->shape[i+1];
+    int strideX[5];
+    if (xformat == "ncdhw") {
+        x->shape.unpack(xn, xc, xd, xh, xw);
+        int tmp[5] = {sx[0],sx[1],sx[2],sx[3],sx[4]};
+        memcpy(strideX, tmp, sizeof(tmp));
+    } else {
+        x->shape.unpack(xn, xd, xh, xw, xc);
+        int tmp[5] = {sx[0],sx[2],sx[3],sx[4],sx[1]};
+        memcpy(strideX, tmp, sizeof(tmp));
+    }
+    int dimX[] = {xn, xc, xd, xh, xw};
+    // dimX: ncdhw
+    checkCudaErrors(cudnnSetTensorNdDescriptor(
+        cudnnIdesc, getDataType<Tx>(),
+        5, dimX, strideX
+    ));
+
+    auto ws = w->shape;
+    int dimW[] = {(int)ws[0],(int)ws[1],(int)ws[2],(int)ws[3],(int)ws[4]};
+    // cudnn only support this two format
+    // https://docs.nvidia.com/deeplearning/sdk/cudnn-api/index.html#cudnnSetFilterNdDescriptor
+    #define filterFormat_oihw CUDNN_TENSOR_NCHW
+    #define filterFormat_ohwi CUDNN_TENSOR_NHWC
+
+    // dimW: KCRS(oihw)
+    checkCudaErrors(cudnnSetFilterNdDescriptor(
+        cudnnFdesc, getDataType<Tw>(),
+        // filterFormat_@WFORMAT, 5, dimW
+        filterFormat_oihw, 5, dimW
+    ));
+
+    int padA[] = {paddingd, paddingh, paddingw};
+    int convstrideA[] = {strided, strideh, stridew};
+    int dilationA[] = {dilationd, dilationh, dilationw};
+    // difference between
+    // CUDNN_CONVOLUTION and CUDNN_CROSS_CORRELATION
+    // is the kernel rc order
+    // currently, No perf difference is observed between
+    // this two mode
+    checkCudaErrors(cudnnSetConvolutionNdDescriptor(
+        cudnnConvDesc, 3,
+        padA, convstrideA, dilationA,
+        CUDNN_CROSS_CORRELATION, getDataType<Ty>()
+    ));
+
+    // using tensor core
+    // checkCudaErrors( cudnnSetConvolutionMathType(cudnnConvDesc, CUDNN_TENSOR_OP_MATH) );
+
+
+    int sy[] = {0,0,0,0,1};
+    for (int i=3; i>=0; i--) sy[i] = sy[i+1] * y->shape[i+1];
+    int strideY[5];
+    if (xformat == "ncdhw") {
+        y->shape.unpack(yn, yc, yd, yh, yw);
+        int tmp[5] = {sy[0],sy[1],sy[2],sy[3],sy[4]};
+        memcpy(strideY, tmp, sizeof(tmp));
+    } else {
+        y->shape.unpack(yn, yd, yh, yw, yc);
+        int tmp[5] = {sy[0],sy[2],sy[3],sy[4],sy[1]};
+        memcpy(strideY, tmp, sizeof(tmp));
+    }
+    int dimY[] = {yn, yc, yd, yh, yw};
+
+    checkCudaErrors(cudnnSetTensorNdDescriptor(
+        cudnnOdesc, getDataType<Ty>(),
+        5, dimY, strideY
+    ));
+
+    cudnnConvolutionFwdAlgo_t algos[] = {
+         CUDNN_CONVOLUTION_FWD_ALGO_GEMM,
+         CUDNN_CONVOLUTION_FWD_ALGO_FFT,
+         CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING,
+         CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM,
+         CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM,
+         CUDNN_CONVOLUTION_FWD_ALGO_DIRECT,
+         CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD,
+         CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED,
+    };
+    int num_algos = CUDNN_CONVOLUTION_FWD_ALGO_COUNT;
+    int perf_count;
+    cudnnConvolutionFwdAlgoPerf_t perf_results[num_algos];
+    cudnnConvolutionFwdAlgo_t algo;
+    bool benchmark=true;
+
+    jk.clear();
+    jk << dimX[0] << "," << dimX[1] << "," << dimX[2] << "," << dimX[3] << "," << dimX[4] << ",";
+    jk << dimW[0] << "," << dimW[1] << "," << dimW[2] << "," << dimW[3] << "," << dimW[4] << ",";
+    jk << paddingd << paddingh << paddingw << "," << strided << strideh <<stridew << "," << dilationd << dilationh << dilationw << "," << groups << ".";
+    auto iter = fwd_algo_cache.find(jk.to_string());
+    
+    if (iter!=fwd_algo_cache.end()) algo = iter->second;
+    else {
+        if (fwd_algo_cache.size()>=max_cache_size) benchmark = false;
+        if (benchmark) {
+            size_t max_ws_size = 0;
+            for (int i = 0; i < num_algos; i++) {
+                size_t sz;
+                cudnnStatus_t ret = cudnnGetConvolutionForwardWorkspaceSize(
+                    handle_, cudnnIdesc, cudnnFdesc, cudnnConvDesc, 
+                    cudnnOdesc, algos[i], &sz);
+                // continue if use too much workspace
+                if (sz > mem_info.total_cuda_ram * max_workspace_ratio) continue;
+                if (CUDNN_STATUS_SUCCESS == ret && sz > max_ws_size) max_ws_size = sz;
+            } 
+            size_t allocation;
+            void* ws = exe.temp_allocator->alloc(max_ws_size, allocation);
+            checkCudaErrors(cudnnFindConvolutionForwardAlgorithmEx(
+                handle_,
+                cudnnIdesc, x->ptr<Tx>(),
+                cudnnFdesc, w->ptr<Tw>(),
+                cudnnConvDesc,
+                cudnnOdesc, y->ptr<Ty>(),
+                num_algos,
+                &perf_count,
+                perf_results,
+                ws,
+                max_ws_size));
+            exe.temp_allocator->free(ws, max_ws_size, allocation);
+        } else {
+            checkCudaErrors(cudnnGetConvolutionForwardAlgorithm_v7(
+                handle_,
+                cudnnIdesc,
+                cudnnFdesc,
+                cudnnConvDesc,
+                cudnnOdesc,
+                num_algos,
+                &perf_count,
+                perf_results));
+        }
+        int best_algo_idx=-1;
+        for (int i = 0; i < perf_count; i++) 
+            if (perf_results[i].status == CUDNN_STATUS_SUCCESS){
+                best_algo_idx=i;
+                break;
+            }
+        ASSERT(best_algo_idx!=-1);
+        algo=perf_results[best_algo_idx].algo;
+        if (benchmark) {
+            fwd_algo_cache[jk.to_string()] = algo;
+            if (fwd_algo_cache.size()==max_cache_size)
+                LOGw << "forward_ algorithm cache is full";
+        }
+    }
+
+    // TODO: warp work space
+    void *workSpace = 0;
+    size_t workSpaceSize;
+    checkCudaErrors (cudnnGetConvolutionForwardWorkspaceSize(
+        handle_, cudnnIdesc, cudnnFdesc, cudnnConvDesc, 
+        cudnnOdesc, algo, &workSpaceSize) );
+    size_t allocation;
+    if (workSpaceSize > 0) {
+        workSpace = exe.temp_allocator->alloc(workSpaceSize, allocation);
+    }
+    float alpha=1, beta=0;
+    checkCudaErrors(cudnnConvolutionForward(
+        handle_,
+        (void*)(&alpha),
+        cudnnIdesc, x->ptr<Tx>(),
+        cudnnFdesc, w->ptr<Tw>(),
+        cudnnConvDesc,
+        algo,
+        workSpace, workSpaceSize,
+        (void*)(&beta),
+        cudnnOdesc, y->ptr<Ty>())
+    );
+    if (workSpace)
+        exe.temp_allocator->free(workSpace, workSpaceSize, allocation);
+
+    checkCudaErrors(cudnnDestroyTensorDescriptor( cudnnIdesc ));
+    checkCudaErrors(cudnnDestroyFilterDescriptor( cudnnFdesc ));
+    checkCudaErrors(cudnnDestroyTensorDescriptor( cudnnOdesc ));
+    checkCudaErrors(cudnnDestroyConvolutionDescriptor( cudnnConvDesc ));
+}
+#endif
+#endif // JIT
+
+} // jittor
--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_op.h
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_op.h
@ -0,0 +1,24 @@
+// ***************************************************************
+// Copyright (c) 2021 Jittor. All Rights Reserved. 
+// Maintainers: Dun Liang <randonlang@gmail.com>. 
+// This file is subject to the terms and conditions defined in
+// file 'LICENSE.txt', which is part of this source code package.
+// ***************************************************************
+#pragma once
+#include "op.h"
+
+namespace jittor {
+
+struct CudnnConv3dOp : Op {
+    Var* x, * w, * y;
+    int strided, strideh, stridew, paddingd, paddingh, paddingw, dilationd, dilationh, dilationw, groups;
+    string xformat;
+    CudnnConv3dOp(Var* x, Var* w, int strided, int strideh, int stridew, int paddingd, int paddingh, int paddingw, int dilationd=1, int dilationh=1, int dilationw=1, int groups=1, string xformat="ncdhw");
+    
+    const char* name() const override { return "cudnn_conv3d"; }
+    VarPtr grad(Var* out, Var* dout, Var* v, int v_index) override;
+    void infer_shape() override;
+    DECLARE_jit_run;
+};
+
+} // jittor
--- a/python/jittor/extern/cuda/cudnn/src/cudnn_conv_test.cc
+++ b/python/jittor/extern/cuda/cudnn/src/cudnn_conv_test.cc
@ -65,6 +65,7 @@
 #include <assert.h>

 #include <cudnn.h>
+#include "utils/log.h"
 #include "helper_cuda.h"
 #include "fp16_dev.h"
 #include "fp16_emu.h"
--- a/python/jittor/extern/cuda/cudnn/src/helper_cudnn.cc
+++ b/python/jittor/extern/cuda/cudnn/src/helper_cudnn.cc
@ -1,4 +1,5 @@
 #include <cudnn.h>
+#include "utils/log.h"
 #include "helper_cuda.h"

 const char *_cudaGetErrorEnum(cudnnStatus_t error) {
--- a/python/jittor/extern/cuda/curand/src/helper_curand.cc
+++ b/python/jittor/extern/cuda/curand/src/helper_curand.cc
@ -11,6 +11,7 @@

 #include <cuda_runtime.h>
 #include <cublas.h>
+#include "utils/log.h"
 #include "helper_cuda.h"
 #include <curand.h>

--- a/python/jittor/extern/cuda/cutt/ops/cutt_transpose_op.cc
+++ b/python/jittor/extern/cuda/cutt/ops/cutt_transpose_op.cc
@ -89,7 +89,7 @@ void CuttTransposeOp::jit_run() {
        reverse[i] = dim-1-new_axes[dim-1-i];
    for (int i=0; i<dim; i++)
        x_shape[i] = new_shape[dim-1-i];
-    if (dim == 1) {
+    if (dim == 1 || x->num==1) {
        checkCudaErrors(cudaMemcpyAsync(yp, xp, x->size, cudaMemcpyDefault, 0));
        return;
    }
@ -105,7 +105,9 @@ void CuttTransposeOp::jit_run() {
        cuttExecute(iter->second, xp, yp);
    } else {
        cuttHandle plan;
-        CHECK(0==cuttPlan(&plan, dim, x_shape.data(), reverse.data(), x->dtype().dsize(), 0));
+        checkCudaErrors(cudaDeviceSynchronize());
+        auto ret = cuttPlan(&plan, dim, x_shape.data(), reverse.data(), x->dtype().dsize(), 0);
+        CHECK(0==ret) << ret << jk.to_string() << x << y;
        cutt_plan_cache[jk.to_string()] = plan;
        cuttExecute(plan, xp, yp);
    }
--- a/python/jittor/extern/cuda/nccl/inc/nccl_warper.h
+++ b/python/jittor/extern/cuda/nccl/inc/nccl_warper.h
@ -12,6 +12,7 @@

 #include <cuda_runtime.h>
 #include <nccl.h>
+#include "utils/log.h"
 #include "helper_cuda.h"

 namespace jittor {
--- a/python/jittor/extern/cuda/src/fp16_dev.cu
+++ b/python/jittor/extern/cuda/src/fp16_dev.cu
@ -9,6 +9,7 @@
 *
 */

+#include "utils/log.h"
 #include "helper_cuda.h"
 #include "fp16_dev.h"

--- a/python/jittor/extern/cuda/src/helper_cuda.cc
+++ b/python/jittor/extern/cuda/src/helper_cuda.cc
@ -13,6 +13,7 @@
 // These are CUDA Helper functions for initialization and error checking

 #include <cuda_runtime.h>
+#include "utils/log.h"
 #include "helper_cuda.h"

 #ifdef _CUFFT_H_
--- a/python/jittor/extern/mkl/ops/cpu_cnn_inference_f32.cpp
+++ b/python/jittor/extern/mkl/ops/cpu_cnn_inference_f32.cpp
@ -778,7 +778,7 @@ void simple_net(int times = 100) {
    s.wait();
 }

-// extern "C" int mkl_test_entry();
+// extern int mkl_test_entry();

 int mkl_test_entry() {
    try {
--- a/python/jittor/init.py
+++ b/python/jittor/init.py
@ -15,25 +15,25 @@ def eye(shape, dtype):
    return jt.array(np.identity(shape[0])).unary(dtype)

 def eye_(var):
-    var.assign(eye(var.shape, var.dtype))
+    return var.assign(eye(var.shape, var.dtype))

 def constant(shape, dtype, value=0.0):
    return jt.array(value).unary(dtype).broadcast(shape)

 def constant_(var, value=0.0):
-    var.assign(constant(var.shape, var.dtype, value))
+    return var.assign(constant(var.shape, var.dtype, value))

 def uniform(shape, dtype, low, high):
    return jt.random(shape, dtype) * (low - high) + high

 def uniform_(var, low, high):
-    var.assign(uniform(var.shape, var.dtype, low, high))
+    return var.assign(uniform(var.shape, var.dtype, low, high))

 def gauss(shape, dtype, mean=0.0, std=1.0):
    return jt.random(shape, dtype, "normal") * std + mean

 def gauss_(var, mean=0.0, std=1.0):
-    var.assign(gauss(var.shape, var.dtype, mean, std))
+    return var.assign(gauss(var.shape, var.dtype, mean, std))

 def invariant_uniform(shape, dtype, mode="fan_in"):
    assert len(shape)>1
@ -61,7 +61,7 @@ def relu_invariant_gauss(shape, dtype, mode="fan_in"):
    return gauss(shape, dtype, 0, std)

 def relu_invariant_gauss_(var, mode="fan_in"):
-    var.assign(relu_invariant_gauss(tuple(var.shape), var.dtype, mode))
+    return var.assign(relu_invariant_gauss(tuple(var.shape), var.dtype, mode))

 def calculate_std(var,mode,nonlinearity,param=0.01):
    mode = mode.lower()
@ -100,7 +100,6 @@ def kaiming_normal_(var, a=0, mode='fan_in', nonlinearity='leaky_relu'):
    return gauss_(var,0, std)


-#TODO: bound = gain * math.sqrt(6.0/fan) ??
 def xavier_uniform(shape, dtype, gain=1.0):
    assert len(shape)>1

@ -108,11 +107,11 @@ def xavier_uniform(shape, dtype, gain=1.0):
    for i in shape[2:]:
        matsize *= i
    fan = (shape[1] * matsize) + (shape[0] * matsize)
-    bound = gain * math.sqrt(1.0/fan)
+    bound = gain * math.sqrt(6.0/fan)
    return uniform(shape, dtype, -bound, bound)

 def xavier_uniform_(var, gain=1.0):
-    var.assign(xavier_uniform(tuple(var.shape), var.dtype, gain))
+    return var.assign(xavier_uniform(tuple(var.shape), var.dtype, gain))

 def xavier_gauss(shape, dtype, gain=1.0):
    assert len(shape)>1
@ -125,4 +124,4 @@ def xavier_gauss(shape, dtype, gain=1.0):
    return gauss(shape, dtype, 0, std)

 def xavier_gauss_(var, gain=1.0):
-    var.assign(xavier_gauss(tuple(var.shape), var.dtype, gain))
+    return var.assign(xavier_gauss(tuple(var.shape), var.dtype, gain))
--- a/python/jittor/loss3d/init.py
+++ b/python/jittor/loss3d/init.py
@ -0,0 +1,2 @@
+from .chamfer import chamfer_loss, ChamferLoss
+from .emd import earth_mover_distance, EarthMoverDistance
--- a/python/jittor/loss3d/chamfer.py
+++ b/python/jittor/loss3d/chamfer.py
@ -0,0 +1,153 @@
+# Author: Zheng-Ning Liu 
+# 
+# This file implements chamfer loss on both CPU and GPU.
+# The implementation does no use extra NxM matrix to store distances, and thus
+# supports large point clouds.
+
+import jittor as jt
+import jittor.nn as nn
+
+cpu_src = '''
+    for (int bs = 0; bs < in0_shape0; ++bs)
+        for (int i = 0; i < in0_shape1; ++i) {
+            float min_dis = (@in0(bs, i, 0) - @in1(bs, 0, 0)) * (@in0(bs, i, 0) - @in1(bs, 0, 0)) +
+                            (@in0(bs, i, 1) - @in1(bs, 0, 1)) * (@in0(bs, i, 1) - @in1(bs, 0, 1)) +
+                            (@in0(bs, i, 2) - @in1(bs, 0, 2)) * (@in0(bs, i, 2) - @in1(bs, 0, 2));
+            @out(bs, i) = 0;
+            for (int j = 1; j < in1_shape1; ++j) {
+                float dis = (@in0(bs, i, 0) - @in1(bs, j, 0)) * (@in0(bs, i, 0) - @in1(bs, j, 0)) +
+                            (@in0(bs, i, 1) - @in1(bs, j, 1)) * (@in0(bs, i, 1) - @in1(bs, j, 1)) +
+                            (@in0(bs, i, 2) - @in1(bs, j, 2)) * (@in0(bs, i, 2) - @in1(bs, j, 2));
+                if (dis < min_dis) {
+                    min_dis = dis;
+                    @out(bs, i) = j;
+                }
+            }
+        }
+'''
+
+cuda_src = '''
+    __global__ void chamfer_loss_min_idx_kernel(@ARGS_DEF) {
+        @PRECALC
+        int bs = blockIdx.x;
+        int n = in0_shape1;
+        int m = in1_shape1;
+
+        for (int i = threadIdx.x; i < n; i += blockDim.x) {
+            float min_dis = (@in0(bs, i, 0) - @in1(bs, 0, 0)) * (@in0(bs, i, 0) - @in1(bs, 0, 0)) +
+                            (@in0(bs, i, 1) - @in1(bs, 0, 1)) * (@in0(bs, i, 1) - @in1(bs, 0, 1)) +
+                            (@in0(bs, i, 2) - @in1(bs, 0, 2)) * (@in0(bs, i, 2) - @in1(bs, 0, 2));
+            @out(bs, i) = 0;
+            for (int j = 1; j < m; ++j) {
+                float dis = (@in0(bs, i, 0) - @in1(bs, j, 0)) * (@in0(bs, i, 0) - @in1(bs, j, 0)) +
+                            (@in0(bs, i, 1) - @in1(bs, j, 1)) * (@in0(bs, i, 1) - @in1(bs, j, 1)) +
+                            (@in0(bs, i, 2) - @in1(bs, j, 2)) * (@in0(bs, i, 2) - @in1(bs, j, 2));
+                if (dis < min_dis) {
+                    min_dis = dis;
+                    @out(bs, i) = j;
+                }
+            }
+        }
+    }
+
+    chamfer_loss_min_idx_kernel<<<in0_shape0, 512>>>(@ARGS);
+'''
+
+
+def chamfer_loss(pc1, pc2, reduction='mean', dims='BNC', bidirectional=False):
+    ''' return the chamfer loss from pc1 to pc2.
+
+    :param pc1:  input point cloud
+    :type pc1: jittor array
+
+    :param pc2:  input point cloud
+    :type pc2: jittor array
+
+    :param reduction: reduction method in batches, can be 'mean', 'sum', or None. Default: 'mean'.
+    :type reduction: str, optional
+            
+    :param dims: a string that represents each dimension, can be
+            '[BNC]' ([batch, number of points, xyz]), or
+            '[BCN]' ([batch, xyz, number of points]). Default: 'BNC'.
+    :type dims: str, optional
+
+    Example:
+
+    >>> import jittor as jt
+    >>> from jittor.loss3d import chamfer_loss
+    >>> jt.flags.use_cuda = True
+    >>> pc1 = jt.rand([10, 100, 3], dtype=jt.float32)
+    >>> pc2 = jt.rand([10, 100, 3], dtype=jt.float32)
+    >>> cf = chamfer_loss(pc1, pc2, dims='BNC', bidirectional=True)
+    >>> print('chamfer loss =', cf.item())
+    '''
+    if bidirectional:
+        return chamfer_loss(pc1, pc2, reduction, dims) + chamfer_loss(pc2, pc1, reduction, dims)
+
+    assert dims in ['BNC', 'BCN']
+    if dims == 'BCN':
+        pc1, pc2 = pc1.permute(0, 2, 1), pc2.permute(0, 2, 1)
+
+    batch_size_1, N, _ = pc1.shape
+    batch_size_2, M, _ = pc2.shape
+    assert batch_size_1 == batch_size_2
+    batch_size = batch_size_1
+
+    idx = jt.code([batch_size, N], 'int32', [pc1, pc2],
+                        cpu_src=cpu_src,
+                        cuda_src=cuda_src)
+
+    nearest_pts = pc2.reindex([batch_size, idx.shape[1], 3], [
+        'i0',
+        '@e0(i0, i1)',
+        'i2'
+    ], extras=[idx])
+
+    chamfer_distance = (((pc1 - nearest_pts) ** 2).sum(dim=-1)).sqrt()
+    if reduction is None:
+        return chamfer_distance
+    elif reduction == 'sum':
+        return jt.sum(chamfer_distance)
+    elif reduction == 'mean':
+        return jt.mean(chamfer_distance)
+
+
+class ChamferLoss(nn.Module):
+    ''' A loss layer that computes the chamfer loss from pc1 to pc2.
+
+    :param pc1:  input point cloud
+    :type pc1: jittor array
+
+    :param pc2:  input point cloud
+    :type pc2: jittor array
+
+    :param reduction: reduction method in batches, can be 'mean', 'sum', or None. Default: 'mean'.
+    :type reduction: str, optional
+            
+    :param dims: a string that represents each dimension, can be
+            '[BNC]' ([batch, number of points, xyz]), or
+            '[BCN]' ([batch, xyz, number of points]). Default: 'BNC'.
+    :type dims: str, optional
+
+    Example:
+
+    >>> import jittor as jt
+    >>> from jittor.loss3d import ChamferLoss
+    >>> jt.flags.use_cuda = True
+    >>> pc1 = jt.rand([10, 100, 3], dtype=jt.float32)
+    >>> pc2 = jt.rand([10, 100, 3], dtype=jt.float32)
+    >>> CF = ChamferLoss(dims='BNC', bidirectional=True)
+    >>> cf = CF(pc1, pc2)
+    >>> print('chamfer loss =', cf.item())
+    '''
+
+    def __init__(self, reduction='mean', dims='BNC', bidirectional=False):
+        ''' see function @chamfer_loss
+        '''
+        super().__init__()
+        self.reduction = reduction
+        self.dims = dims
+        self.bidirectional = bidirectional
+
+    def execute(self, pc1, pc2):
+        return chamfer_loss(pc1, pc2, self.reduction, self.dims, self.bidirectional)
--- a/python/jittor/loss3d/emd.py
+++ b/python/jittor/loss3d/emd.py
@ -0,0 +1,440 @@
+# Author: Zheng-Ning Liu 
+# 
+# The gpu implementation is original provided by Haoqiang Fan and Kaichun Mo,
+# <https://github.com/daerduoCarey/PyTorchEMD>.
+
+import jittor as jt
+from jittor import Function
+
+EMD_gpu_header = '''
+namespace jittor {
+__device__ inline out_type dist2(out_type x1, out_type y1, out_type z1,
+        out_type x2, out_type y2, out_type z2) {
+    return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1) + (z2 - z1) * (z2 - z1);
+}
+}
+'''
+
+approxmatch_gpu_src = '''
+    __global__ void approxmatch_gpu_kernel(@ARGS_DEF) {
+        @PRECALC
+        @alias(xyz1, in0)
+        @alias(xyz2, in1)
+        @alias(match, out)
+
+        int b = in0_shape0;
+        int n = in0_shape1;
+        int m = in1_shape1;
+
+        out_type *remainL = in2_p + blockIdx.x * (n + m) * 2;
+        out_type *remainR = remainL + n;
+        out_type *ratioL = remainR + m;
+        out_type *ratioR = ratioL + n;
+
+        const int Block = 1024;
+        __shared__ out_type buf[Block * 4];
+
+        for (int i = blockIdx.x; i < b; i += gridDim.x) {
+            for (int j = threadIdx.x; j < n * m; j += blockDim.x)
+                match_p[i * n * m + j] = 0;
+            for (int j = threadIdx.x; j < n; j += blockDim.x)
+                remainL[j] = n >= m ? 1 : m / n;
+            for (int j = threadIdx.x; j < m; j += blockDim.x)
+                remainR[j] = n >= m ? n / m : 1;
+            __syncthreads();
+
+            for (int j = 7; j >= -2; j--) {
+                out_type level = j > -2 ? -powf(4.0f, j) : 0;
+
+                for (int k0 = 0; k0 < n; k0 += blockDim.x) {
+                    int k = k0 + threadIdx.x;
+                    out_type x1 = 0, y1 = 0, z1 = 0;
+                    if (k < n) {
+                        x1 = @xyz1(i, k, 0);
+                        y1 = @xyz1(i, k, 1);
+                        z1 = @xyz1(i, k, 2);
+                    }
+
+                    out_type suml = 1e-9f;
+                    for (int l0 = 0; l0 < m; l0 += Block){
+                        int lend = min(m, l0 + Block) - l0;
+                        for (int l = threadIdx.x; l < lend; l += blockDim.x) {
+                            buf[l * 4 + 0] = @xyz2(i, l0 + l, 0);
+                            buf[l * 4 + 1] = @xyz2(i, l0 + l, 1);
+                            buf[l * 4 + 2] = @xyz2(i, l0 + l, 2);
+                            buf[l * 4 + 3] = remainR[l0 + l];
+                        }
+                        __syncthreads();
+
+                        for (int l = 0; l < lend; l++){
+                            out_type x2 = buf[l * 4 + 0];
+                            out_type y2 = buf[l * 4 + 1];
+                            out_type z2 = buf[l * 4 + 2];
+                            out_type d = level * dist2(x1, y1, z1, x2, y2, z2);
+                            out_type w = __expf(d) * buf[l * 4 + 3];
+                            suml += w;
+                        }
+                        __syncthreads();
+                    }
+                    if (k < n)
+                        ratioL[k] = remainL[k] / suml;
+                }
+                __syncthreads();
+
+                for (int l0 = 0; l0 < m; l0 += blockDim.x){
+                    int l = l0 + threadIdx.x;
+                    out_type x2 = 0, y2 = 0, z2 = 0;
+                    if (l < m){
+                        x2 = @xyz2(i, l, 0);
+                        y2 = @xyz2(i, l, 1);
+                        z2 = @xyz2(i, l, 2);
+                    }
+                    out_type sumr = 0;
+                    for (int k0 = 0; k0 < n; k0 += Block){
+                        int kend = min(n, k0 + Block) - k0;
+                        for (int k = threadIdx.x; k < kend; k += blockDim.x){
+                            buf[k * 4 + 0] = @xyz1(i, k0 + k, 0);
+                            buf[k * 4 + 1] = @xyz1(i, k0 + k, 1);
+                            buf[k * 4 + 2] = @xyz1(i, k0 + k, 2);
+                            buf[k * 4 + 3] = ratioL[k0 + k];
+                        }
+                        __syncthreads();
+
+                        for (int k = 0; k < kend; k++){
+                            out_type x1 = buf[k * 4 + 0];
+                            out_type y1 = buf[k * 4 + 1];
+                            out_type z1 = buf[k * 4 + 2];
+                            out_type d = level * dist2(x1, y1, z1, x2, y2, z2);
+                            out_type w = __expf(d) * buf[k * 4 + 3];
+                            sumr += w;
+                        }
+                        __syncthreads();
+                    }
+
+                    if (l < m){
+                        sumr *= remainR[l];
+                        out_type consumption = fminf(remainR[l] / (sumr + 1e-9f), 1.0f);
+                        ratioR[l] = consumption * remainR[l];
+                        remainR[l] = fmaxf(0.0f, remainR[l] - sumr);
+                    }
+                }
+                __syncthreads();
+
+                for (int k0 = 0; k0 < n; k0 += blockDim.x){
+                    int k = k0 + threadIdx.x;
+                    out_type x1 = 0, y1 = 0, z1 = 0;
+                    if (k < n){
+                        x1 = @xyz1(i, k, 0);
+                        y1 = @xyz1(i, k, 1);
+                        z1 = @xyz1(i, k, 2);
+                    }
+                    out_type suml = 0;
+                    for (int l0 = 0; l0 < m; l0 += Block){
+                        int lend = min(m, l0 + Block)-l0;
+                        for (int l = threadIdx.x; l < lend; l += blockDim.x){
+                            buf[l * 4 + 0] = @xyz2(i, l0 + l, 0);
+                            buf[l * 4 + 1] = @xyz2(i, l0 + l, 1);
+                            buf[l * 4 + 2] = @xyz2(i, l0 + l, 2);
+                            buf[l * 4 + 3] = ratioR[l0 + l];
+                        }
+                        __syncthreads();
+
+                        out_type rl = ratioL[k];
+                        if (k < n){
+                            for (int l = 0; l < lend; l++){
+                                out_type x2 = buf[l * 4 + 0];
+                                out_type y2 = buf[l * 4 + 1];
+                                out_type z2 = buf[l * 4 + 2];
+                                out_type d = level * dist2(x1, y1, z1, x2, y2, z2);
+                                out_type w = __expf(d) * rl * buf[l*4+3];
+                                @match(i, l0 + l, k) += w;
+                                suml += w;
+                            }
+                        }
+                        __syncthreads();
+                    }
+                    if (k < n)
+                        remainL[k] = fmaxf(0.0f, remainL[k] - suml);
+                }
+                __syncthreads();
+            }
+        }
+    }
+
+    approxmatch_gpu_kernel<<<32, 512>>>(@ARGS);
+'''
+
+matchcost_gpu_src = '''
+    __global__ void matchcost_gpu_kernel(@ARGS_DEF) {
+        @PRECALC
+        @alias(xyz1, in0)
+        @alias(xyz2, in1)
+        @alias(match, in2)
+
+        int b = in0_shape0;
+        int n = in0_shape1;
+        int m = in1_shape1;
+
+        const int Block = 1024;
+        __shared__ out_type allsum[512];
+        __shared__ out_type buf[Block * 3];
+
+        for (int i = blockIdx.x; i < b; i += gridDim.x) {
+            out_type subsum = 0;
+            for (int k0 = 0; k0 < n; k0 += blockDim.x) {
+                int k = k0 + threadIdx.x;
+                out_type x1 = 0, y1 = 0, z1 = 0;
+                if (k < n) {
+                    x1 = @xyz1(i, k, 0);
+                    y1 = @xyz1(i, k, 1);
+                    z1 = @xyz1(i, k, 2);
+                }
+
+                for (int l0 = 0; l0 < m; l0 += Block) {
+                    int lend = min(m, l0 + Block) - l0;
+                    for (int l = threadIdx.x; l < lend * 3; l += blockDim.x)
+                        buf[l] = xyz2_p[i * m * 3 + l0 * 3 + l];
+                    __syncthreads();
+
+                    if (k < n) {
+                        for (int l = 0; l < lend; l++) {
+                            out_type x2 = buf[l * 3 + 0];
+                            out_type y2 = buf[l * 3 + 1];
+                            out_type z2 = buf[l * 3 + 2];
+                            out_type d = dist2(x1, y1, z1, x2, y2, z2);
+                            subsum += d * @match(i, l0 + l, k);
+                        }
+                    }
+                    __syncthreads();
+                }
+            }
+
+            allsum[threadIdx.x] = subsum;
+            for (int j = 1; j < blockDim.x; j <<= 1) {
+                __syncthreads();
+                if ((threadIdx.x & j) == 0 && threadIdx.x + j < blockDim.x) {
+                    allsum[threadIdx.x] += allsum[threadIdx.x + j];
+                }
+            }
+
+            if (threadIdx.x == 0)
+                @out(i) = allsum[0];
+            __syncthreads();
+        }
+    }
+
+    matchcost_gpu_kernel<<<32, 512>>>(@ARGS);
+'''
+
+matchcost_grad1_gpu_src = '''
+    __global__ void matchcost_grad1_gpu_kernel(@ARGS_DEF) {
+        @PRECALC
+        @alias(grad, in0)
+        @alias(xyz1, in1)
+        @alias(xyz2, in2)
+        @alias(match, in3)
+
+        int b = grad_shape0;
+        int n = xyz1_shape1;
+        int m = xyz2_shape1;
+
+        for (int i = blockIdx.x; i < b ; i += gridDim.x){
+            for (int l = threadIdx.x; l < n; l += blockDim.x){
+                out_type x1 = @xyz1(i, l, 0);
+                out_type y1 = @xyz1(i, l, 1);
+                out_type z1 = @xyz1(i, l, 2);
+                out_type dx = 0, dy = 0, dz = 0;
+                for (int k = 0; k < m; k++){
+                    out_type x2 = @xyz2(i, k, 0);
+                    out_type y2 = @xyz2(i, k, 1);
+                    out_type z2 = @xyz2(i, k, 2);
+                    out_type d = @match(i, k, l) * 2;
+                    dx += (x1 - x2) * d;
+                    dy += (y1 - y2) * d;
+                    dz += (z1 - z2) * d;
+                }
+                @out(i, l, 0) = dx * @grad(i);
+                @out(i, l, 1) = dy * @grad(i);
+                @out(i, l, 2) = dz * @grad(i);
+            }
+        }
+    }
+
+    matchcost_grad1_gpu_kernel<<<32, 512>>>(@ARGS);
+'''
+
+matchcost_grad2_gpu_src = '''
+    __global__ void matchcost_grad2_gpu_kernel(@ARGS_DEF) {
+        @PRECALC
+        @alias(grad, in0)
+        @alias(xyz1, in1)
+        @alias(xyz2, in2)
+        @alias(match, in3)
+
+        int b = grad_shape0;
+        int n = xyz1_shape1;
+        int m = xyz2_shape1;
+
+        __shared__ out_type sum_grad[256 * 3];
+        for (int i = blockIdx.x; i < b; i += gridDim.x) {
+            int kbeg = m * blockIdx.y / gridDim.y;
+            int kend = m * (blockIdx.y + 1) / gridDim.y;
+            for (int k = kbeg; k < kend; k++) {
+                out_type x2 = @xyz2(i, k, 0);
+                out_type y2 = @xyz2(i, k, 1);
+                out_type z2 = @xyz2(i, k, 2);
+                out_type subsumx = 0, subsumy = 0, subsumz = 0;
+                for (int j = threadIdx.x; j < n; j += blockDim.x) {
+                    out_type x1 = x2 - @xyz1(i, j, 0);
+                    out_type y1 = y2 - @xyz1(i, j, 1);
+                    out_type z1 = z2 - @xyz1(i, j, 2);
+                    out_type d = @match(i, k, j) * 2;
+                    subsumx += x1 * d;
+                    subsumy += y1 * d;
+                    subsumz += z1 * d;
+                }
+                sum_grad[threadIdx.x * 3 + 0] = subsumx;
+                sum_grad[threadIdx.x * 3 + 1] = subsumy;
+                sum_grad[threadIdx.x * 3 + 2] = subsumz;
+
+                for (int j = 1; j < blockDim.x; j <<= 1) {
+                    __syncthreads();
+                    int j1 = threadIdx.x;
+                    int j2 = threadIdx.x + j;
+                    if ((j1 & j) == 0 && j2 < blockDim.x){
+                        sum_grad[j1 * 3 + 0] += sum_grad[j2 * 3 + 0];
+                        sum_grad[j1 * 3 + 1] += sum_grad[j2 * 3 + 1];
+                        sum_grad[j1 * 3 + 2] += sum_grad[j2 * 3 + 2];
+                    }
+                }
+                if (threadIdx.x == 0){
+                    @out(i, k, 0) = sum_grad[0] * @grad(i);
+                    @out(i, k, 1) = sum_grad[1] * @grad(i);
+                    @out(i, k, 2) = sum_grad[2] * @grad(i);
+                }
+                __syncthreads();
+            }
+        }
+    }
+
+    matchcost_grad2_gpu_kernel<<<dim3(32, 32), 256>>>(@ARGS);
+'''
+
+class EarthMoverDistance(Function):
+    ''' A loss layer that computes Earth Mover's distance from pc1 to pc2. Only supports GPU.
+
+    :param pc1:  input point cloud
+    :type pc1: jittor array
+
+    :param pc2:  input point cloud
+    :type pc2: jittor array
+
+    :param reduction: reduction method in batches, can be 'mean', 'sum', or None. Default: 'mean'.
+    :type reduction: str, optional
+            
+    :param dims: a string that represents each dimension, can be
+            '[BNC]' ([batch, number of points, xyz]), or
+            '[BCN]' ([batch, xyz, number of points]). Default: 'BNC'.
+    :type dims: str, optional
+
+    Example:
+
+    >>> import jittor as jt
+    >>> from jittor.loss3d import EarthMoverDistance
+    >>> jt.flags.use_cuda = True
+    >>> pc1 = jt.rand([10, 100, 3], dtype=jt.float32)
+    >>> pc2 = jt.rand([10, 100, 3], dtype=jt.float32)
+    >>> EMD = EarthMoverDistance(dims='BNC')
+    >>> emd = EMD(pc1, pc2)
+    >>> print('EMD =', emd.item())
+    '''
+    def execute(self, pc1, pc2, reduction='mean', dims='BNC'):
+        assert dims in ['BNC', 'BCN']
+        if dims == 'BCN':
+            pc1, pc2 = pc1.permute(0, 2, 1), pc2.permute(0, 2, 1)
+
+        batch_size_1, N, _ = pc1.shape
+        batch_size_2, M, _ = pc2.shape
+        assert batch_size_1 == batch_size_2
+        batch_size = batch_size_1
+
+        temp = jt.zeros([batch_size, (N + M) * 2], pc1.dtype)
+        match = jt.code(
+            shape=[batch_size, M, N],
+            dtype=pc1.dtype,
+            inputs=[pc1, pc2, temp],
+            cuda_header=EMD_gpu_header,
+            cuda_src=approxmatch_gpu_src,
+        )
+
+        emd = jt.code(
+            shape=[batch_size],
+            dtype=pc1.dtype,
+            inputs=[pc1, pc2, match],
+            cuda_header=EMD_gpu_header,
+            cuda_src=matchcost_gpu_src,
+        )
+
+        self.saved_vars = (pc1, pc2, match, reduction)
+
+        if reduction is None:
+            return emd
+        elif reduction == 'sum':
+            return emd.sum()
+        elif reduction == 'mean':
+            return emd.mean()
+
+    def grad(self, grad):
+        pc1, pc2, match, reduction = self.saved_vars
+
+        if reduction == 'sum':
+            grad = jt.ones([pc1.shape[0]]) * grad
+        elif reduction == 'mean':
+            grad = jt.ones([pc1.shape[0]]) * grad / pc1.shape[0]
+
+        grad_pc1 = jt.code(
+            shape=pc1.shape,
+            dtype=pc1.dtype,
+            inputs=[grad, pc1, pc2, match],
+            cuda_src=matchcost_grad1_gpu_src,
+        )
+
+        grad_pc2 = jt.code(
+            shape=pc2.shape,
+            dtype=pc2.dtype,
+            inputs=[grad, pc1, pc2, match],
+            cuda_src=matchcost_grad2_gpu_src,
+        )
+
+        return grad_pc1, grad_pc2
+
+
+def earth_mover_distance(pc1, pc2, reduction='mean', dims='BNC'):
+    ''' Earth Mover's distance from pc1 to pc2. Only supports GPU.
+
+    :param pc1:  input point cloud
+    :type pc1: jittor array
+
+    :param pc2:  input point cloud
+    :type pc2: jittor array
+
+    :param reduction: reduction method in batches, can be 'mean', 'sum', or None. Default: 'mean'.
+    :type reduction: str, optional
+            
+    :param dims: a string that represents each dimension, can be
+            '[BNC]' ([batch, number of points, xyz]), or
+            '[BCN]' ([batch, xyz, number of points]). Default: 'BNC'.
+    :type dims: str, optional
+
+
+    Example:
+
+    >>> import jittor as jt
+    >>> from jittor.loss3d import earth_mover_distance
+    >>> jt.flags.use_cuda = True
+    >>> pc1 = jt.rand([10, 100, 3], dtype=jt.float32)
+    >>> pc2 = jt.rand([10, 100, 3], dtype=jt.float32)
+    >>> emd = earth_mover_distance(pc1, pc2, dims='BNC')
+    >>> print('EMD =', emd.item())
+    '''
+    return EarthMoverDistance.apply(pc1, pc2, reduction, dims)
--- a/python/jittor/lr_scheduler.py
+++ b/python/jittor/lr_scheduler.py
@ -116,6 +116,65 @@ class CosineAnnealingLR(object):
            if param_group.get("lr") != None:
                param_group["lr"] = self.get_lr(self.base_lr_pg[i], param_group["lr"])

+
+class ExponentialLR(object):
+    """ learning rate is multiplied by gamma in each step.
+    """
+    def __init__(self, optimizer, gamma, last_epoch=-1):
+        self.optimizer = optimizer
+        self.gamma = gamma
+        self.last_epoch = last_epoch
+        self.base_lr = optimizer.lr
+        self.base_lr_pg = [pg.get("lr") for pg in optimizer.param_groups]
+
+    def get_lr(self, base_lr, now_lr):
+        if self.last_epoch == 0:
+            return base_lr
+        return base_lr * self.gamma ** self.last_epoch
+
+    def step(self):
+        self.last_epoch += 1
+        self.update_lr()
+            
+    def update_lr(self):
+        self.optimizer.lr = self.get_lr(self.base_lr, self.optimizer.lr)
+        for i, param_group in enumerate(self.optimizer.param_groups):
+            if param_group.get("lr") != None:
+                param_group["lr"] = self.get_lr(self.base_lr_pg[i], param_group["lr"])
+
+
+class StepLR(object):
+    def __init__(self, optimizer, step_size, gamma=0.1, last_epoch=-1):
+        self.optimizer = optimizer
+        self.step_size = step_size
+        self.gamma = gamma
+        self.last_epoch = last_epoch
+        self.cur_epoch = 0
+    
+    def get_gamma(self):
+        if self.last_epoch < 0:
+            if (self.cur_epoch != 0 and (self.cur_epoch + 1) % self.step_size == 0):
+                return self.gamma
+        else:
+            if (self.cur_epoch + 1 + self.last_epoch) % self.step_size == 0:
+                return self.gamma
+        return 1.
+
+    def get_lr(self):
+        return self.optimizer.lr
+
+    def step(self):
+        self.update_lr()
+        self.cur_epoch += 1
+            
+    def update_lr(self):
+        gamma = self.get_gamma()
+        if gamma != 1.:
+            self.optimizer.lr = self.optimizer.lr * gamma
+            for i, param_group in enumerate(self.optimizer.param_groups):
+                if param_group.get("lr") != None:
+                    param_group["lr"] = param_group["lr"] * gamma
+
 class MultiStepLR(object):
    def __init__(self, optimizer, milestones=[], gamma=0.1, last_epoch=-1):
        self.optimizer = optimizer
--- a/python/jittor/misc.py
+++ b/python/jittor/misc.py
@ -182,7 +182,28 @@ def chunk(x, chunks, dim=0):
 jt.Var.chunk = chunk


-def expand(x, shape):
+def expand(x, *shape):
+    ''' Expand and broadcast this array, -1 represents this dimension is not changed.
+
+Example::
+
+    a = jt.zeros((3,1))
+    b = a.expand(3, 4)
+    assert b.shape == (3,4)
+    b = a.expand(-1, 4)
+    assert b.shape == (3,4)
+    b = a.expand((3, 4))
+    assert b.shape == (3,4)
+    b = a.expand((-1, 4))
+    assert b.shape == (3,4)
+
+    '''
+    if len(shape) == 1 and isinstance(shape[0], (tuple,list,jt.NanoVector)):
+        shape = shape[0]
+    shape = list(shape)
+    for i in range(len(shape)):
+        if shape[i] == -1:
+            shape[i] = x.shape[i]
    return x.broadcast(shape)
 jt.Var.expand = expand

@ -634,23 +655,6 @@ def kthvalue(input, k, dim=None, keepdim=False):

 jt.Var.kthvalue = kthvalue

-
-def gather(x,dim,index):
-    if dim<0:
-        dim+=index.ndim
-    x_shape = list(x.shape )
-    i_shape = list(index.shape)
-    assert i_shape[dim]>0
-    assert x.ndim == index.ndim
-    i_shape[dim]=x_shape[dim]
-    assert i_shape == x_shape
-    ins = []
-    for i in range(index.ndim):
-        ins.append(jt.index(index.shape,dim=i))
-    ins[dim]=index
-    return x.reindex(ins)
-jt.Var.gather = gather
-
 def _prod(x,dim=0):
    x = jt.log(x)
    x = x.sum(dim=dim)
@ -708,7 +712,6 @@ def nms(dets,thresh):
    return order[selected]


-jt.Var.expand = jt.Var.broadcast
 jt.Var.expand_as = jt.Var.broadcast_var


@ -1061,6 +1064,13 @@ def randperm(n, dtype="int32"):
    return index.cast(dtype)

 def set_global_seed(seed):
+    ''' Sets the seeds of the random number generators of Python, numpy and jittor,
+    simultaneously.
+
+    .. note::
+    Jittor also gurantees each worker of jittor.dataset.Dataset to hold a different seed,
+    which is global_seed ^ worker_id ^ 1234.
+    '''
    import random
    random.seed(seed)
    jt.set_seed(seed)
@ -1197,7 +1207,7 @@ def gather(x, dim, index):

 Parameters::

-    * input (jt.Var) – the source array
+    * x (jt.Var) – the source array
    * dim (int) – the axis along which to index
    * index (jt.Var) – the indices of elements to gather

@ -1216,3 +1226,46 @@ Example::
    return x.getitem(tuple(indexes))

 jt.Var.gather = gather
+
+def roll(x, shifts, dims=None):
+    '''Roll the tensor along the given dimension(s).
+
+Parameters::
+
+    * x (jt.Var) – the source array
+    * shifts (int or tuple) – shift offset of dims
+    * dims (int or tuple) – shift dims
+
+Examples::
+
+        x = jt.array([1, 2, 3, 4, 5, 6, 7, 8]).view(4, 2)
+        y = x.roll(1, 0)
+        assert (y.numpy() == [[7,8],[1,2],[3,4],[5,6]]).all()
+        y = x.roll(-1, 0)
+        assert (y.numpy() == [[3,4],[5,6],[7,8],[1,2]]).all()
+        y = x.roll(shifts=(2, 1), dims=(0, 1))
+        assert (y.numpy() == [[6,5],[8,7],[2,1],[4,3]]).all()
+
+    '''
+    if isinstance(shifts, int):
+        shifts = (shifts,)
+    if dims is None:
+        dims = tuple(range(len(shifts)))
+    elif isinstance(dims, int):
+        dims = (dims,)
+    assert len(dims) == len(shifts)
+    ids = [ f'i{i}' for i in range(x.ndim) ]
+    for i in range(len(dims)):
+        shift = shifts[i]
+        d = dims[i]
+        size = x.shape[d]
+        shift = shift % size
+        if shift<0: shift += size
+        ids[d] = f'(i{d}<{shift}?i{d}+{size-shift}:(i{d}-{shift}))'
+    return x.reindex(x.shape, ids)
+
+jt.Var.roll = roll
+
+def safe_log(x):
+    return jt.safe_clip(x, 1e-30, 1e30).log()
+jt.Var.safe_log = safe_log
--- a/python/jittor/nn.py
+++ b/python/jittor/nn.py
@ -20,19 +20,20 @@ import math
 from collections import OrderedDict
 from jittor.pool import *
 from jittor.optim import *
-from jittor.misc import _pair
+from jittor.misc import _pair, _triple
+from jittor_utils import LOG


 def matmul_transpose(a, b):
    '''
    returns a * b^T
    '''
-    assert len(a.shape) >= 2 and len(b.shape) == 2
    assert a.shape[-1] == b.shape[-1], (a.shape, b.shape)
-    if len(a.shape)>2:
+    if len(a.shape) != 2:
        aa = a.reshape((-1, a.shape[-1]))
        cc = matmul_transpose(aa, b)
        return cc.reshape(a.shape[:-1]+(-1,))
+    assert len(a.shape) == 2 and len(b.shape) == 2

    shape = list(a.shape)[:-1] + list(b.shape)
    a = a.broadcast(shape, [len(shape)-2])
@ -44,7 +45,7 @@ def bmm_transpose(a, b):
    '''
    returns a * b^T
    '''
-    if jt.flags.use_cuda:
+    if jt.flags.use_cuda and jt.compile_extern.cublas_ops:
        return jt.compile_extern.cublas_ops.cublas_batched_matmul(a, b, 0, 1)
    t = list(range(b.ndim))
    t[-1], t[-2] = t[-2], t[-1]
@ -117,7 +118,7 @@ Example::
    if len_a>=3 and len_a==len_b:
        # bmm
        # a: [..., n, m], b: [..., m, k], c:[..., n, k]
-        if jt.flags.use_cuda:
+        if jt.flags.use_cuda and jt.compile_extern.cublas_ops:
            return jt.compile_extern.cublas_ops.cublas_batched_matmul(a, b, 0, 0)
    shape = []
    len_c = max(len_a, len_b)
@ -178,36 +179,45 @@ class ELU(Module):
 class PReLU(Module):
    def __init__(self, num_parameters=1, init_=0.25):
        self.num_parameters = num_parameters
-        self.a = init.constant((num_parameters,), "float32", init_)
+        self.weight = init.constant((num_parameters,), "float32", init_)

    def execute(self, x):
        if self.num_parameters != 1:
            assert self.num_parameters == x.size(1), f"num_parameters does not match input channels in PReLU"
-            return jt.maximum(0, x) + self.a.broadcast(x, [0,2,3]) * jt.minimum(0, x)
+            return jt.maximum(0, x) + self.weight.broadcast(x, [0,2,3]) * jt.minimum(0, x)
        else:
-            return jt.maximum(0, x) + self.a * jt.minimum(0, x)
+            return jt.maximum(0, x) + self.weight * jt.minimum(0, x)

 #TODO dims is 4 will cause slowly execution
-def cross_entropy_loss(output, target, ignore_index=None):
+def cross_entropy_loss(output, target, weight=None, ignore_index=None,reduction='sum'):
    if len(output.shape) == 4:
        c_dim = output.shape[1]
        output = output.transpose((0, 2, 3, 1))
        output = output.reshape((-1, c_dim))
-    if ignore_index is not None:
-        target = jt.ternary(target==ignore_index,
-            jt.array(-1).broadcast(target), target)
-        mask = jt.logical_and(target >= 0, target < output.shape[1])
+
    target = target.reshape((-1, ))
+    target_weight = jt.ones(target.shape[0], dtype='float32')
+    if weight is not None:
+        target_weight = weight[target]
+    if ignore_index is not None:
+        target_weight = jt.ternary(
+            target==ignore_index,
+            jt.array(0).broadcast(target_weight),
+            target_weight
+        )
+    
    target = target.broadcast(output, [1])
    target = target.index(1) == target
    
    output = output - output.max([1], keepdims=True)
-    loss = output.exp().sum(1).log()
-    loss = loss - (output*target).sum(1)
-    if ignore_index is None:
-        return loss.mean()
+    logsum = output.exp().sum(1).log()
+    loss = (logsum - (output*target).sum(1)) * target_weight
+    if reduction == 'sum':
+        return loss.sum() / target_weight.sum()
+    elif reduction == 'mean':
+        return loss.mean() / target_weight.mean()
    else:
-        return loss.sum() / jt.maximum(mask.int().sum(), 1)
+        return loss / target_weight

 def mse_loss(output, target):
    return (output-target).sqr().mean()
@ -273,11 +283,12 @@ def nll_loss(output,target,weight=None,ignore_index=-100,reduction='mean'):
        raise ValueError(f'not support {reduction}')
    
 class CrossEntropyLoss(Module):
-    def __init__(self,ignore_index=None):
+    def __init__(self, weight=None, ignore_index=None):
+        self.weight = weight
        self.ignore_index = ignore_index
        
    def execute(self, output, target):
-        return cross_entropy_loss(output, target,self.ignore_index)
+        return cross_entropy_loss(output, target, self.weight, self.ignore_index)

 class MSELoss(Module):
    def __init__(self):
@ -423,7 +434,16 @@ class BatchNorm(Module):
            norm_x = x * w.broadcast(x, dims) + b.broadcast(x, dims)
            return norm_x

-BatchNorm2d = BatchNorm1d = BatchNorm
+BatchNorm3d = BatchNorm2d = BatchNorm1d = BatchNorm
+
+def batch_norm(x, running_mean, running_var, weight=1, bias=0, training=False, momentum=0.1, eps=1e-05):
+    dims = [0]+list(range(2,x.ndim))
+    assert not training
+    w = weight / jt.sqrt(running_var+eps)
+    b = bias - running_mean * w
+    norm_x = x * w.broadcast(x, dims) + b.broadcast(x, dims)
+    return norm_x
+

 class InstanceNorm(Module):
    def __init__(self, num_features, eps=1e-05, momentum=0.1, affine=True, is_train=True, sync=True):
@ -447,7 +467,23 @@ class InstanceNorm(Module):
        b = self.bias - xmean * w
        return x * w.broadcast(x, dims) + b.broadcast(x, dims)

-InstanceNorm2d = InstanceNorm1d = InstanceNorm
+InstanceNorm3d = InstanceNorm2d = InstanceNorm1d = InstanceNorm
+
+def instance_norm(x, 
+    running_mean = None,
+    running_var = None,
+    weight = 1,
+    bias = 0,
+    momentum = 0.1,
+    eps = 1e-5):
+    dims = list(range(2,x.ndim))
+    xmean = jt.mean(x, dims=dims)
+    x2mean = jt.mean(x*x, dims=dims)
+
+    xvar = (x2mean-xmean*xmean).maximum(0.0)
+    w = weight / jt.sqrt(xvar+eps)
+    b = bias - xmean * w
+    return x * w.broadcast(x, dims) + b.broadcast(x, dims)

 class LayerNorm(Module):
    def __init__(self, normalized_shape, eps: float = 1e-5, elementwise_affine: bool = True) -> None:
@ -470,7 +506,22 @@ class LayerNorm(Module):
        return x * w + b


-LayerNorm2d = LayerNorm1d = LayerNorm
+LayerNorm3d = LayerNorm2d = LayerNorm1d = LayerNorm
+
+def layer_norm(x, 
+    normalized_shape, 
+    weight = 1,
+    bias = 0,
+    eps: float = 1e-5, 
+    elementwise_affine: bool = True):
+    dims = [-i for i in range(len(normalized_shape), 0, -1)]
+    xmean = jt.mean(x, dims=dims, keepdims=1)
+    x2mean = jt.mean(x*x, dims=dims, keepdims=1)
+
+    xvar = (x2mean-xmean*xmean).maximum(0.0)
+    w = weight / jt.sqrt(xvar+eps)
+    b = bias - xmean * w
+    return x * w + b

 class GroupNorm(Module):
    def __init__(self, num_groups, num_channels, eps=1e-05, affine=True, is_train=True):
@ -506,6 +557,33 @@ class GroupNorm(Module):
        x = x * w.broadcast(x, [3]) + b.broadcast(x, [3])
        return x.reshape(output_shape)

+def group_norm(x, 
+    num_groups, 
+    weight = 1,
+    bias = 0,
+    eps=1e-05):
+    N = x.shape[0]
+    C = x.shape[1]
+    output_shape = (N,-1)
+    # TODO: 3d group norm
+    if x.ndim==4:
+        output_shape = x.shape
+    assert C % num_groups == 0
+    x = x.reshape((N, num_groups, C//num_groups, -1))
+    xmean = jt.mean(x, dims=[2,3]).reshape((N, num_groups, 1))
+    x2mean = jt.mean(x*x, dims=[2,3]).reshape((N, num_groups, 1))
+    xvar = (x2mean-xmean*xmean).maximum(0.0)
+
+    if isinstance(weight, jt.Var):
+        weight = weight.reshape((1, num_groups, -1))
+    if isinstance(bias, jt.Var):
+        bias = bias.reshape((1, num_groups, -1))
+    weight = weight / jt.sqrt(xvar+eps)
+    bias = bias - xmean * weight
+    x = x * weight.broadcast(x, [3]) + bias.broadcast(x, [3])
+    return x.reshape(output_shape)
+
+
 Relu = jt.make_module(relu)
 ReLU = Relu
 Leaky_relu = jt.make_module(leaky_relu, 2)
@ -637,6 +715,34 @@ class Conv1d(Module):
        y = x.squeeze(-1)
        return y

+class Conv3d(Module):
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True):
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size if isinstance(kernel_size, tuple) else (kernel_size, kernel_size, kernel_size)
+        self.stride = stride if isinstance(stride, tuple) else (stride, stride, stride)
+        self.padding = padding if isinstance(padding, tuple) else (padding, padding, padding)
+        self.dilation = dilation if isinstance(dilation, tuple) else (dilation, dilation, dilation)
+        self.groups = groups
+        assert in_channels % groups == 0, 'in_channels must be divisible by groups'
+        assert out_channels % groups == 0, 'out_channels must be divisible by groups'
+        Kh, Kw, Kd = self.kernel_size
+        self.groups = groups
+        assert in_channels % groups == 0, 'in_channels must be divisible by groups'
+        assert out_channels % groups == 0, 'out_channels must be divisible by groups'
+
+        self.weight = init.invariant_uniform([out_channels, in_channels//groups, Kh, Kw, Kd], dtype="float")
+        if bias:
+            fan=1
+            for i in self.weight.shape[1:]:
+                fan *= i
+            bound = 1 / math.sqrt(fan)
+            self.bias = init.uniform([out_channels], dtype="float", low=-bound, high=bound)
+        else:
+            self.bias = None
+
+    def execute(self, x):
+        return conv3d(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)

 def conv2d(x, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
    padding = _pair(padding)
@ -694,7 +800,70 @@ def conv2d(x, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
        if bias is not None:
            b = bias.broadcast(y.shape, [0,2,3])
            y = y + b
-        return y          
+        return y
+
+def conv3d(x, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
+    padding = _triple(padding)
+    stride = _triple(stride)
+    dilation = _triple(dilation)
+    out_channels = weight.shape[0]
+
+    if jt.flags.use_cuda and jt.cudnn:
+        y = jt.cudnn.ops.cudnn_conv3d(x, weight, *stride, *padding, *dilation, groups)
+    elif groups == 1:
+        N,C,D,H,W = x.shape
+        Kd, Kh, Kw = weight.shape[-3:]
+        od = (D+padding[0]*2-Kd*dilation[0]+dilation[0]-1)//stride[0]+1
+        oh = (H+padding[1]*2-Kh*dilation[1]+dilation[1]-1)//stride[1]+1
+        ow = (W+padding[2]*2-Kw*dilation[2]+dilation[2]-1)//stride[2]+1
+        xx = x.reindex([N,out_channels,C,od,oh,ow,Kd,Kh,Kw], [
+                'i0', # Nid
+                'i2', # Cid
+                f'i3*{stride[0]}-{padding[0]}+i6*{dilation[0]}', # Hid+Khid
+                f'i4*{stride[1]}-{padding[1]}+i7*{dilation[1]}', # Wid+KWid
+                f'i5*{stride[2]}-{padding[2]}+i8*{dilation[2]}', # Did+KDid
+            ])
+        ww = weight.broadcast(xx.shape, [0,3,4,5])
+        yy = xx*ww
+        y = yy.sum([2,6,7,8]) # Kc, Kh, Kw,Kd
+    else:
+        N,C,D,H,W = x.shape
+        Kd, Kh, Kw = weight.shape[-3:]
+        G = groups
+        CpG = C // G # channels per group
+        oc = out_channels
+        od = (D+padding[0]*2-Kd*dilation[0]+dilation[0]-1)//stride[0]+1
+        oh = (H+padding[1]*2-Kh*dilation[1]+dilation[1]-1)//stride[1]+1
+        ow = (W+padding[2]*2-Kw*dilation[2]+dilation[2]-1)//stride[2]+1
+        xx = x.reindex([N,G,oc//G,CpG,od,oh,ow,Kd,Kh,Kw], [
+                'i0', # Nid
+                f'i1*{CpG}+i3', # Gid
+                f'i4*{stride[0]}-{padding[0]}+i7*{dilation[0]}', # Hid+Khid
+                f'i5*{stride[1]}-{padding[1]}+i8*{dilation[1]}', # Wid+KWid
+                f'i6*{stride[2]}-{padding[2]}+i9*{dilation[2]}', # Did+KDid
+            ])
+        xx.compile_options = {"G":G}
+        # w: [oc, CpG, Kh, Kw, Kd]
+        ww = weight.reindex([N, G, oc//G, CpG, oh, ow, od, Kh, Kw, Kd], [
+                f'i1*{oc//G}+i2',
+                'i3',
+                'i7',
+                'i8',
+                'i9'
+            ])
+        yy = xx*ww
+        y = yy.reindex_reduce('add', [N, oc, od, oh, ow], [
+                'i0',
+                f'i1*{oc//G}+i2',
+                'i4',
+                'i5',
+                'i6'
+            ])
+
+    if bias is not None:
+        b = bias.broadcast(y.shape, [0,2,3,4])
+        y = y + b
+    return y

 class ConvTranspose(Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, \
@ -754,6 +923,45 @@ class ConvTranspose(Module):
            y = y + b
        return y

+class ConvTranspose3d(Module):
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1, \
+                 padding=0, output_padding=0, groups=1, bias=True, dilation=1):
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+
+        # added
+        self.dilation = dilation
+        self.group = groups
+        assert groups==1, "Group conv not supported yet."
+
+        self.kernel_size = kernel_size if isinstance(kernel_size, tuple) else (kernel_size, kernel_size, kernel_size)
+        self.stride = stride if isinstance(stride, tuple) else (stride, stride, stride)
+        self.dilation = dilation if isinstance(dilation, tuple) else (dilation, dilation, dilation)
+        # added
+        self.padding = padding if isinstance(padding, tuple) else (padding, padding, padding)
+        self.real_padding = (
+            self.dilation[0] * (self.kernel_size[0] - 1) - self.padding[0],
+            self.dilation[1] * (self.kernel_size[1] - 1) - self.padding[1],
+            self.dilation[2] * (self.kernel_size[2] - 1) - self.padding[2])
+        self.output_padding = output_padding if isinstance (output_padding, tuple) else (output_padding, output_padding, output_padding)
+        assert self.output_padding[0] < max(self.stride[0], self.dilation[0]) and \
+            self.output_padding[1] < max(self.stride[1], self.dilation[1]) and \
+            self.output_padding[2] < max(self.stride[2], self.dilation[2]), \
+            "output padding must be smaller than max(stride, dilation)"
+
+        self.weight = init.invariant_uniform((in_channels, out_channels) + self.kernel_size, dtype="float")
+        if bias:
+            fan=1
+            for i in self.weight.shape[1:]:
+                fan *= i
+            bound = 1 / math.sqrt(fan)
+            self.bias = init.uniform([out_channels], dtype="float", low=-bound, high=bound)
+        else:
+            self.bias = None
+
+    def execute(self, x):
+        return conv_transpose3d(x, self.weight, self.bias, self.stride, self.padding, self.output_padding, self.group, self.dilation)
+
 def conv_transpose(input, weight, bias=None, stride=1, padding=0, output_padding=0, groups=1, dilation=1):
    x = input
    N,C,H,W = x.shape
@ -792,6 +1000,49 @@ def conv_transpose(input, weight, bias=None, stride=1, padding=0, output_padding
        assert not bias, "Bias should be none or jittor var"
    return y

+def conv_transpose3d(input, weight, bias=None, stride=1, padding=0, output_padding=0, groups=1, dilation=1):
+    x = input
+    N,C,D,H,W = x.shape
+    i,o,d,h,w = weight.shape
+    assert C==i
+    assert groups==1, "Group conv not supported yet."
+    stride = stride if isinstance(stride, tuple) else (stride, stride, stride)
+    dilation = dilation if isinstance(dilation, tuple) else (dilation, dilation, dilation)
+    # added
+    padding = padding if isinstance(padding, tuple) else (padding, padding, padding)
+    output_padding = output_padding if isinstance (output_padding, tuple) else (output_padding, output_padding, output_padding)
+    assert output_padding[0] < max(stride[0], dilation[0]) and \
+        output_padding[1] < max(stride[1], dilation[1]) and \
+        output_padding[2] < max(stride[2], dilation[2]), \
+        "output padding must be smaller than max(stride, dilation)"
+
+    stride_d, stride_h, stride_w = stride
+    padding_d, padding_h, padding_w = padding
+    dilation_d, dilation_h, dilation_w = dilation
+
+    d_out = (D-1) * stride_d + output_padding[0] - 2*padding_d + 1 + (d-1)*dilation_d
+    h_out = (H-1) * stride_h + output_padding[1] - 2*padding_h + 1 + (h-1)*dilation_h
+    w_out = (W-1) * stride_w + output_padding[2] - 2*padding_w + 1 + (w-1)*dilation_w
+    out_shape = (N, o, d_out, h_out, w_out)
+    if jt.flags.use_cuda and jt.cudnn:
+        return jt.cudnn.ops.cudnn_conv3d_backward_x(weight, x, *out_shape[2:], *stride, *padding, *dilation, groups)
+    shape = (N, i, o, D, H, W, d, h, w)
+    xx = x.broadcast(shape, (2, 6, 7, 8)) # i,h,w
+    ww = weight.broadcast(shape, (0, 3, 4, 5)) # N,H,W
+    y = (ww*xx).reindex_reduce("add", out_shape, [
+        'i0', # N
+        'i2', # o
+        f'i3*{stride_d}-{padding_d}+i6*{dilation_d}', # Did+Kdid
+        f'i4*{stride_h}-{padding_h}+i7*{dilation_h}', # Hid+Khid
+        f'i5*{stride_w}-{padding_w}+i8*{dilation_w}', # Wid+KWid
+    ])
+    if isinstance(bias, jt.Var):
+        b = bias.broadcast(y.shape, [0,2,3,4])
+        y = y + b
+    else:
+        assert not bias, "Bias should be none or jittor var"
+    return y
+
 conv_transpose2d = conv_transpose

 def pad(x,padding, mode='constant', value=0):
@ -997,9 +1248,9 @@ def _bicubic(x, a, func):

 def _interpolate(img, x, y, ids, mode):
    if mode == "nearest":
-        return img.reindex([*ids, x.floor(), y.floor()])
+        return img.reindex([*ids, x.floor_int(), y.floor_int()])
    if mode == "bilinear":
-        fx, fy = x.floor(), y.floor()
+        fx, fy = x.floor_int(), y.floor_int()
        cx, cy = fx + 1, fy + 1
        dx, dy = x - fx, y - fy
        a = img.reindex_var([*ids, fx, fy])
@ -1013,7 +1264,7 @@ def _interpolate(img, x, y, ids, mode):
        return o
    if mode=="bicubic": # ugly ver.
        n,c,h,w = img.shape
-        fx, fy = x.floor(), y.floor()
+        fx, fy = x.floor_int(), y.floor_int()
        dix, diy = x - fx, y - fy
        ax, ay = _bicubic(dix+1,-0.75,2), _bicubic(diy+1,-0.75,2)
        bx, by = _bicubic(dix,-0.75,1), _bicubic(diy,-0.75,1)
@ -1031,23 +1282,8 @@ def _interpolate(img, x, y, ids, mode):
        return o
    raise (f"Not support interpolation mode: {mode}")

-
-def resize(img, size, mode="nearest", align_corners=False):
-    n, c, h, w = img.shape
-    H, W = size
-    nid, cid, hid, wid = jt.index((n, c, H, W))
-    if align_corners:
-        x = hid * ((h - 1) / max(1, H - 1))
-        y = wid * ((w - 1) / max(1, W - 1))
-    else:
-        x = hid * (h / H) + (h / H * 0.5 - 0.5)
-        if H > h: x = x.clamp(0, h - 1)
-        y = wid * (w / W) + (w / W * 0.5 - 0.5)
-        if W > w: y = y.clamp(0, w - 1)
-    return _interpolate(img, x, y, (nid, cid), mode)
-
-
-def upsample(img, size, mode="nearest", align_corners=False):
+# TODO: tf_mode to another function
+def resize(img, size, mode="nearest", align_corners=False, tf_mode=False):
    n, c, h, w = img.shape
    H, W = size
    nid, cid, hid, wid = jt.index((n, c, H, W))
@ -1061,22 +1297,30 @@ def upsample(img, size, mode="nearest", align_corners=False):
        x = hid * (h / H)
        y = wid * (w / W)
    else:
-        x = hid * (h / H) + (h / H * 0.5 - 0.5)
-        if H > h: x = x.clamp(0, h - 1)
-        y = wid * (w / W) + (w / W * 0.5 - 0.5)
-        if W > w: y = y.clamp(0, w - 1)
+        if (tf_mode):
+            x = hid * (h / H)
+            if H > h: x = x.clamp(0, h - 1)
+            y = wid * (w / W)
+            if W > w: y = y.clamp(0, w - 1)
+        else:
+            x = hid * (h / H) + (h / H * 0.5 - 0.5)
+            if H > h: x = x.clamp(0, h - 1)
+            y = wid * (w / W) + (w / W * 0.5 - 0.5)
+            if W > w: y = y.clamp(0, w - 1)
    return _interpolate(img, x, y, (nid, cid), mode)

+upsample = resize

-def interpolate(X, size=None, scale_factor=None, mode='bilinear', align_corners=False):
+
+def interpolate(X, size=None, scale_factor=None, mode='bilinear', align_corners=False, tf_mode=False):
    if scale_factor is not None:
        size = [X.shape[-2] * scale_factor, X.shape[-1] * scale_factor]
    if isinstance(size, int):
        size = (size, size)
    if scale_factor is not None and scale_factor > 1:
-        return upsample(X, size, mode, align_corners)
+        return upsample(X, size, mode, align_corners, tf_mode)
    else:
-        return resize(X, size, mode, align_corners)
+        return resize(X, size, mode, align_corners, tf_mode)


 def grid_sample_v0(input, grid, mode='bilinear', padding_mode='zeros'):
@ -1134,7 +1378,7 @@ def linspace_from_neg_one(grid,num_steps,align_corners):
    return jt.array(ra,dtype=grid.dtype)

 def make_base_grid_4D(theta,N,C,H,W,align_corners):
-    base_grid = jt.zeros((N, H, W, 3), dtype=theta.dtype);
+    base_grid = jt.zeros((N, H, W, 3), dtype=theta.dtype)
    base_grid[...,0] = linspace_from_neg_one(theta, W, align_corners)
    base_grid[...,1] = jt.unsqueeze(linspace_from_neg_one(theta, H, align_corners),-1)
    base_grid[...,-1] = 1
@ -1190,7 +1434,7 @@ def reflect_coordinates(x,twice_low,twice_high):
    x = (x - m).abs()
    #`fmod` returns same sign as `in`, which is positive after the `fabs` above.
    extra = x.mod(span)
-    flips = (x / span).floor()
+    flips = (x / span).floor_int()
    result1 = extra+m
    result2 = span-extra+m
    con = flips%2==0
@ -1242,9 +1486,9 @@ def grid_sampler_3d(X,grid,mode,padding_mode,align_corners):
    zid = z.reindex(shape,['i0','i2','i3','i4'])

    if mode=='nearest':
-        return X.reindex([nid,cid,zid.round(),yid.round(),xid.round()])
+        return X.reindex([nid,cid,zid.round_int(),yid.round_int(),xid.round_int()])
    elif mode=='bilinear':
-        fx,fy,fz = xid.floor(),yid.floor(),zid.floor()
+        fx,fy,fz = xid.floor_int(),yid.floor_int(),zid.floor_int()
        cx,cy,cz = fx+1,fy+1,fz+1
        dx,dy,dz = xid-fx,yid-fy,zid-fz
        dnx,dny,dnz = cx-xid,cy-yid,cz-zid
@ -1279,10 +1523,10 @@ def grid_sampler_2d(X,grid,mode,padding_mode,align_corners):
    yid = y.reindex(shape,['i0','i2','i3'])

    if mode=='nearest':
-        return X.reindex([nid,cid,yid.round(),xid.round()])
+        return X.reindex([nid,cid,yid.round_int(),xid.round_int()])
    elif mode=='bilinear':
        #xid,yid = (xid+0.00001),(yid+0.00001)
-        fx,fy = (xid).floor(),(yid).floor()
+        fx,fy = (xid).floor_int(),(yid).floor_int()
        cx,cy = fx+1,fy+1
        dx,dy = xid-fx,yid-fy
        dnx,dny = cx-xid,cy-yid
@ -1385,6 +1629,87 @@ class Sequential(Module):
        return len(self.layers)


+class ParameterList(Module):
+    def __init__(self, *args):
+        self.params = collections.OrderedDict()
+        for var in args:
+            if isinstance(var, (collections.OrderedDict, dict)):
+                for k, v in var.items():
+                    self.add_param(k, v)
+            elif isinstance(var, list):
+                for v in var:
+                    self.append(v)
+            else:
+                self.append(var)
+    def __getitem__(self, idx):
+        if idx not in self.params:
+            return list(self.params.values())[idx]
+
+        return self.params[idx]
+    def __iter__(self):
+        return self.params.values().__iter__()
+    def keys(self):
+        return self.params.keys()
+    def values(self):
+        return self.params.values()
+    def items(self):
+        return self.params.items()
+    def execute(self, x):
+        raise NotImplementedError("Parameters is not executable")
+    def append(self, var):
+        assert isinstance(var, jt.Var), f"argument <{type(var)}> is not jittor var"
+        self.params[len(self.params)] = var
+    def add_param(self, name, var):
+        assert isinstance(var, jt.Var), f"argument <{type(var)}> is not jittor var"
+        self.params[name]=var
+    def __setitem__(self, name, var):
+        self.add_param(name, var)
+
+    def __len__(self):
+        return len(self.params)
+
+ParameterDict = ParameterList
+
+def Parameter(data, requires_grad=True):
+    ''' The `Parameter` interface isn't needed in Jittor, this interface
+doesn't nothings and it is just used for compatible.
+    
+A Jittor Var is a Parameter
+when it is a member of Module, if you don't want a Jittor
+Var menber is treated as a Parameter, just name it startswith
+underscore `_`.
+    '''
+    LOG.w(Parameter.__doc__)
+    data = data.clone()
+    data.requires_grad = requires_grad
+    return data
+
+def backward(v, *args, **kw):
+    ''' The `backward` variable interface doesn't exist in Jittor.
+please use `optimizer.backward(loss)` or 
+`optimizer.step(loss)` instead.
+For example, if your code looks like this::
+
+    optimizer.zero_grad()
+    loss.backward()
+    optimizer.step()
+
+It can be changed to this::
+
+    optimizer.zero_grad()
+    optimizer.backward(loss)
+    optimizer.step()
+
+Or more concise::
+
+    optimizer.step(loss)
+
+The step function will automatically zero grad and backward.
+    '''
+    LOG.f(backward.__doc__)
+
+jt.Var.backward = backward
+
 def unfold(X, kernel_size, dilation=1, padding=0, stride=1):
    assert X.ndim == 4
    if not isinstance(kernel_size, tuple):
@ -1900,3 +2225,36 @@ class GRU(RNNBase):
        h = (1 - z) * n + z * hidden

        return h, h
+
+def bilinear(in1, in2, weight, bias):
+    w = weight.transpose((1,0,2))
+    w = w.reshape((w.shape[0], -1))
+    x = jt.matmul(in1, w)
+    x = x.reshape(x.shape[:-1]+[weight.shape[0], weight.shape[2]])
+    y = in2.broadcast(x, (-2,))
+    z = (x*y).sum(-1)
+    if bias is not None:
+        z += bias
+    return z
+
+
+class Bilinear(Module):
+    ''' bilinear transformation $out = in1^T W in2 + bias$, Example::
+
+    m = nn.Bilinear(20, 30, 40)
+    input1 = jt.randn(128, 20)
+    input2 = jt.randn(128, 30)
+    output = m(input1, input2)
+    print(output.shape)
+    # [128, 40]
+
+    '''
+    def __init__(self, in1_features, in2_features, out_features, bias=True, dtype="float32"):
+        bound = 1 / math.sqrt(in1_features)
+        self.weight = jt.init.uniform([out_features, in1_features, in2_features], dtype, -bound, bound)
+        self.bias = bias
+        if bias:
+            self.bias = jt.init.uniform([out_features], dtype, -bound, bound)
+
+    def execute(self, in1, in2):
+        return bilinear(in1, in2, self.weight, self.bias)
--- a/python/jittor/notebook/60分钟快速入门Jittor/README.md
+++ b/python/jittor/notebook/60分钟快速入门Jittor/README.md
@ -0,0 +1,11 @@
+# 计图零基础入门教程（60分钟）
+
+```
+git clone https://github.com/Jittor/LearnJittorBasicIn60Min.git
+cd LearnJittorBasicIn60Min
+jupyter notebook
+```
+
+在线浏览地址：<https://nbviewer.jupyter.org/github/Jittor/LearnJittorBasicIn60Min/tree/master/>
+
+特别感谢教程作者：llt
--- a/python/jittor/notebook/60分钟快速入门Jittor/mnist.png
+++ b/python/jittor/notebook/60分钟快速入门Jittor/mnist.png
--- a/python/jittor/notebook/60分钟快速入门Jittor/计图入门教程
+++ b/python/jittor/notebook/60分钟快速入门Jittor/计图入门教程
@ -0,0 +1,138 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 计图入门教程 0 --- 介绍与安装"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "**计图 (Jittor)** 是一个以 Python 为前端语言的深度学习框架，它  \n",
+    "* 效率高：可作为 NumPy，PyTorch 的替代品，可以使用 GPU 等其他加速器进行高效的数据运算。除此之外，计图还拥有多个创新点，旨在大幅提升其运算效率；\n",
+    "* 易使用：是一个用于实现神经网络的自动微分库，并集成了大量有关深度学习的函数库，方便您快速开展开发任务。\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "**通过本教程，您将**\n",
+    "* 学习并理解计图中基本类型的一般操作；\n",
+    "* 了解神经网络的一些基本概念，并学会如何利用计图进行神经网络的训练；\n",
+    "* 解决一个机器学习的经典实战问题。\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**本教程的适用群体：**  \n",
+    "我们的目标是，只要您会 Python 编程，即可通过本教程学习并掌握如何使用计图进行深度学习的开发。不用担心，本教程几乎对所有的关键代码都加以注释说明。只要您耐心跟着本教程一步步学习，便一定能有所斩获。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "现在，请您开启计图快速入门之旅。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## 安装\n",
+    "\n",
+    "\n",
+    "Jittor框架对环境要求如下:\n",
+    "\n",
+    "\n",
+    "* 操作系统: **Ubuntu** >= 16.04 或 **Windows Subsystem of Linux（WSL）**\n",
+    "* Python：版本 >= 3.7\n",
+    "* C++编译器 （需要下列至少一个）\n",
+    "    - g++ （>=5.4.0）\n",
+    "    - clang （>=8.0）\n",
+    "* GPU 编译器（可选）：nvcc >=10.0\n",
+    "* GPU 加速库（可选）：cudnn-dev (cudnn开发版, 推荐使用tar安装方法，[参考链接](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar))\n",
+    "\n",
+    "如果您不希望手动配置环境，我们推荐使用 Docker 进行安装。\n",
+    "除此之外，您还可以使用 pip 安装和手动安装。\n",
+    "\n",
+    "注意：目前Jittor通过WSL的方式在Windows操作系统上运行，WSL的安装方法请参考[微软官网](https://docs.microsoft.com/en-us/windows/wsl/install-win10)，WSL版本目前尚不支持CUDA。\n",
+    "\n",
+    "Jittor 提供了三种安装方法：docker，pip和手动安装：\n",
+    "\n",
+    "\n",
+    "\n",
+    "## Docker 安装\n",
+    "\n",
+    "我们提供了Docker安装方式，免去您配置环境，Docker安装方法如下：\n",
+    "\n",
+    "\n",
+    "```\n",
+    "# CPU only(Linux)\n",
+    "docker run -it --network host jittor/jittor\n",
+    "# CPU and CUDA(Linux)\n",
+    "docker run -it --network host --gpus all jittor/jittor-cuda\n",
+    "# CPU only(Mac and Windows)\n",
+    "docker run -it -p 8888:8888 jittor/jittor\n",
+    "# Upgrade jittor docker image\n",
+    "docker pull jittor/jittor\n",
+    "docker pull jittor/jittor-cuda\n",
+    "```\n",
+    "\n",
+    "关于Docker安装的详细教程，可以参考[Windows/Mac/Linux通过Docker安装计图](https://cg.cs.tsinghua.edu.cn/jittor/tutorial/2020-5-15-00-00-docker/)\n",
+    "\n",
+    "## Pip 安装\n",
+    "\n",
+    "\n",
+    "如果您没有准备好环境，或者使用的不是Ubuntu操作系统， 推荐使用**docker安装**， 如果您已经装好编译器和对应版本的Python,我们强烈推荐您使用这种方法\n",
+    "(如果无法访问github, 可以通过jittor主页下载):\n",
+    "\n",
+    "```bash\n",
+    "sudo apt install python3.7-dev libomp-dev\n",
+    "python3.7 -m pip install jittor\n",
+    "# or install from github(latest version)\n",
+    "# python3.7 -m pip install git+https://github.com/Jittor/jittor.git\n",
+    "python3.7 -m jittor.test.test_example\n",
+    "\n",
+    "# Upgrade jittor from pip\n",
+    "python3.7 -m pip install jittor -U\n",
+    "# Upgrade jittor from github\n",
+    "python3.7 -m pip install git+https://github.com/Jittor/jittor.git -U\n",
+    "```\n",
+    "\n",
+    "如果测试运行通过,恭喜你已经安装完成.\n",
+    "jittor会自动在路径中寻找合适的编译器, 如果您希望手动指定编译器, 请使用环境变量 `cc_path` 和 `nvcc_path`(可选).\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/python/jittor/notebook/60分钟快速入门Jittor/计图入门教程
+++ b/python/jittor/notebook/60分钟快速入门Jittor/计图入门教程
--- a/python/jittor/notebook/60分钟快速入门Jittor/计图入门教程
+++ b/python/jittor/notebook/60分钟快速入门Jittor/计图入门教程
--- a/python/jittor/notebook/60分钟快速入门Jittor/计图入门教程
+++ b/python/jittor/notebook/60分钟快速入门Jittor/计图入门教程
--- a/python/jittor/optim.py
+++ b/python/jittor/optim.py
@ -37,6 +37,7 @@ class Optimizer(object):
        # __zero_grad is a value for fast determ the grad is zero or not
        # so we can omit 0+x
        self.__zero_grad = True
+        self._grad_map = {}

    def add_param_group(self, group):
        self.param_groups.append(group)
@ -77,7 +78,7 @@ class Optimizer(object):
        for pg in self.param_groups:
            for p, g in zip(pg["params"], pg["grads"]):
                if p.is_stop_grad(): continue
-                g *= clip_coef
+                g.update(g*clip_coef)

    
    @property
@ -86,6 +87,14 @@ class Optimizer(object):
        return { k:v for k, v in self.__dict__.items()
            if k[0] != '_' and k not in exclude and not callable(v) }

+    def state_dict(self):
+        state = {"defaults": self.defaults}
+        return state
+
+    def load_state_dict(self, state):
+        for k,v in state["defaults"].items():
+            setattr(self, k, v)
+
    def zero_grad(self):
        self.__zero_grad = True

@ -181,6 +190,35 @@ class Optimizer(object):
                p.update(p - g * lr)
        self.zero_grad()

+    def _build_grad_map(self):
+        _grad_map = {}
+        for pg in self.param_groups:
+            for p, g in zip(pg["params"], pg["grads"]):
+                _grad_map[id(p)] = g
+        self._grad_map = _grad_map
+
+    def find_grad(self, v:jt.Var) -> jt.Var:
+        if id(v) not in self._grad_map:
+            self._build_grad_map()
+            if id(v) not in self._grad_map:
+                raise RuntimeError("This variable is not managed by this optimizer")
+        return self._grad_map[id(v)]
+
+def opt_grad(v:jt.Var, opt:Optimizer):
+    ''' Get grad of certain variable in optimizer, Example::
+
+
+    model = Model()
+    optimizer = SGD(model.parameters())
+    ...
+    optimizer.backward(loss)
+    
+    for p in model.parameters():
+        grad = p.opt_grad(optimizer)
+    '''
+    return opt.find_grad(v)
+
+jt.Var.opt_grad = opt_grad

 class SGD(Optimizer):
    """ SGD Optimizer.
@ -325,6 +363,60 @@ class Adam(Optimizer):
        self.zero_grad()


+class AdamW(Optimizer):
+    """ AdamW Optimizer.
+    
+    Example::
+
+        optimizer = nn.AdamW(model.parameters(), lr, eps=1e-8, betas=(0.9, 0.999))
+        optimizer.step(loss)
+    """
+    def __init__(self, params, lr, eps=1e-8, betas=(0.9, 0.999), weight_decay=0):
+        super().__init__(params, lr)
+        self.eps = eps
+        self.betas = betas
+        self.weight_decay = weight_decay
+        # assert weight_decay==0, "weight_decay is not supported yet"
+        
+        # initialize required arguments for each param_groups
+        for pg in self.param_groups:
+            values = pg["values"] = []
+            m = pg["m"] = []
+            for p in pg["params"]:
+                values.append(jt.zeros(p.shape, p.dtype).stop_grad())
+                m.append(jt.zeros(p.shape, p.dtype).stop_grad())
+
+    def add_param_group(self, group):
+        values = group["values"] = []
+        m = group["m"] = []
+        for p in group["params"]:
+            values.append(jt.zeros(p.shape, p.dtype).stop_grad())
+            m.append(jt.zeros(p.shape, p.dtype).stop_grad())
+        self.param_groups.append(group)
+
+    def step(self, loss=None):
+        if loss is not None:
+            self.pre_step(loss)
+        n = float(self.n_step)
+        for pg in self.param_groups:
+            # get arguments from each param_groups
+            lr = pg.get("lr", self.lr)
+            eps = pg.get("eps", self.eps)
+            weight_decay = pg.get("weight_decay", self.weight_decay)
+            b0, b1 = pg.get("betas", self.betas)
+            for p, g, v, m in zip(pg["params"], pg["grads"], pg["values"], pg["m"]):
+                if p.is_stop_grad(): continue
+                p.update(p * (1 - lr * weight_decay))
+                bias_correction1 = 1 - b0 ** n
+                bias_correction2 = 1 - b1 ** n
+                m.update(b0 * m + (1-b0) * g) #exp_avg
+                v.update(b1 * v + (1-b1) * g * g) #exp_avg_sq
+                denom = jt.sqrt(v) / jt.sqrt(bias_correction2) + eps
+                step_size = lr / bias_correction1
+                p.update(p - step_size * m / denom)
+        self.zero_grad()
+
+
 class LRScheduler:
    def __init__(self,optimizer, last_epoch=-1):
        assert isinstance(optimizer,Optimizer)
--- a/python/jittor/pool.py
+++ b/python/jittor/pool.py
@ -73,7 +73,7 @@ class Pool(Module):
                    for (int q = k3; q < k3_; ++q) 
                        if (out_value < @in0(i0, i1, p, q)) {{
                            out_value = @in0(i0, i1, p, q);
-                            out_index = (p - k2) * {self.kernel_size[0]} + (q - k3);
+                            out_index = p * in0_shape3 + q;
                        }}
                @out(i0, i1, i2, i3) = out_value;
                @out1(i0, i1, i2, i3) = out_index;
@ -99,7 +99,7 @@ class Pool(Module):
            '''
            if self.return_indices:
                return_shapes = [[N,C,h,w]] * 2
-                return_dtypes = [x.dtype, 'uint8']
+                return_dtypes = [x.dtype, 'int32']
            else:
                return_shapes = [N,C,h,w]
                return_dtypes = x.dtype
@ -184,6 +184,204 @@ class Pool(Module):
            ])
            return xx.reduce(self.op, [4,5])

+def _triple(x):
+    if isinstance(x, tuple):
+        assert len(x) == 3
+        return x
+    else:
+        return (x,x,x)
+
+
+class Pool3d(Module):
+    def __init__(self, kernel_size, stride=None, padding=0, dilation=None, return_indices=None, ceil_mode=False, count_include_pad=True, op="maximum"):
+        assert dilation == None
+        assert return_indices == None or op == "maximum"
+        self.return_indices = return_indices
+        self.kernel_size = _triple(kernel_size)
+        self.op = op
+        stride = stride if stride else kernel_size
+        self.stride = _triple(stride)
+        self.padding = _triple(padding)
+        self.ceil_mode = ceil_mode
+        self.count_include_pad = count_include_pad and padding != 0
+
+    def execute(self, x):
+        N,C,D,H,W = x.shape
+        if self.ceil_mode == False:
+            d = (D+self.padding[0]*2-self.kernel_size[0])//self.stride[0]+1
+            h = (H+self.padding[1]*2-self.kernel_size[1])//self.stride[1]+1
+            w = (W+self.padding[2]*2-self.kernel_size[2])//self.stride[2]+1
+            use_code_op = self.op in ['maximum', 'minimum']
+            # some second order avg_pool is require, so we don't use code op here  
+        else:
+            d = (D+self.padding[0]*2-self.kernel_size[0] + self.stride[0] - 1)//self.stride[0]+1
+            h = (H+self.padding[1]*2-self.kernel_size[1] + self.stride[1] - 1)//self.stride[1]+1
+            w = (W+self.padding[2]*2-self.kernel_size[2] + self.stride[2] - 1)//self.stride[2]+1
+            use_code_op = self.op in ['maximum', 'minimum', 'mean']
+
+        if use_code_op:
+            if self.op == 'mean':
+                if self.count_include_pad:
+                    count = f"int count = {self.kernel_size[0]*self.kernel_size[1]*self.kernel_size[2]};"
+                else:
+                    count = "int count = (k2_ - k2) * (k3_ - k3) * (k4_ - k4);"
+                count += "float32 rcount = 1.0f / count;"
+            else:
+                count = ""
+            forward_body = f'''
+                int k4 = i4*{self.stride[2]}-{self.padding[2]};
+                int k3 = i3*{self.stride[1]}-{self.padding[1]};
+                int k2 = i2*{self.stride[0]}-{self.padding[0]};
+                int k4_ = min(k4 + {self.kernel_size[2]}, in0_shape4);
+                int k3_ = min(k3 + {self.kernel_size[1]}, in0_shape3);
+                int k2_ = min(k2 + {self.kernel_size[0]}, in0_shape2);
+                k4 = max(0, k4);
+                k3 = max(0, k3);
+                k2 = max(0, k2);
+                {count}
+            '''
+            if not self.return_indices:
+                forward_body += f'''
+                @out(i0, i1, i2, i3, i4) = init_{self.op}(out_type);
+                for (int p = k2; p < k2_; ++p)
+                    for (int q = k3; q < k3_; ++q)
+                        for (int r = k4; r < k4_; ++r)
+                            @out(i0, i1, i2, i3, i4) = {self.op}(out_type, @out(i0, i1, i2, i3, i4), @in0(i0, i1, p, q, r));
+                '''
+            else:
+                forward_body += f'''
+                auto out_value = init_{self.op}(out_type);
+                int out_index = -1;
+                for (int p = k2; p < k2_; ++p)
+                    for (int q = k3; q < k3_; ++q) 
+                        for (int r = k4; q < k4_; ++r) 
+                        if (out_value < @in0(i0, i1, p, q, r)) {{
+                            out_value = @in0(i0, i1, p, q, r);
+                            out_index = p * in0_shape3 * in0_shape4 + q * in0_shape4 + r;
+                        }}
+                @out(i0, i1, i2, i3, i4) = out_value;
+                @out1(i0, i1, i2, i3, i4) = out_index;
+                '''
+            backward_body = f'''
+                int k4 = i4*{self.stride[2]}-{self.padding[2]};
+                int k3 = i3*{self.stride[1]}-{self.padding[1]};
+                int k2 = i2*{self.stride[0]}-{self.padding[0]};
+                int k4_ = min(k4 + {self.kernel_size[2]}, in0_shape4);
+                int k3_ = min(k3 + {self.kernel_size[1]}, in0_shape3);
+                int k2_ = min(k2 + {self.kernel_size[0]}, in0_shape2);
+                k4 = max(0, k4);
+                k3 = max(0, k3);
+                k2 = max(0, k2);
+                {count}
+                int bo=1;
+                for (int p = k2; p < k2_ && bo; ++p)
+                    for (int q = k3; q < k3_ && bo; ++q) 
+                        for (int r = k4; r < k4_ && bo; ++r) {{
+                            {"atomicAdd(&@out(i0,i1,p,q,r), @dout(i0,i1,i2,i3,i4)/count);"
+                                if self.op == "mean" else
+                            f"""if (@pout(i0,i1,i2,i3,i4) == @in0(i0,i1,p,q,r)) {{
+                                atomicAdd(&@out(i0,i1,p,q,r), @dout(i0,i1,i2,i3,i4)),
+                                bo=0;
+                            }}"""}
+                        }}
+            '''
+            if self.return_indices:
+                return_shapes = [[N,C,d,h,w]] * 2
+                return_dtypes = [x.dtype, 'int32']
+            else:
+                return_shapes = [N,C,d,h,w]
+                return_dtypes = x.dtype
+            out = jt.code(return_shapes, return_dtypes, [x],
+                cuda_header="""
+                    #include <ops/binary_op_defs.h>
+                    #include <misc/cuda_limits.h>
+                """,
+                cuda_src=f'''
+                    __global__ static void kernel1(@ARGS_DEF) {{
+                        @PRECALC
+                        int p4 = threadIdx.x;
+                        int s4 = blockDim.x;
+                        int p3 = threadIdx.y;
+                        int s3 = blockDim.y;
+                        int p2 = threadIdx.z + blockIdx.x * blockDim.z;
+                        int s2 = blockDim.z * gridDim.x;
+                        int i1 = blockIdx.y;
+                        int i0 = blockIdx.z;
+                        for (int i4 = p4; i4 < out_shape4; i4 += s4)
+                        for (int i3 = p3; i3 < out_shape3; i3 += s3)
+                        for (int i2 = p2; i2 < out_shape2; i2 += s2)
+                            {{ {forward_body} }}
+                    }}
+                    int tx = min(1024, out_shape4);
+                    int ty = min(1024 / tx, out_shape3);
+                    int tz = min(1024 / tx / ty, out_shape2);
+                    int bx = (out_shape2 - 1) / tz + 1;
+                    int by = out_shape1;
+                    int bz = out_shape0;
+                    dim3 s1(bx, by, bz);
+                    dim3 s2(tx, ty, tz);
+                    kernel1<<<s1, s2>>>(@ARGS);
+                ''',
+                cuda_grad_src=[f'''
+                    __global__ static void kernel3(@ARGS_DEF) {{
+                        @PRECALC
+                        int p4 = threadIdx.x;
+                        int s4 = blockDim.x;
+                        int p3 = threadIdx.y;
+                        int s3 = blockDim.y;
+                        int p2 = threadIdx.z + blockIdx.x * blockDim.z;
+                        int s2 = blockDim.z * gridDim.x;
+                        int i1 = blockIdx.y;
+                        int i0 = blockIdx.z;
+                        for (int i4 = p4; i4 < out_shape4; i4 += s4)
+                        for (int i3 = p3; i3 < out_shape3; i3 += s3)
+                        for (int i2 = p2; i2 < out_shape2; i2 += s2)
+                                {{ {backward_body} }}
+                    }}
+                    cudaMemsetAsync(out_p, 0, out->size);
+                    int tx = min(1024, pout_shape4);
+                    int ty = min(1024 / tx, pout_shape3);
+                    int tz = min(1024 / tx / ty, pout_shape2);
+                    int bx = (pout_shape2 - 1) / tz + 1;
+                    int by = pout_shape1;
+                    int bz = pout_shape0;
+                    dim3 s1(bx, by, bz);
+                    dim3 s2(tx, ty, tz);
+                    kernel3<<<s1, s2>>>(@ARGS);
+                '''],
+                cpu_header='#include <ops/binary_op_defs.h>',
+                cpu_src=f'''
+                    using namespace std;
+                    for (int i0=0; i0<out_shape0; i0++)
+                    for (int i1=0; i1<out_shape1; i1++)
+                    for (int i2=0; i2<out_shape2; i2++)
+                    for (int i3=0; i3<out_shape3; i3++)
+                    for (int i4=0; i4<out_shape4; i4++)
+                        {{ {forward_body} }}
+                ''',
+                cpu_grad_src = [f'''
+                    using namespace std;
+                    std::memset(out_p, 0, out->size);
+                    #define atomicAdd(a,b) (*a) += b
+
+                    for (int i0=0; i0<pout_shape0; i0++)
+                    for (int i1=0; i1<pout_shape1; i1++)
+                    for (int i2=0; i2<pout_shape2; i2++) 
+                    for (int i3=0; i3<pout_shape3; i3++)
+                    for (int i4=0; i4<pout_shape4; i4++)
+                        {{ {backward_body} }}
+                '''])
+            return out
+        else:
+            # TODO: backward 
+            xx = x.reindex([N,C,d,h,w,self.kernel_size[0],self.kernel_size[1],self.kernel_size[2]], [
+                "i0", # Nid
+                "i1", # Cid
+                f"i2*{self.stride[0]}-{self.padding[0]}+i5", # Did
+                f"i3*{self.stride[1]}-{self.padding[1]}+i6", # Hid
+                f"i4*{self.stride[2]}-{self.padding[2]}+i7", # Hid
+            ])
+            return xx.reduce(self.op, [5,6,7])

 class AdaptiveAvgPool2d(Module):
    def __init__(self, output_size):
@ -216,8 +414,9 @@ class AdaptiveAvgPool2d(Module):
        return xx.reduce("mean", [4,5])

 class AdaptiveMaxPool2d(Module):
-    def __init__(self, output_size):
+    def __init__(self, output_size, return_indices=False):
        self.output_size = output_size
+        self.return_indices = return_indices

    def execute(self, x):
        if isinstance(self.output_size, int):
@ -235,6 +434,10 @@ class AdaptiveMaxPool2d(Module):
        self.sw = math.floor(W / ow)
        self.ksh = H - (oh - 1) * self.sh
        self.ksw = W - (ow - 1) * self.sw
+        if self.return_indices:
+            return MaxPool2d(
+                kernel_size=(self.ksh, self.ksw),
+                stride=(self.sh, self.sw), return_indices=True)(x)
        h = (H-self.ksh)//self.sh+1
        w = (W-self.ksw)//self.sw+1
        xx = x.reindex([N,C,h,w,self.ksh,self.ksw], [
@ -245,9 +448,74 @@ class AdaptiveMaxPool2d(Module):
        ])
        return xx.reduce("maximum", [4,5])

+
+class AdaptiveAvgPool3d(Module):
+    def __init__(self, output_size):
+        self.output_size = _triple(output_size)
+
+    def execute(self, x):
+        od, oh, ow = self.output_size
+        if od == 1 and oh == 1 and ow == 1:
+            return x.reduce("mean", [2,3,4], keepdims=True)
+        N,C,D,H,W = x.shape
+        self.sd = math.floor(D / od)
+        self.sh = math.floor(H / oh)
+        self.sw = math.floor(W / ow)
+        self.ksd = D - (od - 1) * self.sd
+        self.ksh = H - (oh - 1) * self.sh
+        self.ksw = W - (ow - 1) * self.sw
+        d = (D-self.ksd)//self.sd+1
+        h = (H-self.ksh)//self.sh+1
+        w = (W-self.ksw)//self.sw+1
+        xx = x.reindex([N,C,d,h,w,self.ksd,self.ksh,self.ksw], [
+            "i0", # Nid
+            "i1", # Cid
+            f"i2*{self.sd}+i5", # Did
+            f"i3*{self.sh}+i6", # Hid
+            f"i4*{self.sw}+i7", # Wid
+        ])
+        return xx.reduce("mean", [5,6,7])
+
+class AdaptiveMaxPool3d(Module):
+    def __init__(self, output_size, return_indices=False):
+        self.output_size = _triple(output_size)
+        self.return_indices = return_indices
+
+    def execute(self, x):
+        od, oh, ow = self.output_size
+        if od == 1 and oh == 1 and ow == 1 and not self.return_indices:
+            return x.reduce("maximum", [2,3,4], keepdims=True)
+        N,C,D,H,W = x.shape
+        self.sd = math.floor(D / od)
+        self.sh = math.floor(H / oh)
+        self.sw = math.floor(W / ow)
+        self.ksd = D - (od - 1) * self.sd
+        self.ksh = H - (oh - 1) * self.sh
+        self.ksw = W - (ow - 1) * self.sw
+        if self.return_indices:
+            return MaxPool3d(
+                kernel_size=(self.ksd, self.ksh, self.ksw),
+                stride=(self.sd, self.sh, self.sw), return_indices=True)(x)
+        d = (D-self.ksd)//self.sd+1
+        h = (H-self.ksh)//self.sh+1
+        w = (W-self.ksw)//self.sw+1
+        xx = x.reindex([N,C,d,h,w,self.ksd,self.ksh,self.ksw], [
+            "i0", # Nid
+            "i1", # Cid
+            f"i2*{self.sd}+i5", # Did
+            f"i3*{self.sh}+i6", # Hid
+            f"i4*{self.sw}+i7", # Wid
+        ])
+        return xx.reduce("maximun", [5,6,7])
+
 def pool(x, kernel_size, op, padding=0, stride=None):
    return Pool(kernel_size, stride, padding, op=op)(x)

+pool2d = pool
+
+def pool3d(x, kernel_size, op, padding=0, stride=None):
+    return Pool3d(kernel_size, stride, padding, op=op)(x)
+
 class AvgPool2d(Module):
    def __init__(self, kernel_size, stride=None, padding=0, ceil_mode=False, count_include_pad=True):
        self.layer = Pool(kernel_size=kernel_size, stride=stride, padding=padding, ceil_mode=ceil_mode, count_include_pad=count_include_pad, op="mean")
@ -255,6 +523,13 @@ class AvgPool2d(Module):
    def execute(self, x):
        return self.layer(x)

+class AvgPool3d(Module):
+    def __init__(self, kernel_size, stride=None, padding=0, ceil_mode=False, count_include_pad=True):
+        self.layer = Pool3d(kernel_size=kernel_size, stride=stride, padding=padding, ceil_mode=ceil_mode, count_include_pad=count_include_pad, op="mean")
+    
+    def execute(self, x):
+        return self.layer(x)
+
 def avg_pool2d(x, kernel_size, stride=None, padding=0, ceil_mode=False, count_include_pad=True):
    return AvgPool2d(kernel_size, stride, padding, ceil_mode, count_include_pad)(x)

@ -265,54 +540,115 @@ class MaxPool2d(Module):
    def execute(self, x):
        return self._layer(x)

+class MaxPool3d(Module):
+    def __init__(self, kernel_size, stride=None, padding=0, dilation=None, return_indices=None, ceil_mode=False):
+        self._layer = Pool3d(kernel_size=kernel_size, stride=stride, padding=padding, dilation=dilation, return_indices=return_indices, ceil_mode=ceil_mode, op="maximum")
+    
+    def execute(self, x):
+        return self._layer(x)
+
 def max_pool2d(x, kernel_size, stride=None, padding=0, dilation=None, return_indices=None, ceil_mode=False):
    return MaxPool2d(kernel_size, stride, padding, dilation, return_indices, ceil_mode)(x)

+
+def max_pool3d(x, kernel_size, stride=None, padding=0, dilation=None, return_indices=None, ceil_mode=False):
+    return MaxPool3d(kernel_size, stride, padding, dilation, return_indices, ceil_mode)(x)
+
 class MaxUnpool2d(Module):
+    ''' MaxUnpool2d is the invert version of MaxPool2d with indices.
+    It takes the output index of MaxPool2d as input.
+    The element will be zero if it is not the max pooled value.
+
+    Example::
+
+    >>> import jittor as jt
+    >>> from jittor import nn
+
+    >>> pool = nn.MaxPool2d(2, stride=2, return_indices=True)
+    >>> unpool = nn.MaxUnpool2d(2, stride=2)
+    >>> input = jt.array([[[[ 1.,  2,  3,  4,0],
+                            [ 5,  6,  7,  8,0],
+                            [ 9, 10, 11, 12,0],
+                            [13, 14, 15, 16,0],
+                            [0,  0,  0,  0, 0]]]])
+    >>> output, indices = pool(input)
+    >>> unpool(output, indices, output_size=input.shape)
+    jt.array([[[[   0.,  0.,   0.,   0.,   0.],
+                [   0.,  6.,   0.,   8.,   0.],
+                [   0.,  0.,   0.,   0.,   0.],
+                [   0., 14.,   0.,  16.,   0.],
+                [   0.,  0.,   0.,   0.,   0.]]]])
+    '''
    def __init__(self, kernel_size, stride=None):
-        ''' MaxUnpool2d is the invert version of MaxPool2d with indices.
-        It takes the output index of MaxPool2d as input.
-        The element will be zero if it is not the max pooled value.
-
-        Example::
-
-        >>> import jittor as jt
-        >>> from jittor import nn
-
-        >>> pool = nn.MaxPool2d(2, stride=2, return_indices=True)
-        >>> unpool = nn.MaxUnpool2d(2, stride=2)
-        >>> input = jt.array([[[[ 1.,  2,  3,  4,0],
-                                [ 5,  6,  7,  8,0],
-                                [ 9, 10, 11, 12,0],
-                                [13, 14, 15, 16,0],
-                                [0,  0,  0,  0, 0]]]])
-        >>> output, indices = pool(input)
-        >>> unpool(output, indices, output_size=input.shape)
-        jt.array([[[[   0.,  0.,   0.,   0.,   0.],
-                    [   0.,  6.,   0.,   8.,   0.],
-                    [   0.,  0.,   0.,   0.,   0.],
-                    [   0., 14.,   0.,  16.,   0.],
-                    [   0.,  0.,   0.,   0.,   0.]]]])
-        '''
        if isinstance(kernel_size, int):
            kernel_size = (kernel_size, kernel_size)
        if isinstance(stride, int):
            stride = (stride, stride)
        if stride is None: stride = kernel_size
-        assert stride == kernel_size, "Different stride and kernel is not supported yet."
        self.kernel_size = kernel_size
+        self.stride = stride

    def execute(self, x, id, output_size=None):
        b, c, ph, pw = x.shape
        kh, kw = self.kernel_size
+        sh, sw = self.stride
        if output_size:
            h, w = output_size[-2:]
        else:
-            h, w = ph * kh, pw * kw
-        x = x.reindex(shape=[b, c, h, w], 
-            indexes=['i0', 'i1', f'i2/{kh}', f'i3/{kw}'],
-            extras=[id], 
-            overflow_conditions=[
-                f'((i2%{kh})*{kw}+i3%{kw}) != @e0(i0,i1,i2/{kh},i3/{kw})'],
-            overflow_value=0)
-        return x
+            h, w = ph * sh, pw * sw
+        if self.stride == self.kernel_size:
+            x = x.reindex(shape=[b, c, h, w], 
+                indexes=['i0', 'i1', f'i2/{kh}', f'i3/{kw}'],
+                extras=[id], 
+                overflow_conditions=[
+                    f'(i2*yshape3+i3) != @e0(i0,i1,i2/{kh},i3/{kw})'],
+                overflow_value=0)
+        else:
+            x = x.reindex_reduce(
+                op="add",
+                shape=[b, c, h, w], 
+                indexes=['i0', 'i1', 
+                    f'@e0(i0,i1,i2,i3)/xshape3', 
+                    f'@e0(i0,i1,i2,i3)%xshape3'],
+                extras=[id], 
+            )
+        return x
+
+class MaxUnpool3d(Module):
+    ''' MaxUnpool3d is the invert version of MaxPool3d with indices.
+    It takes the output index of MaxPool3d as input.
+    The element will be zero if it is not the max pooled value.
+    '''
+    def __init__(self, kernel_size, stride=None):
+        if stride is None: stride = kernel_size
+        kernel_size = _triple(kernel_size)
+        stride = _triple(stride)
+        self.kernel_size = kernel_size
+        self.stride = stride
+
+    def execute(self, x, id, output_size=None):
+        b, c, pd, ph, pw = x.shape
+        kd, kh, kw = self.kernel_size
+        sd, sh, sw = self.stride
+        if output_size:
+            d, h, w = output_size[-3:]
+        else:
+            d, h, w = pd * sd, ph * sh, pw * sw
+        if self.stride == self.kernel_size:
+            x = x.reindex(shape=[b, c, d, h, w], 
+                indexes=['i0', 'i1', f'i2/{kd}', f'i3/{kh}', f'i4/{kw}'],
+                extras=[id], 
+                overflow_conditions=[
+                    f'(i2*yshape3*yshape4+i3*yshape4+i4) != @e0(i0,i1,i2/{kd},i3/{kh},i4/{kw})'],
+                overflow_value=0)
+        else:
+            x = x.reindex_reduce(
+                op="add",
+                shape=[b, c, d, h, w], 
+                indexes=['i0', 'i1', 
+                    f'@e0(i0,i1,i2,i3,i4)/(xshape4*xshape3)', 
+                    f'@e0(i0,i1,i2,i3,i4)/xshape4%xshape3', 
+                    f'@e0(i0,i1,i2,i3,i4)%xshape4'],
+                extras=[id], 
+            )
+        return x
--- a/python/jittor/pyjt_compiler.py
+++ b/python/jittor/pyjt_compiler.py
@ -668,7 +668,7 @@ def compile_src(src, h, basename):
                    arr_func_return.append(f"return ({func_call},0)")
                    func_return_failed = "return -1"
                else:
-                    assert "-> void" in func_head
+                    assert "-> void" in func_head, func_head
                    arr_func_return.append(f"{func_call};{before_return}return")
                    func_return_failed = "return"
        # generate error msg when not a valid call
--- a/python/jittor/script/build_aarch64_mkl.sh
+++ b/python/jittor/script/build_aarch64_mkl.sh
@ -0,0 +1,31 @@
+# wget https://github.com/oneapi-src/oneDNN/archive/refs/tags/v2.2.zip
+# extract zip
+# cd to root folder
+
+mkdir -p build
+cd build
+make clean
+export CC=aarch64-linux-gnu-gcc-8
+export CXX=aarch64-linux-gnu-g++-8
+cmake .. \
+    -DCMAKE_SYSTEM_NAME=Linux \
+    -DCMAKE_SYSTEM_PROCESSOR=AARCH64 \
+    -DCMAKE_LIBRARY_PATH=/usr/aarch64-linux-gnu/lib \
+    -DCMAKE_BUILD_TYPE=Release
+    # -DCMAKE_SHARED_LINKER_FLAGS=' -lm ' \
+make -j8
+
+name=dnnl_lnx_2.2.0_cpu_gomp_aarch64
+mkdir -p $name
+cp -r ../include ./$name/
+mkdir -p ./$name/lib
+cp ./src/libmkldnn.so ./$name/lib/libmkldnn.so
+cp -r ../examples ./$name/
+cp ./include/oneapi/dnnl/* ./$name/include/oneapi/dnnl/
+
+tar -acvf $name.tgz ./$name/
+
+rsync -avPu $name.tgz jittor-web:Documents/jittor-blog/assets/
+ssh jittor-web Documents/jittor-blog.git/hooks/post-update
+echo "https://cg.cs.tsinghua.edu.cn/jittor/assets/$name.tgz"
+md5sum $name.tgz
--- a/python/jittor/src/common.h
+++ b/python/jittor/src/common.h
@ -10,7 +10,7 @@
 #include <functional>
 #include "utils/log.h"

-#define JIT_TEST(name) extern "C" void jit_test_ ## name ()
+#define JIT_TEST(name) extern void jit_test_ ## name ()
 void expect_error(std::function<void()> func);

 #define VAR_MEMBER_NAME_AND_OFFSET(name, op) { #name , offsetof(struct op, name) }
--- a/python/jittor/src/executor.cc
+++ b/python/jittor/src/executor.cc
@ -34,6 +34,7 @@ namespace jittor {
 Executor exe;
 extern MemoryProfiler memory_profiler;
 DECLARE_FLAG(int, profile_memory_enable);
+DEFINE_FLAG(int, gopt_disable, 0, "Disable graph optimizer.");

 // from fetch_op.cc
 extern list<VarPtr> fetcher_to_free;
@ -145,7 +146,7 @@ void Executor::run_sync(vector<Var*> vars, bool device_sync) {
                }
            }
        }
-        if (!need_opt) break;
+        if (!need_opt || gopt_disable) break;
        for (Node* n : bfs_q) {
            if (n->flags.get(NodeFlags::_has_gopt)) {
                n->op()->graph_optimize();
@ -486,9 +487,14 @@ void Executor::run_sync(vector<Var*> vars, bool device_sync) {
            if (use_cuda)
                checkCudaErrors(cudaDeviceSynchronize());
            #endif
-            for (Var* var : op->outputs())
-                check_nan(var);
        }
+        #ifdef JT_CHECK_NAN
+        for (Var* var : op->outputs())
+            check_nan(var);
+        #endif
+        #ifdef JT_SYNC
+        checkCudaErrors(cudaDeviceSynchronize());
+        #endif
        LOGvvv << "Finished Op(" >> op->name() << rid >> 
            "/" >> queue.size() >> ") output:" << op->outputs();
        if (is_fused_op) {
--- a/python/jittor/src/grad.cc
+++ b/python/jittor/src/grad.cc
@ -37,6 +37,14 @@ inline static void assign_attrs(Var* a, Var* b) {
        a->flags.set(NodeFlags::_stop_fuse);
 }

+map<string,int> grad_breaks;
+
+void warn_grad_break(int i, Var* v) {
+    if (grad_breaks.count(v->name.c_str())) return;
+    grad_breaks[v->name.c_str()] = 1;
+    LOGw << "grads[">>i>>"] '">> v->name>>"' doesn't have gradient. It will be set to zero:" << v;
+}
+
 vector<VarPtr> grad(Var* loss, vector<Var*> targets) {
    LOGvv << "loss:" >> loss << "targets:" >> targets;
    CHECK(loss->is_float()) << "Loss should be float";
@ -212,7 +220,7 @@ vector<VarPtr> grad(Var* loss, vector<Var*> targets) {
            grad = move(grads[id]);
        if (!grad) {
            // TODO: better warning message
-            LOGw << "grads[">>i>>"] '">> var->name>>"' doesn't have gradient. It will be set to zero:" << var;
+            warn_grad_break(i, var);
            grad = make_number(0.f, var);
            assign_attrs(grad.ptr, var);
        }
--- a/python/jittor/src/init.h
+++ b/python/jittor/src/init.h
@ -13,14 +13,27 @@ namespace jittor {
 typedef void (*set_seed_callback)(int);

 void init();
+
+/**
+Sets the seed of jittor random number generator. Also see @jittor.set_global_seed.
+
+----------------
+
+* [in] seed: a python number.
+
+ */
 // @pyjt(set_seed, seed)
 void set_seed(int seed);
+
+/**
+Returns the seed of jittor random number generator.
+ */
 // @pyjt(get_seed)
 int get_seed();

 void add_set_seed_callback(set_seed_callback callback);

-extern "C"
+extern
 std::default_random_engine* get_random_engine();

 // things need to be clean before python exit
--- a/python/jittor/src/jit_compiler.cc
+++ b/python/jittor/src/jit_compiler.cc
@ -12,6 +12,7 @@
 #else
 #include <dlfcn.h>
 #endif
+#include <mutex>

 #include "jit_compiler.h"
 #include "op.h"
@ -33,14 +34,19 @@ DEFINE_FLAG(int, rewrite_op, 1, "Rewrite source file of jit operator or not");

 namespace jit_compiler {

+std::mutex dl_open_mutex;
+
 jit_op_entry_t load_jit_lib(string name, string symbol_name="jit_entry") {
    const char* msg = "";
    LOGvv << "Opening jit lib:" << name;
    #ifdef _WIN32
    void* handle = (void*)LoadLibrary(name.c_str());
-    #else
+    #elif defined(__linux__)
    void* handle = dlopen(name.c_str(), RTLD_LAZY | RTLD_DEEPBIND | RTLD_LOCAL);
    msg = dlerror();
+    #else
+    void *handle = dlopen(name.c_str(), RTLD_NOW | RTLD_LOCAL);
+    msg = dlerror();
    #endif

    CHECK(handle) << "Cannot open library" << name << ":" << msg;
@ -99,10 +105,10 @@ jit_op_entry_t compile(const string& jit_key, const string& src, const bool is_c
            + " \"" + jit_src_path + "\"" + other_src
            + cc_flags + extra_flags
            + " -o \"" + jit_lib_path + "\"";
-        #ifndef _WIN32
+#ifdef __linux__
        cmd = python_path+" "+jittor_path+"/utils/asm_tuner.py "
            "--cc_path=" + cmd;
-        #endif
+#endif
    }
    cache_compile(cmd, cache_path, jittor_path);
    auto symbol_name = get_symbol_name(jit_key);
--- a/python/jittor/src/jit_key.cc
+++ b/python/jittor/src/jit_key.cc
@ -8,34 +8,33 @@
 #include <sys/mman.h>
 #endif
 #include <sstream>
+#include <unistd.h>
 #include "jit_key.h"
 #include "utils/str_utils.h"

 namespace jittor {

-const int page_size = 4*1024;
-
 extern thread_local size_t protected_page;

+#ifndef _WIN32
 static size_t get_buffer_end_page(size_t buffer_end) {
    // get the last complete page in buffer
    // 4k align :
    //  |       |       |       |       |
    //  buffer:    xxxxxxxxxxxxxxxxxxxxxxxx
    //                          ^  buffer_end_page
-    size_t buffer_end_page = buffer_end - buffer_end % page_size;
-    if (buffer_end_page + page_size-1 > buffer_end)
-        buffer_end_page -= page_size;
+    size_t buffer_end_page = buffer_end - buffer_end % getpagesize();
+    if (buffer_end_page + getpagesize()-1 > buffer_end)
+        buffer_end_page -= getpagesize();
    return buffer_end_page;
 }
+#endif

 JitKey::JitKey() {
-    (void)get_buffer_end_page;
 #ifndef _WIN32
    auto buffer_end_page = get_buffer_end_page((size_t)&buffer[buffer_size-1]);
    LOGvv << "protect page" << (void*)buffer_end_page;
-    ASSERT(0==mprotect((void*)buffer_end_page, page_size, PROT_NONE));
-    // windows assign extern thread_local var cause fault, FIX IT
+    ASSERT(0==mprotect((void*)buffer_end_page, getpagesize(), PROT_NONE));
    protected_page = buffer_end_page;
 #endif
 }
@ -44,7 +43,7 @@ JitKey::~JitKey() {
 #ifndef _WIN32
    auto buffer_end_page = get_buffer_end_page((size_t)&buffer[buffer_size-1]);
    LOGvv << "un-protect page" << (void*)buffer_end_page;
-    mprotect((void*)buffer_end_page, page_size, PROT_READ|PROT_WRITE|PROT_EXEC);
+    mprotect((void*)buffer_end_page, getpagesize(), PROT_READ|PROT_WRITE|PROT_EXEC);
    protected_page = 0;
 #endif
 }
--- a/python/jittor/src/jit_key.h
+++ b/python/jittor/src/jit_key.h
@ -165,8 +165,8 @@ inline JK& operator<<(JK& jk, int64 c) {
    }
    return jk << JK::hex(c);
 }
-#ifndef _WIN32
-// win32 cause redefinition error
+
+#ifdef __linux__
 inline JK& operator<<(JK& jk, long long int c) {
    return jk << (int64)c;
 }
--- a/python/jittor/src/mem/allocator/aligned_allocator.cc
+++ b/python/jittor/src/mem/allocator/aligned_allocator.cc
@ -15,6 +15,9 @@ const char* AlignedAllocator::name() const {return "aligned";}

 void* AlignedAllocator::alloc(size_t size, size_t& allocation) {
    #ifndef _WIN32
+    #ifdef __APPLE__
+    size += 32-size%32;
+    #endif
    return aligned_alloc(alignment, size);
    #else
    return _aligned_malloc(size, alignment);
--- a/python/jittor/src/mem/allocator/temp_allocator.cc
+++ b/python/jittor/src/mem/allocator/temp_allocator.cc
@ -13,6 +13,7 @@
 namespace jittor {

 DEFINE_FLAG(int, use_temp_allocator, 1, "Enable temp allocator");
+vector<TempAllocator*> TempAllocator::temp_allocators;

 TempAllocator::~TempAllocator() {
    while (!cached_blocks.empty()) {
--- a/python/jittor/src/mem/allocator/temp_allocator.h
+++ b/python/jittor/src/mem/allocator/temp_allocator.h
@ -24,6 +24,7 @@ struct TempCachingBlock {
 struct TempAllocator : Allocator {
    static const size_t ALIGN_SIZE = 512;
    static const size_t ID_LIMIT = 1 << 18;
+    static vector<TempAllocator*> temp_allocators;
    Allocator* underlying;
    size_t cache_blocks_limit, used_memory, unused_memory;
    std::map<unsigned long long, TempCachingBlock*> cached_blocks;
@ -33,6 +34,7 @@ struct TempAllocator : Allocator {


    inline TempAllocator(size_t cache_blocks_limit=2) : cache_blocks_limit(cache_blocks_limit), used_memory(0), unused_memory(0), tot_block_id(0), occupied_id_mapper(new TempCachingBlock*[ID_LIMIT]) {
+        temp_allocators.push_back(this);
    }
    inline TempAllocator(Allocator* underlying, size_t cache_blocks_limit=2) : TempAllocator(cache_blocks_limit) {
        setup(underlying);
--- a/python/jittor/src/mem/mem_info.cc
+++ b/python/jittor/src/mem/mem_info.cc
@ -6,11 +6,17 @@
 // ***************************************************************
 #include <iomanip>
 #include <algorithm>
-#ifndef _WIN32
+#if defined(__linux__)
 #include <sys/sysinfo.h>
-#else
+#elif defined(__APPLE__)
+#include <sys/sysctl.h>
+#include <mach/host_info.h>
+#include <mach/mach_init.h>
+#include <mach/mach_host.h>
+#elif defined(_WIN32)
 #include <windows.h>
 #endif
+#include <unistd.h>

 #include "var.h"
 #include "op.h"
@ -104,16 +110,41 @@ void display_memory_info(const char* fileline, bool dump_var, bool red_color) {
                >> "(" >> std::setprecision(p) >> a->unused_memory*100.0 / total >> "%)"
            << "total:" << FloatOutput{(double)total, " KMG", 1024, "B"} >> "\n";
    }
+    if (use_temp_allocator && exe.temp_allocator) {
+        for (auto& a : TempAllocator::temp_allocators) {
+            auto total = a->used_memory + a->unused_memory;
+            all_total += total;
+            a->is_cuda() ? gpu_total += total : cpu_total += total;
+            log << "name:" << a->name() << "is_cuda:" << a->is_cuda()
+                << "used:" << FloatOutput{(double)a->used_memory, " KMG", 1024, "B"}
+                    >> "(" >> std::setprecision(p) >> a->used_memory*100.0 / total >> "%)"
+                << "unused:" << FloatOutput{(double)a->unused_memory, " KMG", 1024, "B"} 
+                    >> "(" >> std::setprecision(p) >> a->unused_memory*100.0 / total >> "%)"
+                << "total:" << FloatOutput{(double)total, " KMG", 1024, "B"} >> "\n";
+        }
+    }
    log << "cpu&gpu:" << FloatOutput{(double)all_total, " KMG", 1024, "B"}
        << "gpu:" << FloatOutput{(double)gpu_total, " KMG", 1024, "B"}
        << "cpu:" << FloatOutput{(double)cpu_total, " KMG", 1024, "B"} >> '\n';
-    if (use_temp_allocator) {
-        TempAllocator* temp_allocator = (TempAllocator*)exe.temp_allocator;
-        log << "\nname:" << temp_allocator->name() << "\n";
-        log << "used_memory:" << FloatOutput{(double)temp_allocator->used_memory, " KMG", 1024, "B"} << "\n";
-        log << "unused_memory:" << FloatOutput{(double)temp_allocator->unused_memory, " KMG", 1024, "B"} << "\n";
-
+    size_t cpu_free = 0;
+#if defined(__linux__)
+    cpu_free = get_avphys_pages() * sysconf(_SC_PAGESIZE);
+#elif defined(__APPLE__)
+    {
+        mach_msg_type_number_t count = HOST_VM_INFO_COUNT;
+        vm_statistics_data_t vmstat;
+        if (KERN_SUCCESS == host_statistics(mach_host_self(), HOST_VM_INFO, (host_info_t)&vmstat, &count)) {
+            cpu_free = vmstat.free_count * sysconf(_SC_PAGESIZE);
+        }
    }
+#endif
+    size_t gpu_free = 0, _gpu_total = 0;
+    (void)gpu_free; (void)_gpu_total;
+    #ifdef HAS_CUDA
+    cudaMemGetInfo(&gpu_free, &_gpu_total);
+    #endif
+    log << "free: cpu(">>FloatOutput{(double)cpu_free, " KMG", 1024, "B"}
+        >> ") gpu(">>FloatOutput{(double)gpu_free, " KMG", 1024, "B"} >> ")\n";
    if (dump_var) {
        vector<Node*> queue;
        unordered_set<Node*> visited;
@ -155,22 +186,35 @@ void display_memory_info(const char* fileline, bool dump_var, bool red_color) {
    log.end();
 }

+extern vector<void(*)()> sigquit_callback;
+
+void meminfo_callback() {
+    display_memory_info();
+}
+
 MemInfo::MemInfo() {
-#ifndef _WIN32
+
+#if defined(__linux__)
    struct sysinfo info = {0};
    sysinfo(&info);
    total_cpu_ram = info.totalram;
-#else
+#elif defined(__APPLE__)
+    int mib[] = {CTL_HW, HW_MEMSIZE};
+    size_t len=sizeof(total_cpu_ram);
+    sysctl(mib, 2, &total_cpu_ram, &len, NULL, 0);
+#elif defined(_WIN32)
    MEMORYSTATUSEX statex;
    GlobalMemoryStatusEx (&statex);
    total_cpu_ram = statex.ullTotalPhys;
 #endif
+
    total_cuda_ram = 0;
 #ifdef HAS_CUDA
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    total_cuda_ram = prop.totalGlobalMem;
 #endif
+    sigquit_callback.push_back(&meminfo_callback);
 }

 MemInfo mem_info;
--- a/python/jittor/src/misc/cuda_atomic.h
+++ b/python/jittor/src/misc/cuda_atomic.h
@ -10,11 +10,11 @@
 namespace jittor {

 __device__ inline static int floatToOrderedInt(float floatVal) {
- int intVal = __float_as_int( floatVal );
- return (intVal >= 0 ) ? intVal : intVal ^ 0x7FFFFFFF;
+    int intVal = __float_as_int( floatVal );
+    return (intVal >= 0 ) ? intVal : intVal ^ 0x7FFFFFFF;
 }
 __device__ inline static float orderedIntToFloat(int intVal) {
- return __int_as_float((intVal >= 0) ? intVal : intVal ^ 0x7FFFFFFF);
+    return __int_as_float((intVal >= 0) ? intVal : intVal ^ 0x7FFFFFFF);
 }

 __global__ inline static void fix_float_kernel(float* x, int num) {
@ -24,7 +24,24 @@ __global__ inline static void fix_float_kernel(float* x, int num) {
        x[i] = orderedIntToFloat(__float_as_int(x[i]));
 }

-inline static void fix_float(float* x, int num) {
+
+__device__ inline static long long floatToOrderedInt(double floatVal) {
+    long long intVal = __double_as_longlong( floatVal );
+    return (intVal >= 0 ) ? intVal : intVal ^ 0x7FFFFFFFFFFFFFFF;
+}
+__device__ inline static double orderedIntToFloat(long long intVal) {
+ return __longlong_as_double((intVal >= 0) ? intVal : intVal ^ 0x7FFFFFFFFFFFFFFF);
+}
+
+__global__ inline static void fix_float_kernel(double* x, int num) {
+    int tid = threadIdx.x + blockIdx.x * blockDim.x;
+    int tnum = gridDim.x * blockDim.x;
+    for (int i=tid; i<num; i+=tnum)
+        x[i] = orderedIntToFloat(__double_as_longlong(x[i]));
+}
+
+template<class T>
+inline static void fix_float(T* x, int num) {
    fix_float_kernel<<<std::min((num-1)/1024+1,256), 1024>>>(x, num);
 }

@ -38,6 +55,11 @@ inline float cuda_atomic_max(float* a, float b) {
    return orderedIntToFloat(atomicMax((int *)a, floatToOrderedInt(b)));
 }

+template<> __device__
+inline double cuda_atomic_max(double* a, double b) {
+    return orderedIntToFloat(atomicMax((long long *)a, floatToOrderedInt(b)));
+}
+
 template<class T> __device__
 T cuda_atomic_min(T* a, T b) {
    return atomicMin(a, b);
@ -48,6 +70,11 @@ inline float cuda_atomic_min(float* a, float b) {
    return orderedIntToFloat(atomicMin((int *)a, floatToOrderedInt(b)));
 }

+template<> __device__
+inline double cuda_atomic_min(double* a, double b) {
+    return orderedIntToFloat(atomicMin((long long *)a, floatToOrderedInt(b)));
+}
+
 template <class T> struct int_mapper {
    typedef T src;
    typedef T target;
--- a/python/jittor/src/misc/nano_string.cc
+++ b/python/jittor/src/misc/nano_string.cc
@ -4,7 +4,7 @@
 // This file is subject to the terms and conditions defined in
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
-#include <bits/stdc++.h>
+#include <cstring>
 #include "misc/nano_string.h"

 namespace jittor {
@ -59,6 +59,9 @@ static unordered_set<string> unary_ops = {
    "round",
    "floor",
    "ceil",
+    "round_int",
+    "floor_int",
+    "ceil_int",
    "cast",
    "sin",
    "asin",
@ -82,9 +85,9 @@ static unordered_set<string> unary_float_ops = {
    "sqrt",
 };
 static unordered_set<string> unary_int_ops = {
-    "round",
-    "floor",
-    "ceil",
+    "round_int",
+    "floor_int",
+    "ceil_int",
 };

 static unordered_set<string> binary_ops = {
--- a/python/jittor/src/misc/nano_string.h
+++ b/python/jittor/src/misc/nano_string.h
@ -62,6 +62,9 @@ constexpr int ns_max_len = 16;
    m(round) \
    m(floor) \
    m(ceil) \
+    m(round_int) \
+    m(floor_int) \
+    m(ceil_int) \
    m(cast) \
    \
    m(sin) \
--- a/python/jittor/src/misc/nano_vector.h
+++ b/python/jittor/src/misc/nano_vector.h
@ -159,13 +159,8 @@ struct NanoVector {
        for (auto a : v) push_back_check_overflow(a);
    }

-    inline static NanoVector make(const int64* v, int n) {
-        NanoVector nv;
-        for (int i=0; i<n; i++) nv.push_back_check_overflow(v[i]);
-        return nv;
-    }
-
-    inline static NanoVector make(const int32* v, int n) {
+    template<typename TMakeV>
+    inline static NanoVector make(const TMakeV* v, int n) {
        NanoVector nv;
        for (int i=0; i<n; i++) nv.push_back_check_overflow(v[i]);
        return nv;
@ -238,6 +233,21 @@ struct NanoVector {
            v[i] = at(i);
        return v;
    }
+
+    inline void _unpack(int i) {
+        return;
+    }
+
+    template<class... Args>
+    void _unpack(int i, int& x, Args&&... args) {
+       x = this->operator[](i);
+       _unpack(i+1, std::forward<Args>(args)...);
+    }
+
+    template<class... Args>
+    void unpack(Args&&... args) {
+        _unpack(0, std::forward<Args>(args)...);
+    }
 };


--- a/python/jittor/src/misc/ring_buffer.h
+++ b/python/jittor/src/misc/ring_buffer.h
@ -47,13 +47,11 @@ struct RingBuffer {
        }
        
        inline ~Cond() {
-            #ifndef _WIN32
            // a dirty hack
            // ref: https://stackoverflow.com/questions/20439404/pthread-conditions-and-process-termination
            // cv.__data.__wrefs = 0;
+            #ifdef __linux__
            cv.__data = {0};
-            #else
-            // cv.__data = 0;
            #endif
            pthread_cond_destroy(&cv);
        }
--- a/python/jittor/src/misc/string_view_map.h
+++ b/python/jittor/src/misc/string_view_map.h
@ -5,12 +5,22 @@
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
 #pragma once
+
+#if defined(__clang__)
+#include <string_view>
+#elif defined(__GNUC__)
 #include <experimental/string_view>
+#endif
+
 #include "common.h"

 namespace jittor {

+#if defined(__clang__)
+using std::string_view;
+#elif defined(__GNUC__)
 using std::experimental::string_view;
+#endif

 template<class T>
 struct string_view_map {
--- a/python/jittor/src/op_compiler.cc
+++ b/python/jittor/src/op_compiler.cc
@ -104,6 +104,8 @@ int OpCompiler::total_member_count() {
        // array need a extra local var
        if (op->ops[i]->name()==string("array"))
            member_count += 1;
+        if (op->ops[i]->name()==string("safe_clip"))
+            member_count += 2;
        member_count += v.size();
        i += 1;
    }
@ -826,11 +828,15 @@ string OpCompiler::__get_fused_src(
    const unordered_set<string> members = {
        "x", "y", "z", "cond", "output", "extras"
    };
+    const unordered_set<string> scalar_members = {
+        "left", "right"
+    };
    const unordered_set<string> unchanged = {
        "for", "const", "auto", "get_random_engine",
        "int", "float", "bool", "CHECK", "STRINGIZE",
        "void", "__restrict__", "if", "true", "false",
-        "Op", "Var", "Node", "itof", "assert", "ASSERT"
+        "Op", "Var", "Node", "itof", "assert", "ASSERT",
+        "float64"
    };
    auto not_change = [&](const string& s) -> bool {
        if (unchanged.count(s)) return true;
@ -941,7 +947,8 @@ string OpCompiler::__get_fused_src(
                    while (l<src.size() && isvar(src[l])) l++;
                    auto var = src.substr(j, l-j);
                    if (var[0] == ':' || isdigit(var[0]) || not_change(var) || src[j-1]=='.' || src[j-1]=='>') {} else
-                    if (members.count(var)) {
+                    if (members.count(var) || scalar_members.count(var)) {
+                        bool is_member = members.count(var);
                        string arg_name = "op" + S(oi) + "_" + var;
                        if (l<src.size() && src[l]=='[') {
                            // handle extras[...]
@ -964,7 +971,8 @@ string OpCompiler::__get_fused_src(
                                " = (("+name3+"Op*)(ops[" + S(oi) + "]))->" + var;
                            fused_kernel_args += ";\n";
                            kernel_args.insert(arg_name);
-                            op_members[oi].push_back(arg_name);
+                            if (is_member)
+                                op_members[oi].push_back(arg_name);
                        }
                        fused_kernel += arg_name;
                        j = l-1;
--- a/python/jittor/src/ops/binary_op.cc
+++ b/python/jittor/src/ops/binary_op.cc
@ -476,6 +476,14 @@ VarPtr BinaryOp::grad(Var* out, Var* dout, Var* v, int v_index) {
            return make_binary(ndzx, y2, ns_divide);
        }
    }
+    if (ns == ns_mod) {
+        if (v_index == 0)
+            return dout;
+        else {
+            auto a = make_unary(make_binary(x, y, ns_divide), ns_floor);
+            return make_unary(a, ns_negative);
+        }
+    }
    if (ns == ns_maximum || ns == ns_minimum) {
        auto zeros = make_number(0, dout);
        auto cond = make_binary(y, z, ns_equal);
--- a/python/jittor/src/ops/binary_op_defs.h
+++ b/python/jittor/src/ops/binary_op_defs.h
@ -13,12 +13,12 @@ namespace jittor {
 #define pow(T,a,b) ::pow(a,b)
 #define maximum(T,a,b) ::max(T(a), T(b))
 #define minimum(T,a,b) ::min(T(a), T(b))
-#define mod(T,a,b) @if(@strcmp(@Tx,float32)==0,::fmodf(T(a),T(b)),@if(@strcmp(@Tx,float64)==0,::fmod(T(a),T(b)),((a)%(b))))
+#define mod(T,a,b) @if(@strcmp(@T,float32)==0,(a-::floorf((a)/(b))*(b)),@if(@strcmp(@Tx,float64)==0,(a-::floor((a)/(b))*(b)),(a%b)))
 #else // JIT_cpu
 #define pow(T,a,b) std::pow(a,b)
 #define maximum(T,a,b) std::max(T(a), T(b))
 #define minimum(T,a,b) std::min(T(a), T(b))
-#define mod(T,a,b) @if(@strcmp(@Tx,float32)==0 || @strcmp(@Tx,float64)==0,std::fmod((T)a,(T)b),((a)%(b)))
+#define mod(T,a,b) @if(@strcmp(@T,float32)==0,(a-std::floor((a)/(b))*(b)),@if(@strcmp(@Tx,float64)==0,(a-std::floor((a)/(b))*(b)),(a%b)))
 #endif
 #define add(T,a,b) ((a)+(b))
 #define subtract(T,a,b) ((a)-(b))
--- a/python/jittor/src/ops/fuse_transpose_op.cc
+++ b/python/jittor/src/ops/fuse_transpose_op.cc
@ -0,0 +1,114 @@
+// ***************************************************************
+// Copyright (c) 2021 Jittor. All Rights Reserved. 
+// Maintainers: Dun Liang <randonlang@gmail.com>. 
+// This file is subject to the terms and conditions defined in
+// file 'LICENSE.txt', which is part of this source code package.
+// ***************************************************************
+#include "ops/fuse_transpose_op.h"
+#include "var.h"
+#include "ops/op_register.h"
+#include "misc/cuda_flags.h"
+
+namespace jittor {
+
+#ifndef JIT
+static auto make_transpose = get_op_info("fuse_transpose")
+    .get_constructor<VarPtr, Var*, NanoVector>();
+
+static inline NanoVector get_reverse(NanoVector axes) {
+    NanoVector reverse;
+    reverse.reserve(axes.size(), axes.size());
+    for (uint i=0; i<axes.size(); i++)
+        reverse.set_data(axes[i], i);
+    return reverse;
+}
+
+FuseTransposeOp::FuseTransposeOp(Var* x, NanoVector axes_) : x(x), axes(axes_) {
+    OpType tp = OpType::broadcast;
+    if (!x->is_finished()) {
+        auto type = x->input()->type();
+        if (type==OpType::broadcast || type==OpType::element)
+            tp = OpType::reduce;
+    }
+    flags.set(NodeFlags::_cpu);
+    flags.set(NodeFlags::_cuda);
+    set_type(tp);
+    int i=0;
+    for (; i<axes.size(); i++)
+        if (i!=axes[i]) break;
+    if (i==axes.size() && axes.size()) {
+        forward(x);
+        return;
+    }
+    auto xdim = x->shape.size();
+    if (!axes.size()) {
+        for (int i=0; i<(int)xdim; i++)
+            axes.push_back(xdim-1-i);
+    }
+    y = create_output(nullptr, x->dtype());
+}
+
+void FuseTransposeOp::infer_shape() {
+    auto xdim = x->shape.size();
+    CHECK(xdim);
+    if (!axes.size()) {
+        for (int i=0; i<(int)xdim; i++)
+            axes.push_back(xdim-1-i);
+    } else {
+        CHECKop(axes.size(),==,xdim);
+        int64_t mask=0;
+        for (auto i : axes) mask |= 1<<i;
+        CHECK(mask==((1ll<<xdim)-1)) << "Invalid axes" << axes;
+    }
+    NanoVector shape;
+    for (uint i=0; i<xdim; i++)
+        shape.push_back(x->shape[axes[i]]);
+    y->set_shape(shape);
+}
+
+VarPtr FuseTransposeOp::grad(Var* out, Var* dout, Var* v, int v_index) {
+    return make_transpose(dout, get_reverse(axes));
+}
+
+void FuseTransposeOp::jit_prepare(JK& jk) {
+    auto bc = type()==OpType::broadcast;
+    auto ax = bc ? axes : get_reverse(axes);
+    jk << _CS("[Tx:") << x->dtype();
+    jk << _CS("][DIM=") << JK::hex1(axes.size());
+    jk << _CS("][BC:") << JK::hex1(bc);
+    for (uint i=0; i<ax.size(); i++)
+        jk << _CS("][AXES") << JK::hex1(ax[i]) << '=' << JK::hex1(i);
+    jk << ']';
+}
+
+#else // JIT
+void FuseTransposeOp::jit_run() {
+    auto* __restrict__ xp = x->ptr<Tx>();
+    auto* __restrict__ yp = y->ptr<Tx>();
+    
+    @for(i, 0, DIM, index_t yshape@i = y->shape[@i];)
+    @for(i, 0, DIM, index_t xshape@i = yshape@{AXES@i};)
+    index_t xstride@{DIM-1} = 1;
+    @for(i, DIM-2, -1, -1, auto xstride@i = xstride@{i+1} * xshape@{i+1};)
+    index_t ystride@{DIM-1} = 1;
+    @for(i, DIM-2, -1, -1, auto ystride@i = ystride@{i+1} * yshape@{i+1};)
+    @if(BC,
+    @for(d, 0, DIM, for (index_t i@d=0; i@d < yshape@d; i@d++)) {
+        auto yid = @for(d, 0, DIM, + i@d * ystride@d);
+        @for(d, 0, DIM, auto xid@d = i@{AXES@d};)
+        auto xid = @for(d, 0, DIM, + xid@d * xstride@d);
+        yp[yid] = xp[xid];
+    },
+    @for(d, 0, DIM, for (index_t i@d=0; i@d < xshape@d; i@d++)) {
+        auto xid = @for(d, 0, DIM, + i@d * xstride@d);
+        @for(d, 0, DIM, auto yid@d = i@{AXES@d};)
+        auto yid = @for(d, 0, DIM, + yid@d * ystride@d);
+        yp[yid] = xp[xid];
+    }
+    )
+    // unused var
+    (void)xshape0;
+}
+#endif // JIT
+
+} // jittor
--- a/python/jittor/src/ops/fuse_transpose_op.h
+++ b/python/jittor/src/ops/fuse_transpose_op.h
@ -0,0 +1,23 @@
+// ***************************************************************
+// Copyright (c) 2021 Jittor. All Rights Reserved. 
+// Maintainers: Dun Liang <randonlang@gmail.com>. 
+// This file is subject to the terms and conditions defined in
+// file 'LICENSE.txt', which is part of this source code package.
+// ***************************************************************
+#pragma once
+#include "op.h"
+
+namespace jittor {
+
+struct FuseTransposeOp : Op {
+    Var* x, * y;
+    NanoVector axes;
+    FuseTransposeOp(Var* x, NanoVector axes=NanoVector());
+    
+    const char* name() const override { return "fuse_transpose"; }
+    VarPtr grad(Var* out, Var* dout, Var* v, int v_index) override;
+    void infer_shape() override;
+    DECLARE_jit_run;
+};
+
+} // jittor
--- a/python/jittor/src/ops/getitem_op.cc
+++ b/python/jittor/src/ops/getitem_op.cc
@ -93,6 +93,9 @@ void GetitemOp::infer_slices(
        } else
        if (s.is_ellipsis()) {
            auto remain_slice = vs.n-vid-1;
+            for (int i=vid+1; i<vs.n; i++)
+                if (vs.slices[i].is_none())
+                    remain_slice--;
            auto remain_idims = nin-i;
            auto ellipsis_size = remain_idims - remain_slice;
            ASSERT(ellipsis_size>=0) << "NDims not match";
--- a/python/jittor/src/ops/safe_clip_op.cc
+++ b/python/jittor/src/ops/safe_clip_op.cc
@ -0,0 +1,47 @@
+// ***************************************************************
+// Copyright (c) 2021 Jittor. All Rights Reserved. 
+// Maintainers: Dun Liang <randonlang@gmail.com>. 
+// This file is subject to the terms and conditions defined in
+// file 'LICENSE.txt', which is part of this source code package.
+// ***************************************************************
+#include <cmath>
+#include "var.h"
+#include "ops/safe_clip_op.h"
+#include "ops/op_register.h"
+
+namespace jittor {
+
+#ifndef JIT
+
+SafeClipOp::SafeClipOp(Var* x, float64 left, float64 right) : x(x), left(left), right(right) {
+    flags.set(NodeFlags::_cpu);
+    flags.set(NodeFlags::_cuda);
+    set_type(OpType::element);
+    y = create_output(nullptr, x->dtype());
+}
+
+VarPtr SafeClipOp::grad(Var* out, Var* dout, Var* v, int v_index) {
+    return dout;
+}
+
+void SafeClipOp::infer_shape() {
+    y->set_shape(x->shape);
+}
+
+void SafeClipOp::jit_prepare(JK& jk) {
+    jk << _CS("[Tx:") << x->dtype() <<']';
+}
+
+#else // JIT
+void SafeClipOp::jit_run() {
+    auto* __restrict__ xp = x->ptr<Tx>();
+    Tx left_value = (Tx)std::max((float64)std::numeric_limits<Tx>::lowest(), left);
+    Tx right_value = (Tx)std::min((float64)std::numeric_limits<Tx>::max(), right);
+    auto* __restrict__ yp = y->ptr<Tx>();
+    index_t num = y->num;
+    for (index_t i=0; i<num; i++)
+        yp[i] = xp[i] < left_value ? left_value : (xp[i] > right_value ? right_value : xp[i]);
+}
+#endif // JIT
+
+} // jittor
--- a/python/jittor/src/ops/safe_clip_op.h
+++ b/python/jittor/src/ops/safe_clip_op.h
@ -0,0 +1,33 @@
+// ***************************************************************
+// Copyright (c) 2021 Jittor. All Rights Reserved. 
+// Maintainers: Dun Liang <randonlang@gmail.com>. 
+// This file is subject to the terms and conditions defined in
+// file 'LICENSE.txt', which is part of this source code package.
+// ***************************************************************
+#pragma once
+#include "op.h"
+
+
+namespace jittor {
+
+struct SafeClipOp : Op {
+    Var* x, * y;
+    float64 left, right;
+    /** Safe clip value to a range, and keep 
+ the gradient pass thought.
+ 
+    * [in] x:   input value
+    * [in] left: float64 clip min value.
+    * [in] right: float64 clip max value.
+
+     */
+    // @pybind(safe_clip)
+    SafeClipOp(Var* x, float64 left, float64 right);
+    
+    const char* name() const override { return "safe_clip"; }
+    VarPtr grad(Var* out, Var* dout, Var* v, int v_index) override;
+    void infer_shape() override;
+    DECLARE_jit_run;
+};
+
+} // jittor
--- a/python/jittor/src/ops/setitem_op.cc
+++ b/python/jittor/src/ops/setitem_op.cc
@ -316,8 +316,11 @@ void SetitemOp::jit_run() {
        checkCudaErrors(cudaMemcpyAsync(op, ip, out->size, cudaMemcpyDefault, 0));
    #endif

-    if (data->allocation == in->allocation && 
-        data->allocator == in->allocator)
+    if (flags.get((NodeFlags::Flags(SetitemOp::_data_inplaced))) &&
+        // array op may move the data allocation, double check
+        // affect test_contrib.pu 
+        in->allocator == data->allocator &&
+        in->allocation == data->allocation)
        return;

    @for(d, 0, ODIM, for (index_t i@d=0; i@d < oshape@d; i@d++)) {
--- a/python/jittor/src/ops/setitem_op.h
+++ b/python/jittor/src/ops/setitem_op.h
@ -11,6 +11,7 @@
 namespace jittor {

 struct SetitemOp : Op {
+    static constexpr int _data_inplaced = NodeFlags::_has_vary_input + 1;
    VarSlices vs;
    // map i to related var slice
    NanoVector i_to_vs;
--- a/python/jittor/src/ops/unary_op.cc
+++ b/python/jittor/src/ops/unary_op.cc
@ -168,7 +168,7 @@ static unordered_set<string> unary_ops = {
        >>> a
        jt.Var([ 2.101595    0.33055413 -0.44147047 -0.7720668 ], dtype=float32)
        >>> jt.round(a)
-        jt.Var([ 2  0  0 -1], dtype=int32)
+        jt.Var([ 2.0  0.0  0.0 -1.0], dtype=float32)
     */
    "round",

@ -185,7 +185,7 @@ static unordered_set<string> unary_ops = {
        >>> a
        jt.Var([-1.0339162 -0.7259972 -0.9220003 -0.8449701], dtype=float32)
        >>> jt.floor(a)
-        jt.Var([-2 -1 -1 -1], dtype=int32)
+        jt.Var([-2.0 -1.0 -1.0 -1.0], dtype=float32)
     */
    "floor",

@ -203,10 +203,63 @@ static unordered_set<string> unary_ops = {
        >>> a
        jt.Var([-1.0339162 -0.7259972 -0.9220003 -0.8449701], dtype=float32)
        >>> jt.ceil(a)
-        jt.Var([-1  0  0  0], dtype=int32)
+        jt.Var([-1.0  0.0  0.0  0.0], dtype=float32)
     */
    "ceil",

+    /**
+    Returns the closest integer of the input ``x``. 
+
+    ----------------
+
+    * [in] x: the input jt.Var.
+
+    ----------------
+
+    Example-1::
+        >>> a = jt.randn(4)
+        >>> a
+        jt.Var([ 2.101595    0.33055413 -0.44147047 -0.7720668 ], dtype=float32)
+        >>> jt.round_int(a)
+        jt.Var([ 2  0  0 -1], dtype=int32)
+     */
+    "round_int",
+
+    /**
+     Returns the largest integer less than or equal to the input ``x``. 
+
+    ----------------
+
+    * [in] x: the input jt.Var.
+
+    ----------------
+    Example-1::
+        >>> a = jt.randn(4)
+        >>> a
+        jt.Var([-1.0339162 -0.7259972 -0.9220003 -0.8449701], dtype=float32)
+        >>> jt.floor_int(a)
+        jt.Var([-2 -1 -1 -1], dtype=int32)
+     */
+    "floor_int",
+
+    /**
+    Returns the smallest integer greater than or equal to the input ``x``. 
+
+    ----------------
+
+    * [in] x: the input jt.Var.
+
+    ----------------
+
+    Example-1::
+        >>> a = jt.randn(4)
+        >>> a
+        jt.Var([-1.0339162 -0.7259972 -0.9220003 -0.8449701], dtype=float32)
+        >>> jt.ceil_int(a)
+        jt.Var([-1  0  0  0], dtype=int32)
+     */
+    "ceil_int",
+
    /**
    Returns the sine of the input ``x``. 

--- a/python/jittor/src/ops/unary_op_defs.h
+++ b/python/jittor/src/ops/unary_op_defs.h
@ -21,6 +21,9 @@ namespace jittor {
 #define round(T,x) ((T) ::roundf((x)))
 #define floor(T,x) ((T) ::floorf((x)))
 #define ceil(T,x) ((T) ::ceilf((x)))
+#define round_int(T,x) ((T) ::roundf((x)))
+#define floor_int(T,x) ((T) ::floorf((x)))
+#define ceil_int(T,x) ((T) ::ceilf((x)))

 #define sin(T,x) ((T) ::sinf((x)))
 #define asin(T,x) ((T) ::asinf((x)))
@ -49,6 +52,9 @@ namespace jittor {
 #define round(T,x) ((T)std::round((x)))
 #define floor(T,x) ((T)std::floor((x)))
 #define ceil(T,x) ((T)std::ceil((x)))
+#define round_int(T,x) ((T)std::round((x)))
+#define floor_int(T,x) ((T)std::floor((x)))
+#define ceil_int(T,x) ((T)std::ceil((x)))

 #define sin(T,x) ((T) std::sin((x)))
 #define asin(T,x) ((T) std::asin((x)))
--- a/python/jittor/src/opt/gopt/setitem_gopt.cc
+++ b/python/jittor/src/opt/gopt/setitem_gopt.cc
@ -47,6 +47,8 @@ static void setitem_inplace(SetitemOp* op) {
        return;
    }
    auto output = op->outputs().front();
+    // return if output is all ready shared
+    if (output->allocator) return;
    output->share_with(input);
    
    auto data = op->input(1);
@ -78,13 +80,13 @@ static void setitem_inplace(SetitemOp* op) {
        VarSlice s = vs.slices[i];
        if (!(s.is_slice())) return;
        Slice ss = s.slice;
-        if (!(ss.start == 0 && ss.stop >= in_shape[i] && ss.step == 1))
+        if (!(ss.start == 0 && (ss.mask&2) && ss.step == 1))
            return;
        inplace_size *= in_shape[i];
    }
    
    VarSlice s = vs.slices[0];
-    if (s.is_var()) return;
+    if (s.is_var() || s.is_str()) return;
    
    auto size = 0;
    if (s.is_int())
@ -105,6 +107,7 @@ static void setitem_inplace(SetitemOp* op) {
    }
    add_dependency(data->input(), {input->node()});
    data->share_with(input, size);
+    op->flags.set((NodeFlags::Flags(SetitemOp::_data_inplaced)));
 }

 struct BBox {
@ -174,7 +177,10 @@ static void getitem_inplace(GetitemOp* op) {

    auto in = op->inputs().front();
    auto ou = op->outputs().front();
-    
+
+    // return if out is all ready inplaced
+    if (ou->allocator)
+        return;
    // return if input or output's shape is variable
    if (in->num <= 0 || ou->num <= 0)
        return;
@ -186,12 +192,12 @@ static void getitem_inplace(GetitemOp* op) {
        VarSlice s = vs.slices[i];
        if (!(s.is_slice())) return;
        Slice ss = s.slice;
-        if (!(ss.start == 0 && ss.stop >= in_shape[i] && ss.step == 1))
-            return; 
+        if (!(ss.start == 0 && (ss.mask&2) && ss.step == 1))
+            return;
    }
    
    VarSlice s = vs.slices[0];
-    if (s.is_var()) return;
+    if (s.is_var() || s.is_str()) return;
    
    auto size = 0;
    if (s.is_int())
@ -213,7 +219,7 @@ void SetitemOp::graph_optimize() {
 void GetitemOp::graph_optimize() {
    // This optimize is still WIP
    // LOGir << "hello getitem graph_optimize";
-    setitem_grad_opt(this);
+    // setitem_grad_opt(this);
    (void)setitem_grad_opt;
    // (void)getitem_inplace;
    getitem_inplace(this);
--- a/python/jittor/src/opt/pass/float_atomic_fix_pass.cc
+++ b/python/jittor/src/opt/pass/float_atomic_fix_pass.cc
@ -22,7 +22,7 @@ void FloatAtomicFixPass::run() {
    if (!choice) return;

    unordered_map<string,int> fixed;
-    auto fix_float_atomic = [&](string name) {
+    auto fix_float_atomic = [&](string name, Var* v) {
        if (fixed.count(name)) return;
        fixed[name] = 1;
        string namep = name+"p";
@ -38,9 +38,15 @@ void FloatAtomicFixPass::run() {
                return;
            // fix code a[b] = c -->
            // a[b] = __int_as_float(floatToOrderedInt(c))
-            string new_code = namep+'['+results.at(0)->to_string(true)+
-                "] = __int_as_float(floatToOrderedInt(" +
-                results.at(1)->to_string(true) + "));";
+            string new_code;
+            if (v->dtype() == ns_float32)
+                new_code = namep+'['+results.at(0)->to_string(true)+
+                    "] = __int_as_float(floatToOrderedInt(" +
+                    results.at(1)->to_string(true) + "));";
+            else
+                new_code = namep+'['+results.at(0)->to_string(true)+
+                    "] = __longlong_as_double(floatToOrderedInt(" +
+                    results.at(1)->to_string(true) + "));";
            LOGvvvv << "prev code" << code >> "\nreplace:" << new_code;
            code = new_code;
        });
@ -74,7 +80,7 @@ void FloatAtomicFixPass::run() {
        }
        if (!var->dtype().is_float()) return;
        LOGvvvv << "find var" << var << "op" << op;
-        fix_float_atomic(s);
+        fix_float_atomic(s, var);
    });
 }

--- a/python/jittor/src/opt/pass/loop_to_func_pass.cc
+++ b/python/jittor/src/opt/pass/loop_to_func_pass.cc
@ -67,6 +67,10 @@ void LoopToFuncPass::run() {
                        args.push_back(d.get());
                        continue;
                    }
+                    if (endswith(d->attrs["lvalue"], "_value")) {
+                        args.push_back(d.get());
+                        continue;
+                    }
                }
            }
            func->push_back(d->clone());
@ -99,7 +103,9 @@ void LoopToFuncPass::run() {
        auto& fc = ir->children[i];
        fc->attrs["loop_func"] = func->attrs["lvalue"];
    }
-    // ir->remove_all_unused();
+    #ifdef __APPLE__
+    ir->remove_all_unused();
+    #endif
 }

 } // jittor
--- a/python/jittor/src/opt/pass/loop_var_analyze_pass.cc
+++ b/python/jittor/src/opt/pass/loop_var_analyze_pass.cc
@ -13,7 +13,7 @@

 namespace jittor {

-DECLARE_FLAG(int, para_opt_level);
+DEFINE_FLAG(int, para_opt_level, 3, "para_opt_level");

 void LoopVarAnalyzePass::run() {
    // loop_vars: opi_xx->shape[j]
@ -130,6 +130,7 @@ void LoopVarAnalyzePass::run() {
            }
            loop_vars.reserve(loop_var->shape.size());
            string vname = pm->oc->get_name_by_op_var(op, loop_var);
+            ASSERT(vname!="__fill__");
            for (uint j=0; j<loop_var->shape.size(); j++)
                loop_vars.emplace_back(vname+"->shape["+S(j)+"]");
            break;
--- a/python/jittor/src/profiler/profiler.cc
+++ b/python/jittor/src/profiler/profiler.cc
@ -65,9 +65,12 @@ unique_ptr<MemoryChecker>* load_memory_checker(string name) {
    LOGvv << "Opening jit lib:" << name;
    #ifdef _WIN32
    void* handle = (void*)LoadLibrary(name.c_str());
-    #else
+    #elif defined(__linux__)
    void* handle = dlopen(name.c_str(), RTLD_LAZY | RTLD_DEEPBIND | RTLD_LOCAL);
    msg = dlerror();
+    #else
+    void* handle = dlopen(name.c_str(), RTLD_LAZY | RTLD_LOCAL);
+    msg = dlerror();
    #endif

    CHECK(handle) << "Cannot open library" << name << ":" << msg;
--- a/python/jittor/src/pybind/py_var_tracer.cc
+++ b/python/jittor/src/pybind/py_var_tracer.cc
@ -20,6 +20,7 @@
 namespace jittor {

 DEFINE_FLAG(int, trace_py_var, 0, "Trace py stack max depth for debug.");
+DEFINE_FLAG(int, trace_var_data, 0, "Trace py stack max depth for debug.");
 Op* trace_grad_op = nullptr;

 TraceData trace_data;
@ -185,6 +186,44 @@ static vector<Stack> get_stack_info() {
    return stacks;
 }

+template<class T>
+string get_str(T* t, int64 num) {
+    string s = "";
+    for (int64 i=0; i<num; i++) {
+        s += S(t[i]);
+        if (i != num-1)
+            s += ',';
+    }
+    return s;
+} 
+
+static inline string get_var_data_str(Var* v) {
+    if (v->dtype() == ns_int8)
+        return get_str(v->ptr<int8>(), v->num);
+    if (v->dtype() == ns_int16)
+        return get_str(v->ptr<int16>(), v->num);
+    if (v->dtype() == ns_int32)
+        return get_str(v->ptr<int32>(), v->num);
+    if (v->dtype() == ns_int64)
+        return get_str(v->ptr<int64>(), v->num);
+
+
+    if (v->dtype() == ns_uint8)
+        return get_str(v->ptr<uint8>(), v->num);
+    if (v->dtype() == ns_uint16)
+        return get_str(v->ptr<uint16>(), v->num);
+    if (v->dtype() == ns_uint32)
+        return get_str(v->ptr<uint32>(), v->num);
+    if (v->dtype() == ns_uint64)
+        return get_str(v->ptr<uint64>(), v->num);
+
+    if (v->dtype() == ns_float32)
+        return get_str(v->ptr<float32>(), v->num);
+    if (v->dtype() == ns_float64)
+        return get_str(v->ptr<float64>(), v->num);
+    return "";
+}
+
 void TraceData::record_node(Node* node, bool record_stack) {
    if (thread_name.size()) return;
    NodeData data;
@ -255,6 +294,8 @@ void TraceData::record_exe_node(Node* node) {
            data.attrs["dsize"] = S(v->dtype().dsize());
            data.attrs["name"] = v->name.c_str();
            data.attrs["is_var"] = "1";
+            if (trace_var_data && v->mem_ptr)
+                data.attrs["data"] = get_var_data_str(v);
        } else {
            auto op = node->op();
            data.attrs["name"] = op->name_ex();
--- a/python/jittor/src/pyjt/py_array_op.cc
+++ b/python/jittor/src/pyjt/py_array_op.cc
@ -86,15 +86,19 @@ ArrayOp::ArrayOp(PyObject* obj) {
        // use 32-bit by default
        if ((auto_convert_64_to_32 || holder.obj) 
            && args.dtype.dsize() == 8 && args.ptr) {
-            auto num = PyArray_Size(arr)/8;
+            auto size = PyArray_Size(arr);
+            args.buffer.reset(new char[size]);
+            auto pre_data = args.ptr;
+            args.ptr = args.buffer.get();
+            auto num = size/8;
            if (args.dtype.is_int()) {
-                auto* __restrict__ i64 = (int64*)args.ptr;
+                auto* __restrict__ i64 = (int64*)pre_data;
                auto* __restrict__ i32 = (int32*)args.ptr;
                for (int i=0; i<num; i++)
                    i32[i] = (int32)i64[i];
                args.dtype = ns_int32;
            } else if (args.dtype.is_float()) {
-                auto* __restrict__ f64 = (float64*)args.ptr;
+                auto* __restrict__ f64 = (float64*)pre_data;
                auto* __restrict__ f32 = (float32*)args.ptr;
                for (int i=0; i<num; i++)
                    f32[i] = (float32)f64[i];
@ -136,7 +140,11 @@ ArrayOp::ArrayOp(PyObject* obj) {
        std::memcpy(host_ptr, args.ptr, size);
    } else {
        // this is non-continue numpy array
+#if defined(__linux__) || defined(_WIN32)
        int64 dims[args.shape.size()];
+#elif defined(__APPLE__)
+        long dims[args.shape.size()];
+#endif
        for (int i=0; i<args.shape.size(); i++)
            dims[i] = args.shape[i];
        holder.assign(PyArray_New(
--- a/python/jittor/src/pyjt/py_converter.h
+++ b/python/jittor/src/pyjt/py_converter.h
@ -266,7 +266,11 @@ DEF_IS(ArrayArgs, bool) is_type(PyObject* obj) {
 }

 DEF_IS(ArrayArgs, PyObject*) to_py_object(const T& a) {
+#if defined(__linux__) || defined(_WIN32)
    int64 dims[a.shape.size()];
+#elif defined(__APPLE__)
+    long dims[a.shape.size()];
+#endif
    for (int i=0; i<a.shape.size(); i++)
        dims[i] = a.shape[i];
    PyObjHolder obj(PyArray_SimpleNew(
@ -378,7 +382,11 @@ DEF_IS(VarHolder*, T) from_py_object(PyObject* obj, unique_ptr<VarHolder>& holde

 struct DataView;
 DEF_IS(DataView, PyObject*) to_py_object(T a) {
+#if defined(__linux__) || defined(_WIN32)
    int64 dims[a.shape.size()];
+#elif defined(__APPLE__)
+    long dims[a.shape.size()];
+#endif
    for (int i=0; i<a.shape.size(); i++)
        dims[i] = a.shape[i];
    PyObjHolder oh(PyArray_New(
--- a/python/jittor/src/pyjt/py_ring_buffer.cc
+++ b/python/jittor/src/pyjt/py_ring_buffer.cc
@ -109,7 +109,11 @@ static void push_py_object(RingBuffer* rb, PyObject* obj, uint64& __restrict__ o
                rb->push_t<NanoString>(args.dtype, offset);
                rb->push(size, offset);
                args.ptr = rb->get_ptr(size, offset);
+#if defined(__linux__) || defined(_WIN32)
                int64 dims[args.shape.size()];
+#elif defined(__APPLE__)
+                long dims[args.shape.size()];
+#endif
                for (int i=0; i<args.shape.size(); i++)
                    dims[i] = args.shape[i];
                PyObjHolder oh(PyArray_New(
--- a/python/jittor/src/pyjt/pyjt_console.h
+++ b/python/jittor/src/pyjt/pyjt_console.h
@ -415,8 +415,19 @@ inline Console() {
    #endif

    run("import jittor as jt");
-    make_pyjt_array = (PyObject* (*)(const vector<int64>& shape, const string& dtype, const void* data))dlsym(RTLD_DEFAULT, "_ZN6jittor15make_pyjt_arrayERKSt6vectorIlSaIlEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPKv");
-    get_pyjt_array = (void (*)(PyObject* obj, vector<int64>& shape, string& dtype, void*& data))dlsym(RTLD_DEFAULT, "_ZN6jittor14get_pyjt_arrayEP7_objectRSt6vectorIlSaIlEERNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERPv");
+    #ifdef __APPLE__
+    auto symbol_make_pyjt_array = "__ZN6jittor15make_pyjt_arrayERKNSt3__16vectorIxNS0_9allocatorIxEEEERKNS0_12basic_stringIcNS0_11char_traitsIcEENS2_IcEEEEPKv";
+    auto symbol_gen_pyjt_array = "__ZN6jittor14get_pyjt_arrayEP7_objectRNSt3__16vectorIxNS2_9allocatorIxEEEERNS2_12basic_stringIcNS2_11char_traitsIcEENS4_IcEEEERPv";
+    #else
+    auto symbol_make_pyjt_array = "_ZN6jittor15make_pyjt_arrayERKSt6vectorIlSaIlEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPKv";
+    auto symbol_gen_pyjt_array = "_ZN6jittor14get_pyjt_arrayEP7_objectRSt6vectorIlSaIlEERNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERPv";
+    #endif
+    make_pyjt_array = (PyObject* (*)(const vector<int64>& shape, const string& dtype, const void* data))dlsym(RTLD_DEFAULT, symbol_make_pyjt_array);
+    get_pyjt_array = (void (*)(PyObject* obj, vector<int64>& shape, string& dtype, void*& data))dlsym(RTLD_DEFAULT, symbol_gen_pyjt_array);
+    if (!make_pyjt_array || !get_pyjt_array) {
+        std::cerr << "get symbol failed." << std::endl;
+        exit(1);
+    }
 }

 inline ~Console() {
--- a/python/jittor/src/utils/cache_compile.cc
+++ b/python/jittor/src/utils/cache_compile.cc
@ -141,7 +141,7 @@ size_t skip_comments(const string& src, size_t i) {
    return i;
 }

-void process(string src, vector<string>& input_names) {
+void process(string src, vector<string>& input_names, string& cmd) {
    for (size_t i=0; i<src.size(); i++) {
        i = skip_comments(src, i);
        if (i>=src.size()) break;
@ -163,6 +163,20 @@ void process(string src, vector<string>& input_names) {
                    input_names.push_back(inc);
                }
            }
+            if (l-k>2 && src[k] == 'J' && src[k+1] == 'T' && j-i==6 && src.substr(i,j-i) == "#ifdef") {
+                auto inc = src.substr(k, l-k);
+                auto env = getenv(inc.c_str());
+                if (env && string(env)!="0") {
+                    string dflag = " -D"+inc+"="+string(env)+" -o ";
+                    if (cmd.find(dflag) == string::npos) {
+                        // -D flags should insert before -o flag
+                        auto cmds = split(cmd, " -o ", 2);
+                        if (cmds.size() == 2) {
+                            cmd = cmds[0] + dflag + cmds[1];
+                        }
+                    }
+                }
+            }
            i=l;
        }
    }
@ -191,7 +205,7 @@ static inline bool is_full_path(const string& name) {
 #endif
 }

-bool cache_compile(const string& cmd, const string& cache_path, const string& jittor_path) {
+bool cache_compile(string cmd, const string& cache_path, const string& jittor_path) {
    vector<string> input_names;
    map<string,vector<string>> extra;
    string output_name;
@ -199,29 +213,21 @@ bool cache_compile(const string& cmd, const string& cache_path, const string& ji
    string output_cache_key;
    bool ran = false;
    output_cache_key = read_all(output_name+".key");
-    // string cd_cmd = cache_path.size() ? "cd " + cache_path + " && " + cmd : cmd;
-    string cd_cmd = cmd;
-    if (output_cache_key.size() == 0) {
-        LOGvv << "Cache key of" << output_name << "not found.";
-        LOGvvv << "Run cmd:" << cmd;
-        check_win_file(output_name);
-        system_with_check(cd_cmd.c_str());
-        ran = true;
-    }
-    string cache_key = cmd;
-    cache_key += "\n";
+    string cache_key;
    unordered_set<string> processed;
    auto src_path = join(jittor_path, "src");
    const auto& extra_include = extra["I"];
    for (size_t i=0; i<input_names.size(); i++) {
        if (processed.count(input_names[i]) != 0)
            continue;
+        if (input_names[i] == "dynamic_lookup")
+            continue;
        processed.insert(input_names[i]);
        auto src = read_all(input_names[i]);
        ASSERT(src.size()) << "Source read failed:" << input_names[i];
        auto hash = S(hash64(src));
        vector<string> new_names;
-        process(src, new_names);
+        process(src, new_names, cmd);
        for (auto& name : new_names) {
            string full_name;
            if (name.substr(0, 4) == "jit/" || name.substr(0, 4) == "gen/")
@ -251,11 +257,18 @@ bool cache_compile(const string& cmd, const string& cache_path, const string& ji
        cache_key += hash;
        cache_key += "\n";
    }
+    cache_key = cmd + "\n" + cache_key;
+    if (output_cache_key.size() == 0) {
+        LOGvv << "Cache key of" << output_name << "not found.";
+        LOGvvv << "Run cmd:" << cmd;
+        system_with_check(cmd.c_str());
+        ran = true;
+    }
    if (output_cache_key.size() != 0 && output_cache_key != cache_key) {
        LOGvv << "Cache key of" << output_name << "changed.";
        LOGvvv << "Run cmd:" << cmd;
        check_win_file(output_name);
-        system_with_check(cd_cmd.c_str());
+        system_with_check(cmd.c_str());
        ran = true;
    }
    if (output_cache_key != cache_key) {
@ -326,7 +339,8 @@ void test_find_nams_error(string cmd) {

 void test_process(string src, vector<string> files) {
    vector<string> ifiles;
-    jittor::jit_compiler::process(src, ifiles);
+    string cmd;
+    jittor::jit_compiler::process(src, ifiles, cmd);
    CHECK(files.size() == ifiles.size());
    for (size_t i=0; i<files.size(); i++)
        CHECKop(files[i],==,ifiles[i]);
--- a/python/jittor/src/utils/cache_compile.h
+++ b/python/jittor/src/utils/cache_compile.h
@ -14,7 +14,7 @@ string read_all(const string& fname);
 void write(const string& fname, const string& src);
 bool file_exist(const string& fname);
 string join(string a, string b);
-bool cache_compile(const string& cmd, const string& cache_path="", const string& jittor_path="");
+bool cache_compile(string cmd, const string& cache_path="", const string& jittor_path="");

 } // jit_compiler
 } // jittor
--- a/python/jittor/src/utils/jit_utils.cc
+++ b/python/jittor/src/utils/jit_utils.cc
@ -8,11 +8,11 @@
 #include "pyjt/py_converter.h"
 #include "pyjt/py_arg_printer.h"
 #ifdef __clang__
-#pragma clang diagnostic ignored "-Wdefaulted-function-deleted"
+// #pragma clang diagnostic ignored "-Wdefaulted-function-deleted"
 #endif
 #ifdef __GNUC__
 #endif
-#ifndef _WIN32
+#ifdef __linux__
 #include <sys/prctl.h>
 #endif
 #include <signal.h>
@ -23,7 +23,7 @@
 namespace jittor {

 void init_subprocess() {
-#ifndef _WIN32
+#ifdef __linux__
    prctl(PR_SET_PDEATHSIG, SIGKILL);
 #endif
 }
--- a/python/jittor/src/utils/log.cc
+++ b/python/jittor/src/utils/log.cc
@ -206,6 +206,8 @@ int segfault_happen = 0;
 string thread_local thread_name;
 static int _pid = getpid();
 vector<void(*)()> cleanup_callback;
+vector<void(*)()> sigquit_callback;
+int64 last_q_time;

 #ifdef _WIN32
 void handle_signal(int signal) {
@ -214,13 +216,49 @@ void handle_signal(int signal) {
    abort();
 }
 #else
+static inline void do_exit() {
+    #ifdef __APPLE__
+    _Exit(1);
+    #else
+    std::quick_exit(1);
+    #endif
+}
+
 void segfault_sigaction(int signal, siginfo_t *si, void *arg) {
+    if (signal == SIGQUIT) {
+        if (_pid == getpid()) {
+            std::cerr << "Caught SIGQUIT" << std::endl;
+            int64 now = clock();
+            if (now > last_q_time && last_q_time+CLOCKS_PER_SEC/10 > now) {
+                last_q_time = now;
+                std::cerr << "GDB attach..." << std::endl;
+                breakpoint();
+            } else {
+                last_q_time = now;
+                for (auto f : sigquit_callback)
+                    f();
+            }
+        }
+        return;
+    }
+    if (signal == SIGCHLD) {
+        if (si->si_code != CLD_EXITED && si->si_status != SIGTERM) {
+            LOGe << "Caught SIGCHLD" 
+                << "si_errno:" << si->si_errno 
+                << "si_code:" << si->si_code 
+                << "si_status:" << si->si_status
+                << ", quick exit";
+            exited = true;
+            do_exit();
+        }
+        return;
+    }
    if (signal == SIGINT) {
        if (_pid == getpid()) {
            LOGe << "Caught SIGINT, quick exit";
        }
        exited = true;
-        std::quick_exit(1);
+        do_exit();
    }
    std::cerr << "Caught segfault at address " << si->si_addr << ", "
        << "thread_name: '" << thread_name << "', flush log..." << std::endl;
@ -266,6 +304,7 @@ int register_sigaction() {
    sigaction(SIGSTOP, &sa, NULL);
    sigaction(SIGFPE, &sa, NULL);
    sigaction(SIGINT, &sa, NULL);
+    sigaction(SIGCHLD, &sa, NULL);
    sigaction(SIGILL, &sa, NULL);
    sigaction(SIGBUS, &sa, NULL);
    sigaction(SIGQUIT, &sa, NULL);
@ -364,6 +403,34 @@ but you can hot fix it by this command:
        )";
 }

+static inline void check_cuda_gcc_version(const string& output) {
+    /*  if such error occur: 
+    error: identifier "__is_assignable" is undefined
+    this means your gcc version is not match with nvcc,
+    for example, nvcc 10 support gcc<=7, nvcc 11 support gcc<=9,
+
+    https://gist.github.com/ax3l/9489132
+    */
+    string pat = "__is_assignable";
+    auto id = output.find(pat);
+    if (id == string::npos) return;
+    LOGf << output << R"(
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+Dear user, your nvcc and gcc version are still not match
+after dirty hack, your should install the correct version of g++
+or nvcc, for example, nvcc 10 support g++<=7, nvcc 11 support g++<=9,
+here is the NVCC Compatibility Matrix:
+    https://gist.github.com/ax3l/9489132
+Please install correct version of gcc, for example:
+    >>> sudo apt install g++-7
+After your g++ is installed, using enviroment variable `cc_path` to
+tell jittor use the correct version of g++, for example:
+    >>> cc_path='g++-7' python3.7 -m jittor.test.test_core
+If you still have problems, please contact us:
+    https://github.com/Jittor/jittor/issues
+    )";
+}
+
 #ifdef _WIN32
 int system_popen(const char *cmd) {
    HANDLE g_hChildStd_OUT_Rd = NULL;
@ -444,6 +511,7 @@ int system_popen(const char *cmd) {

    if (ec) {
        check_cuda_unsupport_version(output);
+        check_cuda_gcc_version(output);
    }
    return ec;
 }
@ -468,6 +536,7 @@ int system_popen(const char* cmd) {
    }
    if (ret) {
        check_cuda_unsupport_version(output);
+        check_cuda_gcc_version(output);
    }
    return ret;
 }
--- a/python/jittor/src/utils/log.h
+++ b/python/jittor/src/utils/log.h
@ -34,9 +34,9 @@ constexpr int32_t basename_index(const char * const path, const int32_t index =

 #define PREDICT_BRANCH_NOT_TAKEN(x) (__builtin_expect(x, 0))

-extern "C" uint32_t get_tid();
-extern "C" bool g_supports_color;
-extern "C" void print_prefix(std::ostream* out);
+extern uint32_t get_tid();
+extern bool g_supports_color;
+extern void print_prefix(std::ostream* out);

 #ifdef _WIN32
 constexpr char green[] = "\x1b[1;32m";
@ -83,10 +83,10 @@ static void get_color(char level, int verbose, const char*& color_begin, const c

 #endif

-extern "C" void send_log(std::ostringstream&& out);
-extern "C" void flush_log();
-extern "C" void log_capture_start();
-extern "C" void log_capture_stop();
+extern void send_log(std::ostringstream&& out);
+extern void flush_log();
+extern void log_capture_start();
+extern void log_capture_stop();
 extern std::vector<std::map<string,string>> log_capture_read();
 extern string thread_local thread_name;

@ -170,9 +170,10 @@ template<class T> T get_from_env(const char* name,const T& _default) {
 template<> std::string get_from_env(const char* name, const std::string& _default);

 #define DECLARE_FLAG(type, name) \
-extern "C" type name; \
-extern "C" std::string doc_ ## name; \
-extern "C" void set_ ## name (const type&);
+extern type name; \
+extern std::string doc_ ## name; \
+extern void set_ ## name (const type&);
+

 #ifdef JIT

--- a/python/jittor/src/utils/str_utils.cc
+++ b/python/jittor/src/utils/str_utils.cc
@ -25,15 +25,15 @@ bool endswith(const string& a, const string& b) {

 vector<string> split(const string& s, const string& sep, int max_split) {
    vector<string> ret;
-    int pos = -1, pos_next;
+    int pos = 0, pos_next;
    while (1) {
-        pos_next = s.find(sep, pos+1);
+        pos_next = s.find(sep, pos);
        if (pos_next == (int)string::npos || (int)ret.size() == max_split-1) {
-            ret.push_back(s.substr(pos+sep.size()));
+            ret.push_back(s.substr(pos));
            return ret;
        }
-        ret.push_back(s.substr(pos+sep.size(), pos_next-pos-sep.size()));
-        pos = pos_next;
+        ret.push_back(s.substr(pos, pos_next-pos));
+        pos = pos_next + sep.size();
    }
    ASSERT(max_split==0);
    return ret;
--- a/Show More
+++ b/Show More